Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

resource-tracker – Usage Guide

resource-tracker is a lightweight Linux resource tracker. It polls CPU, memory, disk, network, and GPU metrics at a configurable interval and emits metrics as newline-delimited JSON (JSONL) or CSV lines to stderr or target file.


Quick start

# Build
cargo build --release

# Run with defaults to track resources used by hashing for 5 seconds
./target/release/resource-tracker timeout 5s sha512sum /dev/zero

# Track a specific process tree
./target/release/resource-tracker --pid 1234 --job-name "my-job"

Each line of output is a complete JSON object representing one sample by default:

{
  "timestamp_secs": 1718000000,
  "job_name": "my-benchmark",
  "cpu": { "utilization_pct": 4.6, "per_core_pct": [12.5, 38.0, "..."], "process_cores_used": 3.8, "process_child_count": 4 },
  "memory": { "total_mib": 64000, "used_mib": 30468, "used_pct": 47.6, "free_mib": 2289, "available_mib": 18432, "buffers_mib": 263, "cached_mib": 8472, "active_mib": 8157, "inactive_mib": 7404, "swap_total_mib": 0, "swap_used_mib": 0, "swap_used_pct": 0.0 },
  "network": [{ "interface": "eth0", "rx_bytes_per_sec": 1200.0, "tx_bytes_per_sec": 400.0, "rx_bytes_total": 9834200, "tx_bytes_total": 312400, "driver": "virtio_net", "operstate": "up", "speed_mbps": 1000, "mtu": 1500, "mac_address": "02:00:00:aa:bb:cc" }],
  "disk": [{ "device": "nvme0n1", "model": "Samsung SSD 990 PRO", "device_type": "nvme", "capacity_bytes": 1000204886016, "read_bytes_per_sec": 0.0, "write_bytes_per_sec": 204800.0, "mounts": [{ "mount_point": "/", "filesystem": "ext4", "total_bytes": 999292796928, "used_bytes": 841676800000, "available_bytes": 142023000000, "used_pct": 84.2 }] }],
  "gpu": [{ "name": "NVIDIA GeForce RTX 4090", "utilization_pct": 98.0, "vram_used_pct": 72.3, "vram_used_bytes": 17394819072, "vram_total_bytes": 24026849280, "temperature_celsius": 74, "power_watts": 318.5, "frequency_mhz": 2520 }]
}

CLI flags

FlagShortDefaultDescription
--pid PID-p(none)Root PID of the process tree to attribute CPU usage to. Includes all child processes.
--interval SECS-i1How often to emit a sample, in seconds.
--config FILE-cresource-tracker.tomlPath to a TOML config file. Silently ignored if the file does not exist.
--format FORMAT-fjsonOutput format: json or csv.
--output FILE-oPath to the output file. Defaults to stderr.
--quietSuppress metric output entirely, e.g. when streaming metrics to Sentinel and local output is not needed.
--help-hPrint help.
--version-VPrint version.

Precedence: CLI flags > config file > built-in defaults.


Config file (resource-tracker.toml)

The TOML config file lets you persist settings so you don’t have to repeat CLI flags on every invocation. It is optional – the tool works with no config file at all. Any field set on the CLI overrides the corresponding field in the file.

The default lookup path is resource-tracker.toml in the current working directory. Use --config /path/to/file.toml to point elsewhere.

Full reference

[job]
# Human-readable label for this tracking session.
# Appears as "job_name" in every emitted JSON sample.
# Useful when multiple runs are collected into the same data store so you can
# filter and group by job.
name = "gpu-benchmark-run-42"

# Root PID of the process to track.
# resource-tracker will walk the full process tree (parent + all descendants)
# and sum their CPU tick usage to report process_cores_used.
# Leave unset to collect system-wide metrics only.
pid = 12345

[tracker]
# Sampling interval in seconds.  Lower values give finer resolution at the
# cost of more output volume and slightly higher observer overhead.
# Default: 1
interval_secs = 10

Minimal example – system-wide monitoring

[tracker]
interval_secs = 30

Example – named job with process tracking

[job]
name    = "my_job_i_want_to_track"
pid     = 98231

[tracker]
interval_secs = 5

Sentinel API streaming and S3 output

When SENTINEL_API_TOKEN is set, the tracker registers the run with the Sentinel API and streams metric batches to S3 in the background. No network connections are ever made when the token is absent.

How it works

  1. At startup, start_run API endpoint is called to register the run and obtain temporary S3 upload credentials from the Sentinel API.
  2. A background upload thread wakes every TRACKER_UPLOAD_INTERVAL seconds (default 60), drains the in-memory sample buffer, serializes as CSV, gzip-compresses, and PUTs the file to the S3 prefix returned by the API.
  3. On clean exit (SIGTERM, shell-wrapper child exits), any samples not yet uploaded are base64-encoded and sent inline to finish_run inside a gzip-compressed JSON body. If S3 uploads did occur, only the S3 URIs are sent.

Environment variables

VariableRequiredDefaultDescription
SENTINEL_API_TOKENYesBearer token for the Sentinel API. Streaming is disabled when absent or empty.
SENTINEL_API_URLNohttps://api.sentinel.sparecores.netOverride the Sentinel API base URL.
TRACKER_UPLOAD_INTERVALNo60Seconds between S3 batch uploads.

Job metadata environment variables

All Section 9.3 metadata fields can be set via environment variable instead of CLI flags. Environment variables are overridden by the corresponding CLI flag when both are supplied.

VariableCLI flag
TRACKER_JOB_NAME--job-name
TRACKER_PROJECT_NAME--project-name
TRACKER_STAGE_NAME--stage-name
TRACKER_TASK_NAME--task-name
TRACKER_TEAM--team
TRACKER_ENV--env
TRACKER_LANGUAGE--language
TRACKER_ORCHESTRATOR--orchestrator
TRACKER_EXECUTOR--executor
TRACKER_EXTERNAL_RUN_ID--external-run-id
TRACKER_CONTAINER_IMAGE--container-image

Example

export SENTINEL_API_TOKEN="your-token-here"
export TRACKER_JOB_NAME="gpu-benchmark"
export TRACKER_UPLOAD_INTERVAL=30

./resource-tracker --interval 1 -- python train.py

The tracker spawns python train.py, monitors it, uploads a gzip-compressed CSV batch to S3 every 30 seconds, and calls finish_run when the script exits.


When to use the config file vs CLI flags

SituationRecommended approach
One-off interactive runCLI flags – faster, no file to manage
Recurring job (cron, SLURM, systemd unit)TOML file alongside the job definition
CI / benchmark pipelineTOML file checked into the repository
Multiple named jobs on the same hostOne TOML file per job, point to it with --config
Containerized workloadSet config via CLI flags in the CMD / ENTRYPOINT

Capturing output

Because samples are emitted as newline-delimited JSON to stdout, standard Unix tools work directly with the output.

# Write to a file
./resource-tracker > run.jsonl

# Tail live output
./resource-tracker | tee run.jsonl

# Pretty-print with jq
./resource-tracker | jq .

# Extract only CPU utilization over time
./resource-tracker | jq '{ t: .timestamp_secs, cpu: .cpu.utilization_pct }'

# Watch GPU VRAM usage
./resource-tracker --interval 1 | jq '.gpu[] | { name, vram_used_pct }'

Shell-wrapper mode

Pass a command after -- to have the tracker spawn and monitor it:

./resource-tracker --interval 1 --job-name "training-run" -- python train.py --epochs 50

The tracker sets --pid automatically to the spawned child’s PID, emits one final sample when the child exits, then exits with the child’s exit code.

Rationale: eliminates the two-process boilerplate (tracker & python ...; wait) and guarantees the tracker always exits with the job’s exit code, making it transparent to CI systems.


Process tree tracking (--pid)

When --pid is set, every sample includes two extra fields under cpu:

  • process_cores_used – fractional cores consumed by the process tree (e.g. 3.8 means the tree is using the equivalent of 3.8 full cores).
  • process_child_count – number of live child/descendant processes at the time of sampling (does not include the root PID itself).

If the tracked PID exits during a run, its contribution drops to zero and process_child_count drops to zero. The tracker itself keeps running.

Rationale: Python’s SystemTracker tracks only the calling process’s own ticks. Rust walks the full /proc tree so multi-process and multi-threaded workloads (e.g. PyTorch data-loader workers, MPI ranks, Spark executors) are attributed correctly under a single root PID.

Finding the PID of a running process:

# By name
pgrep -x python

# Most recently launched
pgrep -n my-training-script

# Already know the command? Launch and capture PID
my-training-script &
./resource-tracker --pid $! --job-name "training-run-1"

GPU support

GPUs are detected automatically at startup via NVML (NVIDIA) and libamdgpu_top (AMD). No configuration is needed. On hosts without GPU hardware or without the relevant driver libraries installed, the gpu array in each sample will be empty – the tracker continues running normally.

Supported accelerators: NVIDIA GPUs (NVML), AMD GPUs (ROCm/AMDGPU).

Rationale: per-GPU temperature, power draw, and clock frequency are not emitted by Python’s SystemTracker. These fields enable thermal throttle detection and power-efficiency analysis without a separate monitoring tool.


Metrics reference

cpu

FieldUnitDescription
utilization_pctfractional coresAggregate cores in use (0.0..N_cores). 4.6 on a 16-core host means ~4.6 vCPUs fully utilized.
per_core_pct% eachPer-logical-core utilization array (0.0–100.0).
utime_secssecondsUser+nice CPU time across all cores this interval.
stime_secssecondsSystem CPU time across all cores this interval.
process_countcountRunnable processes (procs_running from /proc/stat).
process_cores_usedfractional coresCores consumed by tracked process tree (null if no PID).
process_child_countcountLive descendant processes (null if no PID).

memory

All values in mebibytes (MiB = 1,048,576 bytes).

FieldDescription
total_mibTotal installed RAM
free_mibTruly free RAM (MemFree from /proc/meminfo)
available_mibFree + reclaimable RAM (MemAvailable); better estimate of headroom
used_mibtotal - free - buffers - cached (excludes reclaimable cache)
used_pctFraction of total RAM in use
buffers_mibKernel I/O buffer cache
cached_mibPage cache including slab-reclaimable (Cached + SReclaimable)
active_mibActive pages (recently accessed)
inactive_mibInactive pages (candidates for reclaim)
swap_total_mibTotal swap space (0 if no swap)
swap_used_mibUsed swap
swap_used_pctFraction of swap in use

Rationale: Python’s SystemTracker reports memory in KiB and omits available_mib, active_mib, inactive_mib, swap_*. Rust reports all fields in MiB (matching Python resource-tracker PR #9) and adds available_mib (MemAvailable) which is a more reliable headroom estimate than free_mib alone on systems with large page caches.

disk (one entry per whole-disk block device)

FieldUnitDescription
deviceKernel device name, e.g. nvme0n1, sda
modelDrive model string from /sys/block/
vendorVendor string from /sys/block/
serialSerial number or WWID
device_typenvme, ssd, or hdd
capacity_bytesbytesRaw device capacity
mountsArray of mounted filesystems on this device
mounts[].mount_pointe.g. /, /home
mounts[].filesysteme.g. ext4, xfs, btrfs
mounts[].total_bytesbytesFilesystem total size
mounts[].used_bytesbytesSpace in use
mounts[].available_bytesbytesSpace available to non-root users
mounts[].used_pct%Fraction of filesystem in use
read_bytes_per_secbytes/sDisk read throughput
write_bytes_per_secbytes/sDisk write throughput
read_bytes_totalbytesCumulative bytes read since boot
write_bytes_totalbytesCumulative bytes written since boot

Rationale: Python aggregates disk space across all mounts into three scalar CSV columns. Rust retains per-device, per-mount detail in the JSON output, enabling per-volume capacity tracking and per-device I/O attribution that the aggregated CSV cannot express.

network (one entry per non-loopback interface)

FieldUnitDescription
interfaceInterface name, e.g. eth0, ens3
mac_addressHardware MAC address
driverKernel driver name, e.g. igc, virtio_net
operstateLink state: up, down, unknown
speed_mbpsMbpsNegotiated link speed (-1 if not reported)
mtubytesMaximum transmission unit
rx_bytes_per_secbytes/sReceived throughput
tx_bytes_per_secbytes/sTransmitted throughput
rx_bytes_totalbytesCumulative bytes received since boot
tx_bytes_totalbytesCumulative bytes sent since boot

Rationale: Python’s SystemTracker emits only cumulative rx/tx byte totals per interface. Rust adds per-interval rates, driver identity, link state, negotiated speed, and MTU, enabling network saturation and driver-level diagnostics without a separate tool.

gpu (one entry per detected accelerator)

FieldUnitDescription
uuidVendor-assigned device UUID
nameDevice name, e.g. NVIDIA GeForce RTX 4090
device_typeGPU, NPU, TPU, etc.
host_idHost-level device identifier (PCIe slot or platform index)
detailDriver-specific key/value map (PCI IDs, ASIC name, driver version, …)
utilization_pct%Core utilization
vram_total_bytesbytesTotal VRAM
vram_used_bytesbytesUsed VRAM
vram_used_pct%Fraction of VRAM in use
temperature_celsiusdeg CDie temperature
power_wattsWPower draw
frequency_mhzMHzCore clock
core_countcountShader/compute cores (null if not reported)