resource-tracker – Usage Guide
resource-tracker is a lightweight Linux resource tracker. It polls CPU,
memory, disk, network, and GPU metrics at a configurable interval and emits
metrics as newline-delimited JSON (JSONL) or CSV lines to stderr or target file.
Quick start
# Build
cargo build --release
# Run with defaults to track resources used by hashing for 5 seconds
./target/release/resource-tracker timeout 5s sha512sum /dev/zero
# Track a specific process tree
./target/release/resource-tracker --pid 1234 --job-name "my-job"
Each line of output is a complete JSON object representing one sample by default:
{
"timestamp_secs": 1718000000,
"job_name": "my-benchmark",
"cpu": { "utilization_pct": 4.6, "per_core_pct": [12.5, 38.0, "..."], "process_cores_used": 3.8, "process_child_count": 4 },
"memory": { "total_mib": 64000, "used_mib": 30468, "used_pct": 47.6, "free_mib": 2289, "available_mib": 18432, "buffers_mib": 263, "cached_mib": 8472, "active_mib": 8157, "inactive_mib": 7404, "swap_total_mib": 0, "swap_used_mib": 0, "swap_used_pct": 0.0 },
"network": [{ "interface": "eth0", "rx_bytes_per_sec": 1200.0, "tx_bytes_per_sec": 400.0, "rx_bytes_total": 9834200, "tx_bytes_total": 312400, "driver": "virtio_net", "operstate": "up", "speed_mbps": 1000, "mtu": 1500, "mac_address": "02:00:00:aa:bb:cc" }],
"disk": [{ "device": "nvme0n1", "model": "Samsung SSD 990 PRO", "device_type": "nvme", "capacity_bytes": 1000204886016, "read_bytes_per_sec": 0.0, "write_bytes_per_sec": 204800.0, "mounts": [{ "mount_point": "/", "filesystem": "ext4", "total_bytes": 999292796928, "used_bytes": 841676800000, "available_bytes": 142023000000, "used_pct": 84.2 }] }],
"gpu": [{ "name": "NVIDIA GeForce RTX 4090", "utilization_pct": 98.0, "vram_used_pct": 72.3, "vram_used_bytes": 17394819072, "vram_total_bytes": 24026849280, "temperature_celsius": 74, "power_watts": 318.5, "frequency_mhz": 2520 }]
}
CLI flags
| Flag | Short | Default | Description |
|---|---|---|---|
--pid PID | -p | (none) | Root PID of the process tree to attribute CPU usage to. Includes all child processes. |
--interval SECS | -i | 1 | How often to emit a sample, in seconds. |
--config FILE | -c | resource-tracker.toml | Path to a TOML config file. Silently ignored if the file does not exist. |
--format FORMAT | -f | json | Output format: json or csv. |
--output FILE | -o | Path to the output file. Defaults to stderr. | |
--quiet | Suppress metric output entirely, e.g. when streaming metrics to Sentinel and local output is not needed. | ||
--help | -h | Print help. | |
--version | -V | Print version. |
Precedence: CLI flags > config file > built-in defaults.
Config file (resource-tracker.toml)
The TOML config file lets you persist settings so you don’t have to repeat CLI flags on every invocation. It is optional – the tool works with no config file at all. Any field set on the CLI overrides the corresponding field in the file.
The default lookup path is resource-tracker.toml in the current working directory.
Use --config /path/to/file.toml to point elsewhere.
Full reference
[job]
# Human-readable label for this tracking session.
# Appears as "job_name" in every emitted JSON sample.
# Useful when multiple runs are collected into the same data store so you can
# filter and group by job.
name = "gpu-benchmark-run-42"
# Root PID of the process to track.
# resource-tracker will walk the full process tree (parent + all descendants)
# and sum their CPU tick usage to report process_cores_used.
# Leave unset to collect system-wide metrics only.
pid = 12345
[tracker]
# Sampling interval in seconds. Lower values give finer resolution at the
# cost of more output volume and slightly higher observer overhead.
# Default: 1
interval_secs = 10
Minimal example – system-wide monitoring
[tracker]
interval_secs = 30
Example – named job with process tracking
[job]
name = "my_job_i_want_to_track"
pid = 98231
[tracker]
interval_secs = 5
Sentinel API streaming and S3 output
When SENTINEL_API_TOKEN is set, the tracker registers the run with the
Sentinel API and streams metric batches to S3 in the background.
No network connections are ever made when the token is absent.
How it works
- At startup,
start_runAPI endpoint is called to register the run and obtain temporary S3 upload credentials from the Sentinel API. - A background upload thread wakes every
TRACKER_UPLOAD_INTERVALseconds (default 60), drains the in-memory sample buffer, serializes as CSV, gzip-compresses, and PUTs the file to the S3 prefix returned by the API. - On clean exit (SIGTERM, shell-wrapper child exits), any samples not yet
uploaded are base64-encoded and sent inline to
finish_runinside a gzip-compressed JSON body. If S3 uploads did occur, only the S3 URIs are sent.
Environment variables
| Variable | Required | Default | Description |
|---|---|---|---|
SENTINEL_API_TOKEN | Yes | – | Bearer token for the Sentinel API. Streaming is disabled when absent or empty. |
SENTINEL_API_URL | No | https://api.sentinel.sparecores.net | Override the Sentinel API base URL. |
TRACKER_UPLOAD_INTERVAL | No | 60 | Seconds between S3 batch uploads. |
Job metadata environment variables
All Section 9.3 metadata fields can be set via environment variable instead of CLI flags. Environment variables are overridden by the corresponding CLI flag when both are supplied.
| Variable | CLI flag |
|---|---|
TRACKER_JOB_NAME | --job-name |
TRACKER_PROJECT_NAME | --project-name |
TRACKER_STAGE_NAME | --stage-name |
TRACKER_TASK_NAME | --task-name |
TRACKER_TEAM | --team |
TRACKER_ENV | --env |
TRACKER_LANGUAGE | --language |
TRACKER_ORCHESTRATOR | --orchestrator |
TRACKER_EXECUTOR | --executor |
TRACKER_EXTERNAL_RUN_ID | --external-run-id |
TRACKER_CONTAINER_IMAGE | --container-image |
Example
export SENTINEL_API_TOKEN="your-token-here"
export TRACKER_JOB_NAME="gpu-benchmark"
export TRACKER_UPLOAD_INTERVAL=30
./resource-tracker --interval 1 -- python train.py
The tracker spawns python train.py, monitors it, uploads a gzip-compressed
CSV batch to S3 every 30 seconds, and calls finish_run when the script exits.
When to use the config file vs CLI flags
| Situation | Recommended approach |
|---|---|
| One-off interactive run | CLI flags – faster, no file to manage |
| Recurring job (cron, SLURM, systemd unit) | TOML file alongside the job definition |
| CI / benchmark pipeline | TOML file checked into the repository |
| Multiple named jobs on the same host | One TOML file per job, point to it with --config |
| Containerized workload | Set config via CLI flags in the CMD / ENTRYPOINT |
Capturing output
Because samples are emitted as newline-delimited JSON to stdout, standard Unix tools work directly with the output.
# Write to a file
./resource-tracker > run.jsonl
# Tail live output
./resource-tracker | tee run.jsonl
# Pretty-print with jq
./resource-tracker | jq .
# Extract only CPU utilization over time
./resource-tracker | jq '{ t: .timestamp_secs, cpu: .cpu.utilization_pct }'
# Watch GPU VRAM usage
./resource-tracker --interval 1 | jq '.gpu[] | { name, vram_used_pct }'
Shell-wrapper mode
Pass a command after -- to have the tracker spawn and monitor it:
./resource-tracker --interval 1 --job-name "training-run" -- python train.py --epochs 50
The tracker sets --pid automatically to the spawned child’s PID, emits one
final sample when the child exits, then exits with the child’s exit code.
Rationale: eliminates the two-process boilerplate (tracker & python ...; wait)
and guarantees the tracker always exits with the job’s exit code, making it
transparent to CI systems.
Process tree tracking (--pid)
When --pid is set, every sample includes two extra fields under cpu:
process_cores_used– fractional cores consumed by the process tree (e.g.3.8means the tree is using the equivalent of 3.8 full cores).process_child_count– number of live child/descendant processes at the time of sampling (does not include the root PID itself).
If the tracked PID exits during a run, its contribution drops to zero and
process_child_count drops to zero. The tracker itself keeps running.
Rationale: Python’s SystemTracker tracks only the calling process’s own
ticks. Rust walks the full /proc tree so multi-process and multi-threaded
workloads (e.g. PyTorch data-loader workers, MPI ranks, Spark executors) are
attributed correctly under a single root PID.
Finding the PID of a running process:
# By name
pgrep -x python
# Most recently launched
pgrep -n my-training-script
# Already know the command? Launch and capture PID
my-training-script &
./resource-tracker --pid $! --job-name "training-run-1"
GPU support
GPUs are detected automatically at startup via NVML (NVIDIA) and
libamdgpu_top (AMD). No configuration is needed. On hosts without GPU
hardware or without the relevant driver libraries installed, the gpu array
in each sample will be empty – the tracker continues running normally.
Supported accelerators: NVIDIA GPUs (NVML), AMD GPUs (ROCm/AMDGPU).
Rationale: per-GPU temperature, power draw, and clock frequency are not
emitted by Python’s SystemTracker. These fields enable thermal throttle
detection and power-efficiency analysis without a separate monitoring tool.
Metrics reference
cpu
| Field | Unit | Description |
|---|---|---|
utilization_pct | fractional cores | Aggregate cores in use (0.0..N_cores). 4.6 on a 16-core host means ~4.6 vCPUs fully utilized. |
per_core_pct | % each | Per-logical-core utilization array (0.0–100.0). |
utime_secs | seconds | User+nice CPU time across all cores this interval. |
stime_secs | seconds | System CPU time across all cores this interval. |
process_count | count | Runnable processes (procs_running from /proc/stat). |
process_cores_used | fractional cores | Cores consumed by tracked process tree (null if no PID). |
process_child_count | count | Live descendant processes (null if no PID). |
memory
All values in mebibytes (MiB = 1,048,576 bytes).
| Field | Description |
|---|---|
total_mib | Total installed RAM |
free_mib | Truly free RAM (MemFree from /proc/meminfo) |
available_mib | Free + reclaimable RAM (MemAvailable); better estimate of headroom |
used_mib | total - free - buffers - cached (excludes reclaimable cache) |
used_pct | Fraction of total RAM in use |
buffers_mib | Kernel I/O buffer cache |
cached_mib | Page cache including slab-reclaimable (Cached + SReclaimable) |
active_mib | Active pages (recently accessed) |
inactive_mib | Inactive pages (candidates for reclaim) |
swap_total_mib | Total swap space (0 if no swap) |
swap_used_mib | Used swap |
swap_used_pct | Fraction of swap in use |
Rationale: Python’s SystemTracker reports memory in KiB and omits
available_mib, active_mib, inactive_mib, swap_*. Rust reports all
fields in MiB (matching Python resource-tracker PR #9) and adds
available_mib (MemAvailable) which is a more reliable headroom estimate
than free_mib alone on systems with large page caches.
disk (one entry per whole-disk block device)
| Field | Unit | Description |
|---|---|---|
device | – | Kernel device name, e.g. nvme0n1, sda |
model | – | Drive model string from /sys/block/ |
vendor | – | Vendor string from /sys/block/ |
serial | – | Serial number or WWID |
device_type | – | nvme, ssd, or hdd |
capacity_bytes | bytes | Raw device capacity |
mounts | – | Array of mounted filesystems on this device |
mounts[].mount_point | – | e.g. /, /home |
mounts[].filesystem | – | e.g. ext4, xfs, btrfs |
mounts[].total_bytes | bytes | Filesystem total size |
mounts[].used_bytes | bytes | Space in use |
mounts[].available_bytes | bytes | Space available to non-root users |
mounts[].used_pct | % | Fraction of filesystem in use |
read_bytes_per_sec | bytes/s | Disk read throughput |
write_bytes_per_sec | bytes/s | Disk write throughput |
read_bytes_total | bytes | Cumulative bytes read since boot |
write_bytes_total | bytes | Cumulative bytes written since boot |
Rationale: Python aggregates disk space across all mounts into three scalar CSV columns. Rust retains per-device, per-mount detail in the JSON output, enabling per-volume capacity tracking and per-device I/O attribution that the aggregated CSV cannot express.
network (one entry per non-loopback interface)
| Field | Unit | Description |
|---|---|---|
interface | – | Interface name, e.g. eth0, ens3 |
mac_address | – | Hardware MAC address |
driver | – | Kernel driver name, e.g. igc, virtio_net |
operstate | – | Link state: up, down, unknown |
speed_mbps | Mbps | Negotiated link speed (-1 if not reported) |
mtu | bytes | Maximum transmission unit |
rx_bytes_per_sec | bytes/s | Received throughput |
tx_bytes_per_sec | bytes/s | Transmitted throughput |
rx_bytes_total | bytes | Cumulative bytes received since boot |
tx_bytes_total | bytes | Cumulative bytes sent since boot |
Rationale: Python’s SystemTracker emits only cumulative rx/tx byte
totals per interface. Rust adds per-interval rates, driver identity,
link state, negotiated speed, and MTU, enabling network saturation and
driver-level diagnostics without a separate tool.
gpu (one entry per detected accelerator)
| Field | Unit | Description |
|---|---|---|
uuid | – | Vendor-assigned device UUID |
name | – | Device name, e.g. NVIDIA GeForce RTX 4090 |
device_type | – | GPU, NPU, TPU, etc. |
host_id | – | Host-level device identifier (PCIe slot or platform index) |
detail | – | Driver-specific key/value map (PCI IDs, ASIC name, driver version, …) |
utilization_pct | % | Core utilization |
vram_total_bytes | bytes | Total VRAM |
vram_used_bytes | bytes | Used VRAM |
vram_used_pct | % | Fraction of VRAM in use |
temperature_celsius | deg C | Die temperature |
power_watts | W | Power draw |
frequency_mhz | MHz | Core clock |
core_count | count | Shader/compute cores (null if not reported) |