Rationale
resource-tracker is a Rust rewrite of the Python
resource-tracker library.
It preserves full CSV column parity with the Python implementation while adding
new capabilities that are difficult or impossible to express in the original.
Why Rust
| Property | Python resource-tracker | resource-tracker |
|---|---|---|
| Runtime dependency | Python interpreter + psutil | Single static binary |
| Startup overhead | ~200-500 ms | < 5 ms |
| Observer CPU overhead | ~0.5-1% per core | < 0.1% per core |
| Memory footprint | ~30-60 MiB (interpreter) | ~2-4 MiB |
| Deployment | pip / uv install | Copy binary |
The lower observer overhead matters when tracking short-lived or CPU-intensive workloads where the tracker itself would otherwise appear in the numbers it is collecting.
New user-facing functionality
Shell-wrapper mode
./resource-tracker --interval 1 -- python train.py --epochs 50
Pass any command after -- and the tracker spawns it, sets --pid
automatically, emits one final sample on exit, and forwards the child’s
exit code. This eliminates the two-process boilerplate
(tracker & child; wait) and makes the tracker transparent to CI systems
and schedulers that check exit codes.
Full process tree tracking (--pid)
Python’s SystemTracker attributes CPU ticks only to the root process.
Rust walks the full /proc tree and sums every descendant (workers,
threads, MPI ranks, Spark executors) under the given root PID. Two fields
appear in every JSON sample when --pid is active:
cpu.process_cores_used– fractional cores consumed by the whole treecpu.process_child_count– live descendant count at each sample
Sentinel API streaming and S3 upload
When SENTINEL_API_TOKEN is set, the tracker registers the run, streams
gzip-compressed CSV batches to S3 every TRACKER_UPLOAD_INTERVAL seconds
(default 60), and posts a finish_run call on clean exit. No network
connections are made when the token is absent.
TOML config file + environment variable overrides
All settings (interval, job name, PID, metadata) can be persisted in a
resource-tracker.toml file alongside the job definition. Every field
also has a TRACKER_* environment variable override, which is convenient
for containerized or CI environments where config files are impractical.
Richer metrics (JSON superset)
The CSV output matches Python column-for-column. The JSON output carries additional fields not expressible as Python CSV scalars.
CPU
per_core_pct[]– per-logical-core utilization; identifies hot cores and NUMA imbalanceutilization_pctexpressed as fractional cores (0.0..N_cores), not a percentage clamped to 100; more useful on multi-core hosts
Memory
available_mib(MemAvailable) – free + reclaimable; a more reliable headroom estimate thanfree_mibon systems with large page cachesswap_total_mib,swap_used_mib,swap_used_pct– swap pressure visible before OOM; Python omits swap entirelyactive_mib/inactive_mib– distinguish working-set pressure from cold cache
Disk
- Per-device, per-mount detail instead of three aggregated scalars; enables per-volume capacity tracking and per-device I/O attribution
device_type(nvme,ssd,hdd),model,vendor,serial– correlate metrics with physical hardware without a separatelsblkcall- Per-device hardware sector size read from sysfs; correct byte counts on 4K-native drives where a hard-coded 512 would under-count I/O by 8x
Network
- Per-interval rates (
rx_bytes_per_sec,tx_bytes_per_sec) in addition to cumulative totals; no client-side diff required driver,operstate,speed_mbps,mtuper interface; identify which NIC is under load and whether the link is running at full negotiated speed
GPU (NVIDIA and AMD)
Python emits no GPU metrics at all. Rust supports both NVIDIA (NVML) and AMD (ROCm/AMDGPU) accelerators via runtime dynamic loading, with no build-time driver dependencies. Additional fields beyond utilization and VRAM:
temperature_celsius– detect thermal throttling in real timepower_watts– power-efficiency analysis; watts-per-FLOP budgetingfrequency_mhz– confirm boost clock is active; correlate with thermal stateuuid,name,host_id– attribute metrics to specific devices in multi-GPU systems