Introduction
This file contains the initial specification/ideation of the resource-tracker-rs project.
Background
The resource-tracker Python package
was brought to life in 2025 to have a way to track the resources used by long-running
DS/ML/AI jobs in the cloud, and recommend better cloud resource allocations.
This was started as an experimentation and resulted in the following features:
- Supports Linux, macOS, and Windows. No dependencies on Linux, required
psutilon other operating systems. - Tracks CPU, memory, NVIDIA GPU and VRAM (even at the process level), disk usage, network usage at the system and process level.
- Monitoring happens at a configurable interval (defaults to 1 second), and collects metrics to local (temp) CSV files.
- Performance is unnoticeable at 1-sec frequency, but cannot go much lower without significant performance overhead.
- Computes aggregated statistics on the metrics (e.g. average and peak values).
- Recommends optimal cloud resource allocations based on the metrics.
- Recommends best-priced cloud servers for the given workload.
- Renders a local HTML report with all the metrics and recommendations.
- Has an R package wrapper for the same functionality.
- Integrates well with Metaflow.
While it worked well for Python and R, we also wanted a standalone tool that can be better used as a CLI wrapper to track any processes in any environment, and eventually integrate back in the existing Python and R packages. The overall goal is to have a lightweight binary, compiled cross-platform, that can
-
Track a wide range of resource utilization metrics locally – including CPU, memory, GPU and VRAM, disk usage, network usage.
-
Optionally stream these metrics to a remote server for centralized analysis, visualization, and further optimization.
This allows us not to embed any complex logic in the binary, and just focus on data collection and delivery, so that am accompained free/commercial service can deliver the centeralized visibility, recommendations, automation and optimization – while keeping most of the ecosystem open-source and open to extend with other tools and services.
Data Collection
Discovery Tools
What worked great in the Python implementation was the ability to discover the
- Most important specs of the host machine, such as CPU cores count, memory amount etc.
- Cloud environment of the server (when available), such as vendor, region, instance type.
These limited tools are implemented at
- https://github.com/SpareCores/resource-tracker/blob/main/src/resource_tracker/server_info.py
- https://github.com/SpareCores/resource-tracker/blob/main/src/resource_tracker/cloud_info.py
We are sure the hardware discovery could be improved further, and we aim to
collect at least the following (all prefixed with host_ in the data ingestion
endpoint):
- host_id (text): Unique identifier of the host machine, such as AWS EC2 instance ID or the server S/N.
- host_name (text): Hostname of the machine.
- host_ ip (text): IP address of the machine.
- host_allocation (enum): If the server is dedicated to the monitored process, or shared with other processes.
- host_vcpus (int): Number of logical virtual CPU cores.
- host_cpu_model (text): Model of the CPU (e.g. from
lscpuoutput). - host_memory_mib (int): Amount of memory in MiB.
- host_gpu_model (text): Model of the GPU (e.g. from
nvidia-smioutput). - host_gpu_count (int): Number of GPUs.
- host_gpu_vram_mib (int): Amount of VRAM in MiB.
- host_storage_gb (float): Amount of storage in GB.
All these fields are optional, and only collected when available. Users should be able to suppress any sensitive fields, such as the host IP address.
The cloud discovery is implemented via probing the Metadata server endpoints of
the supported cloud providers. We should try to get information about the
following fields (all using the cloud_ prefix in the data ingestion endpoint):
- cloud_vendor_id (text): The cloud provider’s id, mapped to the Spare Cores
Navigator’s vendor table reference (e.g.
aws). - cloud_account_id (text): The cloud account id.
- cloud_region_id (text): The cloud region id, mapped to the Spare Cores
Navigator’s region table reference (e.g.
us-east-1). - cloud_zone_id (text): The cloud zone id, mapped to the Spare Cores Navigator’s
zone table reference (e.g.
us-east-1a). - cloud_instance_type (text): The cloud instance type, mapped to the Spare Cores
Navigator’s server table’s
api_referencefield (e.g.t3a.nano).
Find the Spare Cores Navigator’s vendor, region, zone and server tables at https://github.com/SpareCores/sc-data-dumps/tree/main/data and schemas described at https://dbdocs.io/spare-cores/sc-crawler.
Metrics to Track
The data ingestion endpoint is rather liberal and any arbitrary metric can be
tracked. The only restriction is that the submitted data needs to be a CSV file
with at least one column named timestamp, which should be UNIX timestamp in
seconds.
All other columns are treated as metrics. We recommend storing machine-wide
metrics prefixed with system_ and the process-level metrics prefixed with
process_. If distinguishing between machine-wide and process-level metrics is
not feasible, metrics can be submitted without any prefix.
Recommended column names for commonly tracked process-level metrics that are taken into consideration in the backend:
- children: The number of child processes.
- utime: The total user+nice mode CPU time in seconds.
- stime: The total system mode CPU time in seconds.
- cpu_usage: The current CPU usage between 0 and number of CPUs.
- memory_mib: Current memory usage in MiB. Preferably PSS (Proportional Set Size) on Linux, fall back to RSS (Resident Set Size).
- disk_read_bytes: The total number of bytes read from disk.
- disk_write_bytes: The total number of bytes written to disk.
- gpu_usage: The current GPU utilization between 0 and GPU count.
- gpu_vram_mib: The current GPU memory used in MiB.
- gpu_utilized: The number of GPUs with utilization > 0.
Recommended column names for commonly tracked machine-wide metrics that are taken into consideration in the backend:
- processes: The number of running processes.
- utime: The total user+nice mode CPU time in seconds.
- stime: The total system mode CPU time in seconds.
- cpu_usage: The current CPU usage between 0 and number of CPUs.
- memory_free_mib: The amount of free memory in MiB.
- memory_used_mib: The amount of used memory in MiB.
- memory_buffers_mib: The amount of memory used for buffers in MiB.
- memory_cached_mib: The amount of memory used for caching in MiB.
- memory_active_mib: The amount of memory used for active pages in MiB.
- memory_inactive_mib: The amount of memory used for inactive pages in MiB.
- disk_read_bytes: The total number of bytes read from all disks.
- disk_write_bytes: The total number of bytes written to all disks.
- disk_space_total_gb: The total disk space in GB.
- disk_space_used_gb: The used disk space in GB.
- disk_space_free_gb: The free disk space in GB.
- net_recv_bytes: The total number of bytes received over network.
- net_sent_bytes: The total number of bytes sent over network.
- gpu_usage: The current GPU utilization between 0 and GPU count.
- gpu_vram_mib: The current GPU memory used in MiB.
- gpu_utilized: The number of GPUs with utilization > 0.
No other metrics are officially supported by the backend at the moment, but the user can submit any arbitrary values (even strings!) for future use.
Wishlist for future metrics:
-
CPU saturation and efficiency metrics:
- Load average (1m)
- L1/L2/L3 cache hit rate
- TLB miss rate
- Major page faults
- iowait
- IPC (Instructions Per Cycle)
- Context switches
-
GPU saturation and efficiency metrics:
- PCIe TX and RX throughput, Nvlink throughput + theoretical max throughput (e.g.
nvidia-smi nvlink -c) - Power usage (W)
- Temperature (C)
- PCIe TX and RX throughput, Nvlink throughput + theoretical max throughput (e.g.
-
Disk saturation and efficiency metrics:
- Disk latency (ms)
- Disk queue length
Overall, we are looking for metrics that can help identify potential bottlenecks and find better cloud servers for the monitored workload.
Metadata
We also want to support collecting the following metadata about the monitored process:
- pid (int): The process ID.
- container_image (text): The container image, including optional tag.
- command (json): JSON array of the command and its arguments.
- env (text): The environment (e.g. dev or prod).
- language (text): The language of the process (e.g. python or r).
- orchestrator (text): The orchestrator of the process (e.g. metaflow).
- executor (text): The executor of the process (e.g. k8s).
- team (text): The team of the process.
- project_name (text): The project name of the process.
- job_name (text): The job name of the process (e.g. flow in metaflow, workflow in flyte).
- stage_name (text): The stage name of the process (e.g. step in metaflow, node in flyte).
- task_name (text): The task name of the process (e.g. task both in metaflow and flyte).
- external_run_id (text): The external run id of the process (e.g. Jenkins build number – internal to the orchestrator).
Most (if not all: except for the command) of these fields are to be provided
voluntarily and manually by the user (or job orchestrator) and should be optional.
Privacy and security concerns are addressed in the public service’s legal docs.
The user should be also able to provide any ad-hoc key-value pairs (tags) for tracking purposes.
Status
The data ingestion endpoint automatically captures the start and end time of the process, and calculates the duration in seconds. It also captures user and organization information based on the user’s credentials. Once a job is finished, statistics and recommendations are calculated and stored in a database, made available to the user via a web interface, API, and potentially via the CLI tool as well in the future.
But the CLI tool need to collect the following fields and pass to the data ingestion endpoint:
- exit_code (int): The exit code of the process.
- run_status (enum): The status of the run (e.g. success, failure, etc).
Data Streaming
To authenticate with the data ingestion API endpoint, the Resource Tracker needs
to use a long-lived API token set by the user in the SENTINEL_API_TOKEN
environment variable. This needs to be passed as the Authorization header with
the value Bearer <token>.
At the start of the Resource Tracker, hit the data ingestion endpoint to
register the start of a Run along with the following optional parameters:
- metadata (e.g.
project_nameetc.) - server and cloud discovery information (e.g. number of CPUs and/or actual instance type)
The response contains:
run_idthat should be stored until the end of the run as all future API calls will need to reference that.upload_uri_prefix: An S3 URI prefix to upload the metrics to.upload_credentials: The temporary AWS STS session credentials for the upload authentication, including anexpires_attimestamp.
Then the Resource Tracker should start a background thread (or similar solution)
to upload collected metrics in batches (e.g. every 1 minute) as new objects
under the upload_uri_prefix as gzipped CSV files. The Resource Tracker should
also keep track of the uploaded URIs.
When the temporary upload credentials expire, the Resource Tracker should hit the data ingestion endpoint to refresh the credentials.
When the tracked process finishes, the Resource Tracker should hit the data ingestion endpoint to register the end of the run. This takes
- The
run_id, - The status of the run (e.g. success, failure, etc.) along with an optional
exit_codeas described above, - And either the list of the uploaded URIs as
data_urisalong withdata_sourceset tos3, or if no S3 uploads happened yet (e.g. short duration run), then the CSV file asdata_csvalong withdata_sourceset tolocal.
The endpoint will process the data in synchronous manner, and return statistics.
More Details
Find the data ingestion API endpoints docs at https://api.sentinel.sparecores.net/docs, including the data contracts and API references.
Rationale
resource-tracker is a Rust rewrite of the Python
resource-tracker library.
It preserves full CSV column parity with the Python implementation while adding
new capabilities that are difficult or impossible to express in the original.
Why Rust
| Property | Python resource-tracker | resource-tracker |
|---|---|---|
| Runtime dependency | Python interpreter + psutil | Single static binary |
| Startup overhead | ~200-500 ms | < 5 ms |
| Observer CPU overhead | ~0.5-1% per core | < 0.1% per core |
| Memory footprint | ~30-60 MiB (interpreter) | ~2-4 MiB |
| Deployment | pip / uv install | Copy binary |
The lower observer overhead matters when tracking short-lived or CPU-intensive workloads where the tracker itself would otherwise appear in the numbers it is collecting.
New user-facing functionality
Shell-wrapper mode
./resource-tracker --interval 1 -- python train.py --epochs 50
Pass any command after -- and the tracker spawns it, sets --pid
automatically, emits one final sample on exit, and forwards the child’s
exit code. This eliminates the two-process boilerplate
(tracker & child; wait) and makes the tracker transparent to CI systems
and schedulers that check exit codes.
Full process tree tracking (--pid)
Python’s SystemTracker attributes CPU ticks only to the root process.
Rust walks the full /proc tree and sums every descendant (workers,
threads, MPI ranks, Spark executors) under the given root PID. Two fields
appear in every JSON sample when --pid is active:
cpu.process_cores_used– fractional cores consumed by the whole treecpu.process_child_count– live descendant count at each sample
Sentinel API streaming and S3 upload
When SENTINEL_API_TOKEN is set, the tracker registers the run, streams
gzip-compressed CSV batches to S3 every TRACKER_UPLOAD_INTERVAL seconds
(default 60), and posts a finish_run call on clean exit. No network
connections are made when the token is absent.
TOML config file + environment variable overrides
All settings (interval, job name, PID, metadata) can be persisted in a
resource-tracker.toml file alongside the job definition. Every field
also has a TRACKER_* environment variable override, which is convenient
for containerized or CI environments where config files are impractical.
Richer metrics (JSON superset)
The CSV output matches Python column-for-column. The JSON output carries additional fields not expressible as Python CSV scalars.
CPU
per_core_pct[]– per-logical-core utilization; identifies hot cores and NUMA imbalanceutilization_pctexpressed as fractional cores (0.0..N_cores), not a percentage clamped to 100; more useful on multi-core hosts
Memory
available_mib(MemAvailable) – free + reclaimable; a more reliable headroom estimate thanfree_mibon systems with large page cachesswap_total_mib,swap_used_mib,swap_used_pct– swap pressure visible before OOM; Python omits swap entirelyactive_mib/inactive_mib– distinguish working-set pressure from cold cache
Disk
- Per-device, per-mount detail instead of three aggregated scalars; enables per-volume capacity tracking and per-device I/O attribution
device_type(nvme,ssd,hdd),model,vendor,serial– correlate metrics with physical hardware without a separatelsblkcall- Per-device hardware sector size read from sysfs; correct byte counts on 4K-native drives where a hard-coded 512 would under-count I/O by 8x
Network
- Per-interval rates (
rx_bytes_per_sec,tx_bytes_per_sec) in addition to cumulative totals; no client-side diff required driver,operstate,speed_mbps,mtuper interface; identify which NIC is under load and whether the link is running at full negotiated speed
GPU (NVIDIA and AMD)
Python emits no GPU metrics at all. Rust supports both NVIDIA (NVML) and AMD (ROCm/AMDGPU) accelerators via runtime dynamic loading, with no build-time driver dependencies. Additional fields beyond utilization and VRAM:
temperature_celsius– detect thermal throttling in real timepower_watts– power-efficiency analysis; watts-per-FLOP budgetingfrequency_mhz– confirm boost clock is active; correlate with thermal stateuuid,name,host_id– attribute metrics to specific devices in multi-GPU systems
Open-Source Resource Monitoring Landscape
Competitive Analysis for resource-tracker (SpareCores)
Prepared: March 25, 2026 Context: Phase 1 feasibility assessment for a Rust/Linux CLI implementation of ResourceTracker Reference tool: https://github.com/SpareCores/resource-tracker
Executive Summary
resource-tracker occupies a specific and underserved niche: a lightweight, zero-dependency, batch-job-oriented process + system resource monitor with workflow framework integration (Metaflow), visualization via cards, and cloud server recommendations. The open-source landscape has many partial overlaps but no single tool matches all its characteristics simultaneously.
The tools below are organized into meaningful categories. Most tools are either:
- Too low-level (profilers that require code instrumentation or produce flame graphs rather than time-series resource logs)
- Too heavy (system daemons, full observability stacks)
- Too narrow (single-resource: CPU only, or memory only, or GPU only)
- Not batch-job oriented (designed for long-running services, not scripts that run and exit)
Category 1: Python Libraries for Process/System Resource Monitoring
These are the closest functional analogues to resource-tracker in the Python ecosystem.
1.1 psutil
- URL: https://github.com/giampaolo/psutil
- Language: Python (C extension)
- Description: The foundational library for cross-platform system/process information in Python.
resource-trackeritself uses psutil as an optional backend on non-Linux systems. psutil retrieves CPU, memory, disk, network, and process-level data programmatically but provides no time-series tracking, no decorator/wrapper API, no visualization, and no batch job reporting. - Key features: CPU %, memory (RSS/PSS/USS/VMS), per-process I/O, network I/O, disk usage, process tree traversal. Cross-platform (Linux, macOS, Windows).
- Difference: Raw data API only. No tracking loop, no reports, no workflow integration. It is a building block, not a solution.
1.2 memory_profiler
- URL: https://github.com/pythonprofilers/memory_profiler
- Language: Python
- Description: Line-by-line memory usage profiler for Python scripts. Uses
@profiledecorator andmprofCLI to record memory usage over time and plot it. Built on psutil. - Key features: Line-level memory profiling, time-series memory plot via
mprof,@profiledecorator,memory_usage()API. - Difference: Memory only (no CPU, GPU, disk, network). Requires code instrumentation for line-level profiling. Targeted at developers finding memory leaks, not at batch job operators seeking resource utilization logs.
1.3 Scalene
- URL: https://github.com/plasma-umass/scalene
- Language: Python + C++
- Description: High-performance, high-precision CPU, GPU, and memory profiler for Python. Uniquely profiles CPU time, GPU time, and memory at the line level simultaneously. Includes AI-powered optimization suggestions and an interactive web UI.
- Key features: Line-level CPU + GPU + memory profiling, separates Python vs native time, web-based interactive report, minimal overhead (~10-20%).
- Difference: A developer profiler (find bottlenecks in code), not a resource utilization logger for batch jobs. Does not track network or disk I/O, does not integrate with workflow tools, does not produce time-series utilization logs for operational use.
1.4 Memray
- URL: https://github.com/bloomberg/memray
- Language: Python + C++
- Description: Bloomberg’s memory profiler for Python. Tracks every allocation in Python, native extensions, and the interpreter itself. Produces flame graphs, heap charts, and other visualizations.
- Key features: Full allocation tracking (Python + C/C++), flame graphs, live mode, Jupyter integration, reporter API.
- Difference: Memory only, developer-oriented (find leaks/hotspots in code). Does not track CPU, GPU, disk, or network. Not designed for batch job monitoring.
1.5 Fil (filprofiler)
- URL: https://github.com/pythonspeed/filprofiler
- Language: Python + Rust
- Description: Memory profiler from pythonspeed targeting data scientists and scientific computing. Finds peak memory usage and identifies what code caused the peak. Produces flame graphs.
- Key features: Peak memory tracking (captures C and Python allocations), flame graphs, designed for NumPy/Pandas workloads, CLI usage.
- Difference: Memory only, developer-oriented. No CPU, GPU, disk, network. Produces offline profiling reports, not operational time-series logs.
1.6 pyinstrument
- URL: https://github.com/joerick/pyinstrument
- Language: Python
- Description: Sampling call-stack profiler for Python. Samples the call stack every 1ms and shows a readable summary of where time is spent. Supports context manager and decorator API.
- Key features: Low-overhead sampling, context manager (
with Profiler()), decorator, CLI, HTML/text/JSON output, async support. - Difference: CPU time only (call stack), no memory/GPU/disk/network. Developer-oriented (why is code slow?), not a resource utilization monitor.
1.7 py-spy
- URL: https://github.com/benfred/py-spy
- Language: Rust
- Description: Sampling profiler for Python programs written in Rust. Attaches to a running Python process without modifying it. Can generate flame graphs or a top-like display.
- Key features: Attaches to running process (no code changes), flame graphs, top-like live view, very low overhead, works across OS.
- Difference: CPU only (call stack). No memory, GPU, disk, or network tracking. Attach-to-process model differs from
resource-tracker’s wrap-a-job model.
1.8 Austin
- URL: https://github.com/P403n1x87/austin
- Language: C
- Description: Python frame stack sampler for CPython. Samples the Python interpreter’s memory space directly to retrieve running thread stacks. Extremely low overhead.
- Key features: Zero-instrumentation, pure C, very low overhead, multi-thread and multi-process support, output compatible with flame graph tools.
- Difference: CPU/call stack profiling only. No resource utilization metrics (memory, GPU, disk, network).
1.9 Glances
- URL: https://github.com/nicolargo/glances
- Language: Python
- Description: Cross-platform system monitoring tool with a rich curses/web UI. Shows CPU, memory, disk, network, process list, temperatures, GPU (via plugin), Docker containers, and more. Can export data to InfluxDB, CSV, Prometheus, etc.
- Key features: Real-time monitoring, web UI, REST API, exporters (InfluxDB, Prometheus, CSV, JSON), Docker/container awareness, GPU plugin, cross-platform (Linux, macOS, Windows, BSD).
- Difference: A long-running system monitor daemon/interactive tool, not designed to wrap a batch job, produce a per-job report, or integrate with workflow frameworks. No job-level summary reports.
1.10 nvitop
- URL: https://github.com/XuehaiPan/nvitop
- Language: Python
- Description: Interactive NVIDIA GPU process viewer with a rich terminal UI. Goes beyond
nvidia-smiby showing per-process GPU/VRAM usage in real time, supports programmatic API access. - Key features: Per-process GPU utilization and VRAM, process tree, interactive kill/signal, rich terminal UI, Python API (
ResourceMetricCollector). - Difference: GPU-only (NVIDIA). Covers system + process level GPU metrics well. Its
ResourceMetricCollectorAPI is a meaningful overlap withresource-trackerfor GPU tracking. No CPU/memory/disk/network integration.
1.11 gpustat
- URL: https://github.com/wookayin/gpustat
- Language: Python
- Description: Simple command-line utility for querying and monitoring NVIDIA GPU status. Aggregates
nvidia-smioutput with color-coded display. Supports--watchmode. - Key features: GPU utilization, VRAM usage, temperature, power draw, per-process GPU use, JSON output, watch mode.
- Difference: NVIDIA GPU only, read-only display tool, no time-series logging, no CPU/memory/disk/network.
1.12 pynvml / nvidia-ml-py
- URL: https://github.com/gpuopenanalytics/pynvml
- Language: Python (NVML binding)
- Description: Python bindings for NVIDIA’s NVML C library, enabling programmatic GPU diagnostics. Used as a building block by gpustat, nvitop, and resource-tracker itself.
- Key features: Full NVML API access: GPU utilization, VRAM, temperature, power, clock speed, process-level GPU usage, fan speed.
- Difference: Raw API, no tracking loop, no reporting. A building block.
1.13 CodeCarbon
- URL: https://github.com/mlco2/codecarbon
- Language: Python
- Description: Tracks CPU, GPU, and RAM energy consumption and converts it to estimated CO2 emissions. Designed for ML training runs. Provides decorator and context manager APIs.
- Key features:
@track_emissionsdecorator, context manager, estimates CO2 equivalent, per-run reporting, dashboard, supports Intel RAPL and NVML. - Difference: Focused on energy/carbon footprint rather than raw resource utilization metrics. Does not track disk I/O or network. Closest in UX philosophy (decorator for batch scripts) but different output goal.
1.14 CarbonTracker
- URL: https://github.com/lfwa/carbontracker
- Language: Python
- Description: Tracks and predicts energy consumption and carbon footprint of deep learning model training. Can stop training when predicted impact exceeds a threshold.
- Key features: Predictive carbon footprint, supports GPU and CPU energy, training-run oriented, can send alerts.
- Difference: Energy/carbon focused, ML training specific, no disk/network tracking.
1.15 pyRAPL
- URL: https://github.com/powerapi-ng/pyRAPL
- Language: Python
- Description: Measures energy consumption of Python code using Intel RAPL (Running Average Power Limit) hardware counters. Provides decorator and context manager APIs.
- Key features: CPU socket, DRAM, and integrated GPU energy measurement, decorator and
withblock APIs, per-domain granularity. - Difference: Intel RAPL only (Intel CPUs since Sandy Bridge), energy not utilization percentage, no GPU computation metrics, no disk/network.
1.16 pyJoules
- URL: https://github.com/powerapi-ng/pyJoules
- Language: Python
- Description: Captures energy consumption of code snippets using Intel RAPL and NVIDIA NVML. Provides decorator and context manager APIs with breakpoints.
- Key features: Multi-device energy capture (CPU, DRAM, NVIDIA GPU), decorator API, MongoDB and Pandas export handlers.
- Difference: Energy measurement, not utilization tracking. Requires Intel RAPL-capable hardware.
1.17 PowerAPI
- URL: https://github.com/powerapi-ng/powerapi
- Language: Python
- Description: Middleware framework for building software-defined power meters. Estimates power at process, container, VM, or application level. Can use hardware counters or performance counters.
- Key features: Pluggable sensors and estimators, multiple granularity levels (process, container, VM), real-time power estimation.
- Difference: Power/energy framework requiring configuration and sensor setup. Not a drop-in decorator for batch jobs.
1.18 eco2AI
- URL: https://github.com/sb-ai-lab/eco2AI
- Language: Python
- Description: Tracks carbon emissions while training/inferring Python ML models. Accounts for CPU, GPU, and RAM energy consumption.
- Key features:
@track_emissionsdecorator, real-time emission monitoring, CSV reporting. - Difference: Carbon/energy focus, similar decorator pattern to
resource-tracker, no disk/network.
1.19 pyperf
- URL: https://github.com/psf/pyperf
- Language: Python
- Description: Python Software Foundation toolkit for writing and running benchmarks. Includes memory tracking (
--track-memory,--tracemalloc) as part of benchmark metadata collection. - Key features: Benchmark calibration, worker process management, memory peak tracking, JSON results, statistical analysis.
- Difference: Benchmarking framework, not a general resource monitor. Memory tracking is incidental to benchmarking.
1.20 ClearML
- URL: https://github.com/clearml/clearml
- Language: Python
- Description: Open-source MLOps platform. Automatically tracks GPU, CPU, memory, and network metrics during ML experiment runs. Provides an experiment tracker, data manager, orchestrator, and more.
- Key features: Automatic system metric logging (GPU, CPU, memory, network), experiment tracking, model registry, pipeline orchestration, web UI.
- Difference: Full MLOps platform (not a lightweight library). Requires a ClearML server. Targets ML experiments rather than general batch jobs.
1.21 python-resmon
- URL: https://github.com/xybu/python-resmon
- Language: Python
- Description: Lightweight resource monitor that records CPU usage, RAM usage, disk I/O, and NIC speed, outputting data in CSV format for post-processing.
- Key features: CSV output, configurable polling interval, system-level metrics, easy post-processing.
- Difference: System-level only (no per-process tracking), no GPU, no visualization, no workflow integration. Small utility script rather than a library.
Category 2: Interactive Terminal Monitors (System-Level)
These tools provide real-time visual monitoring of system resources. They do not produce per-job reports or integrate with batch workflows, but they are widely used for manual resource observation.
2.1 htop
- URL: https://github.com/htop-dev/htop
- Language: C
- Description: Interactive process viewer and system monitor. The modern replacement for
top. Shows per-CPU usage, memory, swap, and a process list with tree view. - Key features: Interactive (kill, renice, filter), color-coded per-CPU bars, tree view, mouse support, cross-platform.
- Difference: Interactive visual tool only. No data capture, no time-series, no batch job integration.
2.2 btop / btop++
- URL: https://github.com/aristocratos/btop
- Language: C++
- Description: Advanced terminal resource monitor. Third generation of bashtop->bpytop->btop++. Shows CPU, memory, disk I/O, network, and process list with rich ASCII art graphs.
- Key features: Responsive UI, mouse support, GPU support (Nvidia/AMD/Intel via plugins), disk I/O, network I/O, process filtering, themes.
- Difference: Interactive visual tool only. No data export, no batch job tracking.
2.3 bpytop
- URL: https://github.com/aristocratos/bpytop
- Language: Python
- Description: Python predecessor to btop++. Linux/macOS/FreeBSD resource monitor with animated ASCII graphs.
- Key features: CPU, memory, disk, network, process list, ASCII graphs.
- Difference: Interactive visual tool. Superseded by btop++.
2.4 bashtop
- URL: https://github.com/aristocratos/bashtop
- Language: Bash
- Description: Original Bash-based resource monitor from the same developer. Ancestor of bpytop and btop++.
- Key features: CPU, memory, disk, network, process monitoring in pure Bash.
- Difference: Superseded by btop++. Interactive visual only.
2.5 glances (see 1.9 above)
- Interactive + exportable, see Category 1 entry.
2.6 atop
- URL: https://github.com/Atoptool/atop
- Language: C
- Description: Advanced interactive system and process monitor for Linux. Records all system activity and writes to binary log files for later replay/analysis. Integrates with
atopsarfor historical reporting. - Key features: Full system activity logging (CPU, memory, disk, network, process), persistent binary logs, replay mode, atopsar for reporting.
- Difference: Long-running daemon for system-wide logging. Not designed to wrap a specific job; tracks the whole system. Closest among CLI tools to providing historical per-process data.
2.7 nmon (Nigel’s Monitor)
- URL: http://nmon.sourceforge.net/
- Language: C
- Description: Performance monitoring tool for AIX and Linux. Provides real-time view and can capture data to CSV for later analysis with nmon Analyser.
- Key features: CPU, memory, disk I/O, network, filesystem, processes; CSV capture mode, lightweight.
- Difference: System-wide monitor. No batch job integration or workflow decorator. The CSV output mode is useful for offline analysis.
2.8 collectl
- URL: http://collectl.sourceforge.net/
- Language: Perl
- Description: Collects a broad set of Linux system statistics (CPU, memory, network, disk, inodes, processes, NFS, TCP, sockets) and can write to files, print to stdout, or feed to Graphite/ganglia.
- Key features: Wide metric coverage, multiple output formats (CSV, plot, etc.), daemon or one-shot mode.
- Difference: System-wide collection daemon. No batch job wrapping, no workflow integration.
2.9 sysstat (sar/sadc/sadf/iostat/pidstat/mpstat)
- URL: https://github.com/sysstat/sysstat
- Language: C
- Description: Collection of Linux performance monitoring utilities.
sarcollects and reports system activity historically.pidstatreports per-process CPU, memory, and I/O.iostatreports disk I/O.sadcis the backend data collector. - Key features: Historical data collection, per-process stats via
pidstat, JSON/CSV/XML output viasadf, schedulable via cron/systemd, very low overhead. - Difference: System and process monitoring utilities, not designed for batch job wrapping.
pidstatis the closest to per-job process monitoring but requires manual invocation.
2.10 nvtop
- URL: https://github.com/Syllo/nvtop
- Language: C
- Description: (h)top-like task monitor for GPUs and accelerators. Supports AMD, Apple M1/M2 (limited), Huawei Ascend, Intel, NVIDIA, Qualcomm, Broadcom, Rockchip.
- Key features: Multi-GPU and multi-vendor support, real-time GPU/VRAM utilization, per-process GPU use, interactive UI.
- Difference: GPU-focused interactive monitor. No data export, no CPU/memory/disk/network integration.
2.11 vtop
- URL: https://github.com/MrRio/vtop
- Language: JavaScript (Node.js)
- Description: Graphical terminal activity monitor with Unicode braille charts. Groups processes sharing the same name (e.g., NGINX master + workers).
- Key features: ASCII charts, process grouping, extensible via plugins.
- Difference: Interactive visual only, no data capture. Note: project appears unmaintained.
2.12 Netdata
- URL: https://github.com/netdata/netdata
- Language: C (agent core)
- Description: Real-time performance monitoring with per-second metrics and a powerful web UI. 800+ integrations. Most-starred monitoring project on GitHub (76k+ stars).
- Key features: Per-second metrics, web dashboard, alerts, ML anomaly detection, 800+ integrations (Docker, Kubernetes, StatsD, OpenMetrics), process-level metrics, GPU plugins.
- Difference: Full-stack observability daemon. Requires installation as a service. Not designed for wrapping a batch job.
Category 3: eBPF / Kernel-Level Tracing Tools
These tools use Linux eBPF (extended Berkeley Packet Filter) for highly efficient, zero-instrumentation tracing deep in the kernel. Most relevant for system-level visibility with very low overhead.
3.1 BCC (BPF Compiler Collection)
- URL: https://github.com/iovisor/bcc
- Language: C + Python/Lua frontends
- Description: Toolkit for creating efficient kernel tracing and manipulation programs using eBPF. Includes ready-made tools (execsnoop, biolatency, tcplife, memleak, etc.) and a framework for writing custom eBPF programs with Python frontends.
- Key features: Kernel + userspace tracing, network/disk/memory/CPU tools, Python API for custom programs, very low overhead.
- Difference: Requires kernel support (Linux 4.1+), root privileges, and knowledge of eBPF to build custom tools. Not a drop-in batch job monitor.
3.2 bpftrace
- URL: https://github.com/bpftrace/bpftrace
- Language: C++ (awk/DTrace-like scripting language)
- Description: High-level tracing language for Linux eBPF. Write concise one-liners or short scripts for ad-hoc analysis.
- Key features: High-level scripting, LLVM backend, supports tracepoints, kprobes, uprobes, usdt. One-liner analysis.
- Difference: Ad-hoc kernel tracing tool. Requires root and kernel support. Not designed for operational batch job monitoring.
3.3 Parca / Parca Agent
- URL: https://github.com/parca-dev/parca
- Language: Go
- Description: Continuous profiling for CPU and memory usage, down to the line number and throughout time. Parca Agent is an eBPF-based always-on profiler with Kubernetes auto-discovery. Uses pprof format.
- Key features: Zero-instrumentation eBPF profiling, <1% overhead, continuous collection, icicle graph UI, SQL-queryable profile storage, multi-language support.
- Difference: Continuous profiling infrastructure (runs as a DaemonSet on Kubernetes nodes). Not a per-job wrapper. Heavy infrastructure requirement.
3.4 Pyroscope (Grafana)
- URL: https://github.com/grafana/pyroscope
- Language: Go
- Description: Continuous profiling database and platform (formed from merger of Phlare + Pyroscope). Stores profiling data from applications instrumented with Pyroscope SDKs or from eBPF agents. Integrates with Grafana.
- Key features: SDK-based push profiling (Python, Go, Java, Ruby, .NET, Rust, PHP, Node.js), eBPF pull mode, flame graphs, Grafana integration, scalable storage.
- Difference: Continuous profiling infrastructure. Requires a server and SDK integration. Not a lightweight batch job wrapper.
Category 4: Linux Performance Profiling Tools (C/C++/Native)
These tools profile native code at a low level. Most are developer-focused profilers rather than operational monitors.
4.1 perf (Linux perf_events)
- URL: https://perfwiki.github.io/main/
- Language: C (Linux kernel subsystem)
- Description: The primary Linux performance tool. Samples CPU events using hardware performance counters, traces system calls, and instruments kernel/userspace functions. Foundation for many other tools.
- Key features: Hardware counter sampling, call graph recording, per-process and system-wide, flame graph generation (via FlameGraph scripts), supports all architectures.
- Difference: Low-level developer profiler. Requires root for many features. No time-series resource logging, no workflow integration.
4.2 FlameGraph
- URL: https://github.com/brendangregg/FlameGraph
- Language: Perl
- Description: Stack trace visualization toolkit by Brendan Gregg. Generates SVG flame graphs from perf, DTrace, SystemTap, and other profiler output.
- Key features: CPU, memory, and off-CPU flame graphs, works with many backends.
- Difference: Visualization tool for profiler output, not a monitoring tool itself.
4.3 gperftools (Google Performance Tools)
- URL: https://github.com/gperftools/gperftools
- Language: C++
- Description: Collection from Google: fast malloc (TCMalloc), CPU profiler, heap profiler, and heap checker. Used via
LD_PRELOADor explicit linking. - Key features: CPU profiling (sampling), heap profiling, heap leak detection, pprof visualization, multi-threaded support.
- Difference: Developer profiler requiring code linking or LD_PRELOAD. No time-series operational monitoring, no disk/network/GPU.
4.4 Valgrind / Massif / Callgrind
- URL: https://valgrind.org/
- Language: C
- Description: Instrumentation framework for building dynamic analysis tools. Massif is its heap profiler; Callgrind is its call graph profiler; Memcheck is its memory error detector.
- Key features: Complete heap tracking, memory leak detection, call graph analysis, massif-visualizer GUI.
- Difference: High-overhead instrumentation (10-50x slowdown). Developer tool, not operational monitor. No GPU, disk, or network metrics.
4.5 Heaptrack
- URL: https://github.com/KDE/heaptrack
- Language: C++ + Python
- Description: Fast heap memory profiler for Linux, designed as a faster, lower-overhead alternative to Valgrind/Massif. Traces all allocations and annotates with stack traces.
- Key features: Lower overhead than Valgrind, flame graph output, heaptrack_gui for visualization, finds memory leaks and allocation hotspots.
- Difference: Memory only, developer profiler. No GPU, CPU utilization, disk, or network.
4.6 Perfetto
- URL: https://github.com/google/perfetto
- Language: C++
- Description: Google’s open-source production-grade system profiling and tracing tool. Default tracing system for Android and used in Chromium. Can capture CPU scheduling, memory, I/O, GPU events, and custom trace points.
- Key features: Multi-process system trace, SQL-based analysis, browser-based UI, heap profiling (heapprofd), CPU frequency and scheduling, Android + Linux support.
- Difference: Complex tracing infrastructure primarily targeting Android/embedded and browser use cases. Not a lightweight batch job wrapper.
4.7 async-profiler
- URL: https://github.com/async-profiler/async-profiler
- Language: C (JVM agent)
- Description: Low-overhead sampling CPU and heap profiler for JVM (Java/Kotlin/Scala/Clojure). Uses AsyncGetCallTrace + perf_events to avoid safepoint bias.
- Key features: CPU + heap sampling, flame graphs, JFR files, tracks native + JVM code, suitable for production.
- Difference: JVM-specific. No Python/R/general process monitoring. No disk, network, or GPU.
4.8 TAU (Tuning and Analysis Utilities)
- URL: https://www.cs.uoregon.edu/research/tau/home.php
- Language: C++ (with Python, Fortran, Java support)
- Description: Comprehensive profiling and tracing toolkit for HPC parallel programs (MPI, OpenMP, CUDA). Supports hardware counters, GPU profiling, and generates call graphs.
- Key features: Parallel program profiling (MPI, OpenMP), hardware counters, GPU support, ParaProf visualization, call graph.
- Difference: HPC research tool for parallel program performance analysis. Complex setup, not a lightweight batch job wrapper.
4.9 HPCToolkit
- URL: https://hpctoolkit.org/
- Language: C/C++
- Description: Sampling-based measurement and analysis suite for HPC programs on CPUs and GPUs. Supports supercomputers.
- Key features: 1-5% overhead sampling, full calling context, hpcviewer GUI, GPU support.
- Difference: HPC research tool, complex setup, not designed for general batch jobs or Python/R scripts.
Category 5: Rust Tools
5.1 below (Facebook/Meta)
- URL: https://github.com/facebookincubator/below
- Language: Rust
- Description: Time-traveling resource monitor for modern Linux systems. Records system activity to disk and allows replay of historical data. Cgroup-aware with PSI (Pressure Stall Information) support.
- Key features: Record + replay mode, cgroup hierarchy view, PSI metrics, process-level stats, live mode, persistent storage. Built on cgroupv2.
- Difference: System-wide monitoring daemon. Designed for Linux infrastructure monitoring, not for wrapping individual batch jobs. No workflow integration. Very strong on cgroup/container awareness.
5.2 samply
- URL: https://github.com/mstange/samply
- Language: Rust
- Description: Command-line sampling CPU profiler for macOS, Linux, and Windows. Uses Linux perf events. Spawns the target process as a subprocess and profiles it, then opens Firefox Profiler UI.
- Key features: Subprocess wrapping (
samply record ./your_program), Firefox Profiler UI, local symbol resolution, flame graphs. - Difference: CPU profiling only (call stack). No memory, GPU, disk, or network tracking. Developer profiler.
5.3 Bytehound
- URL: https://github.com/koute/bytehound
- Language: Rust
- Description: Memory profiler for Linux. Intercepts all heap allocations via
LD_PRELOAD. Produces detailed allocation timelines with stack traces. - Key features: Full allocation tracking, web-based GUI, Rhai scripting for analysis, multi-architecture (AMD64, ARM, AArch64, MIPS64).
- Difference: Memory only. Developer profiler. Requires
LD_PRELOAD, no GPU/disk/network.
5.4 pprof-rs
- URL: https://github.com/tikv/pprof-rs
- Language: Rust
- Description: Rust CPU profiler using backtrace-rs. Generates pprof-compatible output.
- Key features: CPU profiling for Rust applications, pprof output, flame graphs, low overhead.
- Difference: CPU profiler for Rust programs only.
Category 6: System-Level Daemons and Metrics Collection Infrastructure
These tools are designed for long-running infrastructure monitoring, not individual batch jobs, but represent the broader ecosystem.
6.1 Prometheus + node_exporter
- URL: https://github.com/prometheus/node_exporter
- Language: Go
- Description: Prometheus exporter for hardware and OS metrics from
/procand/sys. Exposes CPU, memory, disk, network, filesystem, and more as Prometheus metrics. - Key features: Pull-based metrics, scrape-able endpoint, very broad metric coverage, alerting via Prometheus + Alertmanager.
- Difference: Infrastructure monitoring daemon. Requires Prometheus server. No per-job tracking.
6.2 Prometheus Pushgateway
- URL: https://github.com/prometheus/pushgateway
- Language: Go
- Description: Push acceptor for ephemeral and batch jobs. Allows short-lived jobs to push metrics to Prometheus (which normally pulls). Stores last-received metrics until explicitly deleted.
- Key features: HTTP push endpoint, labels/grouping by job, integrates with Prometheus.
- Difference: Infrastructure component. Not a resource tracker itself; requires a separate process to collect and push metrics. Most relevant for a Rust implementation that needs to output to Prometheus.
6.3 Prometheus process-exporter
- URL: https://github.com/ncabatoff/process-exporter
- Language: Go
- Description: Prometheus exporter that reads
/procto report on selected processes. Groups processes by name or regex and exposes CPU, memory, file descriptors, I/O, and thread counts. - Key features: Per-process-group CPU and memory metrics,
/proc-based, configurable process selection, Prometheus compatible. - Difference: Infrastructure daemon, not a batch job wrapper. Monitors selected processes continuously.
6.4 cAdvisor (Container Advisor)
- URL: https://github.com/google/cadvisor
- Language: Go
- Description: Google’s container resource usage and performance analysis agent. Exposes Prometheus metrics for running containers.
- Key features: Container-level CPU, memory, disk, and network metrics, Prometheus endpoint, supports Docker and other runtimes.
- Difference: Container/cgroup focused daemon. Not for general process monitoring.
6.5 Telegraf
- URL: https://github.com/influxdata/telegraf
- Language: Go
- Description: Plugin-driven metrics collection agent from InfluxData. Single agent collecting system metrics (CPU, memory, disk, network, GPU, containers) and writing to InfluxDB or other backends.
- Key features: 300+ input plugins (system, Docker, SNMP, statsd, etc.), multiple output backends, flexible configuration.
- Difference: Infrastructure agent daemon. Not designed for per-job wrapping.
6.6 Netdata (see 2.12)
6.7 kube-state-metrics
- URL: https://github.com/kubernetes/kube-state-metrics
- Language: Go
- Description: Kubernetes add-on that generates metrics about Kubernetes object state (pod resource requests/limits, deployment status, etc.) for Prometheus.
- Key features: Pod/node resource quota metrics, deployment health, Prometheus format.
- Difference: Kubernetes-only, no process-level metrics.
6.8 OpenTelemetry (OTel)
- URL: https://opentelemetry.io/ / https://github.com/open-telemetry/opentelemetry-python
- Language: Multi-language (Go, Python, Java, .NET, etc.)
- Description: CNCF standard for collecting traces, metrics, and logs. Includes system metrics via the OTel Collector. Growing support for profiling via OTel.
- Key features: Traces + metrics + logs, vendor-neutral, collector, SDKs in all major languages, exporters to Prometheus, Jaeger, OTLP.
- Difference: General observability framework, not a resource tracker per se. Relevant for instrumenting a Rust CLI to expose metrics in a standard format.
6.9 NVIDIA DCGM + dcgm-exporter
- URL: https://github.com/NVIDIA/DCGM / https://github.com/NVIDIA/dcgm-exporter
- Language: C (DCGM) + Go (exporter)
- Description: NVIDIA Data Center GPU Manager for GPU telemetry in large Linux clusters. dcgm-exporter exposes GPU metrics for Prometheus.
- Key features: Per-GPU and per-process GPU metrics, health monitoring, diagnostics, Kubernetes integration, Prometheus exporter.
- Difference: NVIDIA GPU infrastructure daemon for data center clusters. Not a batch job wrapper.
Category 7: Per-Process Network and Disk I/O Monitors
7.1 nethogs
- URL: https://github.com/raboof/nethogs
- Language: C++
- Description: Linux “net top” tool that groups network bandwidth by process using
/proc/net/tcpand libpcap. - Key features: Per-process network bandwidth (upload/download), real-time top-like display.
- Difference: Network only, interactive display, no data capture to file.
7.2 iftop
- URL: https://www.ex-parrot.com/pdw/iftop/
- Language: C
- Description: Shows network bandwidth grouped by source/destination host pairs. Does not show per-process breakdown.
- Key features: Per-connection bandwidth, host name resolution.
- Difference: Network only, host-pair level (not process level).
7.3 iotop
- URL: https://github.com/Tomas-M/iotop
- Language: C (rewrite of original Python version)
- Description: Top-like tool for disk I/O. Shows per-process disk read/write rates using kernel I/O accounting.
- Key features: Per-process disk I/O, real-time display, accumulated I/O counters.
- Difference: Disk I/O only, interactive display, no data capture.
7.4 dstat
- URL: https://github.com/dagwieers/dstat
- Language: Python
- Description: Versatile system statistics tool combining vmstat, iostat, netstat, and ifstat. Outputs columns of metrics to terminal, can write to CSV.
- Key features: CPU, disk, network, memory, system statistics; CSV output; pluggable.
- Difference: System-wide only (not per-process), no GPU. CSV output mode is useful for offline analysis.
Category 8: ML Experiment Tracking Platforms with Resource Monitoring
These platforms include resource metric tracking as one feature among many.
8.1 Weights & Biases (W&B)
- URL: https://github.com/wandb/wandb
- Language: Python
- Description: ML experiment tracking platform with automatic system metric logging. Tracks GPU, CPU, memory, and network during training runs.
- Key features: Automatic system metric logging (GPU, CPU, RAM, network), experiment tracking, model registry, artifacts, collaborative dashboards.
- Difference: Primarily an ML experiment tracker. Resource monitoring is automatic and integrated but secondary to experiment logging. Requires W&B account (cloud-first, has open-source local server option).
8.2 MLflow
- URL: https://github.com/mlflow/mlflow
- Language: Python
- Description: Open-source ML lifecycle management. Does not natively log CPU/GPU metrics; requires external integration.
- Key features: Experiment tracking, model registry, deployment. No built-in system resource monitoring.
- Difference: No native resource tracking.
8.3 ClearML (see 1.20)
Category 9: HPC Batch Job Monitoring
9.1 Jobstats
- URL: https://github.com/PrincetonUniversity/jobstats
- Language: Python + Prometheus stack
- Description: Slurm-compatible job monitoring platform for CPU and GPU clusters. Displays per-job CPU and GPU efficiency summaries using Prometheus, Grafana, and Slurm Prolog/Epilog hooks.
- Key features: Per-Slurm-job efficiency report (CPU utilization, memory, GPU utilization), compares requested vs. used resources, automatically stores data in Slurm AdminComment field.
- Difference: Slurm HPC specific. Requires full Prometheus + Grafana + Slurm infrastructure. Very close in concept to
resource-tracker(per-job resource reports) but for HPC/Slurm, not general Python/R scripts.
9.2 Open XDMoD
- URL: https://open.xdmod.org/
- Language: PHP + Python
- Description: Open-source tool for analyzing HPC center usage and job efficiency. Tracks CPU, memory, GPU, and I/O for Slurm/PBS/SGE jobs.
- Key features: Job-level resource utilization reports, efficiency recommendations, web portal.
- Difference: HPC management tool. Requires full HPC stack. Not for general batch jobs.
Category 10: R Language Profiling Tools
Resource-tracker explicitly supports R scripts. These are the closest R-ecosystem analogues.
10.1 profvis
- URL: https://github.com/rstudio/profvis
- Language: R
- Description: Interactive visualization of R code profiling data. Uses
Rprof()to collect call stack samples and displays an interactive flame graph and memory timeline in a web browser. - Key features: Interactive flame graph, memory timeline, line-level time attribution, RStudio integration.
- Difference: CPU + memory profiling for R code, developer-oriented. No disk, network, or GPU. No batch job wrapping or time-series operational logging.
10.2 bench
- URL: https://github.com/r-lib/bench
- Language: R
- Description: High-precision benchmarking for R with memory tracking.
- Key features: High-resolution timing, memory allocation tracking, comparison of multiple expressions.
- Difference: Benchmarking tool. No operational resource monitoring.
10.3 microbenchmark
- URL: https://github.com/joshuaulrich/microbenchmark
- Language: R
- Description: R package for sub-millisecond timing benchmarks.
- Key features: High-precision CPU timing.
- Difference: CPU timing only, micro-benchmarking specific.
10.4 profmem
- URL: https://github.com/HenrikBengtsson/profmem
- Language: R
- Description: Simple memory profiling for R expressions. Uses
tracemem/R internals to log all memory allocations. - Key features: Per-expression memory allocation log.
- Difference: Memory only, developer-oriented.
Category 11: Python Standard Library / Built-in Profiling
11.1 cProfile / profile
- URL: https://docs.python.org/3/library/profile.html
- Language: Python (stdlib)
- Description: Python’s built-in deterministic profiler. Records function call counts and cumulative time.
- Key features: Function-level timing, call count, cumulative/per-call time, pstats for analysis.
- Difference: CPU time only, function-level. No memory, GPU, disk, or network.
11.2 tracemalloc
- URL: https://docs.python.org/3/library/tracemalloc.html
- Language: Python (stdlib, since 3.4)
- Description: Traces Python memory allocations with tracebacks to allocation sites.
- Key features: Peak memory tracking, traceback to allocation sites, snapshot comparison.
- Difference: Python-managed memory only. No native/C allocations, no GPU/disk/network.
11.3 yappi
- URL: https://github.com/sumerc/yappi
- Language: Python + C
- Description: Yet Another Python Profiler. Supports both wall clock and CPU time, multi-threaded profiling, and async code.
- Key features: Wall + CPU time, multi-thread awareness, async support, pstats/callgrind output.
- Difference: CPU profiling only.
11.4 line_profiler
- URL: https://github.com/pyutils/line_profiler
- Language: Python + C
- Description: Line-by-line CPU time profiler for Python using
@profiledecorator. - Key features: Line-level execution time,
@profiledecorator. - Difference: CPU time only, requires decoration.
Summary Comparison Table
| Tool | Lang | CPU | Mem | GPU | Disk | Net | Batch-job wrap | Per-job report | Workflow integration | Output |
|---|---|---|---|---|---|---|---|---|---|---|
| resource-tracker | Python | Y | Y | Y | Y | Y | Y | Y | Metaflow, Flyte, Airflow | Metrics + card visualization |
| psutil | Python | Y | Y | — | Y | Y | — | — | — | Raw API |
memory_profiler | Python | — | Y | — | — | — | Y (mprof) | Y (plot) | — | Plot + log |
| Scalene | Python | Y | Y | Y | — | — | Y (CLI) | Y (web UI) | — | Interactive web report |
| Memray | Python | — | Y | — | — | — | Y (CLI) | Y (flame graph) | — | Flame graphs |
| Fil | Python | — | Y | — | — | — | Y (CLI) | Y (flame graph) | — | Flame graph |
| pyinstrument | Python | Y | — | — | — | — | Y | Y | — | HTML/text |
| py-spy | Rust | Y | — | — | — | — | Y (attach) | Y (flame graph) | — | Flame graph |
| Austin | C | Y | — | — | — | — | Y | — | — | Stack samples |
| Glances | Python | Y | Y | Y* | Y | Y | — | — | — | TUI + web API |
| nvitop | Python | — | — | Y | — | — | — | — | — | TUI + Python API |
| gpustat | Python | — | — | Y | — | — | — | — | — | CLI display |
| CodeCarbon | Python | Y* | Y* | Y* | — | — | Y (decorator) | Y (CSV) | — | CO2 report |
| ClearML | Python | Y | Y | Y | — | Y | Y (auto) | Y (web) | ML frameworks | Web dashboard |
| below | Rust | Y | Y | — | Y | Y | — | — | — | TUI + replay |
| samply | Rust | Y | — | — | — | — | Y (subprocess) | Y (flame graph) | — | Firefox profiler |
| Bytehound | Rust | — | Y | — | — | — | Y (LD_PRELOAD) | Y (web GUI) | — | Web GUI |
| atop | C | Y | Y | — | Y | Y | — | — | — | TUI + binary log |
| sysstat/pidstat | C | Y | Y | — | Y | Y | — | — | — | CLI + CSV |
| htop | C | Y | Y | — | Y | Y | — | — | — | TUI |
| btop++ | C++ | Y | Y | Y* | Y | Y | — | — | — | TUI |
| Jobstats | Python | Y | Y | Y | — | — | Y* (Slurm) | Y (Slurm) | Slurm | CLI + DB |
| Pyroscope | Go | Y | Y | — | — | — | Y (SDK) | — | — | Flame graphs |
| Parca | Go | Y | Y | — | — | — | — | — | Kubernetes | Icicle graphs |
| perf | C | Y | — | — | Y | — | Y (subprocess) | — | — | Raw perf data |
| Valgrind | C | Y | Y | — | — | — | Y (subprocess) | Y | — | Text + GUI |
| nethogs | C++ | — | — | — | — | Y | — | — | — | TUI |
| iotop | C | — | — | — | Y | — | — | — | — | TUI |
| PowerAPI | Python | Y* | Y* | — | — | — | — | — | — | Power estimates |
| W&B | Python | Y | Y | Y | — | Y | Y (auto) | Y (web) | ML frameworks | Web dashboard |
| Prometheus stack | Go | Y | Y | Y* | Y | Y | — | — | Kubernetes | Time-series DB |
Y = partial/plugin-based support
Key Findings for Rust CLI Implementation
Based on this landscape analysis, the following observations are most relevant to the planned Rust/Linux CLI implementation:
-
No existing Rust tool covers the full feature set of resource-tracker (CPU + memory + GPU + disk + network + batch job wrapping + per-job reporting).
below(Rust) is the closest in scope but is a system-wide daemon, not a per-job wrapper. -
procfs is the right foundation for Linux. The
/procfilesystem is used by psutil, process-exporter, sysstat, and resource-tracker itself. A Rust implementation can use theprocfscrate or read/procdirectly with zero external dependencies. -
GPU support requires dynamic linking (NVML via
libpynvmlor directlibnvidia-ml.so). This is a hard constraint noted in the SOW. The Rust NVML binding (nvidia-management-library crate or similar) will be needed. -
The Pushgateway integration (Extra Component: S3 PUT) is unique to resource-tracker and not present in any comparable tool. This makes it particularly well-suited for cloud batch job environments.
-
The decorator/wrapper pattern (similar to
samply record ./program) is present in py-spy, samply, Austin, and Fil — wrapping a subprocess is the right architectural pattern for a CLI tool. -
The closest functional analogues (tools that wrap a job, collect multi-resource metrics, and produce a per-job report) are:
- Scalene (Python, CPU+GPU+memory, developer-oriented)
- memory_profiler (Python, memory only, has mprof)
- Jobstats (HPC/Slurm specific)
- resource-tracker itself (the reference implementation)
None of these is in Rust, none covers all six resource dimensions (CPU, memory, GPU, VRAM, network, disk) in a single zero-dependency binary.
Sources
- https://github.com/SpareCores/resource-tracker
- https://github.com/giampaolo/psutil
- https://github.com/pythonprofilers/memory_profiler
- https://github.com/plasma-umass/scalene
- https://github.com/bloomberg/memray
- https://github.com/pythonspeed/filprofiler
- https://github.com/joerick/pyinstrument
- https://github.com/benfred/py-spy
- https://github.com/P403n1x87/austin
- https://github.com/nicolargo/glances
- https://github.com/XuehaiPan/nvitop
- https://github.com/wookayin/gpustat
- https://github.com/gpuopenanalytics/pynvml
- https://github.com/mlco2/codecarbon
- https://github.com/lfwa/carbontracker
- https://github.com/powerapi-ng/pyRAPL
- https://github.com/powerapi-ng/pyJoules
- https://github.com/powerapi-ng/powerapi
- https://github.com/sb-ai-lab/eco2AI
- https://github.com/psf/pyperf
- https://github.com/clearml/clearml
- https://github.com/xybu/python-resmon
- https://github.com/htop-dev/htop
- https://github.com/aristocratos/btop
- https://github.com/aristocratos/bpytop
- https://github.com/aristocratos/bashtop
- https://github.com/Atoptool/atop
- https://github.com/sysstat/sysstat
- https://github.com/Syllo/nvtop
- https://github.com/MrRio/vtop
- https://github.com/netdata/netdata
- https://github.com/iovisor/bcc
- https://github.com/bpftrace/bpftrace
- https://github.com/parca-dev/parca
- https://github.com/grafana/pyroscope
- https://github.com/brendangregg/FlameGraph
- https://github.com/gperftools/gperftools
- https://valgrind.org/
- https://github.com/KDE/heaptrack
- https://github.com/google/perfetto
- https://github.com/async-profiler/async-profiler
- https://github.com/facebookincubator/below
- https://github.com/mstange/samply
- https://github.com/koute/bytehound
- https://github.com/tikv/pprof-rs
- https://github.com/prometheus/node_exporter
- https://github.com/prometheus/pushgateway
- https://github.com/ncabatoff/process-exporter
- https://github.com/google/cadvisor
- https://github.com/influxdata/telegraf
- https://github.com/kubernetes/kube-state-metrics
- https://opentelemetry.io/
- https://github.com/NVIDIA/DCGM
- https://github.com/NVIDIA/dcgm-exporter
- https://github.com/raboof/nethogs
- https://github.com/wandb/wandb
- https://github.com/mlflow/mlflow
- https://github.com/PrincetonUniversity/jobstats
- https://github.com/rstudio/profvis
- https://github.com/r-lib/bench
- https://github.com/sumerc/yappi
- https://github.com/pyutils/line_profiler
- https://github.com/msaroufim/awesome-profiling
- https://lambda.ai/blog/keeping-an-eye-on-your-gpus-2
- https://sparecores.com/article/metaflow-resource-tracker
- https://developers.facebook.com/blog/post/2021/09/21/below-time-travelling-resource-monitoring-tool/
Open-Source Tools with Similar Functionality to resource-tracker
resource-tracker is a lightweight, zero-dependency Python package for monitoring CPU, memory, GPU, network, and disk utilization across processes and at the system level, designed for batch jobs (Python/R scripts, Metaflow steps), with decorator-based workflow integration and per-job visualization reports.
The tools below are organized into meaningful categories. No single open-source tool matches all of resource-tracker’s characteristics simultaneously — most are either too narrow (single metric), too heavy (infrastructure daemons), or not batch-job oriented.
Category 1: Python Libraries for Process/System Resource Monitoring
(Closest functional analogues)
| Tool | Notes | Details |
|---|---|---|
| psutil | The foundational building block used by resource-tracker itself. Raw API only, no tracking loop or reports. | Linux; no CLI; CPU/Mem/Disk/Net/Process; no batch wrap; no report |
| memory_profiler | Line-by-line memory, @profile decorator, mprof plot. No CPU/GPU/disk/network. | Linux; CLI (mprof); Memory; batch wrap (mprof CLI); report (plot) |
| Scalene | High-precision line-level profiler with AI optimization suggestions. No disk/network. Developer profiler. | Linux; CLI; CPU/GPU/Mem; batch wrap (CLI); report (web UI) |
| Memray | Bloomberg. Tracks every allocation including C/C++. No CPU/GPU/disk/network. | Linux; CLI; Memory; batch wrap (CLI); report (flame graphs) |
| Fil | Peak memory focus for data scientists (NumPy/Pandas). Written in Rust+Python. Linux/macOS only. | Linux; CLI; Memory (peak); batch wrap (CLI); report (flame graph) |
| pyinstrument | Context manager + decorator. 1ms sampling. No memory/GPU/disk/network. | Linux; CLI; CPU; batch wrap; report |
| py-spy | Written in Rust. Attaches to a running process. No memory/GPU/disk/network. | Linux; CLI; CPU; batch wrap (attach); report (flame graph) |
| Austin | Pure C, extremely low overhead CPython frame stack sampler. | Linux; CLI; CPU; batch wrap; no report |
| Glances | Full system monitor with REST API, web UI, and exporters. Long-running daemon, not a batch-job wrapper. | Linux; CLI; CPU/Mem/Disk/Net/GPU; no batch wrap; no report |
| nvitop | Best GPU process viewer. Has programmatic ResourceMetricCollector API. No CPU/mem/disk/net. | Linux; CLI; NVIDIA GPU; no batch wrap; no report |
| gpustat | Simple NVIDIA GPU status CLI. No time-series logging. | Linux; CLI; NVIDIA GPU; no batch wrap; no report |
| pynvml / nvidia-ml-py | Python NVML bindings. Building block only. | Linux; no CLI; GPU (raw API); no batch wrap; no report |
| CodeCarbon | @track_emissions decorator. CO2/energy focus, not utilization %. No disk/network. | Linux; partial CLI; CPU/Mem/GPU energy; batch wrap (decorator); report (CSV + dashboard) |
| CarbonTracker | Predicts carbon footprint, can halt training. ML training specific. | Linux; no CLI; CPU/GPU energy; batch wrap; report |
| pyRAPL | Intel RAPL via /sys/class/powercap. Intel CPUs only. Energy joules, not utilization %. | Linux only; no CLI; CPU/DRAM energy; batch wrap (decorator); no report |
| pyJoules | Multi-device energy (Intel RAPL + NVML). Context manager and decorator. | Linux only; no CLI; CPU/DRAM/GPU energy; batch wrap (decorator); no report |
| PowerAPI | Framework for software-defined power meters. Process/container/VM granularity. Complex setup. | Linux only; partial CLI; CPU/Mem power; no batch wrap; no report |
| eco2AI | ML training focused CO2 tracking. | Linux; no CLI; CPU/GPU/RAM energy; batch wrap (decorator); report (CSV) |
| pyperf | PSF benchmarking toolkit. --track-memory and --tracemalloc options. Not an operational monitor. | Linux; CLI; Memory (benchmarks); batch wrap; report |
| ClearML | Full MLOps platform. Auto-logs system metrics. Requires ClearML server. | Linux; CLI; CPU/Mem/GPU/Net; auto batch wrap; report (web UI) |
| python-resmon | Lightweight script outputting CSV. System-level only, no per-process or GPU tracking. | Linux; CLI; CPU/Mem/Disk/Net; no batch wrap; report (CSV) |
| yappi | CPU + wall time profiler with multi-thread and async support. | Linux; no CLI; CPU; batch wrap; report |
| line_profiler | Line-by-line CPU time. No memory/GPU/disk/network. | Linux; CLI (kernprof); CPU; batch wrap (@profile); report |
Category 2: Interactive Terminal System Monitors
(Real-time visual monitoring; do not produce per-job reports or integrate with batch workflows)
| Tool | Notes | Details |
|---|---|---|
| htop | Interactive process viewer; no data capture | C; Linux; CLI; CPU/Mem/Proc |
| btop++ | Most modern TUI monitor; GPU via plugins | C++; Linux; CLI; CPU/Mem/Disk/Net/GPU |
| bpytop | Predecessor to btop++ | Python; Linux; CLI; CPU/Mem/Disk/Net |
| bashtop | Predecessor to bpytop | Bash; Linux; CLI; CPU/Mem/Disk/Net |
| atop | Writes persistent binary logs; replay mode; strong process-level detail | C; Linux only; CLI; CPU/Mem/Disk/Net/Proc |
| nmon | CSV capture mode for offline analysis; primarily Linux/AIX | C; Linux; CLI; CPU/Mem/Disk/Net |
| collectl | Wide metric coverage; daemon or one-shot mode | Perl; Linux only; CLI; CPU/Mem/Disk/Net |
| sysstat (sar/pidstat) | pidstat for per-process; sadf for JSON/CSV/XML export; schedulable via cron | C; Linux only; CLI; CPU/Mem/Disk/Net/Proc |
| nvtop | AMD, Apple, Intel, NVIDIA, Qualcomm support; interactive GPU monitor | C; Linux; CLI; GPU (multi-vendor) |
| vtop | Node.js, Unicode charts | JS; Linux; CLI; CPU/Mem/Proc |
| Netdata | 76k+ GitHub stars. Per-second metrics, web UI, ML anomaly detection | C; Linux; CLI; all (800+ plugins) |
Category 3: eBPF / Kernel Tracing Tools
(Zero-overhead kernel-level observability; require root + Linux kernel 4.1+)
| Tool | Notes | Details |
|---|---|---|
| BCC | Toolkit for writing eBPF programs; 70+ ready-made tools | C/Python/Lua; Linux only; CLI |
| bpftrace | DTrace-like one-liners for eBPF; ad-hoc analysis | C++ DSL; Linux only; CLI |
| Parca + Parca Agent | Continuous eBPF-based CPU profiling; pprof format; <1% overhead | Go; Linux only; CLI |
| Pyroscope (Grafana) | Continuous profiling database + eBPF agent; multi-language SDK; Grafana integration | Go; Linux only; CLI |
Category 4: Native C/C++ Profiling Tools
| Tool | Notes | Details |
|---|---|---|
| perf (Linux perf_events) | Foundation for many other tools; hardware counter sampling | C (kernel); Linux only; CLI; CPU/kernel events |
| FlameGraph | Visualizes perf/DTrace output as SVG flame graphs | Perl; Linux; CLI; visualization |
| gperftools | Google Performance Tools: CPU profiler, heap profiler, TCMalloc | C++; Linux; partial CLI (pprof); CPU/Memory |
| Valgrind / Massif | High-overhead instrumentation; Massif=heap profiler; 10–50× slowdown | C; Linux; CLI; CPU/Memory |
| Heaptrack | KDE; faster alternative to Valgrind/Massif for heap profiling | C++; Linux only; CLI; Memory |
| Perfetto | Google; default Android profiler; SQL-queryable traces; browser UI | C++; Linux; CLI; CPU/Mem/GPU/Disk/Sched |
| async-profiler | Low-overhead JVM profiler; flame graphs; JVM only | C (JVM agent); Linux; CLI (asprof); CPU/Heap |
| TAU | HPC parallel profiling suite; complex setup | C++; Linux; CLI; CPU/GPU/MPI |
| HPCToolkit | HPC sampling profiler; 1–5% overhead; supercomputer use | C/C++; Linux; CLI; CPU/GPU |
Category 5: Rust Tools
| Tool | Notes | Details |
|---|---|---|
| below | Facebook/Meta. Time-traveling system monitor with cgroup/PSI support; record+replay mode. System-wide daemon, not a batch-job wrapper. Architecturally most relevant Rust project. | Linux only; CLI |
| samply | Sampling CPU profiler; wraps a subprocess (samply record ./program); uses Linux perf events; Firefox Profiler UI. CPU only. | Linux; CLI |
| Bytehound | Heap memory profiler; LD_PRELOAD-based; multi-arch (AMD64, ARM, AArch64, MIPS64); web-based GUI. Memory only. | Linux only; CLI |
| pprof-rs | CPU profiler for Rust programs using backtrace-rs; pprof output format. Library only. | Linux; no CLI |
Category 6: Infrastructure Metrics Collection (Daemons & Exporters)
(Not batch-job wrappers; relevant for pipeline integration and metric output targets)
| Tool | Notes | Details |
|---|---|---|
| Prometheus node_exporter | System-level Prometheus exporter; /proc-based | Go; Linux; CLI |
| Prometheus Pushgateway | Allows batch jobs to push metrics to Prometheus; standard solution for short-lived jobs | Go; Linux; CLI |
| process-exporter | Per-process-group Prometheus metrics from /proc | Go; Linux only; CLI |
| cAdvisor | Container resource usage and performance; Prometheus exporter | Go; Linux only; CLI |
| Telegraf | Plugin-driven metrics agent; 300+ inputs; InfluxDB backend | Go; Linux; CLI |
| OpenTelemetry | CNCF standard for traces/metrics/logs; structured output for jobs | Multi-lang; Linux; CLI (otelcol) |
| NVIDIA DCGM + dcgm-exporter | GPU telemetry for Kubernetes/data center; Prometheus exporter | C/Go; Linux only; CLI |
| kube-state-metrics | Kubernetes object state metrics for Prometheus | Go; Linux; CLI |
| Jobstats (HPC) | Slurm-compatible per-job efficiency reports (CPU+GPU). Conceptually very close to resource-tracker but Slurm-specific. | Python; Linux only; CLI |
Category 7: Per-Process Network and Disk I/O Monitors
| Tool | Notes | Details |
|---|---|---|
| nethogs | Per-process network bandwidth using /proc/net/tcp + libpcap | C++; Linux only; CLI |
| iftop | Per-connection (not per-process) bandwidth monitor | C; Linux; CLI |
| iotop | Per-process disk I/O using kernel I/O accounting | C; Linux only; CLI |
| dstat | System-wide CPU+disk+network+memory with CSV output | Python; Linux only; CLI |
Category 8: ML Experiment Tracking with Resource Monitoring
| Tool | Notes | Details |
|---|---|---|
| Weights & Biases | Auto-logs GPU, CPU, memory, network during training runs; cloud-first; rich dashboards | Linux; CLI (wandb) |
| ClearML | Open-source MLOps platform; auto-logs GPU+CPU+memory+network; requires ClearML server | Linux; CLI |
| MLflow | Experiment tracking but no native system resource monitoring | Linux; CLI (mlflow) |
Category 9: R Language Profiling
| Tool | Notes | Details |
|---|---|---|
| profvis | Interactive R profiling visualization; CPU + memory timeline; used within R session | Linux; R session only |
| bench | Benchmarking with memory tracking; used within R session | Linux; R session only |
| microbenchmark | Micro-benchmarking tool; used within R session | Linux; R session only |
| profmem | Memory allocation tracing for R expressions; used within R session | Linux; R session only |
Category 10: Python Standard Library Profiling Tools
| Tool | Notes | Details |
|---|---|---|
| cProfile / profile | Function-level CPU time; stdlib | Linux; CLI (python -m cProfile) |
| tracemalloc | Python memory allocation tracing with tracebacks; stdlib since Python 3.4; used within code | Linux; no CLI (used within code) |
Summary: Key Differentiators of resource-tracker
The table below highlights what makes resource-tracker stand out relative to the landscape:
| Feature | resource-tracker | Most profilers | System monitors | ML trackers |
|---|---|---|---|---|
| CPU + Memory + GPU + Disk + Net | All 5 | Usually 1–2 | All 5 | CPU+Mem+GPU |
| Batch-job / script wrapper | Yes | Yes | No (daemons) | Yes |
| Zero runtime dependencies | Yes | Varies | No | No |
| Per-job visual report / card | Yes | Often | No | Yes (cloud) |
| Workflow integration (Metaflow) | Yes | No | No | Varies |
| Cloud instance recommendations | Yes | No | No | No |
| Lightweight process footprint | Yes | Yes | No | No |
| Process-level granularity | Yes | Yes | Partial | No |
| Runs on Linux | Yes | Yes | Yes | Yes |
| CLI invocation | Yes | Yes (most) | Yes | Yes |
Rust Crate-Level Competitive Landscape: Resource Monitoring
This document surveys Rust crates relevant to resource monitoring — tracking CPU, memory, GPU, network, and disk utilization — with particular focus on use cases analogous to the Python resource-tracker package (batch job wrapping, structured output, low overhead).
It also covers dial9-tokio-telemetry, a notable 2026 Rust telemetry crate that is not a resource monitor but is included here to explain why it falls outside this landscape.
Section 1: Core System Information Libraries
(Foundational libraries; highest relevance as building blocks)
| Crate | Notes | Details |
|---|---|---|
| sysinfo | The dominant Rust system-info library. Cross-platform (Linux, macOS, Windows, FreeBSD). Covers everything resource-tracker needs except GPU. Used internally by most other crates here. ~2,700 GitHub stars. | Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 123M downloads |
| procfs | Direct interface to Linux /proc. Most granular per-process data available (CPU time, RSS, VMS, I/O counters, smaps). Authoritative source for Linux-first tools. | Linux only; no CLI; CPU/Mem/Net/Disk; process-level; active; 51M downloads |
| psutil | Rust port of Python’s psutil. Modular feature flags. Linux + macOS. README self-describes as “not well maintained” despite a July 2025 update. | Linux; no CLI; CPU/Mem/Net/Disk; process-level; active*; 3.1M downloads |
| systemstat | Pure Rust (no C bindings). Cross-platform. System-wide only — no per-process metrics. | Linux; no CLI; CPU/Mem/Net/Disk; system-wide only; active; 3.6M downloads |
| libproc | Per-process data on Linux + macOS. Useful complement to procfs for cross-platform support. | Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 5M downloads |
| memory-stats | Cross-platform. Reports the current process’s own RSS and virtual memory only. Narrow scope but zero-dependency and reliable. | Linux; no CLI; Mem only; self-process only; active; 10.3M downloads |
| perf_monitor | Larksuite (Lark/Feishu). Designed explicitly as a monitoring foundation: per-process CPU, memory, FDs, disk I/O. Cross-platform. Archived January 2026 — do not adopt for new projects. | Linux; no CLI; CPU/Mem/Disk; process-level; archived; 36K downloads |
| heim | Async-first psutil/gopsutil equivalent. Conceptually ideal but last released 2020; 74 open issues. Not safe to adopt. | Linux; no CLI; CPU/Mem/Net/Disk; process-level; abandoned; 490K downloads |
*psutil: stated as “not well maintained” in README despite recent activity.
Section 2: GPU Monitoring Libraries
| Crate | Notes | Details |
|---|---|---|
| nvml-wrapper | Safe, ergonomic Rust wrapper for NVIDIA NVML. Covers GPU utilization, memory, temperature, power, fan speed, running compute processes. The standard library for NVIDIA GPU metrics in Rust. | Linux; no CLI; NVIDIA GPU; active; 3.5M downloads |
| all-smi | Most comprehensive multi-vendor GPU CLI in Rust. Prometheus metrics integration. Display-oriented but scriptable. | Linux; CLI + Prometheus; NVIDIA/AMD/Intel/Apple/TPU/NPU GPU; active; 8.3K downloads |
| nviwatch | Interactive TUI + InfluxDB integration. NVIDIA-only. | Linux; TUI; NVIDIA GPU; active; 4.9K downloads |
| gpuinfo | Minimal CLI for GPU status with --watch and --format flags. Scriptable. NVIDIA-only. | Linux; CLI; NVIDIA GPU; active; 5.9K downloads |
Section 3: CLI Tools for Batch Job / Process Resource Tracking
(Most directly comparable to resource-tracker’s execution model)
| Crate | Notes | Details |
|---|---|---|
| denet | Closest Rust analogue to resource-tracker. denet run <cmd> wraps a command and streams CPU%, memory (RSS+VMS), and I/O metrics. JSON/JSONL/CSV output. Adaptive sampling. Child process aggregation. Python API bindings. No GPU or network monitoring. | Linux; CLI; CPU/Mem/Disk; active; 2.6K downloads |
| session-process-monitor | Kubernetes-focused but spm run pattern directly wraps a batch job with monitoring + OOM protection + headless JSON logging. Tracks USS/PSS/RSS memory and disk I/O rate. Very new (March 2026). No GPU or network. | Linux only; CLI (spm run); CPU/Mem/Disk; active; 173 downloads |
| stop-cli | Modern process viewer with JSON/CSV structured output designed for piping to jq. Per-process CPU%, memory, disk I/O, FDs. Very early stage (v0.0.1, November 2025). | Linux; CLI; CPU/Mem/Disk; active; 72 downloads |
| procrec | Records and plots CPU + memory for a process. Conceptually aligned but last updated 2021. | Linux; CLI; CPU/Mem; abandoned; 1.7K downloads |
| radvisor | Container/Kubernetes batch monitoring at 50ms granularity via cgroups. CSVY output. CPU (including throttling), memory, block I/O. Dormant since 2022. | Linux only; CLI; CPU/Mem/Disk; dormant; 1.7K downloads |
| pidtree_mon | CLI monitor for CPU load across entire process trees (parent + all descendants). CPU-only; no memory/disk/network/GPU. | Linux only; CLI; CPU only; active; 6.2K downloads |
| gotta-watch-em-all | CLI memory monitor for process trees. Memory-only. Dormant since 2022. | Linux; CLI; Mem only; dormant; 6.5K downloads |
| procweb-rust | Web interface for per-process Linux resource usage. No structured data output. Stale since 2023. | Linux only; web UI; CPU/Mem; stale; 5.5K downloads |
| systrack | Library for tracking CPU and memory usage over configurable time intervals (rolling windows) — the exact pattern resource-tracker uses. Single release in 2023; dormant since. | Linux; no CLI; CPU/Mem; dormant; 1.4K downloads |
Section 4: Interactive TUI System Monitors
(Visual monitors; not designed for non-interactive batch job instrumentation)
| Crate | Notes | Details |
|---|---|---|
bottom (btm) | Most popular Rust TUI monitor. Cross-platform. No GPU. Uses sysinfo internally. Interactive only — not suitable for batch job instrumentation. | Linux; TUI; CPU/Mem/Net/Disk; active; 13,100 stars |
| mltop | ML-focused TUI combining CPU + NVIDIA GPU (via NVML). Directly targets the ML engineer use case. Interactive only. | Linux; TUI; CPU/Mem/NVIDIA GPU; active; 14 stars |
| rtop | TUI with optional NVIDIA GPU support. Covers all five resource types in a single tool. Interactive only. | Linux; TUI; CPU/Mem/NVIDIA GPU/Net/Disk; active; 36 stars |
| ttop | TUI with multi-vendor GPU (NVIDIA, AMD, Apple Silicon). Very new (March 2026). Interactive only. | Linux; TUI; CPU/Mem/multi-vendor GPU; active |
| hegemon | Modular safe-Rust TUI. Last release 2018. Historical reference only. | Linux only; TUI; CPU/Mem; abandoned; 336 stars |
Section 5: Comprehensive Hardware Monitoring
| Crate | Notes | Details |
|---|---|---|
| silicon-monitor | Most comprehensive hardware monitoring scope of any crate here. NVIDIA (NVML) + AMD (ROCm/sysfs) + Intel (i915) GPU. Also covers temperatures, SMART disk data, USB, audio, per-process GPU attribution. Provides CLI (JSON output), TUI, GUI, library (simonlib), and MCP/AI agent server. Very new (133 downloads, 1 star as of March 2026); unclear stability. Worth watching. | Linux; CLI (JSON); CPU/Mem/multi-vendor GPU/Net/Disk; active |
Section 6: Kernel / Low-Level Profiling Crates
(Measure hardware counters, not high-level resource utilization)
| Crate | Notes | Details |
|---|---|---|
| perf-event | Safe Rust interface to perf_event_open. Exposes hardware counters: CPU cycles, instructions, cache hits/misses, branch predictions, page faults, context switches. Deep profiling of batch jobs; not high-level resource tracking. | Linux only; no CLI; active; 4.2M downloads |
| pprof | CPU profiler for Rust programs (stack sampling → flamegraph/pprof output). Profiler, not a resource monitor. | Linux; no CLI; active; 34M downloads |
| metrics | Application metrics facade (counters, gauges, histograms). Used to emit measurements; not a collector of system resources. | Linux; no CLI; active; 74M downloads |
Section 7: dial9-tokio-telemetry — Async Runtime Telemetry (Out of Scope)
dial9-tokio-telemetry is a runtime telemetry “flight recorder” for the Tokio async runtime in Rust, announced on the Tokio blog on March 18, 2026 (authored by Russell Cohen, with AWS contributions). It is included here to explain why it is not a resource monitor and does not belong in this landscape.
What it does
dial9 hooks into Tokio’s internal instrumentation to capture a microsecond-resolution event log of every:
- Task poll (timing per poll)
- Worker park / unpark event
- Task wake event and lifecycle (creation, worker migration)
- Queue depth change
- Lock contention event (with stack traces on Linux)
- Linux kernel scheduling delay (gap between “ready to run” and “actually scheduled”)
- CPU profile samples (Linux perf/eBPF-style)
- Application-level
tracingspans and logs
Traces are written to compact rotating binary files (or directly to S3) with <5% overhead, enabling continuous production deployment. A web-based trace viewer renders the results.
Why it is not a resource monitor
| Dimension | resource-tracker | dial9-tokio-telemetry |
|---|---|---|
| Target workload | Batch jobs (ML, HPC, pipelines) | Long-running async Rust services |
| Metrics tracked | CPU%, RAM, GPU, network, disk | Tokio task polls, scheduling delays, lock contention |
| Integration | Decorator / subprocess wrap | Must be compiled into the Rust binary |
| Output | Time-series resource usage / plots | Binary event traces for async runtime debugging |
| Question answered | “How much CPU/RAM did this job use?” | “Why did this async request take 18ms instead of 1ms?” |
| Platform | Cross-platform | Linux-primary |
dial9 is an async runtime debugger. It tracks none of the metrics — CPU utilization %, memory, GPU, network bandwidth, disk I/O — that define the resource-tracker use case. It is relevant to Rust async service reliability engineering, not to batch job resource instrumentation.
Summary: Key Findings
No single Rust crate fully replicates resource-tracker
No existing Rust crate combines: subprocess/batch-job wrapping + CPU% + memory + GPU + network + disk + structured JSON/CSV output + low overhead. The gap is real.
Closest existing tools
| Crate | Why it is close | What is missing |
|---|---|---|
denet | denet run <cmd> wraps a command; JSON/CSV output; Python bindings | GPU, network |
session-process-monitor | spm run pattern; OOM protection; headless JSON logging | GPU, network |
stop-cli | Structured JSON/CSV; scripting-friendly | Not a job wrapper; no GPU/network |
Recommended building blocks for a Rust resource-tracker port
| Purpose | Crate |
|---|---|
| CPU, memory, disk, network (system + process) | sysinfo |
| Fine-grained Linux per-process I/O and memory | procfs |
| NVIDIA GPU metrics | nvml-wrapper |
| Multi-vendor GPU CLI | all-smi |
The GPU gap
No Rust library cleanly integrates CPU + memory + multi-vendor GPU + network + disk in a single programmatic API suitable for batch job wrapping. silicon-monitor attempts this scope but is brand new and unproven. nvml-wrapper covers NVIDIA programmatically; multi-vendor GPU support requires either all-smi (CLI) or direct vendor SDK bindings.
Specification Proposal — resource-tracker
- Status: Proposal / Work-in-Progress
- Date: 2026-03-30
- Based on: README.md (SpareCores),
src/prototype, Python PR #9,s3_upload.py - AI large language model tools were used throughout research, specification, and implementation phases of this project to accelerate and improve the quality of the work.
0. Conventions
The key words MUST, MUST NOT, REQUIRED, SHALL, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.
A verifiable requirement is one that can be confirmed by an automated test without manual inspection. Every normative statement below (MUST/SHALL) is intended to be verifiable.
1. Purpose and Scope
resource-tracker is a lightweight, statically self-contained Linux binary that:
- Polls system- and process-level resource utilization at a configurable interval.
- Emits structured samples to stdout (JSON Lines or CSV).
- Optionally streams those samples to the Sentinel API (SpareCores data ingestion endpoint) via gzip-compressed (CVS, TSV, or JSONL) files uploaded to S3 using temporary STS credentials.
The binary is intended as a drop-in CLI wrapper: run it alongside any process and it will transparently record how that process consumes hardware.
Out of scope (v1): macOS, Windows, eBPF, EBPF-based tracing, container image introspection beyond environment variables, multi-host federation.
2. Platform Requirements
| Requirement | Detail |
|---|---|
| Operating System | Linux only (kernel ≥ 4.18 recommended for full /proc coverage) |
| CPU Architectures | x86_64 and aarch64 (ARM64) |
| Linkage | Dynamic linkage for GPU libraries; all other code statically linked or carried as crate dependencies |
| Minimum Rust Edition | 2024 |
GPU support MUST NOT be required for the binary to build or run.
On a CPU-only host GpuCollector::collect() SHALL return an empty Vec and no error.
3. Configuration
3.1 Precedence (highest to lowest)
CLI flags > TOML config file > built-in defaults
Future enhancement: Support
RESOURCE_TRACKER_-prefixed environment variables (e.g.RESOURCE_TRACKER_INTERVAL,RESOURCE_TRACKER_FORMAT) as an additional configuration layer between CLI flags and the TOML file. Environment variables are more practical than file-based config for containerized and scripted workloads and are preferred for the Sentinel integration use case.
3.2 CLI Parameters
The binary MUST accept the following flags via a command line parser:
| Short | Long | Type | Default | Description |
|---|---|---|---|---|
-n | --job-name | String | none | Human-readable label attached to every sample |
-p | --pid | i32 | none | Root PID of the process tree to track (CPU attribution) |
-i | --interval | u64 | 1 | Polling interval in seconds (≥ 1) |
-c | --config | path | resource-tracker.toml | Path to TOML config file |
-f | --format | enum | json | Output format: json or csv |
--version | flag | Print binary version and exit |
All metadata fields listed in Section 9.3 (job_name, project_name, stage_name, etc.) MUST also be accepted as CLI flags. See Section 9.3 for the full flag and environment variable table.
Shell-wrapper mode (MVP target): The binary SHOULD support being used as a
transparent process wrapper, where the command to monitor is passed as trailing
arguments after a -- separator or as positional arguments:
resource-tracker Rscript model.R
resource-tracker -- python train.py --epochs 10
In this mode the binary spawns the given command as a child process, sets
--pid to that child’s PID automatically, and exits when the child exits
(propagating the child exit code). This is a significant usability improvement
over the Python implementation and is a first-class v1 goal.
--interval MUST be > 0. Values of 0 SHALL be rejected with a non-zero exit code and a descriptive error message.
3.3 TOML Config File
The config file is optional. If the file does not exist or cannot be parsed, the binary MUST continue using defaults (no error, no warning).
Schema:
[job]
name = "my-benchmark" # String; optional
pid = 12345 # i32; optional
[tracker]
interval_secs = 5 # u64; optional; default 1
Unrecognized keys MUST be silently ignored.
3.4 Verifiable Configuration Tests
T-CFG-01: Running with no flags produces valid JSON Lines output on stdout.T-CFG-02:--format csvemits a header line matching the exact column list in Section 6.2 before the first data row.T-CFG-03:--interval 0exits with code ≠ 0.T-CFG-04: A TOML file with[tracker] interval_secs = 3results in samples separated by ≈ 3 seconds when no--intervalflag is provided.T-CFG-05: A CLI--interval 2overrides a TOMLinterval_secs = 5.T-CFG-06: A missing TOML file path silently falls back to defaults.
4. Startup Behavior
On startup the binary MUST:
- Parse configuration (Section 3).
- Initialize all collectors.
- Execute one warm-up collection pass to prime delta state in stateful collectors (
CpuCollector,NetworkCollector,DiskCollector). - Sleep exactly one full interval.
- Emit the CSV header (if format = CSV) before the first data row.
- Enter the polling loop (Section 5).
The warm-up pass result MUST NOT be emitted to stdout.
5. Polling Loop
The loop MUST:
- Record
timestamp_secs= current Unix time asu64(seconds since UNIX epoch, UTC). - Collect all metric subsystems (Section 6.1) in the order: CPU, Memory, Network, Disk, GPU.
- Serialize and emit one line to stdout per the chosen format (Section 6.2, Section 6.3).
- Sleep the configured interval.
- Repeat indefinitely until killed.
Collection of any subsystem MUST NOT block the other subsystems. Failures in optional subsystems (GPU) MUST be surfaced as empty/zero values, not panics.
6. Data Model
6.1 Sample
A Sample is a point-in-time snapshot of all tracked resources.
#![allow(unused)]
fn main() {
pub struct Sample {
pub timestamp_secs: u64, // Unix time (seconds)
pub job_name: Option<String>,
pub cpu: CpuMetrics,
pub memory: MemoryMetrics,
pub network: Vec<NetworkMetrics>, // one per interface
pub disk: Vec<DiskMetrics>, // one per block device
pub gpu: Vec<GpuMetrics>, // one per GPU; empty if none
}
}
6.1.1 CpuMetrics
Source: /proc/stat tick deltas; /proc/<pid>/stat for process tracking.
Note:
total_cores(logical CPU count) is a static host property that rarely changes. It belongs in the host discovery snapshot (Section 8.1) rather than in every per-second sample. It is referenced here only for computingcpu_usagein the CSV output (Section 7.2).
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
utilization_pct | f64 | fractional cores | /proc/stat | Aggregate utilization expressed as cores-in-use (0.0..N_cores) |
per_core_pct | Vec<f64> | % | /proc/stat | Per logical CPU percentage; len == host_vcpus; range 0.0–100.0 |
utime_secs | f64 | seconds | /proc/stat | Δ(user+nice ticks) / ticks_per_second for this interval |
stime_secs | f64 | seconds | /proc/stat | Δ(system ticks) / ticks_per_second for this interval |
process_count | u32 | count | /proc numeric dirs | Number of running processes visible to the OS |
process_cores_used | Option<f64> | fractional cores | /proc/<pid>/stat | None when no PID tracked |
process_child_count | Option<u32> | count | /proc/<pid>/stat | Descendant count; excludes root PID; None when no PID tracked |
Computation rules:
utilization_pct=(Δtotal − Δidle) / Δtotal × N_coreswhere N_cores is the logical CPU count from host discovery. The result is expressed as fractional cores in use (e.g. 4.6 on a 16-core host means ~4.6 vCPUs are fully utilized). Do NOT clamp this value; values very slightly above N_cores are valid under kernel accounting rounding. Δtotal = Δ(user + nice + system + idle + iowait + irq + softirq + steal). Δidle = Δ(idle + iowait).utime_secs= Δ(user + nice) /ticks_per_second.stime_secs= Δ(system) /ticks_per_second.process_cores_used= Σ Δ(utime+stime) for root PID and all descendants / (elapsed_wall_clock_seconds × ticks_per_second). Must be ≥ 0.- On the first collection call (no previous snapshot), all delta-based fields MUST return 0. The caller MUST discard this result (warm-up pass).
Verifiable CpuMetrics Tests:
T-CPU-01:utilization_pctis in [0.0, N_cores] for all samples (N_cores from host discovery).T-CPU-02:len(per_core_pct)==host_vcpusfor all samples.T-CPU-03: When--pidis not set,process_cores_usedandprocess_child_countareNone.T-CPU-04: When--pid <self>is set,process_cores_used≥ 0.T-CPU-05:process_count≥ 1 on any running Linux system.T-CPU-06: Firstcollect()call returns 0.0 for all delta fields.
6.1.2 MemoryMetrics
Source: /proc/meminfo. All values in mebibytes (MiB = 1024 × 1024 bytes),
standardized to match Python resource-tracker PR #9 which also adopts MiB
throughout.
| Field | Type | Unit | /proc/meminfo key(s) | Notes |
|---|---|---|---|---|
total_mib | u64 | MiB | MemTotal | |
free_mib | u64 | MiB | MemFree | Truly free RAM |
available_mib | u64 | MiB | MemAvailable | Free + reclaimable |
used_mib | u64 | MiB | MemTotal − MemFree − Buffers − Cached | Matches Python memory_used |
used_pct | f64 | % | derived | used_mib / total_mib × 100; range 0.0–100.0 |
buffers_mib | u64 | MiB | Buffers | Kernel I/O buffers |
cached_mib | u64 | MiB | Cached + SReclaimable | Page cache + slab reclaimable |
swap_total_mib | u64 | MiB | SwapTotal | |
swap_used_mib | u64 | MiB | SwapTotal − SwapFree | |
swap_used_pct | f64 | % | derived | 0.0 when SwapTotal == 0 |
active_mib | u64 | MiB | Active | |
inactive_mib | u64 | MiB | Inactive |
Verifiable MemoryMetrics Tests:
T-MEM-01:free_mib + used_mib + buffers_mib + cached_mib ≤ total_mib(accounting for kernel reserved memory).T-MEM-02:used_pctis in [0.0, 100.0].T-MEM-03:swap_used_pctis 0.0 whenswap_total_mib== 0.T-MEM-04:available_mib ≤ total_mib.
6.1.3 NetworkMetrics
Source: /proc/net/dev (throughput), /sys/class/net/<iface>/ (identity/link state).
One NetworkMetrics record per non-loopback interface.
Architecture note: Fields such as
mac_address,driver,operstate,speed_mbps, andmtuare static properties that do not change every second. They are candidates for promotion to a host-discovery snapshot (Section 8.1) rather than being repeated in every per-second sample. This applies similarly to static fields in Section 6.1.4 (disk) and Section 6.1.5 (GPU). The current spec includes them here for completeness; a future revision should separate static identity fields from dynamic rate fields.
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
interface | String | — | interface name | e.g. "eth0" |
mac_address | Option<String> | — | /sys/class/net/<iface>/address | "00:11:22:33:44:55" |
driver | Option<String> | — | /sys/class/net/<iface>/device/driver symlink | e.g. "igc" |
operstate | Option<String> | — | /sys/class/net/<iface>/operstate | "up", "down", "unknown" |
speed_mbps | Option<i64> | Mbps | /sys/class/net/<iface>/speed | −1 when not reported |
mtu | Option<u32> | bytes | /sys/class/net/<iface>/mtu | |
rx_bytes_per_sec | f64 | bytes/s | /proc/net/dev Δ | Rate for this interval |
tx_bytes_per_sec | f64 | bytes/s | /proc/net/dev Δ | Rate for this interval |
rx_bytes_total | u64 | bytes | /proc/net/dev | Cumulative since boot |
tx_bytes_total | u64 | bytes | /proc/net/dev | Cumulative since boot |
Verifiable NetworkMetrics Tests:
T-NET-01:rx_bytes_per_sec≥ 0.0 andtx_bytes_per_sec≥ 0.0 for all interfaces.T-NET-02:rx_bytes_totalmonotonically non-decreasing between consecutive samples (absent interface reset).T-NET-03: The loopback interface (lo) is NOT included in the output.
6.1.4 DiskMetrics
Source: /proc/diskstats (throughput), /sys/block/<dev>/ (identity),
statvfs(3) (space). One DiskMetrics record per block device (excluding
partitions and device-mapper synthetic devices unless mounted independently).
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
device | String | — | kernel device name | e.g. "sda", "nvme0n1" |
model | Option<String> | — | /sys/block/<dev>/device/model | |
vendor | Option<String> | — | /sys/block/<dev>/device/vendor | |
serial | Option<String> | — | /sys/block/<dev>/device/wwid or serial | |
device_type | Option<DiskType> | — | /sys/block/<dev>/queue/rotational | Nvme, Ssd, or Hdd; None when type cannot be determined |
capacity_bytes | Option<u64> | bytes | /sys/block/<dev>/size × 512 | |
mounts | Vec<DiskMountMetrics> | — | statvfs(3) | One per mount point |
read_bytes_per_sec | f64 | bytes/s | /proc/diskstats Δ | |
write_bytes_per_sec | f64 | bytes/s | /proc/diskstats Δ | |
read_bytes_total | u64 | bytes | /proc/diskstats sectors × sector_size | Cumulative since boot; see sector size note |
write_bytes_total | u64 | bytes | /proc/diskstats sectors × sector_size | Cumulative since boot; see sector size note |
DiskMountMetrics fields:
| Field | Type | Unit | Notes |
|---|---|---|---|
mount_point | String | — | e.g. "/" |
filesystem | String | — | Filesystem type from /proc/mounts; e.g. "ext4", "xfs" |
total_bytes | u64 | bytes | statvfs.f_blocks × f_bsize |
available_bytes | u64 | bytes | statvfs.f_bavail × f_bsize (unprivileged) |
used_bytes | u64 | bytes | total_bytes − (statvfs.f_bfree × f_bsize) |
used_pct | f64 | % | used_bytes / total_bytes × 100; 0.0 when total == 0 |
Sector size note: The current implementation hard-codes 512 bytes/sector for
/proc/diskstatsconversions. Python’sget_sector_sizes()reads/sys/block/<dev>/queue/hw_sector_size(fallback 512). On 4K-native drives (some NVMe) the Rust code will under-count I/O bytes by up to 8×. A future fix should read/sys/block/<dev>/queue/logical_block_sizeat startup and use the actual sector size. See implementation plan P-DSK-SECTOR.
Verifiable DiskMetrics Tests:
T-DSK-01:read_bytes_per_sec≥ 0.0 andwrite_bytes_per_sec≥ 0.0.T-DSK-02: For each mount,used_bytes + available_bytes ≤ total_bytes.T-DSK-03:capacity_bytes(when Some) > 0.
6.1.5 GpuMetrics
Source: NVML (nvml-wrapper crate, runtime-loads libnvidia-ml.so) for
NVIDIA GPUs; libamdgpu_top (runtime-loads libdrm) for AMD GPUs.
| Field | Type | Unit | Notes |
|---|---|---|---|
uuid | String | — | Stable vendor UUID; AMD uses PCI bus address |
name | String | — | Human-readable device name |
device_type | String | — | "GPU", "NPU", "TPU" |
host_id | String | — | Host-level device identifier |
detail | HashMap<String,String> | — | Vendor-specific extras (driver version, PCI bus ID, ROCm version) |
utilization_pct | f64 | % | Core utilization; range 0.0–100.0 |
vram_total_bytes | u64 | bytes | |
vram_used_bytes | u64 | bytes | |
vram_used_pct | f64 | % | vram_used / vram_total × 100; 0.0 when total == 0 |
temperature_celsius | u32 | °C | Die temperature |
power_watts | f64 | W | NVML reports mW; converted to W |
frequency_mhz | u32 | MHz | Core/graphics clock |
core_count | Option<u32> | count | Shader/compute cores; None if not reported |
AMD-specific: When /sys/module/amdgpu does not exist the AMD collection path MUST be skipped entirely (no panic).
NVIDIA-specific: power_watts = raw NVML milliwatt value / 1000.
Verifiable GpuMetrics Tests:
T-GPU-01: On a CPU-only host,gpuVec is empty and no error is returned.T-GPU-02:utilization_pctis in [0.0, 100.0] for each GPU.T-GPU-03:vram_used_bytes ≤ vram_total_bytesfor each GPU.T-GPU-04:vram_used_pctis 0.0 whenvram_total_bytes== 0.T-GPU-05: On a host with AMD GPU,uuidequals the PCI bus address string.
7. Output Formats
7.1 JSON Lines (default)
Each sample is emitted as a single JSON object followed by \n. The binary
MUST include a version field keyed as "<crate-name>-version" with the value
being the Cargo package version string.
Example (abbreviated):
{"timestamp_secs":1743300000,"job_name":null,"cpu":{...},"memory":{...},"network":[...],"disk":[...],"gpu":[],"resource-tracker-version":"0.1.0"}
Requirements:
T-OUT-01: Each line MUST be valid JSON parseable with any standard JSON library.T-OUT-02:timestamp_secsMUST be present and be a positive integer.T-OUT-03: The version key"resource-tracker-version"MUST be present.T-OUT-04: Consecutive samples MUST have non-decreasingtimestamp_secs.
7.2 CSV Format
CSV is the primary and required output format for Sentinel S3 streaming
(Section 9.2.2). It uses the same column names and units as the Python
resource-tracker so the Sentinel backend can ingest both without schema
changes. When uploaded to S3 the CSV content MUST be gzip-compressed and the
object key MUST carry the extension .csv.gz.
When --format csv is selected for stdout output the raw (uncompressed) CSV
bytes are written. Gzip compression is applied only when writing the S3 batch
upload payload (Section 9.2.2).
When --format csv is selected:
- The header line MUST be emitted exactly once, before the first data row.
- The header MUST match the following column names in this exact order:
timestamp,processes,utime,stime,cpu_usage,memory_free,memory_used,memory_buffers,memory_cached,memory_active,memory_inactive,disk_read_bytes,disk_write_bytes,disk_space_total_gb,disk_space_used_gb,disk_space_free_gb,net_recv_bytes,net_sent_bytes,gpu_usage,gpu_vram,gpu_utilized
Column definitions:
| CSV Column | Source Field | Unit | Computation |
|---|---|---|---|
timestamp | timestamp_secs | Unix seconds | direct |
processes | cpu.process_count | count | direct |
utime | cpu.utime_secs | seconds | direct; 3 decimal places |
stime | cpu.stime_secs | seconds | direct; 3 decimal places |
cpu_usage | cpu.utilization_pct | fractional cores | utilization_pct directly; field is already in fractional cores (0..N_cores); 4 decimal places |
memory_free | memory.free_mib | MiB | direct |
memory_used | memory.used_mib | MiB | direct |
memory_buffers | memory.buffers_mib | MiB | direct |
memory_cached | memory.cached_mib | MiB | direct |
memory_active | memory.active_mib | MiB | direct |
memory_inactive | memory.inactive_mib | MiB | direct |
disk_read_bytes | disk subsystem | bytes | Σ read_bytes_per_sec × interval_secs across all devices; integer |
disk_write_bytes | disk subsystem | bytes | Σ write_bytes_per_sec × interval_secs across all devices; integer |
disk_space_total_gb | disk mounts | GB (10⁹) | Σ total_bytes / 1_000_000_000 across all mounts; 6 decimal places |
disk_space_used_gb | disk mounts | GB (10⁹) | disk_space_total_gb − disk_space_free_gb; 6 decimal places |
disk_space_free_gb | disk mounts | GB (10⁹) | Σ available_bytes / 1_000_000_000 across all mounts; 6 decimal places |
net_recv_bytes | network subsystem | bytes | Σ rx_bytes_per_sec × interval_secs across all interfaces; integer |
net_sent_bytes | network subsystem | bytes | Σ tx_bytes_per_sec × interval_secs across all interfaces; integer |
gpu_usage | gpu subsystem | fractional GPUs | Σ utilization_pct / 100 across all GPUs; 4 decimal places |
gpu_vram | gpu subsystem | MiB | Σ vram_used_bytes / 1_048_576; 4 decimal places |
gpu_utilized | gpu subsystem | count | count of GPUs where utilization_pct > 0.0 |
Verifiable CSV Tests:
T-CSV-01: Header is emitted exactly once, as the first line.T-CSV-02: Column count per data row equals column count in header.T-CSV-03:cpu_usagecolumn equalsutilization_pctdirectly (field is already fractional cores, 0..N_cores) to 4 dp.T-CSV-04:disk_space_used_gb = disk_space_total_gb − disk_space_free_gbfor all rows.T-CSV-05: CSV output for a given sample is byte-for-byte reproducible (deterministic).T-CSV-06: No trailing commas; no quoted fields (all values are numbers or bare identifiers).
8. Host and Cloud Discovery
The binary SHOULD collect machine-level metadata once at startup and include it
in the Sentinel API registration payload (Section 9.1). Collected fields use the prefix host_ or cloud_.
8.1 Host Discovery
All fields are optional; collection failure MUST be silently swallowed.
| Field | Type | Source |
|---|---|---|
host_id | Option<String> | AWS: /sys/class/dmi/id/board_asset_tag; fallback: /etc/machine-id |
host_name | Option<String> | gethostname(3) |
host_ip | Option<String> | First non-loopback IPv4 from getifaddrs(3) |
host_allocation | Option<String> | "dedicated" or "shared"; heuristic TBD |
host_vcpus | Option<u32> | Count of logical CPUs (/proc/cpuinfo processor entries) |
host_cpu_model | Option<String> | /proc/cpuinfo model name field |
host_memory_mib | Option<u64> | MemTotal / 1024 from /proc/meminfo |
host_gpu_model | Option<String> | First GPU name from GpuCollector |
host_gpu_count | Option<u32> | Length of GPU Vec |
host_gpu_vram_mib | Option<u64> | Sum of vram_total_bytes / 1_048_576 across all GPUs |
host_storage_gb | Option<f64> | Sum of capacity_bytes / 1_000_000_000 across all block devices |
Users MUST be able to suppress any field by setting the corresponding
environment variable to "0" or "" (exact mechanism TBD in implementation).
8.2 Cloud Discovery
Cloud metadata is probed by making HTTP GET requests to each cloud provider’s Instance Metadata Service (IMDS) with a short timeout (≤ 2 seconds per provider). Probes MUST be attempted in the background and MUST NOT delay the first sample emission.
| Field | Probe endpoint | Notes |
|---|---|---|
cloud_vendor_id | AWS: 169.254.169.254/latest/meta-data/; GCP: metadata.google.internal; Azure: 169.254.169.254/metadata/instance | Infer vendor from which endpoint responds |
cloud_account_id | AWS: /latest/meta-data/identity-credentials/ec2/info | |
cloud_region_id | AWS: /latest/meta-data/placement/region | |
cloud_zone_id | AWS: /latest/meta-data/placement/availability-zone | |
cloud_instance_type | AWS: /latest/meta-data/instance-type |
Verifiable Cloud Discovery Tests:
T-CLD-01: On a non-cloud host, allcloud_*fields areNoneand the binary does not hang for more than 5 seconds total on startup.T-CLD-02: IMDS probe timeout is ≤ 2 seconds per provider.
9. Sentinel API Streaming (Extra Component)
Activation is gated on the SENTINEL_API_TOKEN environment variable being set.
Resolved design decisions:
- Streaming is enabled automatically whenever
SENTINEL_API_TOKENis set; no additional flag needed.- Upload format is
csv.gzonly;jsonl.gzis not supported.- Streaming is not separately configurable via TOML or CLI beyond the token env var.
- On network unavailability:
start_runlogs a warning and disables streaming; local stdout output continues normally (see Section 11 error handling).
9.1 Authentication
The binary MUST read the API token from the environment variable
SENTINEL_API_TOKEN. Every Sentinel API request MUST include the HTTP header:
Authorization: Bearer <token>
If SENTINEL_API_TOKEN is not set, all streaming functionality MUST be silently disabled. Local stdout emission continues normally.
9.2 Run Lifecycle
9.2.1 Start of Run
At startup (after host/cloud discovery), the binary MUST POST to the data ingestion endpoint to register a new Run.
POST /runs (default base URL: https://api.sentinel.sparecores.net).
Request payload (JSON, Content-Type: application/json): all metadata, host, and
cloud fields are merged into a flat top-level object (no nesting):
{
"job_name": "...",
"project_name": "...",
"pid": 12345,
"host_vcpus": 8,
"cloud_vendor_id": "aws",
...
}
Response fields the binary MUST store:
| Response Field | Type | Usage |
|---|---|---|
run_id | String | Referenced in all subsequent API calls |
upload_uri_prefix | String | S3 URI prefix for metric uploads |
upload_credentials.access_key | String | STS credential |
upload_credentials.secret_key | String | STS credential |
upload_credentials.session_token | String | STS credential |
upload_credentials.expiration | String (ISO 8601) | STS credential expiry; optional |
9.2.2 Batch Upload (Background Thread)
The binary MUST start a background thread that:
- Every 60 seconds (configurable, default 60), takes all samples collected since the previous upload.
- Serializes them as CSV (same column layout as Section 7.2) – CSV is the only accepted format for the Sentinel S3 bucket.
- Gzip-compresses the CSV bytes.
- Generates a unique S3 object key under
upload_uri_prefix:<upload_uri_prefix>/<run_id>/<batch_seq_number>.csv.gz - Uploads via AWS Signature V4 (Section 10).
- Appends the uploaded URI to an internal list
uploaded_uris.
If STS credentials are within 5 minutes of expiration, the binary MUST refresh
them by POSTing to /runs/{run_id}/refresh-credentials before attempting the upload.
Upload failures MUST be retried at least once with exponential back-off before being recorded as errors. After 3 consecutive upload failures the background thread MUST log a warning and continue buffering (data is not lost).
Verifiable Streaming Tests:
T-STR-01: WithoutSENTINEL_API_TOKEN, no HTTP connection is made.T-STR-02: A batch upload request containsContent-Encoding: gzipand the body decompresses to valid CSV or JSONL.T-STR-03:uploaded_uriscontains the S3 URIs of all successfully uploaded batches.T-STR-04: Credential refresh is triggered when ≤ 5 minutes remain before expires_at.
9.2.3 End of Run
When the tracked process terminates (or the binary receives SIGTERM), the binary MUST:
SIGINT note: An explicit SIGINT handler is not installed. When the binary is used in shell-wrapper mode, Ctrl-C is delivered to the entire process group, so both the child and the tracker receive SIGINT and exit together. Explicit SIGTERM forwarding to the child process is a future enhancement.
- Flush any remaining samples as a final batch upload (if
uploaded_urisis non-empty). - POST to
/runs/{run_id}/finishto close the Run, including:run_idexit_code(i32, if tracked process exited cleanly; else None)run_statusenum:"finished"(exit 0 or SIGTERM) or"failed"(non-zero exit)data_source:"s3"+data_uris: Vec<String>if any S3 uploads succeeded."inline"+data_csv: <base64(gzip(csv))>for short runs with no S3 uploads.
Verifiable End-of-Run Tests:
T-EOR-01: On SIGTERM, the binary exits with code 0 after flushing remaining data.T-EOR-02: The close-run request body containsrun_idmatching the start-run response.T-EOR-03:data_sourceis"inline"when no S3 uploads occurred.T-EOR-04:data_sourceis"s3"when at least one S3 upload succeeded.
9.3 Metadata Fields
The following metadata MAY be supplied by the user via CLI flags or environment variables. All are optional strings unless noted.
| Field | CLI Flag | Env Variable |
|---|---|---|
job_name | --job-name | TRACKER_JOB_NAME |
project_name | --project-name | TRACKER_PROJECT_NAME |
stage_name | --stage-name | TRACKER_STAGE_NAME |
task_name | --task-name | TRACKER_TASK_NAME |
team | --team | TRACKER_TEAM |
env | --env | TRACKER_ENV |
language | --language | TRACKER_LANGUAGE |
orchestrator | --orchestrator | TRACKER_ORCHESTRATOR |
executor | --executor | TRACKER_EXECUTOR |
external_run_id | --external-run-id | TRACKER_EXTERNAL_RUN_ID |
container_image | --container-image | TRACKER_CONTAINER_IMAGE |
Users MUST also be able to supply arbitrary key-value tags via repeated --tag key=value flags.
10. S3 Upload — AWS Signature V4
The upload is implemented in pure Rust without any AWS SDK dependency (zero
additional transitive deps for this path). The implementation mirrors the
Python s3_upload.py module from PR #9.
10.1 URI Parsing
An S3 URI has the form s3://bucket/path/to/object. Parsing MUST:
- Require scheme ==
"s3". - Require a non-empty bucket name.
- Require a non-empty key (path after bucket).
- Return an error for any other form.
10.2 Bucket Region Detection
If the upload region is not supplied, the binary MUST determine it by sending
an HTTP HEAD request to https://<bucket>.s3.amazonaws.com/ and reading the
x-amz-bucket-region response header. The header is present even on 3xx/4xx
responses. Results MUST be cached in-process for the lifetime of the run.
Default fallback: "eu-central-1".
10.3 Request Construction
A PUT request to https://<bucket>.s3.<region>.amazonaws.com/<key> with:
Content-Length: byte count of body.x-amz-content-sha256: SHA-256 hex of body.x-amz-date:YYYYMMDDTHHmmSSZUTC.x-amz-security-token: STS session token.Authorization: AWS4-HMAC-SHA256 signature (see Section 10.4).
10.4 AWS Signature V4
Signing key derivation:
kDate = HMAC-SHA256("AWS4" + secret_key, date_stamp)
kRegion = HMAC-SHA256(kDate, region)
kService = HMAC-SHA256(kRegion, "s3")
kSigning = HMAC-SHA256(kService, "aws4_request")
Canonical request:
PUT
/<key>
host:<bucket>.s3.<region>.amazonaws.com
x-amz-content-sha256:<payload_hash>
x-amz-date:<amz_date>
x-amz-security-token:<session_token>
host;x-amz-content-sha256;x-amz-date;x-amz-security-token
<payload_hash>
String to sign:
AWS4-HMAC-SHA256
<amz_date>
<date_stamp>/<region>/s3/aws4_request
<canonical_request_sha256>
Authorization header:
AWS4-HMAC-SHA256 Credential=<access_key>/<credential_scope>, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=<hex_sig>
10.5 Upload Success Criteria
HTTP 200 or 201 response from S3 = success. Any other status = error (with response body included in the error message).
10.6 Verifiable S3 Upload Tests
T-S3-01:parse_s3_uri("s3://bucket/path/obj")returns("bucket", "path/obj").T-S3-02:parse_s3_uri("https://bucket/path")returns an error.T-S3-03:parse_s3_uri("s3://bucket/")returns an error (empty key).T-S3-04: Given known access_key, secret_key, session_token, region, and a fixed timestamp, the generatedAuthorizationheader MUST match a pre-computed golden value.T-S3-05: Bucket region cache prevents duplicate HEAD requests for the same bucket.T-S3-06: An upload to a mock S3 server returns the S3 URI on success.
11. Error Handling
| Scenario | Required behavior |
|---|---|
/proc file is unreadable for a single metric | Return 0 / None for that field; do not abort |
| GPU library absent | GPU Vec is empty; no error propagated |
| Sentinel API unreachable at start | Log warning; streaming disabled; local output continues |
| S3 upload fails | Retry once; after 3 consecutive failures log warning and continue |
| Config TOML parse error | Silently fall back to defaults |
--interval 0 | Exit with code ≠ 0 before starting collectors |
| Tracked PID not found | process_cores_used = None; do not abort |
The binary MUST NEVER panic in production code. expect() is only permissible during development;
all expect() calls MUST be replaced with proper error handling before v1.0 release.
12. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Binary size | < 15 MiB stripped (CPU-only build) |
| Startup latency | < 1 × configured interval before first sample |
| CPU overhead of the tracker itself | < 1% of one core at 1-second interval on a 4-core host |
| Memory footprint | < 20 MiB RSS at steady state |
| Stdout buffering | Each line MUST be flushed atomically (no partial lines) |
13. Compatibility with Python resource-tracker
The CSV output format MUST maintain byte-for-byte column-name compatibility
with the Python SystemTracker output so that the Sentinel API backend can
ingest both without schema changes.
Confirmed equivalent columns (see Section 7.2 for derivation):
| Python column | Rust CSV column | Python unit | Rust unit |
|---|---|---|---|
timestamp | timestamp | Unix seconds | Unix seconds |
processes | processes | count | count |
utime | utime | seconds | seconds |
stime | stime | seconds | seconds |
cpu_usage | cpu_usage | fractional cores | fractional cores |
memory_free | memory_free | MiB | MiB |
memory_used | memory_used | MiB | MiB |
memory_buffers | memory_buffers | MiB | MiB |
memory_cached | memory_cached | MiB | MiB |
memory_active | memory_active | MiB | MiB |
memory_inactive | memory_inactive | MiB | MiB |
disk_read_bytes | disk_read_bytes | bytes/interval | bytes/interval |
disk_write_bytes | disk_write_bytes | bytes/interval | bytes/interval |
disk_space_total_gb | disk_space_total_gb | GB (10⁹) | GB (10⁹) |
disk_space_used_gb | disk_space_used_gb | GB (10⁹) | GB (10⁹) |
disk_space_free_gb | disk_space_free_gb | GB (10⁹) | GB (10⁹) |
net_recv_bytes | net_recv_bytes | bytes/interval | bytes/interval |
net_sent_bytes | net_sent_bytes | bytes/interval | bytes/interval |
gpu_usage | gpu_usage | fractional GPUs | fractional GPUs |
gpu_vram | gpu_vram | MiB | MiB |
gpu_utilized | gpu_utilized | count | count |
Verifiable compatibility test:
T-COMPAT-01: Run Python and Rust trackers in parallel on the same host for 60 seconds. For each interval, the difference between corresponding scalar columns MUST be within 5% of the Python value (allowing for measurement-time skew).
14. Open Questions / Future Work
- eBPF integration: Using
aya-rsorlibbpf-rsfor sub-millisecond tracing (CPU saturation, IPC, TLB misses, cache hit rates) — currently considered v2. - Process-level memory (PSS): Preferred over RSS; requires reading
/proc/<pid>/smaps_rollupwhich may be slow for large processes. - Per-process disk and network I/O:
/proc/<pid>/ioand network namespaces; currently only system-wide. - Configurable metric suppression: Allow users to opt out of fields
containing PII (e.g.
host_ip, hostname). - ARM-specific GPU support: Apple Metal not in scope (Linux only); Qualcomm Adreno / Mali GPU metrics TBD.
- Static linking of NVML: Currently not possible; NVML requires a dynamically loaded vendor library.
- Heartbeat endpoint: Periodic ping to Sentinel API while tracking is active (distinct from batch S3 uploads).
Project Dependencies
This is a Rust programming language project requiring the Rust toolchain, including the Rust build system and package manager, named cargo.
In addition to the base toolchain, this project also makes use of the following:
| Tool | Description | Rationale |
|---|---|---|
| uv | An extremely fast Python package and project manager | Solely for benchmarking against the Python implementation |
| just | A handy way to save and run project-specific commands | Convenience |
| jq | A handy way to slice and filter JSON output | Convenience tool for JSON and JSONL. |
| mdbook | A tool to create books with Markdown. | This project is documented via mdbook. |
Rust Crate Dependencies
Dependencies are declared in Cargo.toml and managed by cargo.
Runtime dependencies
| Crate | Version | Purpose |
|---|---|---|
nvml-wrapper | 0.12 | NVIDIA GPU monitoring via NVML; loaded at runtime with libloading – no build-time system deps; returns empty on non-NVIDIA hosts |
clap | 4 | CLI argument parsing; stripped to derive, std, help, usage, error-context, env features only |
procfs | 0.18 | Linux /proc parsing for CPU, memory, disk, and network metrics |
ureq | 3 | Lightweight synchronous HTTP client for Sentinel API and S3 PUT; avoids tokio runtime overhead |
serde | 1 | Serialization/deserialization framework with derive macros |
serde_json | 1 | JSON serialization for metric output and API payloads |
toml | 1.0 | TOML config file parsing; parse + serde features only, no display overhead |
hmac | 0.13.0-rc.6 | HMAC-SHA256 for manual AWS Signature Version 4 signing of S3 PUT requests |
sha2 | 0.11.0 | SHA-256 hashing required by AWS Sig V4; paired with hmac |
hex | 0.4 | Hex encoding of HMAC digests for Sig V4 canonical request construction |
libc | 0.2 | FFI bindings for statvfs (filesystem space), gethostname, and SIGTERM signal handling |
flate2 | 1.1.9 (pinned) | Gzip compression for .csv.gz S3 batch uploads; rust_backend feature uses pure Rust (no zlib-sys C dep) |
libamdgpu_top | 0.11.2 | AMD GPU monitoring via libdrm; libdrm_dynamic_loading feature loads the library at runtime – gracefully skipped on non-AMD hosts |
Dev dependencies
| Crate | Version | Purpose |
|---|---|---|
num_cpus | 1 | Smoke tests: verifies cpu.utilization_pct is expressed as fractional cores (bounded by logical CPU count), not a percentage |
resource-tracker — Design Notes
Spec Summary
- Linux resource tracker (x86 + ARM), using
procfswhere appropriate - Configurable polling interval for: CPU, memory, GPU, VRAM, network in/out, disk read/write
- GPU support requires dynamic linking (no static link)
- CLI tool with optional params (job name/metadata); TOML config file with sane defaults
- Basic HTTP client: hit API endpoints at start, stop, and every X minutes (heartbeat)
- Lightweight S3 PUT using AWS creds to stream resource utilization data
Dependency Assessment
Current Cargo.toml dependencies
| Crate | Version | Purpose |
|---|---|---|
nvml-wrapper | 0.12 | NVIDIA GPU/VRAM monitoring via NVML; runtime dynamic loading |
libamdgpu_top | 0.11.2, no defaults, libdrm_dynamic_loading | AMD GPU monitoring via libdrm; runtime dynamic loading |
clap | 4, no defaults, derive+std+help+usage+error-context+env | CLI argument parsing, minimal footprint |
procfs | 0.18, serde feature only | Linux /proc – CPU, memory, network, disk |
ureq | 3, json feature | Lightweight sync HTTP – no tokio, no async runtime |
serde | 1, derive | Serialization/deserialization |
serde_json | 1 | JSON payload encoding for API and S3 |
toml | 1.0, no defaults, parse+serde features | TOML config file parsing |
hmac | 0.13.0-rc.6 | AWS Signature V4 HMAC signing |
sha2 | 0.11.0 | SHA-256 hashing for AWS Sig V4 |
hex | 0.4 | Hex encoding for AWS Sig V4 signature |
libc | 0.2 | statvfs for filesystem space, gethostname, SIGTERM |
flate2 | =1.1.9 (pinned), no defaults, rust_backend | Gzip compression for S3 batch uploads; pure Rust, no zlib-sys |
Release profile
[profile.release]
opt-level = "z" # optimize for size
lto = true # link-time optimization
codegen-units = 1 # better dead-code elimination
strip = true # strip symbols
panic = "abort" # smaller panic handler
Key decisions
nvml-wrapper+libamdgpu_topoverall-smi:all-smirequiredprotocat build time. Replaced withnvml-wrapper(NVIDIA, no build-time deps) andlibamdgpu_topwithlibdrm_dynamic_loading(AMD, runtime-only). Both load their respective drivers at runtime and degrade gracefully when absent.ureqoverreqwest:reqwestv0.13 pulls intokio(full async runtime),hyper, and TLS stacks – adds ~5-10 MB.ureqv3 is synchronous, no runtime, comparable API surface.procfsfeatures trimmed: Droppedchrono(heavy date/time lib,std::timesuffices) andflate2(only needed for gzip-compressed/procfiles, which are rare).clapdefaults disabled: Default clap features include terminal color, unicode width, etc. Stripped to the functional minimum;envfeature added to supportTRACKER_*environment variable overrides.- Manual AWS Sig V4 (
hmac+sha2+hex): Avoidsaws-sdk-s3(~50+ transitive deps, large binary). S3 PUT only needs ~100-150 lines of signing logic. tomlv1.0 defaults disabled:parse+serdefeatures;serdefeature required fortoml::from_strdeserialization into config structs.flate2pinned to=1.1.9withrust_backend: Pure Rust gzip implementation; avoids azlib-sysC build dependency. Version pinned to prevent unexpected breakage from pre-1.0 semver.libcfor sysfs/POSIX calls:statvfsfor filesystem space,gethostnamefor host identity, andSIGTERMsignal handling – pure FFI bindings with no additional binary size overhead.
Implementation Approaches
Option A — Single-file polling loop
All logic in main.rs. One tight loop: sleep → collect → diff deltas → buffer → flush.
main.rs
├── CLI parsing (clap)
├── Config loading (toml)
├── Polling loop
│ ├── procfs → CPU/mem/net/disk snapshots + delta computation
│ ├── all-smi → GPU/VRAM snapshots
│ └── Vec<Sample> batch buffer
├── HTTP calls (ureq) — start / stop / heartbeat
└── AWS Sig V4 signing + ureq PUT (inline)
Pros:
- Simplest to read and audit end-to-end
- Zero abstraction overhead
- Fastest to prototype
Cons:
main.rsgrows large and hard to navigate- No isolation between collectors — hard to unit test
- Tight coupling makes it hard to disable/swap individual collectors
Best for: MVP / proof of concept.
Option B — Module-per-resource + collector trait (current)
A Collector trait drives a scheduler. Each resource lives in its own module with its own delta state.
src/
├── main.rs — CLI, config, scheduler loop
├── config.rs — TOML config struct + CLI override merge
├── sample.rs — Sample / Report structs (serde)
├── collector/
│ ├── mod.rs — Collector trait: fn collect(&mut self) -> Metric
│ ├── cpu.rs — procfs::CpuTime, delta between ticks
│ ├── memory.rs — procfs::Meminfo
│ ├── network.rs — procfs::Net, bytes delta
│ ├── disk.rs — procfs::DiskStats, read/write delta
│ └── gpu.rs — all-smi wrapper
└── reporter/
├── mod.rs — Reporter trait: fn report(&self, batch: &[Sample])
├── http.rs — ureq: start/stop/heartbeat endpoints
└── s3.rs — AWS Sig V4 + ureq PUT (batch upload)
Collector trait sketch:
#![allow(unused)]
fn main() {
pub trait Collector {
fn collect(&mut self) -> Metric;
}
}
Reporter trait sketch:
#![allow(unused)]
fn main() {
pub trait Reporter {
fn on_start(&self, meta: &JobMeta);
fn on_sample(&self, batch: &[Sample]);
fn on_stop(&self, meta: &JobMeta);
}
}
Pros:
- Each collector independently testable with mock
/procdata - Clean ownership: delta state lives inside each collector struct
- Easy to add/remove resources without touching other collectors
- Reporter abstraction allows multiple outputs (HTTP + S3 simultaneously)
Cons:
- Slightly more upfront boilerplate (trait definitions, module layout)
- Minor indirection vs. inline code
Best for: Production implementation. Right level of structure for the spec.
Option C — Config-driven pipeline with Cargo feature flags
Extends Option B with #[cfg(feature = "...")] gates. GPU collector is behind feature = "gpu" since it requires dynamic linking. This enables a statically-linked build for non-GPU targets.
[features]
default = ["gpu", "s3", "http"]
gpu = ["dep:all-smi"]
s3 = []
http = []
src/
├── main.rs
├── config.rs
├── sample.rs
├── collector/
│ ├── cpu.rs
│ ├── memory.rs
│ ├── network.rs
│ ├── disk.rs
│ └── gpu.rs — #[cfg(feature = "gpu")]
└── reporter/
├── http.rs — #[cfg(feature = "http")]
└── s3.rs — #[cfg(feature = "s3")]
Build variants:
# Full build (default)
cargo build --release
# No GPU — allows static linking (musl target)
cargo build --release --no-default-features --features http,s3
cargo build --release --target x86_64-unknown-linux-musl --no-default-features --features http,s3
# Minimal — metrics only, no reporting
cargo build --release --no-default-features
Pros:
- Truly minimal binary for constrained/embedded/container targets
- Static linking possible when GPU excluded
- Clean separation of optional functionality
Cons:
#[cfg(...)]gates add noise throughout the code- More complex CI/build matrix (multiple feature combinations to test)
- Premature if targets are homogeneous
Best for: Distributing to heterogeneous environments — e.g., some hosts have GPUs, some don’t; or when a stripped container image is a requirement.
Status
Implement Option B first. This provides the right structure for the spec without over-engineering. The Collector and Reporter traits give clean boundaries for testing and future extension.
Option C’s feature-flag layer can be added on top of B later with minimal refactoring; the module boundaries are already in place.
Implementation order (Option B)
config.rs— TOML struct + CLI merge (clap + toml)sample.rs— data model (serde + serde_json)collector/cpu.rs,memory.rs,network.rs,disk.rs— procfs collectorscollector/gpu.rs— all-smi wrapperreporter/http.rs— ureq start/stop/heartbeatreporter/s3.rs— AWS Sig V4 + ureq PUTmain.rs— wire scheduler loop
Benchmarks
Comparison with https://github.com/SpareCores/resource-tracker
Status
The Rust binary collects every field that Python’s SystemTracker emits,
and emits them as either JSON Lines (default) or CSV (--format csv).
The CSV output has parity with Python for all columns (same names, units, and computation formulas). The JSON output is a strict superset – it carries all CSV fields plus additional metrics not available in Python.
CSV Column Mapping
| Column | Python formula | Rust CSV source | Unit | Parity? |
|---|---|---|---|---|
timestamp | time.time() (float) | timestamp_secs (integer) | Unix seconds | approx (see note 1) |
processes | count of all /proc/[0-9]+ entries | cpu.process_count – same /proc count | count | yes |
utime | per-interval delta(user+nice ticks) / ticks_per_sec | cpu.utime_secs – same delta calculation | seconds/interval | yes |
stime | per-interval delta(system ticks) / ticks_per_sec | cpu.stime_secs – same delta calculation | seconds/interval | yes |
cpu_usage | fractional cores (0..N) | cpu.utilization_pct directly (field is already fractional cores) | fractional cores | yes |
memory_free | MemFree from /proc/meminfo | memory.free_mib (MemFree / 1,048,576) | MiB | yes |
memory_used | MemTotal - MemFree - Buffers - (Cached+SReclaimable) | memory.used_mib – same formula | MiB | yes |
memory_buffers | Buffers | memory.buffers_mib | MiB | yes |
memory_cached | Cached + SReclaimable | memory.cached_mib – same formula | MiB | yes |
memory_active | Active | memory.active_mib | MiB | yes |
memory_inactive | Inactive | memory.inactive_mib | MiB | yes |
disk_read_bytes | per-interval delta(sectors_read) x sector_size, all non-partition diskstats entries | sum of rate x interval across all /sys/block whole-disk entries | bytes/interval | approx (see note 2) |
disk_write_bytes | same, write side | same, write side | bytes/interval | approx (see note 2) |
disk_space_total_gb | sum of all non-virtual mount points (incl. snap/loop) | sum of all mounts under /sys/block devices (incl. loop mounts) | GB | approx (see note 3) |
disk_space_used_gb | same, total - free (incl. reserved-for-root blocks) | same formula | GB | approx (see note 3) |
disk_space_free_gb | f_bavail from statvfs | f_bavail from statvfs | GB | approx (see note 3) |
net_recv_bytes | per-interval delta(rx_bytes) across all interfaces | sum of rate x interval across all interfaces | bytes/interval | yes |
net_sent_bytes | same, tx side | same, tx side | bytes/interval | yes |
gpu_usage | fractional GPUs (0..N) | sum gpu[].utilization_pct / 100 | fractional GPUs | yes |
gpu_vram | used VRAM in MiB | sum gpu[].vram_used_bytes / 1,048,576 | MiB | yes |
gpu_utilized | count of GPUs with utilization > 0 | count gpu[].utilization_pct > 0 | count | yes |
Documented Semantic Differences
Note 1 – Timestamp precision
Python’s timestamp is a float (sub-second resolution). Rust emits an integer
Unix timestamp. When aligning rows for comparison, use a +/-0.5 s tolerance.
Note 2 – Disk I/O: device set and sector size
Both Python and Rust use /proc/diskstats deltas and iterate all
whole-disk (non-partition) entries. The device sets should match on most
Linux systems.
Python’s device filter (is_partition from resource_tracker.helpers):
# Returns True only for names matching (sd*, nvme*, mmcblk*) partition patterns
# where a parent device exists in /sys/block. Everything else -- including
# loop*, dm-*, zram* -- is treated as a whole-disk device and included.
Rust’s device filter:
#![allow(unused)]
fn main() {
// Reads /sys/block/ directory entries into a HashSet.
// Keeps every diskstats entry whose name is a direct /sys/block/<name> entry.
// Logically equivalent to Python's filter: partitions like nvme0n1p1
// appear under /sys/block/nvme0n1/ (not top-level) and are excluded.
let block_set: HashSet<String> = read_dir("/sys/block")...;
let devs = diskstats.filter(|d| block_set.contains(&d.name));
}
Sector size: both Python and Rust read the actual hardware sector size per
device from /sys/block/<dev>/queue/hw_sector_size, falling back to 512 bytes.
This was implemented in Rust as P-DSK-SECTOR.
Rationale for explicit sector size: on 4K-native drives the logical sector size is 4,096 bytes; using a hard-coded 512 would under-count I/O bytes by 8x. Reading the actual value from sysfs ensures correctness on all drive types.
Note 2a – ZFS volumes
Python’s disk I/O implementation handles ZFS volumes, where disk usage is
reported differently at /sys/block. Rust does not currently account for
this. ZFS support is a planned enhancement (not required for MVP).
Note 3 – Disk space: mount set
Python sums all mount points that psutil.disk_partitions() reports as
non-virtual (including snap squashfs loop mounts). Rust sums all mount points
found in /proc/mounts whose source device matches a /sys/block entry.
On systems with many snap packages, Python includes the squashfs read-only
mounts for each snap. Because /dev/loop* devices appear in /sys/block,
Rust’s mounts_for_device("loopN") will pick these up too. However,
psutil may enumerate mount points that are not under /dev/ (e.g., tmpfs,
overlay, cgroup2) which Rust’s /dev/<device> prefix filter skips. This
can cause small differences in disk_space_total_gb on container hosts or
systems with unusual mount configurations.
To investigate: run mount | grep -v '^/dev' | grep -v ' type tmpfs' to see
which mount points Python may be counting that Rust is not.
Running the comparison
Prerequisites
uv>= 0.9 (Astral):which uv- Rust release binary:
cargo build --release
Directory layout
benchmarks/
+-- pyproject.toml # uv project -- resource-tracker dependency
+-- run_python.py # SystemTracker -> results/python_metrics.csv
+-- run_rust.sh # resource-tracker --format csv -> results/rust_metrics.csv
+-- compare.py # merge on timestamp, print diff table
+-- results/ # populated at runtime (gitignore this)
+-- python_metrics.csv
+-- rust_metrics.csv
Step 1 – Set up Python environment
cd benchmarks
uv init --no-workspace
uv add resource-tracker
Step 2 – run_python.py
"""Collect SystemTracker metrics for DURATION seconds -> results/python_metrics.csv"""
import time
from resource_tracker import SystemTracker
DURATION = 60
INTERVAL = 1
tracker = SystemTracker(interval=INTERVAL, output_file="results/python_metrics.csv")
time.sleep(DURATION)
tracker.stop()
print(f"Done -> results/python_metrics.csv")
Step 3 – run_rust.sh
#!/usr/bin/env bash
set -euo pipefail
DURATION=60
INTERVAL=1
mkdir -p results
timeout "$DURATION" \
../target/release/resource-tracker --interval "$INTERVAL" --format csv \
> results/rust_metrics.csv || true
echo "Collected $(( $(wc -l < results/rust_metrics.csv) - 1 )) rows -> results/rust_metrics.csv"
Step 4 – compare.py
Strategy:
- Load both CSVs, parse
timestampcolumns. - Differentiate Python’s cumulative I/O columns with
diff()to get rates, matching Rust’s per-interval values. - Merge on nearest timestamp (tolerance +/-0.5 x interval).
- For each shared metric, report: mean, std, min/max for each side plus mean absolute difference (MAD) and % deviation.
"""Compare python_metrics.csv and rust_metrics.csv side by side."""
import csv, sys
from pathlib import Path
IO_COLS = {"disk_read_bytes", "disk_write_bytes", "net_recv_bytes", "net_sent_bytes"}
def load(path):
rows = list(csv.DictReader(Path(path).open()))
return [{k: float(v) if v else 0.0 for k, v in row.items()} for row in rows]
def diff_col(rows, col):
"""Replace cumulative totals with per-row deltas (rate proxy)."""
for i in range(len(rows) - 1, 0, -1):
rows[i][col] = rows[i][col] - rows[i-1][col]
rows[0][col] = 0.0
py = load("results/python_metrics.csv")
rs = load("results/rust_metrics.csv")
for col in IO_COLS:
if col in (py[0] if py else {}):
diff_col(py, col)
shared_cols = set(py[0]) & set(rs[0]) - {"timestamp"} if py and rs else set()
print(f"{'column':<30} {'py_mean':>12} {'rs_mean':>12} {'MAD':>12} {'%dev':>8}")
print("-" * 80)
for col in sorted(shared_cols):
py_vals = [r[col] for r in py]
rs_vals = [r[col] for r in rs]
py_mean = sum(py_vals) / len(py_vals)
rs_mean = sum(rs_vals) / len(rs_vals)
mad = sum(abs(a - b) for a, b in zip(py_vals, rs_vals)) / len(py_vals)
pct = (mad / py_mean * 100) if py_mean != 0 else float("inf")
print(f"{col:<30} {py_mean:>12.3f} {rs_mean:>12.3f} {mad:>12.3f} {pct:>7.1f}%")
Results
To be populated after running the benchmark on target hardware.
Fill in: host specs (CPU model, RAM, OS, kernel), Rust git SHA, Python
resource-trackerversion, output table fromcompare.py, and observations on where the two implementations agree and diverge.
Remaining known differences
| Aspect | Python | Rust | Status |
|---|---|---|---|
| Timestamp precision | Float (sub-second) | Integer (Unix seconds) | By design; use +/-0.5 s tolerance when aligning rows |
| Disk I/O sector size | Per-device from /sys/block/<dev>/queue/hw_sector_size, fallback 512 | Per-device from same sysfs path, fallback 512 | Implemented (P-DSK-SECTOR); parity achieved |
Disk space: non-/dev/ mounts | psutil includes overlay/tmpfs/cgroup mounts if reported non-virtual | Only /dev/<device> prefixed sources in /proc/mounts | Low impact on physical hosts; notable on container/VM hosts |
| ZFS volumes | Handled via psutil disk partition enumeration | Not yet implemented | Planned enhancement |
JSON superset fields (not in Python CSV)
The JSON output carries richer data than any Python CSV column can express.
Rationale: the CSV columns match Python for downstream compatibility. The JSON output is the primary format for new consumers and exposes all available data without being constrained by the Python column set.
| Type | Field | Description | Rationale |
|---|---|---|---|
| cpu | cpu.per_core_pct[] | Per-logical-core utilization (0–100 each) | Identify hot cores and NUMA imbalance; not expressible as a single CSV scalar |
| cpu | cpu.process_cores_used | Fractional cores consumed by tracked PID tree | Covers multi-process workloads (workers, MPI ranks); Python tracks only the root process |
| cpu | cpu.process_child_count | Live descendants under tracked root PID | Detect fork/thread storms without external tooling |
| memory | memory.total_mib | Total installed RAM | Baseline for capacity planning |
| memory | memory.available_mib | MemAvailable: free + reclaimable | Better headroom estimate than free_mib alone on systems with large page caches |
| memory | memory.used_pct | RAM usage as a percentage | Convenient derived field; avoids client-side division |
| memory | memory.active_mib / memory.inactive_mib | Active and inactive page counts | Distinguish working-set pressure from cold cache |
| memory | memory.swap_total_mib / memory.swap_used_mib / memory.swap_used_pct | Swap metrics | Detect swap pressure before OOM; Python omits swap entirely |
| network | network[].interface etc. | Interface name, MAC, driver, operstate, speed, MTU | Identify which NIC is under load and whether the link is at full speed |
| network | network[].rx_bytes_total / tx_bytes_total | Cumulative byte counters | Enables client-side rate computation at any granularity |
| disk | disk[].device_type | nvme, ssd, or hdd | Correlate latency with drive class without parsing device names |
| disk | disk[].capacity_bytes | Raw device capacity | Capacity planning without a separate lsblk call |
| disk | disk[].mounts[] | Per-mount-point space (total/used/available/pct) | Python aggregates all mounts into three scalars; Rust retains per-volume detail |
| disk | disk[].model / vendor / serial | Drive identity | Correlate metrics with physical hardware inventory |
| gpu | gpu[].temperature_celsius | Die temperature | Detect thermal throttling in real time |
| gpu | gpu[].power_watts | Power draw | Power-efficiency analysis; watts-per-FLOP budgeting |
| gpu | gpu[].frequency_mhz | Core clock | Confirm boost clock is active; correlate with thermal state |
| gpu | gpu[].vram_total_bytes | Total VRAM | Baseline for VRAM utilization percentage |
| gpu | gpu[].uuid / name / device_type / host_id | GPU identity | Multi-GPU systems: attribute metrics to specific devices |
resource-tracker – Usage Guide
resource-tracker is a lightweight Linux resource tracker. It polls CPU,
memory, disk, network, and GPU metrics at a configurable interval and emits
metrics as newline-delimited JSON (JSONL) or CSV lines to stderr or target file.
Quick start
# Build
cargo build --release
# Run with defaults to track resources used by hashing for 5 seconds
./target/release/resource-tracker timeout 5s sha512sum /dev/zero
# Track a specific process tree
./target/release/resource-tracker --pid 1234 --job-name "my-job"
Each line of output is a complete JSON object representing one sample by default:
{
"timestamp_secs": 1718000000,
"job_name": "my-benchmark",
"cpu": { "utilization_pct": 4.6, "per_core_pct": [12.5, 38.0, "..."], "process_cores_used": 3.8, "process_child_count": 4 },
"memory": { "total_mib": 64000, "used_mib": 30468, "used_pct": 47.6, "free_mib": 2289, "available_mib": 18432, "buffers_mib": 263, "cached_mib": 8472, "active_mib": 8157, "inactive_mib": 7404, "swap_total_mib": 0, "swap_used_mib": 0, "swap_used_pct": 0.0 },
"network": [{ "interface": "eth0", "rx_bytes_per_sec": 1200.0, "tx_bytes_per_sec": 400.0, "rx_bytes_total": 9834200, "tx_bytes_total": 312400, "driver": "virtio_net", "operstate": "up", "speed_mbps": 1000, "mtu": 1500, "mac_address": "02:00:00:aa:bb:cc" }],
"disk": [{ "device": "nvme0n1", "model": "Samsung SSD 990 PRO", "device_type": "nvme", "capacity_bytes": 1000204886016, "read_bytes_per_sec": 0.0, "write_bytes_per_sec": 204800.0, "mounts": [{ "mount_point": "/", "filesystem": "ext4", "total_bytes": 999292796928, "used_bytes": 841676800000, "available_bytes": 142023000000, "used_pct": 84.2 }] }],
"gpu": [{ "name": "NVIDIA GeForce RTX 4090", "utilization_pct": 98.0, "vram_used_pct": 72.3, "vram_used_bytes": 17394819072, "vram_total_bytes": 24026849280, "temperature_celsius": 74, "power_watts": 318.5, "frequency_mhz": 2520 }]
}
CLI flags
| Flag | Short | Default | Description |
|---|---|---|---|
--pid PID | -p | (none) | Root PID of the process tree to attribute CPU usage to. Includes all child processes. |
--interval SECS | -i | 1 | How often to emit a sample, in seconds. |
--config FILE | -c | resource-tracker.toml | Path to a TOML config file. Silently ignored if the file does not exist. |
--format FORMAT | -f | json | Output format: json or csv. |
--output FILE | -o | Path to the output file. Defaults to stderr. | |
--quiet | Suppress metric output entirely, e.g. when streaming metrics to Sentinel and local output is not needed. | ||
--help | -h | Print help. | |
--version | -V | Print version. |
Precedence: CLI flags > config file > built-in defaults.
Config file (resource-tracker.toml)
The TOML config file lets you persist settings so you don’t have to repeat CLI flags on every invocation. It is optional – the tool works with no config file at all. Any field set on the CLI overrides the corresponding field in the file.
The default lookup path is resource-tracker.toml in the current working directory.
Use --config /path/to/file.toml to point elsewhere.
Full reference
[job]
# Human-readable label for this tracking session.
# Appears as "job_name" in every emitted JSON sample.
# Useful when multiple runs are collected into the same data store so you can
# filter and group by job.
name = "gpu-benchmark-run-42"
# Root PID of the process to track.
# resource-tracker will walk the full process tree (parent + all descendants)
# and sum their CPU tick usage to report process_cores_used.
# Leave unset to collect system-wide metrics only.
pid = 12345
[tracker]
# Sampling interval in seconds. Lower values give finer resolution at the
# cost of more output volume and slightly higher observer overhead.
# Default: 1
interval_secs = 10
Minimal example – system-wide monitoring
[tracker]
interval_secs = 30
Example – named job with process tracking
[job]
name = "my_job_i_want_to_track"
pid = 98231
[tracker]
interval_secs = 5
Sentinel API streaming and S3 output
When SENTINEL_API_TOKEN is set, the tracker registers the run with the
Sentinel API and streams metric batches to S3 in the background.
No network connections are ever made when the token is absent.
How it works
- At startup,
start_runAPI endpoint is called to register the run and obtain temporary S3 upload credentials from the Sentinel API. - A background upload thread wakes every
TRACKER_UPLOAD_INTERVALseconds (default 60), drains the in-memory sample buffer, serializes as CSV, gzip-compresses, and PUTs the file to the S3 prefix returned by the API. - On clean exit (SIGTERM, shell-wrapper child exits), any samples not yet
uploaded are base64-encoded and sent inline to
finish_runinside a gzip-compressed JSON body. If S3 uploads did occur, only the S3 URIs are sent.
Environment variables
| Variable | Required | Default | Description |
|---|---|---|---|
SENTINEL_API_TOKEN | Yes | – | Bearer token for the Sentinel API. Streaming is disabled when absent or empty. |
SENTINEL_API_URL | No | https://api.sentinel.sparecores.net | Override the Sentinel API base URL. |
TRACKER_UPLOAD_INTERVAL | No | 60 | Seconds between S3 batch uploads. |
Job metadata environment variables
All Section 9.3 metadata fields can be set via environment variable instead of CLI flags. Environment variables are overridden by the corresponding CLI flag when both are supplied.
| Variable | CLI flag |
|---|---|
TRACKER_JOB_NAME | --job-name |
TRACKER_PROJECT_NAME | --project-name |
TRACKER_STAGE_NAME | --stage-name |
TRACKER_TASK_NAME | --task-name |
TRACKER_TEAM | --team |
TRACKER_ENV | --env |
TRACKER_LANGUAGE | --language |
TRACKER_ORCHESTRATOR | --orchestrator |
TRACKER_EXECUTOR | --executor |
TRACKER_EXTERNAL_RUN_ID | --external-run-id |
TRACKER_CONTAINER_IMAGE | --container-image |
Example
export SENTINEL_API_TOKEN="your-token-here"
export TRACKER_JOB_NAME="gpu-benchmark"
export TRACKER_UPLOAD_INTERVAL=30
./resource-tracker --interval 1 -- python train.py
The tracker spawns python train.py, monitors it, uploads a gzip-compressed
CSV batch to S3 every 30 seconds, and calls finish_run when the script exits.
When to use the config file vs CLI flags
| Situation | Recommended approach |
|---|---|
| One-off interactive run | CLI flags – faster, no file to manage |
| Recurring job (cron, SLURM, systemd unit) | TOML file alongside the job definition |
| CI / benchmark pipeline | TOML file checked into the repository |
| Multiple named jobs on the same host | One TOML file per job, point to it with --config |
| Containerized workload | Set config via CLI flags in the CMD / ENTRYPOINT |
Capturing output
Because samples are emitted as newline-delimited JSON to stdout, standard Unix tools work directly with the output.
# Write to a file
./resource-tracker > run.jsonl
# Tail live output
./resource-tracker | tee run.jsonl
# Pretty-print with jq
./resource-tracker | jq .
# Extract only CPU utilization over time
./resource-tracker | jq '{ t: .timestamp_secs, cpu: .cpu.utilization_pct }'
# Watch GPU VRAM usage
./resource-tracker --interval 1 | jq '.gpu[] | { name, vram_used_pct }'
Shell-wrapper mode
Pass a command after -- to have the tracker spawn and monitor it:
./resource-tracker --interval 1 --job-name "training-run" -- python train.py --epochs 50
The tracker sets --pid automatically to the spawned child’s PID, emits one
final sample when the child exits, then exits with the child’s exit code.
Rationale: eliminates the two-process boilerplate (tracker & python ...; wait)
and guarantees the tracker always exits with the job’s exit code, making it
transparent to CI systems.
Process tree tracking (--pid)
When --pid is set, every sample includes two extra fields under cpu:
process_cores_used– fractional cores consumed by the process tree (e.g.3.8means the tree is using the equivalent of 3.8 full cores).process_child_count– number of live child/descendant processes at the time of sampling (does not include the root PID itself).
If the tracked PID exits during a run, its contribution drops to zero and
process_child_count drops to zero. The tracker itself keeps running.
Rationale: Python’s SystemTracker tracks only the calling process’s own
ticks. Rust walks the full /proc tree so multi-process and multi-threaded
workloads (e.g. PyTorch data-loader workers, MPI ranks, Spark executors) are
attributed correctly under a single root PID.
Finding the PID of a running process:
# By name
pgrep -x python
# Most recently launched
pgrep -n my-training-script
# Already know the command? Launch and capture PID
my-training-script &
./resource-tracker --pid $! --job-name "training-run-1"
GPU support
GPUs are detected automatically at startup via NVML (NVIDIA) and
libamdgpu_top (AMD). No configuration is needed. On hosts without GPU
hardware or without the relevant driver libraries installed, the gpu array
in each sample will be empty – the tracker continues running normally.
Supported accelerators: NVIDIA GPUs (NVML), AMD GPUs (ROCm/AMDGPU).
Rationale: per-GPU temperature, power draw, and clock frequency are not
emitted by Python’s SystemTracker. These fields enable thermal throttle
detection and power-efficiency analysis without a separate monitoring tool.
Metrics reference
cpu
| Field | Unit | Description |
|---|---|---|
utilization_pct | fractional cores | Aggregate cores in use (0.0..N_cores). 4.6 on a 16-core host means ~4.6 vCPUs fully utilized. |
per_core_pct | % each | Per-logical-core utilization array (0.0–100.0). |
utime_secs | seconds | User+nice CPU time across all cores this interval. |
stime_secs | seconds | System CPU time across all cores this interval. |
process_count | count | Runnable processes (procs_running from /proc/stat). |
process_cores_used | fractional cores | Cores consumed by tracked process tree (null if no PID). |
process_child_count | count | Live descendant processes (null if no PID). |
memory
All values in mebibytes (MiB = 1,048,576 bytes).
| Field | Description |
|---|---|
total_mib | Total installed RAM |
free_mib | Truly free RAM (MemFree from /proc/meminfo) |
available_mib | Free + reclaimable RAM (MemAvailable); better estimate of headroom |
used_mib | total - free - buffers - cached (excludes reclaimable cache) |
used_pct | Fraction of total RAM in use |
buffers_mib | Kernel I/O buffer cache |
cached_mib | Page cache including slab-reclaimable (Cached + SReclaimable) |
active_mib | Active pages (recently accessed) |
inactive_mib | Inactive pages (candidates for reclaim) |
swap_total_mib | Total swap space (0 if no swap) |
swap_used_mib | Used swap |
swap_used_pct | Fraction of swap in use |
Rationale: Python’s SystemTracker reports memory in KiB and omits
available_mib, active_mib, inactive_mib, swap_*. Rust reports all
fields in MiB (matching Python resource-tracker PR #9) and adds
available_mib (MemAvailable) which is a more reliable headroom estimate
than free_mib alone on systems with large page caches.
disk (one entry per whole-disk block device)
| Field | Unit | Description |
|---|---|---|
device | – | Kernel device name, e.g. nvme0n1, sda |
model | – | Drive model string from /sys/block/ |
vendor | – | Vendor string from /sys/block/ |
serial | – | Serial number or WWID |
device_type | – | nvme, ssd, or hdd |
capacity_bytes | bytes | Raw device capacity |
mounts | – | Array of mounted filesystems on this device |
mounts[].mount_point | – | e.g. /, /home |
mounts[].filesystem | – | e.g. ext4, xfs, btrfs |
mounts[].total_bytes | bytes | Filesystem total size |
mounts[].used_bytes | bytes | Space in use |
mounts[].available_bytes | bytes | Space available to non-root users |
mounts[].used_pct | % | Fraction of filesystem in use |
read_bytes_per_sec | bytes/s | Disk read throughput |
write_bytes_per_sec | bytes/s | Disk write throughput |
read_bytes_total | bytes | Cumulative bytes read since boot |
write_bytes_total | bytes | Cumulative bytes written since boot |
Rationale: Python aggregates disk space across all mounts into three scalar CSV columns. Rust retains per-device, per-mount detail in the JSON output, enabling per-volume capacity tracking and per-device I/O attribution that the aggregated CSV cannot express.
network (one entry per non-loopback interface)
| Field | Unit | Description |
|---|---|---|
interface | – | Interface name, e.g. eth0, ens3 |
mac_address | – | Hardware MAC address |
driver | – | Kernel driver name, e.g. igc, virtio_net |
operstate | – | Link state: up, down, unknown |
speed_mbps | Mbps | Negotiated link speed (-1 if not reported) |
mtu | bytes | Maximum transmission unit |
rx_bytes_per_sec | bytes/s | Received throughput |
tx_bytes_per_sec | bytes/s | Transmitted throughput |
rx_bytes_total | bytes | Cumulative bytes received since boot |
tx_bytes_total | bytes | Cumulative bytes sent since boot |
Rationale: Python’s SystemTracker emits only cumulative rx/tx byte
totals per interface. Rust adds per-interval rates, driver identity,
link state, negotiated speed, and MTU, enabling network saturation and
driver-level diagnostics without a separate tool.
gpu (one entry per detected accelerator)
| Field | Unit | Description |
|---|---|---|
uuid | – | Vendor-assigned device UUID |
name | – | Device name, e.g. NVIDIA GeForce RTX 4090 |
device_type | – | GPU, NPU, TPU, etc. |
host_id | – | Host-level device identifier (PCIe slot or platform index) |
detail | – | Driver-specific key/value map (PCI IDs, ASIC name, driver version, …) |
utilization_pct | % | Core utilization |
vram_total_bytes | bytes | Total VRAM |
vram_used_bytes | bytes | Used VRAM |
vram_used_pct | % | Fraction of VRAM in use |
temperature_celsius | deg C | Die temperature |
power_watts | W | Power draw |
frequency_mhz | MHz | Core clock |
core_count | count | Shader/compute cores (null if not reported) |