Introduction

This file contains the initial specification/ideation of the resource-tracker-rs project.

Background

The resource-tracker Python package was brought to life in 2025 to have a way to track the resources used by long-running DS/ML/AI jobs in the cloud, and recommend better cloud resource allocations. This was started as an experimentation and resulted in the following features:

Supports Linux, macOS, and Windows. No dependencies on Linux, required psutil on other operating systems.
Tracks CPU, memory, NVIDIA GPU and VRAM (even at the process level), disk usage, network usage at the system and process level.
Monitoring happens at a configurable interval (defaults to 1 second), and collects metrics to local (temp) CSV files.
Performance is unnoticeable at 1-sec frequency, but cannot go much lower without significant performance overhead.
Computes aggregated statistics on the metrics (e.g. average and peak values).
Recommends optimal cloud resource allocations based on the metrics.
Recommends best-priced cloud servers for the given workload.
Renders a local HTML report with all the metrics and recommendations.
Has an R package wrapper for the same functionality.
Integrates well with Metaflow.

While it worked well for Python and R, we also wanted a standalone tool that can be better used as a CLI wrapper to track any processes in any environment, and eventually integrate back in the existing Python and R packages. The overall goal is to have a lightweight binary, compiled cross-platform, that can

Track a wide range of resource utilization metrics locally – including CPU, memory, GPU and VRAM, disk usage, network usage.
Optionally stream these metrics to a remote server for centralized analysis, visualization, and further optimization.

This allows us not to embed any complex logic in the binary, and just focus on data collection and delivery, so that am accompained free/commercial service can deliver the centeralized visibility, recommendations, automation and optimization – while keeping most of the ecosystem open-source and open to extend with other tools and services.

Data Collection

Discovery Tools

What worked great in the Python implementation was the ability to discover the

Most important specs of the host machine, such as CPU cores count, memory amount etc.
Cloud environment of the server (when available), such as vendor, region, instance type.

These limited tools are implemented at

We are sure the hardware discovery could be improved further, and we aim to collect at least the following (all prefixed with host_ in the data ingestion endpoint):

host_id (text): Unique identifier of the host machine, such as AWS EC2 instance ID or the server S/N.
host_name (text): Hostname of the machine.
host_ ip (text): IP address of the machine.
host_allocation (enum): If the server is dedicated to the monitored process, or shared with other processes.
host_vcpus (int): Number of logical virtual CPU cores.
host_cpu_model (text): Model of the CPU (e.g. from lscpu output).
host_memory_mib (int): Amount of memory in MiB.
host_gpu_model (text): Model of the GPU (e.g. from nvidia-smi output).
host_gpu_count (int): Number of GPUs.
host_gpu_vram_mib (int): Amount of VRAM in MiB.
host_storage_gb (float): Amount of storage in GB.

All these fields are optional, and only collected when available. Users should be able to suppress any sensitive fields, such as the host IP address.

The cloud discovery is implemented via probing the Metadata server endpoints of the supported cloud providers. We should try to get information about the following fields (all using the cloud_ prefix in the data ingestion endpoint):

cloud_vendor_id (text): The cloud provider’s id, mapped to the Spare Cores Navigator’s vendor table reference (e.g. aws).
cloud_account_id (text): The cloud account id.
cloud_region_id (text): The cloud region id, mapped to the Spare Cores Navigator’s region table reference (e.g. us-east-1).
cloud_zone_id (text): The cloud zone id, mapped to the Spare Cores Navigator’s zone table reference (e.g. us-east-1a).
cloud_instance_type (text): The cloud instance type, mapped to the Spare Cores Navigator’s server table’s api_reference field (e.g. t3a.nano).

Find the Spare Cores Navigator’s vendor, region, zone and server tables at https://github.com/SpareCores/sc-data-dumps/tree/main/data and schemas described at https://dbdocs.io/spare-cores/sc-crawler.

Metrics to Track

The data ingestion endpoint is rather liberal and any arbitrary metric can be tracked. The only restriction is that the submitted data needs to be a CSV file with at least one column named timestamp, which should be UNIX timestamp in seconds.

All other columns are treated as metrics. We recommend storing machine-wide metrics prefixed with system_ and the process-level metrics prefixed with process_. If distinguishing between machine-wide and process-level metrics is not feasible, metrics can be submitted without any prefix.

Recommended column names for commonly tracked process-level metrics that are taken into consideration in the backend:

children: The number of child processes.
utime: The total user+nice mode CPU time in seconds.
stime: The total system mode CPU time in seconds.
cpu_usage: The current CPU usage between 0 and number of CPUs.
memory_mib: Current memory usage in MiB. Preferably PSS (Proportional Set Size) on Linux, fall back to RSS (Resident Set Size).
disk_read_bytes: The total number of bytes read from disk.
disk_write_bytes: The total number of bytes written to disk.
gpu_usage: The current GPU utilization between 0 and GPU count.
gpu_vram_mib: The current GPU memory used in MiB.
gpu_utilized: The number of GPUs with utilization > 0.

Recommended column names for commonly tracked machine-wide metrics that are taken into consideration in the backend:

processes: The number of running processes.
utime: The total user+nice mode CPU time in seconds.
stime: The total system mode CPU time in seconds.
cpu_usage: The current CPU usage between 0 and number of CPUs.
memory_free_mib: The amount of free memory in MiB.
memory_used_mib: The amount of used memory in MiB.
memory_buffers_mib: The amount of memory used for buffers in MiB.
memory_cached_mib: The amount of memory used for caching in MiB.
memory_active_mib: The amount of memory used for active pages in MiB.
memory_inactive_mib: The amount of memory used for inactive pages in MiB.
disk_read_bytes: The total number of bytes read from all disks.
disk_write_bytes: The total number of bytes written to all disks.
disk_space_total_gb: The total disk space in GB.
disk_space_used_gb: The used disk space in GB.
disk_space_free_gb: The free disk space in GB.
net_recv_bytes: The total number of bytes received over network.
net_sent_bytes: The total number of bytes sent over network.
gpu_usage: The current GPU utilization between 0 and GPU count.
gpu_vram_mib: The current GPU memory used in MiB.
gpu_utilized: The number of GPUs with utilization > 0.

No other metrics are officially supported by the backend at the moment, but the user can submit any arbitrary values (even strings!) for future use.

Wishlist for future metrics:

CPU saturation and efficiency metrics:
- Load average (1m)
- L1/L2/L3 cache hit rate
- TLB miss rate
- Major page faults
- iowait
- IPC (Instructions Per Cycle)
- Context switches
GPU saturation and efficiency metrics:
- PCIe TX and RX throughput, Nvlink throughput + theoretical max throughput (e.g. nvidia-smi nvlink -c)
- Power usage (W)
- Temperature (C)
Disk saturation and efficiency metrics:
- Disk latency (ms)
- Disk queue length

Overall, we are looking for metrics that can help identify potential bottlenecks and find better cloud servers for the monitored workload.

Metadata

We also want to support collecting the following metadata about the monitored process:

pid (int): The process ID.
container_image (text): The container image, including optional tag.
command (json): JSON array of the command and its arguments.
env (text): The environment (e.g. dev or prod).
language (text): The language of the process (e.g. python or r).
orchestrator (text): The orchestrator of the process (e.g. metaflow).
executor (text): The executor of the process (e.g. k8s).
team (text): The team of the process.
project_name (text): The project name of the process.
job_name (text): The job name of the process (e.g. flow in metaflow, workflow in flyte).
stage_name (text): The stage name of the process (e.g. step in metaflow, node in flyte).
task_name (text): The task name of the process (e.g. task both in metaflow and flyte).
external_run_id (text): The external run id of the process (e.g. Jenkins build number – internal to the orchestrator).

Most (if not all: except for the command) of these fields are to be provided voluntarily and manually by the user (or job orchestrator) and should be optional. Privacy and security concerns are addressed in the public service’s legal docs.

The user should be also able to provide any ad-hoc key-value pairs (tags) for tracking purposes.

Status

The data ingestion endpoint automatically captures the start and end time of the process, and calculates the duration in seconds. It also captures user and organization information based on the user’s credentials. Once a job is finished, statistics and recommendations are calculated and stored in a database, made available to the user via a web interface, API, and potentially via the CLI tool as well in the future.

But the CLI tool need to collect the following fields and pass to the data ingestion endpoint:

exit_code (int): The exit code of the process.
run_status (enum): The status of the run (e.g. success, failure, etc).

Data Streaming

To authenticate with the data ingestion API endpoint, the Resource Tracker needs to use a long-lived API token set by the user in the SENTINEL_API_TOKEN environment variable. This needs to be passed as the Authorization header with the value Bearer <token>.

At the start of the Resource Tracker, hit the data ingestion endpoint to register the start of a Run along with the following optional parameters:

metadata (e.g. project_name etc.)
server and cloud discovery information (e.g. number of CPUs and/or actual instance type)

The response contains:

run_id that should be stored until the end of the run as all future API calls will need to reference that.
upload_uri_prefix: An S3 URI prefix to upload the metrics to.
upload_credentials: The temporary AWS STS session credentials for the upload authentication, including an expires_at timestamp.

Then the Resource Tracker should start a background thread (or similar solution) to upload collected metrics in batches (e.g. every 1 minute) as new objects under the upload_uri_prefix as gzipped CSV files. The Resource Tracker should also keep track of the uploaded URIs.

When the temporary upload credentials expire, the Resource Tracker should hit the data ingestion endpoint to refresh the credentials.

When the tracked process finishes, the Resource Tracker should hit the data ingestion endpoint to register the end of the run. This takes

The run_id,
The status of the run (e.g. success, failure, etc.) along with an optional exit_code as described above,
And either the list of the uploaded URIs as data_uris along with data_source set to s3, or if no S3 uploads happened yet (e.g. short duration run), then the CSV file as data_csv along with data_source set to local.

The endpoint will process the data in synchronous manner, and return statistics.

More Details

Find the data ingestion API endpoints docs at https://api.sentinel.sparecores.net/docs, including the data contracts and API references.

Rationale

resource-tracker is a Rust rewrite of the Python resource-tracker library. It preserves full CSV column parity with the Python implementation while adding new capabilities that are difficult or impossible to express in the original.

Why Rust

Property	Python `resource-tracker`	`resource-tracker`
Runtime dependency	Python interpreter + `psutil`	Single static binary
Startup overhead	~200-500 ms	< 5 ms
Observer CPU overhead	~0.5-1% per core	< 0.1% per core
Memory footprint	~30-60 MiB (interpreter)	~2-4 MiB
Deployment	pip / uv install	Copy binary

The lower observer overhead matters when tracking short-lived or CPU-intensive workloads where the tracker itself would otherwise appear in the numbers it is collecting.

New user-facing functionality

Shell-wrapper mode

./resource-tracker --interval 1 -- python train.py --epochs 50

Pass any command after -- and the tracker spawns it, sets --pid automatically, emits one final sample on exit, and forwards the child’s exit code. This eliminates the two-process boilerplate (tracker & child; wait) and makes the tracker transparent to CI systems and schedulers that check exit codes.

Full process tree tracking (`--pid`)

Python’s SystemTracker attributes CPU ticks only to the root process. Rust walks the full /proc tree and sums every descendant (workers, threads, MPI ranks, Spark executors) under the given root PID. Two fields appear in every JSON sample when --pid is active:

cpu.process_cores_used – fractional cores consumed by the whole tree
cpu.process_child_count – live descendant count at each sample

Sentinel API streaming and S3 upload

When SENTINEL_API_TOKEN is set, the tracker registers the run, streams gzip-compressed CSV batches to S3 every TRACKER_UPLOAD_INTERVAL seconds (default 60), and posts a finish_run call on clean exit. No network connections are made when the token is absent.

TOML config file + environment variable overrides

All settings (interval, job name, PID, metadata) can be persisted in a resource-tracker.toml file alongside the job definition. Every field also has a TRACKER_* environment variable override, which is convenient for containerized or CI environments where config files are impractical.

Richer metrics (JSON superset)

The CSV output matches Python column-for-column. The JSON output carries additional fields not expressible as Python CSV scalars.

CPU

per_core_pct[] – per-logical-core utilization; identifies hot cores and NUMA imbalance
utilization_pct expressed as fractional cores (0.0..N_cores), not a percentage clamped to 100; more useful on multi-core hosts

Memory

available_mib (MemAvailable) – free + reclaimable; a more reliable headroom estimate than free_mib on systems with large page caches
swap_total_mib, swap_used_mib, swap_used_pct – swap pressure visible before OOM; Python omits swap entirely
active_mib / inactive_mib – distinguish working-set pressure from cold cache

Disk

Per-device, per-mount detail instead of three aggregated scalars; enables per-volume capacity tracking and per-device I/O attribution
device_type (nvme, ssd, hdd), model, vendor, serial – correlate metrics with physical hardware without a separate lsblk call
Per-device hardware sector size read from sysfs; correct byte counts on 4K-native drives where a hard-coded 512 would under-count I/O by 8x

Network

Per-interval rates (rx_bytes_per_sec, tx_bytes_per_sec) in addition to cumulative totals; no client-side diff required
driver, operstate, speed_mbps, mtu per interface; identify which NIC is under load and whether the link is running at full negotiated speed

GPU (NVIDIA and AMD)

Python emits no GPU metrics at all. Rust supports both NVIDIA (NVML) and AMD (ROCm/AMDGPU) accelerators via runtime dynamic loading, with no build-time driver dependencies. Additional fields beyond utilization and VRAM:

temperature_celsius – detect thermal throttling in real time
power_watts – power-efficiency analysis; watts-per-FLOP budgeting
frequency_mhz – confirm boost clock is active; correlate with thermal state
uuid, name, host_id – attribute metrics to specific devices in multi-GPU systems

Open-Source Resource Monitoring Landscape

Competitive Analysis for `resource-tracker` (SpareCores)

Prepared: March 25, 2026 Context: Phase 1 feasibility assessment for a Rust/Linux CLI implementation of ResourceTracker Reference tool: https://github.com/SpareCores/resource-tracker

Executive Summary

resource-tracker occupies a specific and underserved niche: a lightweight, zero-dependency, batch-job-oriented process + system resource monitor with workflow framework integration (Metaflow), visualization via cards, and cloud server recommendations. The open-source landscape has many partial overlaps but no single tool matches all its characteristics simultaneously.

The tools below are organized into meaningful categories. Most tools are either:

Too low-level (profilers that require code instrumentation or produce flame graphs rather than time-series resource logs)
Too heavy (system daemons, full observability stacks)
Too narrow (single-resource: CPU only, or memory only, or GPU only)
Not batch-job oriented (designed for long-running services, not scripts that run and exit)

Category 1: Python Libraries for Process/System Resource Monitoring

These are the closest functional analogues to resource-tracker in the Python ecosystem.

1.1 psutil

URL: https://github.com/giampaolo/psutil
Language: Python (C extension)
Description: The foundational library for cross-platform system/process information in Python. resource-tracker itself uses psutil as an optional backend on non-Linux systems. psutil retrieves CPU, memory, disk, network, and process-level data programmatically but provides no time-series tracking, no decorator/wrapper API, no visualization, and no batch job reporting.
Key features: CPU %, memory (RSS/PSS/USS/VMS), per-process I/O, network I/O, disk usage, process tree traversal. Cross-platform (Linux, macOS, Windows).
Difference: Raw data API only. No tracking loop, no reports, no workflow integration. It is a building block, not a solution.

1.2 memory_profiler

URL: https://github.com/pythonprofilers/memory_profiler
Language: Python
Description: Line-by-line memory usage profiler for Python scripts. Uses @profile decorator and mprof CLI to record memory usage over time and plot it. Built on psutil.
Key features: Line-level memory profiling, time-series memory plot via mprof, @profile decorator, memory_usage() API.
Difference: Memory only (no CPU, GPU, disk, network). Requires code instrumentation for line-level profiling. Targeted at developers finding memory leaks, not at batch job operators seeking resource utilization logs.

1.3 Scalene

URL: https://github.com/plasma-umass/scalene
Language: Python + C++
Description: High-performance, high-precision CPU, GPU, and memory profiler for Python. Uniquely profiles CPU time, GPU time, and memory at the line level simultaneously. Includes AI-powered optimization suggestions and an interactive web UI.
Key features: Line-level CPU + GPU + memory profiling, separates Python vs native time, web-based interactive report, minimal overhead (~10-20%).
Difference: A developer profiler (find bottlenecks in code), not a resource utilization logger for batch jobs. Does not track network or disk I/O, does not integrate with workflow tools, does not produce time-series utilization logs for operational use.

1.4 Memray

URL: https://github.com/bloomberg/memray
Language: Python + C++
Description: Bloomberg’s memory profiler for Python. Tracks every allocation in Python, native extensions, and the interpreter itself. Produces flame graphs, heap charts, and other visualizations.
Key features: Full allocation tracking (Python + C/C++), flame graphs, live mode, Jupyter integration, reporter API.
Difference: Memory only, developer-oriented (find leaks/hotspots in code). Does not track CPU, GPU, disk, or network. Not designed for batch job monitoring.

1.5 Fil (filprofiler)

URL: https://github.com/pythonspeed/filprofiler
Language: Python + Rust
Description: Memory profiler from pythonspeed targeting data scientists and scientific computing. Finds peak memory usage and identifies what code caused the peak. Produces flame graphs.
Key features: Peak memory tracking (captures C and Python allocations), flame graphs, designed for NumPy/Pandas workloads, CLI usage.
Difference: Memory only, developer-oriented. No CPU, GPU, disk, network. Produces offline profiling reports, not operational time-series logs.

1.6 pyinstrument

URL: https://github.com/joerick/pyinstrument
Language: Python
Description: Sampling call-stack profiler for Python. Samples the call stack every 1ms and shows a readable summary of where time is spent. Supports context manager and decorator API.
Key features: Low-overhead sampling, context manager (with Profiler()), decorator, CLI, HTML/text/JSON output, async support.
Difference: CPU time only (call stack), no memory/GPU/disk/network. Developer-oriented (why is code slow?), not a resource utilization monitor.

1.7 py-spy

URL: https://github.com/benfred/py-spy
Language: Rust
Description: Sampling profiler for Python programs written in Rust. Attaches to a running Python process without modifying it. Can generate flame graphs or a top-like display.
Key features: Attaches to running process (no code changes), flame graphs, top-like live view, very low overhead, works across OS.
Difference: CPU only (call stack). No memory, GPU, disk, or network tracking. Attach-to-process model differs from resource-tracker’s wrap-a-job model.

1.8 Austin

URL: https://github.com/P403n1x87/austin
Language: C
Description: Python frame stack sampler for CPython. Samples the Python interpreter’s memory space directly to retrieve running thread stacks. Extremely low overhead.
Key features: Zero-instrumentation, pure C, very low overhead, multi-thread and multi-process support, output compatible with flame graph tools.
Difference: CPU/call stack profiling only. No resource utilization metrics (memory, GPU, disk, network).

1.9 Glances

URL: https://github.com/nicolargo/glances
Language: Python
Description: Cross-platform system monitoring tool with a rich curses/web UI. Shows CPU, memory, disk, network, process list, temperatures, GPU (via plugin), Docker containers, and more. Can export data to InfluxDB, CSV, Prometheus, etc.
Key features: Real-time monitoring, web UI, REST API, exporters (InfluxDB, Prometheus, CSV, JSON), Docker/container awareness, GPU plugin, cross-platform (Linux, macOS, Windows, BSD).
Difference: A long-running system monitor daemon/interactive tool, not designed to wrap a batch job, produce a per-job report, or integrate with workflow frameworks. No job-level summary reports.

1.10 nvitop

URL: https://github.com/XuehaiPan/nvitop
Language: Python
Description: Interactive NVIDIA GPU process viewer with a rich terminal UI. Goes beyond nvidia-smi by showing per-process GPU/VRAM usage in real time, supports programmatic API access.
Key features: Per-process GPU utilization and VRAM, process tree, interactive kill/signal, rich terminal UI, Python API (ResourceMetricCollector).
Difference: GPU-only (NVIDIA). Covers system + process level GPU metrics well. Its ResourceMetricCollector API is a meaningful overlap with resource-tracker for GPU tracking. No CPU/memory/disk/network integration.

1.11 gpustat

URL: https://github.com/wookayin/gpustat
Language: Python
Description: Simple command-line utility for querying and monitoring NVIDIA GPU status. Aggregates nvidia-smi output with color-coded display. Supports --watch mode.
Key features: GPU utilization, VRAM usage, temperature, power draw, per-process GPU use, JSON output, watch mode.
Difference: NVIDIA GPU only, read-only display tool, no time-series logging, no CPU/memory/disk/network.

1.12 pynvml / nvidia-ml-py

URL: https://github.com/gpuopenanalytics/pynvml
Language: Python (NVML binding)
Description: Python bindings for NVIDIA’s NVML C library, enabling programmatic GPU diagnostics. Used as a building block by gpustat, nvitop, and resource-tracker itself.
Key features: Full NVML API access: GPU utilization, VRAM, temperature, power, clock speed, process-level GPU usage, fan speed.
Difference: Raw API, no tracking loop, no reporting. A building block.

1.13 CodeCarbon

URL: https://github.com/mlco2/codecarbon
Language: Python
Description: Tracks CPU, GPU, and RAM energy consumption and converts it to estimated CO2 emissions. Designed for ML training runs. Provides decorator and context manager APIs.
Key features: @track_emissions decorator, context manager, estimates CO2 equivalent, per-run reporting, dashboard, supports Intel RAPL and NVML.
Difference: Focused on energy/carbon footprint rather than raw resource utilization metrics. Does not track disk I/O or network. Closest in UX philosophy (decorator for batch scripts) but different output goal.

1.14 CarbonTracker

URL: https://github.com/lfwa/carbontracker
Language: Python
Description: Tracks and predicts energy consumption and carbon footprint of deep learning model training. Can stop training when predicted impact exceeds a threshold.
Key features: Predictive carbon footprint, supports GPU and CPU energy, training-run oriented, can send alerts.
Difference: Energy/carbon focused, ML training specific, no disk/network tracking.

1.15 pyRAPL

URL: https://github.com/powerapi-ng/pyRAPL
Language: Python
Description: Measures energy consumption of Python code using Intel RAPL (Running Average Power Limit) hardware counters. Provides decorator and context manager APIs.
Key features: CPU socket, DRAM, and integrated GPU energy measurement, decorator and with block APIs, per-domain granularity.
Difference: Intel RAPL only (Intel CPUs since Sandy Bridge), energy not utilization percentage, no GPU computation metrics, no disk/network.

1.16 pyJoules

URL: https://github.com/powerapi-ng/pyJoules
Language: Python
Description: Captures energy consumption of code snippets using Intel RAPL and NVIDIA NVML. Provides decorator and context manager APIs with breakpoints.
Key features: Multi-device energy capture (CPU, DRAM, NVIDIA GPU), decorator API, MongoDB and Pandas export handlers.
Difference: Energy measurement, not utilization tracking. Requires Intel RAPL-capable hardware.

1.17 PowerAPI

URL: https://github.com/powerapi-ng/powerapi
Language: Python
Description: Middleware framework for building software-defined power meters. Estimates power at process, container, VM, or application level. Can use hardware counters or performance counters.
Key features: Pluggable sensors and estimators, multiple granularity levels (process, container, VM), real-time power estimation.
Difference: Power/energy framework requiring configuration and sensor setup. Not a drop-in decorator for batch jobs.

1.18 eco2AI

URL: https://github.com/sb-ai-lab/eco2AI
Language: Python
Description: Tracks carbon emissions while training/inferring Python ML models. Accounts for CPU, GPU, and RAM energy consumption.
Key features: @track_emissions decorator, real-time emission monitoring, CSV reporting.
Difference: Carbon/energy focus, similar decorator pattern to resource-tracker, no disk/network.

1.19 pyperf

URL: https://github.com/psf/pyperf
Language: Python
Description: Python Software Foundation toolkit for writing and running benchmarks. Includes memory tracking (--track-memory, --tracemalloc) as part of benchmark metadata collection.
Key features: Benchmark calibration, worker process management, memory peak tracking, JSON results, statistical analysis.
Difference: Benchmarking framework, not a general resource monitor. Memory tracking is incidental to benchmarking.

1.20 ClearML

URL: https://github.com/clearml/clearml
Language: Python
Description: Open-source MLOps platform. Automatically tracks GPU, CPU, memory, and network metrics during ML experiment runs. Provides an experiment tracker, data manager, orchestrator, and more.
Key features: Automatic system metric logging (GPU, CPU, memory, network), experiment tracking, model registry, pipeline orchestration, web UI.
Difference: Full MLOps platform (not a lightweight library). Requires a ClearML server. Targets ML experiments rather than general batch jobs.

1.21 python-resmon

URL: https://github.com/xybu/python-resmon
Language: Python
Description: Lightweight resource monitor that records CPU usage, RAM usage, disk I/O, and NIC speed, outputting data in CSV format for post-processing.
Key features: CSV output, configurable polling interval, system-level metrics, easy post-processing.
Difference: System-level only (no per-process tracking), no GPU, no visualization, no workflow integration. Small utility script rather than a library.

Category 2: Interactive Terminal Monitors (System-Level)

These tools provide real-time visual monitoring of system resources. They do not produce per-job reports or integrate with batch workflows, but they are widely used for manual resource observation.

2.1 htop

URL: https://github.com/htop-dev/htop
Language: C
Description: Interactive process viewer and system monitor. The modern replacement for top. Shows per-CPU usage, memory, swap, and a process list with tree view.
Key features: Interactive (kill, renice, filter), color-coded per-CPU bars, tree view, mouse support, cross-platform.
Difference: Interactive visual tool only. No data capture, no time-series, no batch job integration.

2.2 btop / btop++

URL: https://github.com/aristocratos/btop
Language: C++
Description: Advanced terminal resource monitor. Third generation of bashtop->bpytop->btop++. Shows CPU, memory, disk I/O, network, and process list with rich ASCII art graphs.
Key features: Responsive UI, mouse support, GPU support (Nvidia/AMD/Intel via plugins), disk I/O, network I/O, process filtering, themes.
Difference: Interactive visual tool only. No data export, no batch job tracking.

2.3 bpytop

URL: https://github.com/aristocratos/bpytop
Language: Python
Description: Python predecessor to btop++. Linux/macOS/FreeBSD resource monitor with animated ASCII graphs.
Key features: CPU, memory, disk, network, process list, ASCII graphs.
Difference: Interactive visual tool. Superseded by btop++.

2.4 bashtop

URL: https://github.com/aristocratos/bashtop
Language: Bash
Description: Original Bash-based resource monitor from the same developer. Ancestor of bpytop and btop++.
Key features: CPU, memory, disk, network, process monitoring in pure Bash.
Difference: Superseded by btop++. Interactive visual only.

2.5 glances (see 1.9 above)

Interactive + exportable, see Category 1 entry.

2.6 atop

URL: https://github.com/Atoptool/atop
Language: C
Description: Advanced interactive system and process monitor for Linux. Records all system activity and writes to binary log files for later replay/analysis. Integrates with atopsar for historical reporting.
Key features: Full system activity logging (CPU, memory, disk, network, process), persistent binary logs, replay mode, atopsar for reporting.
Difference: Long-running daemon for system-wide logging. Not designed to wrap a specific job; tracks the whole system. Closest among CLI tools to providing historical per-process data.

2.7 nmon (Nigel’s Monitor)

URL: http://nmon.sourceforge.net/
Language: C
Description: Performance monitoring tool for AIX and Linux. Provides real-time view and can capture data to CSV for later analysis with nmon Analyser.
Key features: CPU, memory, disk I/O, network, filesystem, processes; CSV capture mode, lightweight.
Difference: System-wide monitor. No batch job integration or workflow decorator. The CSV output mode is useful for offline analysis.

2.8 collectl

URL: http://collectl.sourceforge.net/
Language: Perl
Description: Collects a broad set of Linux system statistics (CPU, memory, network, disk, inodes, processes, NFS, TCP, sockets) and can write to files, print to stdout, or feed to Graphite/ganglia.
Key features: Wide metric coverage, multiple output formats (CSV, plot, etc.), daemon or one-shot mode.
Difference: System-wide collection daemon. No batch job wrapping, no workflow integration.

2.9 sysstat (sar/sadc/sadf/iostat/pidstat/mpstat)

URL: https://github.com/sysstat/sysstat
Language: C
Description: Collection of Linux performance monitoring utilities. sar collects and reports system activity historically. pidstat reports per-process CPU, memory, and I/O. iostat reports disk I/O. sadc is the backend data collector.
Key features: Historical data collection, per-process stats via pidstat, JSON/CSV/XML output via sadf, schedulable via cron/systemd, very low overhead.
Difference: System and process monitoring utilities, not designed for batch job wrapping. pidstat is the closest to per-job process monitoring but requires manual invocation.

2.10 nvtop

URL: https://github.com/Syllo/nvtop
Language: C
Description: (h)top-like task monitor for GPUs and accelerators. Supports AMD, Apple M1/M2 (limited), Huawei Ascend, Intel, NVIDIA, Qualcomm, Broadcom, Rockchip.
Key features: Multi-GPU and multi-vendor support, real-time GPU/VRAM utilization, per-process GPU use, interactive UI.
Difference: GPU-focused interactive monitor. No data export, no CPU/memory/disk/network integration.

2.11 vtop

URL: https://github.com/MrRio/vtop
Language: JavaScript (Node.js)
Description: Graphical terminal activity monitor with Unicode braille charts. Groups processes sharing the same name (e.g., NGINX master + workers).
Key features: ASCII charts, process grouping, extensible via plugins.
Difference: Interactive visual only, no data capture. Note: project appears unmaintained.

2.12 Netdata

URL: https://github.com/netdata/netdata
Language: C (agent core)
Description: Real-time performance monitoring with per-second metrics and a powerful web UI. 800+ integrations. Most-starred monitoring project on GitHub (76k+ stars).
Key features: Per-second metrics, web dashboard, alerts, ML anomaly detection, 800+ integrations (Docker, Kubernetes, StatsD, OpenMetrics), process-level metrics, GPU plugins.
Difference: Full-stack observability daemon. Requires installation as a service. Not designed for wrapping a batch job.

Category 3: eBPF / Kernel-Level Tracing Tools

These tools use Linux eBPF (extended Berkeley Packet Filter) for highly efficient, zero-instrumentation tracing deep in the kernel. Most relevant for system-level visibility with very low overhead.

3.1 BCC (BPF Compiler Collection)

URL: https://github.com/iovisor/bcc
Language: C + Python/Lua frontends
Description: Toolkit for creating efficient kernel tracing and manipulation programs using eBPF. Includes ready-made tools (execsnoop, biolatency, tcplife, memleak, etc.) and a framework for writing custom eBPF programs with Python frontends.
Key features: Kernel + userspace tracing, network/disk/memory/CPU tools, Python API for custom programs, very low overhead.
Difference: Requires kernel support (Linux 4.1+), root privileges, and knowledge of eBPF to build custom tools. Not a drop-in batch job monitor.

3.2 bpftrace

URL: https://github.com/bpftrace/bpftrace
Language: C++ (awk/DTrace-like scripting language)
Description: High-level tracing language for Linux eBPF. Write concise one-liners or short scripts for ad-hoc analysis.
Key features: High-level scripting, LLVM backend, supports tracepoints, kprobes, uprobes, usdt. One-liner analysis.
Difference: Ad-hoc kernel tracing tool. Requires root and kernel support. Not designed for operational batch job monitoring.

3.3 Parca / Parca Agent

URL: https://github.com/parca-dev/parca
Language: Go
Description: Continuous profiling for CPU and memory usage, down to the line number and throughout time. Parca Agent is an eBPF-based always-on profiler with Kubernetes auto-discovery. Uses pprof format.
Key features: Zero-instrumentation eBPF profiling, <1% overhead, continuous collection, icicle graph UI, SQL-queryable profile storage, multi-language support.
Difference: Continuous profiling infrastructure (runs as a DaemonSet on Kubernetes nodes). Not a per-job wrapper. Heavy infrastructure requirement.

3.4 Pyroscope (Grafana)

URL: https://github.com/grafana/pyroscope
Language: Go
Description: Continuous profiling database and platform (formed from merger of Phlare + Pyroscope). Stores profiling data from applications instrumented with Pyroscope SDKs or from eBPF agents. Integrates with Grafana.
Key features: SDK-based push profiling (Python, Go, Java, Ruby, .NET, Rust, PHP, Node.js), eBPF pull mode, flame graphs, Grafana integration, scalable storage.
Difference: Continuous profiling infrastructure. Requires a server and SDK integration. Not a lightweight batch job wrapper.

Category 4: Linux Performance Profiling Tools (C/C++/Native)

These tools profile native code at a low level. Most are developer-focused profilers rather than operational monitors.

4.1 perf (Linux perf_events)

URL: https://perfwiki.github.io/main/
Language: C (Linux kernel subsystem)
Description: The primary Linux performance tool. Samples CPU events using hardware performance counters, traces system calls, and instruments kernel/userspace functions. Foundation for many other tools.
Key features: Hardware counter sampling, call graph recording, per-process and system-wide, flame graph generation (via FlameGraph scripts), supports all architectures.
Difference: Low-level developer profiler. Requires root for many features. No time-series resource logging, no workflow integration.

4.2 FlameGraph

URL: https://github.com/brendangregg/FlameGraph
Language: Perl
Description: Stack trace visualization toolkit by Brendan Gregg. Generates SVG flame graphs from perf, DTrace, SystemTap, and other profiler output.
Key features: CPU, memory, and off-CPU flame graphs, works with many backends.
Difference: Visualization tool for profiler output, not a monitoring tool itself.

4.3 gperftools (Google Performance Tools)

URL: https://github.com/gperftools/gperftools
Language: C++
Description: Collection from Google: fast malloc (TCMalloc), CPU profiler, heap profiler, and heap checker. Used via LD_PRELOAD or explicit linking.
Key features: CPU profiling (sampling), heap profiling, heap leak detection, pprof visualization, multi-threaded support.
Difference: Developer profiler requiring code linking or LD_PRELOAD. No time-series operational monitoring, no disk/network/GPU.

4.4 Valgrind / Massif / Callgrind

URL: https://valgrind.org/
Language: C
Description: Instrumentation framework for building dynamic analysis tools. Massif is its heap profiler; Callgrind is its call graph profiler; Memcheck is its memory error detector.
Key features: Complete heap tracking, memory leak detection, call graph analysis, massif-visualizer GUI.
Difference: High-overhead instrumentation (10-50x slowdown). Developer tool, not operational monitor. No GPU, disk, or network metrics.

4.5 Heaptrack

URL: https://github.com/KDE/heaptrack
Language: C++ + Python
Description: Fast heap memory profiler for Linux, designed as a faster, lower-overhead alternative to Valgrind/Massif. Traces all allocations and annotates with stack traces.
Key features: Lower overhead than Valgrind, flame graph output, heaptrack_gui for visualization, finds memory leaks and allocation hotspots.
Difference: Memory only, developer profiler. No GPU, CPU utilization, disk, or network.

4.6 Perfetto

URL: https://github.com/google/perfetto
Language: C++
Description: Google’s open-source production-grade system profiling and tracing tool. Default tracing system for Android and used in Chromium. Can capture CPU scheduling, memory, I/O, GPU events, and custom trace points.
Key features: Multi-process system trace, SQL-based analysis, browser-based UI, heap profiling (heapprofd), CPU frequency and scheduling, Android + Linux support.
Difference: Complex tracing infrastructure primarily targeting Android/embedded and browser use cases. Not a lightweight batch job wrapper.

4.7 async-profiler

URL: https://github.com/async-profiler/async-profiler
Language: C (JVM agent)
Description: Low-overhead sampling CPU and heap profiler for JVM (Java/Kotlin/Scala/Clojure). Uses AsyncGetCallTrace + perf_events to avoid safepoint bias.
Key features: CPU + heap sampling, flame graphs, JFR files, tracks native + JVM code, suitable for production.
Difference: JVM-specific. No Python/R/general process monitoring. No disk, network, or GPU.

4.8 TAU (Tuning and Analysis Utilities)

URL: https://www.cs.uoregon.edu/research/tau/home.php
Language: C++ (with Python, Fortran, Java support)
Description: Comprehensive profiling and tracing toolkit for HPC parallel programs (MPI, OpenMP, CUDA). Supports hardware counters, GPU profiling, and generates call graphs.
Key features: Parallel program profiling (MPI, OpenMP), hardware counters, GPU support, ParaProf visualization, call graph.
Difference: HPC research tool for parallel program performance analysis. Complex setup, not a lightweight batch job wrapper.

4.9 HPCToolkit

URL: https://hpctoolkit.org/
Language: C/C++
Description: Sampling-based measurement and analysis suite for HPC programs on CPUs and GPUs. Supports supercomputers.
Key features: 1-5% overhead sampling, full calling context, hpcviewer GUI, GPU support.
Difference: HPC research tool, complex setup, not designed for general batch jobs or Python/R scripts.

Category 5: Rust Tools

5.1 below (Facebook/Meta)

URL: https://github.com/facebookincubator/below
Language: Rust
Description: Time-traveling resource monitor for modern Linux systems. Records system activity to disk and allows replay of historical data. Cgroup-aware with PSI (Pressure Stall Information) support.
Key features: Record + replay mode, cgroup hierarchy view, PSI metrics, process-level stats, live mode, persistent storage. Built on cgroupv2.
Difference: System-wide monitoring daemon. Designed for Linux infrastructure monitoring, not for wrapping individual batch jobs. No workflow integration. Very strong on cgroup/container awareness.

5.2 samply

URL: https://github.com/mstange/samply
Language: Rust
Description: Command-line sampling CPU profiler for macOS, Linux, and Windows. Uses Linux perf events. Spawns the target process as a subprocess and profiles it, then opens Firefox Profiler UI.
Key features: Subprocess wrapping (samply record ./your_program), Firefox Profiler UI, local symbol resolution, flame graphs.
Difference: CPU profiling only (call stack). No memory, GPU, disk, or network tracking. Developer profiler.

5.3 Bytehound

URL: https://github.com/koute/bytehound
Language: Rust
Description: Memory profiler for Linux. Intercepts all heap allocations via LD_PRELOAD. Produces detailed allocation timelines with stack traces.
Key features: Full allocation tracking, web-based GUI, Rhai scripting for analysis, multi-architecture (AMD64, ARM, AArch64, MIPS64).
Difference: Memory only. Developer profiler. Requires LD_PRELOAD, no GPU/disk/network.

5.4 pprof-rs

URL: https://github.com/tikv/pprof-rs
Language: Rust
Description: Rust CPU profiler using backtrace-rs. Generates pprof-compatible output.
Key features: CPU profiling for Rust applications, pprof output, flame graphs, low overhead.
Difference: CPU profiler for Rust programs only.

Category 6: System-Level Daemons and Metrics Collection Infrastructure

These tools are designed for long-running infrastructure monitoring, not individual batch jobs, but represent the broader ecosystem.

6.1 Prometheus + node_exporter

URL: https://github.com/prometheus/node_exporter
Language: Go
Description: Prometheus exporter for hardware and OS metrics from /proc and /sys. Exposes CPU, memory, disk, network, filesystem, and more as Prometheus metrics.
Key features: Pull-based metrics, scrape-able endpoint, very broad metric coverage, alerting via Prometheus + Alertmanager.
Difference: Infrastructure monitoring daemon. Requires Prometheus server. No per-job tracking.

6.2 Prometheus Pushgateway

URL: https://github.com/prometheus/pushgateway
Language: Go
Description: Push acceptor for ephemeral and batch jobs. Allows short-lived jobs to push metrics to Prometheus (which normally pulls). Stores last-received metrics until explicitly deleted.
Key features: HTTP push endpoint, labels/grouping by job, integrates with Prometheus.
Difference: Infrastructure component. Not a resource tracker itself; requires a separate process to collect and push metrics. Most relevant for a Rust implementation that needs to output to Prometheus.

6.3 Prometheus process-exporter

URL: https://github.com/ncabatoff/process-exporter
Language: Go
Description: Prometheus exporter that reads /proc to report on selected processes. Groups processes by name or regex and exposes CPU, memory, file descriptors, I/O, and thread counts.
Key features: Per-process-group CPU and memory metrics, /proc-based, configurable process selection, Prometheus compatible.
Difference: Infrastructure daemon, not a batch job wrapper. Monitors selected processes continuously.

6.4 cAdvisor (Container Advisor)

URL: https://github.com/google/cadvisor
Language: Go
Description: Google’s container resource usage and performance analysis agent. Exposes Prometheus metrics for running containers.
Key features: Container-level CPU, memory, disk, and network metrics, Prometheus endpoint, supports Docker and other runtimes.
Difference: Container/cgroup focused daemon. Not for general process monitoring.

6.5 Telegraf

URL: https://github.com/influxdata/telegraf
Language: Go
Description: Plugin-driven metrics collection agent from InfluxData. Single agent collecting system metrics (CPU, memory, disk, network, GPU, containers) and writing to InfluxDB or other backends.
Key features: 300+ input plugins (system, Docker, SNMP, statsd, etc.), multiple output backends, flexible configuration.
Difference: Infrastructure agent daemon. Not designed for per-job wrapping.

6.6 Netdata (see 2.12)

6.7 kube-state-metrics

URL: https://github.com/kubernetes/kube-state-metrics
Language: Go
Description: Kubernetes add-on that generates metrics about Kubernetes object state (pod resource requests/limits, deployment status, etc.) for Prometheus.
Key features: Pod/node resource quota metrics, deployment health, Prometheus format.
Difference: Kubernetes-only, no process-level metrics.

6.8 OpenTelemetry (OTel)

URL: https://opentelemetry.io/ / https://github.com/open-telemetry/opentelemetry-python
Language: Multi-language (Go, Python, Java, .NET, etc.)
Description: CNCF standard for collecting traces, metrics, and logs. Includes system metrics via the OTel Collector. Growing support for profiling via OTel.
Key features: Traces + metrics + logs, vendor-neutral, collector, SDKs in all major languages, exporters to Prometheus, Jaeger, OTLP.
Difference: General observability framework, not a resource tracker per se. Relevant for instrumenting a Rust CLI to expose metrics in a standard format.

6.9 NVIDIA DCGM + dcgm-exporter

URL: https://github.com/NVIDIA/DCGM / https://github.com/NVIDIA/dcgm-exporter
Language: C (DCGM) + Go (exporter)
Description: NVIDIA Data Center GPU Manager for GPU telemetry in large Linux clusters. dcgm-exporter exposes GPU metrics for Prometheus.
Key features: Per-GPU and per-process GPU metrics, health monitoring, diagnostics, Kubernetes integration, Prometheus exporter.
Difference: NVIDIA GPU infrastructure daemon for data center clusters. Not a batch job wrapper.

Category 7: Per-Process Network and Disk I/O Monitors

7.1 nethogs

URL: https://github.com/raboof/nethogs
Language: C++
Description: Linux “net top” tool that groups network bandwidth by process using /proc/net/tcp and libpcap.
Key features: Per-process network bandwidth (upload/download), real-time top-like display.
Difference: Network only, interactive display, no data capture to file.

7.2 iftop

URL: https://www.ex-parrot.com/pdw/iftop/
Language: C
Description: Shows network bandwidth grouped by source/destination host pairs. Does not show per-process breakdown.
Key features: Per-connection bandwidth, host name resolution.
Difference: Network only, host-pair level (not process level).

7.3 iotop

URL: https://github.com/Tomas-M/iotop
Language: C (rewrite of original Python version)
Description: Top-like tool for disk I/O. Shows per-process disk read/write rates using kernel I/O accounting.
Key features: Per-process disk I/O, real-time display, accumulated I/O counters.
Difference: Disk I/O only, interactive display, no data capture.

7.4 dstat

URL: https://github.com/dagwieers/dstat
Language: Python
Description: Versatile system statistics tool combining vmstat, iostat, netstat, and ifstat. Outputs columns of metrics to terminal, can write to CSV.
Key features: CPU, disk, network, memory, system statistics; CSV output; pluggable.
Difference: System-wide only (not per-process), no GPU. CSV output mode is useful for offline analysis.

Category 8: ML Experiment Tracking Platforms with Resource Monitoring

These platforms include resource metric tracking as one feature among many.

8.1 Weights & Biases (W&B)

URL: https://github.com/wandb/wandb
Language: Python
Description: ML experiment tracking platform with automatic system metric logging. Tracks GPU, CPU, memory, and network during training runs.
Key features: Automatic system metric logging (GPU, CPU, RAM, network), experiment tracking, model registry, artifacts, collaborative dashboards.
Difference: Primarily an ML experiment tracker. Resource monitoring is automatic and integrated but secondary to experiment logging. Requires W&B account (cloud-first, has open-source local server option).

8.2 MLflow

URL: https://github.com/mlflow/mlflow
Language: Python
Description: Open-source ML lifecycle management. Does not natively log CPU/GPU metrics; requires external integration.
Key features: Experiment tracking, model registry, deployment. No built-in system resource monitoring.
Difference: No native resource tracking.

8.3 ClearML (see 1.20)

Category 9: HPC Batch Job Monitoring

9.1 Jobstats

URL: https://github.com/PrincetonUniversity/jobstats
Language: Python + Prometheus stack
Description: Slurm-compatible job monitoring platform for CPU and GPU clusters. Displays per-job CPU and GPU efficiency summaries using Prometheus, Grafana, and Slurm Prolog/Epilog hooks.
Key features: Per-Slurm-job efficiency report (CPU utilization, memory, GPU utilization), compares requested vs. used resources, automatically stores data in Slurm AdminComment field.
Difference: Slurm HPC specific. Requires full Prometheus + Grafana + Slurm infrastructure. Very close in concept to resource-tracker (per-job resource reports) but for HPC/Slurm, not general Python/R scripts.

9.2 Open XDMoD

URL: https://open.xdmod.org/
Language: PHP + Python
Description: Open-source tool for analyzing HPC center usage and job efficiency. Tracks CPU, memory, GPU, and I/O for Slurm/PBS/SGE jobs.
Key features: Job-level resource utilization reports, efficiency recommendations, web portal.
Difference: HPC management tool. Requires full HPC stack. Not for general batch jobs.

Category 10: R Language Profiling Tools

Resource-tracker explicitly supports R scripts. These are the closest R-ecosystem analogues.

10.1 profvis

URL: https://github.com/rstudio/profvis
Language: R
Description: Interactive visualization of R code profiling data. Uses Rprof() to collect call stack samples and displays an interactive flame graph and memory timeline in a web browser.
Key features: Interactive flame graph, memory timeline, line-level time attribution, RStudio integration.
Difference: CPU + memory profiling for R code, developer-oriented. No disk, network, or GPU. No batch job wrapping or time-series operational logging.

10.2 bench

URL: https://github.com/r-lib/bench
Language: R
Description: High-precision benchmarking for R with memory tracking.
Key features: High-resolution timing, memory allocation tracking, comparison of multiple expressions.
Difference: Benchmarking tool. No operational resource monitoring.

10.3 microbenchmark

URL: https://github.com/joshuaulrich/microbenchmark
Language: R
Description: R package for sub-millisecond timing benchmarks.
Key features: High-precision CPU timing.
Difference: CPU timing only, micro-benchmarking specific.

10.4 profmem

URL: https://github.com/HenrikBengtsson/profmem
Language: R
Description: Simple memory profiling for R expressions. Uses tracemem/R internals to log all memory allocations.
Key features: Per-expression memory allocation log.
Difference: Memory only, developer-oriented.

Category 11: Python Standard Library / Built-in Profiling

11.1 cProfile / profile

URL: https://docs.python.org/3/library/profile.html
Language: Python (stdlib)
Description: Python’s built-in deterministic profiler. Records function call counts and cumulative time.
Key features: Function-level timing, call count, cumulative/per-call time, pstats for analysis.
Difference: CPU time only, function-level. No memory, GPU, disk, or network.

11.2 tracemalloc

URL: https://docs.python.org/3/library/tracemalloc.html
Language: Python (stdlib, since 3.4)
Description: Traces Python memory allocations with tracebacks to allocation sites.
Key features: Peak memory tracking, traceback to allocation sites, snapshot comparison.
Difference: Python-managed memory only. No native/C allocations, no GPU/disk/network.

11.3 yappi

URL: https://github.com/sumerc/yappi
Language: Python + C
Description: Yet Another Python Profiler. Supports both wall clock and CPU time, multi-threaded profiling, and async code.
Key features: Wall + CPU time, multi-thread awareness, async support, pstats/callgrind output.
Difference: CPU profiling only.

11.4 line_profiler

URL: https://github.com/pyutils/line_profiler
Language: Python + C
Description: Line-by-line CPU time profiler for Python using @profile decorator.
Key features: Line-level execution time, @profile decorator.
Difference: CPU time only, requires decoration.

Summary Comparison Table

Tool	Lang	CPU	Mem	GPU	Disk	Net	Batch-job wrap	Per-job report	Workflow integration	Output
resource-tracker	Python	Y	Y	Y	Y	Y	Y	Y	Metaflow, Flyte, Airflow	Metrics + card visualization
psutil	Python	Y	Y	—	Y	Y	—	—	—	Raw API
`memory_profiler`	Python	—	Y	—	—	—	Y (mprof)	Y (plot)	—	Plot + log
Scalene	Python	Y	Y	Y	—	—	Y (CLI)	Y (web UI)	—	Interactive web report
Memray	Python	—	Y	—	—	—	Y (CLI)	Y (flame graph)	—	Flame graphs
Fil	Python	—	Y	—	—	—	Y (CLI)	Y (flame graph)	—	Flame graph
pyinstrument	Python	Y	—	—	—	—	Y	Y	—	HTML/text
py-spy	Rust	Y	—	—	—	—	Y (attach)	Y (flame graph)	—	Flame graph
Austin	C	Y	—	—	—	—	Y	—	—	Stack samples
Glances	Python	Y	Y	Y*	Y	Y	—	—	—	TUI + web API
nvitop	Python	—	—	Y	—	—	—	—	—	TUI + Python API
gpustat	Python	—	—	Y	—	—	—	—	—	CLI display
CodeCarbon	Python	Y*	Y*	Y*	—	—	Y (decorator)	Y (CSV)	—	CO2 report
ClearML	Python	Y	Y	Y	—	Y	Y (auto)	Y (web)	ML frameworks	Web dashboard
below	Rust	Y	Y	—	Y	Y	—	—	—	TUI + replay
samply	Rust	Y	—	—	—	—	Y (subprocess)	Y (flame graph)	—	Firefox profiler
Bytehound	Rust	—	Y	—	—	—	Y (LD_PRELOAD)	Y (web GUI)	—	Web GUI
atop	C	Y	Y	—	Y	Y	—	—	—	TUI + binary log
sysstat/pidstat	C	Y	Y	—	Y	Y	—	—	—	CLI + CSV
htop	C	Y	Y	—	Y	Y	—	—	—	TUI
btop++	C++	Y	Y	Y*	Y	Y	—	—	—	TUI
Jobstats	Python	Y	Y	Y	—	—	Y* (Slurm)	Y (Slurm)	Slurm	CLI + DB
Pyroscope	Go	Y	Y	—	—	—	Y (SDK)	—	—	Flame graphs
Parca	Go	Y	Y	—	—	—	—	—	Kubernetes	Icicle graphs
perf	C	Y	—	—	Y	—	Y (subprocess)	—	—	Raw perf data
Valgrind	C	Y	Y	—	—	—	Y (subprocess)	Y	—	Text + GUI
nethogs	C++	—	—	—	—	Y	—	—	—	TUI
iotop	C	—	—	—	Y	—	—	—	—	TUI
PowerAPI	Python	Y*	Y*	—	—	—	—	—	—	Power estimates
W&B	Python	Y	Y	Y	—	Y	Y (auto)	Y (web)	ML frameworks	Web dashboard
Prometheus stack	Go	Y	Y	Y*	Y	Y	—	—	Kubernetes	Time-series DB

Y = partial/plugin-based support

Key Findings for Rust CLI Implementation

Based on this landscape analysis, the following observations are most relevant to the planned Rust/Linux CLI implementation:

No existing Rust tool covers the full feature set of resource-tracker (CPU + memory + GPU + disk + network + batch job wrapping + per-job reporting). below (Rust) is the closest in scope but is a system-wide daemon, not a per-job wrapper.
procfs is the right foundation for Linux. The /proc filesystem is used by psutil, process-exporter, sysstat, and resource-tracker itself. A Rust implementation can use the procfs crate or read /proc directly with zero external dependencies.
GPU support requires dynamic linking (NVML via libpynvml or direct libnvidia-ml.so). This is a hard constraint noted in the SOW. The Rust NVML binding (nvidia-management-library crate or similar) will be needed.
The Pushgateway integration (Extra Component: S3 PUT) is unique to resource-tracker and not present in any comparable tool. This makes it particularly well-suited for cloud batch job environments.
The decorator/wrapper pattern (similar to samply record ./program) is present in py-spy, samply, Austin, and Fil — wrapping a subprocess is the right architectural pattern for a CLI tool.
The closest functional analogues (tools that wrap a job, collect multi-resource metrics, and produce a per-job report) are:
- Scalene (Python, CPU+GPU+memory, developer-oriented)
- memory_profiler (Python, memory only, has mprof)
- Jobstats (HPC/Slurm specific)
- resource-tracker itself (the reference implementation)
None of these is in Rust, none covers all six resource dimensions (CPU, memory, GPU, VRAM, network, disk) in a single zero-dependency binary.

Sources

https://github.com/SpareCores/resource-tracker
https://github.com/giampaolo/psutil
https://github.com/pythonprofilers/memory_profiler
https://github.com/plasma-umass/scalene
https://github.com/bloomberg/memray
https://github.com/pythonspeed/filprofiler
https://github.com/joerick/pyinstrument
https://github.com/benfred/py-spy
https://github.com/P403n1x87/austin
https://github.com/nicolargo/glances
https://github.com/XuehaiPan/nvitop
https://github.com/wookayin/gpustat
https://github.com/gpuopenanalytics/pynvml
https://github.com/mlco2/codecarbon
https://github.com/lfwa/carbontracker
https://github.com/powerapi-ng/pyRAPL
https://github.com/powerapi-ng/pyJoules
https://github.com/powerapi-ng/powerapi
https://github.com/sb-ai-lab/eco2AI
https://github.com/psf/pyperf
https://github.com/clearml/clearml
https://github.com/xybu/python-resmon
https://github.com/htop-dev/htop
https://github.com/aristocratos/btop
https://github.com/aristocratos/bpytop
https://github.com/aristocratos/bashtop
https://github.com/Atoptool/atop
https://github.com/sysstat/sysstat
https://github.com/Syllo/nvtop
https://github.com/MrRio/vtop
https://github.com/netdata/netdata
https://github.com/iovisor/bcc
https://github.com/bpftrace/bpftrace
https://github.com/parca-dev/parca
https://github.com/grafana/pyroscope
https://github.com/brendangregg/FlameGraph
https://github.com/gperftools/gperftools
https://valgrind.org/
https://github.com/KDE/heaptrack
https://github.com/google/perfetto
https://github.com/async-profiler/async-profiler
https://github.com/facebookincubator/below
https://github.com/mstange/samply
https://github.com/koute/bytehound
https://github.com/tikv/pprof-rs
https://github.com/prometheus/node_exporter
https://github.com/prometheus/pushgateway
https://github.com/ncabatoff/process-exporter
https://github.com/google/cadvisor
https://github.com/influxdata/telegraf
https://github.com/kubernetes/kube-state-metrics
https://opentelemetry.io/
https://github.com/NVIDIA/DCGM
https://github.com/NVIDIA/dcgm-exporter
https://github.com/raboof/nethogs
https://github.com/wandb/wandb
https://github.com/mlflow/mlflow
https://github.com/PrincetonUniversity/jobstats
https://github.com/rstudio/profvis
https://github.com/r-lib/bench
https://github.com/sumerc/yappi
https://github.com/pyutils/line_profiler
https://github.com/msaroufim/awesome-profiling
https://lambda.ai/blog/keeping-an-eye-on-your-gpus-2
https://sparecores.com/article/metaflow-resource-tracker
https://developers.facebook.com/blog/post/2021/09/21/below-time-travelling-resource-monitoring-tool/

Open-Source Tools with Similar Functionality to `resource-tracker`

resource-tracker is a lightweight, zero-dependency Python package for monitoring CPU, memory, GPU, network, and disk utilization across processes and at the system level, designed for batch jobs (Python/R scripts, Metaflow steps), with decorator-based workflow integration and per-job visualization reports.

The tools below are organized into meaningful categories. No single open-source tool matches all of resource-tracker’s characteristics simultaneously — most are either too narrow (single metric), too heavy (infrastructure daemons), or not batch-job oriented.

Category 1: Python Libraries for Process/System Resource Monitoring

(Closest functional analogues)

Tool	Notes	Details
psutil	The foundational building block used by resource-tracker itself. Raw API only, no tracking loop or reports.	Linux; no CLI; CPU/Mem/Disk/Net/Process; no batch wrap; no report
memory_profiler	Line-by-line memory, `@profile` decorator, `mprof plot`. No CPU/GPU/disk/network.	Linux; CLI (mprof); Memory; batch wrap (mprof CLI); report (plot)
Scalene	High-precision line-level profiler with AI optimization suggestions. No disk/network. Developer profiler.	Linux; CLI; CPU/GPU/Mem; batch wrap (CLI); report (web UI)
Memray	Bloomberg. Tracks every allocation including C/C++. No CPU/GPU/disk/network.	Linux; CLI; Memory; batch wrap (CLI); report (flame graphs)
Fil	Peak memory focus for data scientists (NumPy/Pandas). Written in Rust+Python. Linux/macOS only.	Linux; CLI; Memory (peak); batch wrap (CLI); report (flame graph)
pyinstrument	Context manager + decorator. 1ms sampling. No memory/GPU/disk/network.	Linux; CLI; CPU; batch wrap; report
py-spy	Written in Rust. Attaches to a running process. No memory/GPU/disk/network.	Linux; CLI; CPU; batch wrap (attach); report (flame graph)
Austin	Pure C, extremely low overhead CPython frame stack sampler.	Linux; CLI; CPU; batch wrap; no report
Glances	Full system monitor with REST API, web UI, and exporters. Long-running daemon, not a batch-job wrapper.	Linux; CLI; CPU/Mem/Disk/Net/GPU; no batch wrap; no report
nvitop	Best GPU process viewer. Has programmatic `ResourceMetricCollector` API. No CPU/mem/disk/net.	Linux; CLI; NVIDIA GPU; no batch wrap; no report
gpustat	Simple NVIDIA GPU status CLI. No time-series logging.	Linux; CLI; NVIDIA GPU; no batch wrap; no report
pynvml / nvidia-ml-py	Python NVML bindings. Building block only.	Linux; no CLI; GPU (raw API); no batch wrap; no report
CodeCarbon	`@track_emissions` decorator. CO2/energy focus, not utilization %. No disk/network.	Linux; partial CLI; CPU/Mem/GPU energy; batch wrap (decorator); report (CSV + dashboard)
CarbonTracker	Predicts carbon footprint, can halt training. ML training specific.	Linux; no CLI; CPU/GPU energy; batch wrap; report
pyRAPL	Intel RAPL via `/sys/class/powercap`. Intel CPUs only. Energy joules, not utilization %.	Linux only; no CLI; CPU/DRAM energy; batch wrap (decorator); no report
pyJoules	Multi-device energy (Intel RAPL + NVML). Context manager and decorator.	Linux only; no CLI; CPU/DRAM/GPU energy; batch wrap (decorator); no report
PowerAPI	Framework for software-defined power meters. Process/container/VM granularity. Complex setup.	Linux only; partial CLI; CPU/Mem power; no batch wrap; no report
eco2AI	ML training focused CO2 tracking.	Linux; no CLI; CPU/GPU/RAM energy; batch wrap (decorator); report (CSV)
pyperf	PSF benchmarking toolkit. `--track-memory` and `--tracemalloc` options. Not an operational monitor.	Linux; CLI; Memory (benchmarks); batch wrap; report
ClearML	Full MLOps platform. Auto-logs system metrics. Requires ClearML server.	Linux; CLI; CPU/Mem/GPU/Net; auto batch wrap; report (web UI)
python-resmon	Lightweight script outputting CSV. System-level only, no per-process or GPU tracking.	Linux; CLI; CPU/Mem/Disk/Net; no batch wrap; report (CSV)
yappi	CPU + wall time profiler with multi-thread and async support.	Linux; no CLI; CPU; batch wrap; report
line_profiler	Line-by-line CPU time. No memory/GPU/disk/network.	Linux; CLI (kernprof); CPU; batch wrap (@profile); report

Category 2: Interactive Terminal System Monitors

(Real-time visual monitoring; do not produce per-job reports or integrate with batch workflows)

Tool	Notes	Details
htop	Interactive process viewer; no data capture	C; Linux; CLI; CPU/Mem/Proc
btop++	Most modern TUI monitor; GPU via plugins	C++; Linux; CLI; CPU/Mem/Disk/Net/GPU
bpytop	Predecessor to btop++	Python; Linux; CLI; CPU/Mem/Disk/Net
bashtop	Predecessor to bpytop	Bash; Linux; CLI; CPU/Mem/Disk/Net
atop	Writes persistent binary logs; replay mode; strong process-level detail	C; Linux only; CLI; CPU/Mem/Disk/Net/Proc
nmon	CSV capture mode for offline analysis; primarily Linux/AIX	C; Linux; CLI; CPU/Mem/Disk/Net
collectl	Wide metric coverage; daemon or one-shot mode	Perl; Linux only; CLI; CPU/Mem/Disk/Net
sysstat (sar/pidstat)	`pidstat` for per-process; `sadf` for JSON/CSV/XML export; schedulable via cron	C; Linux only; CLI; CPU/Mem/Disk/Net/Proc
nvtop	AMD, Apple, Intel, NVIDIA, Qualcomm support; interactive GPU monitor	C; Linux; CLI; GPU (multi-vendor)
vtop	Node.js, Unicode charts	JS; Linux; CLI; CPU/Mem/Proc
Netdata	76k+ GitHub stars. Per-second metrics, web UI, ML anomaly detection	C; Linux; CLI; all (800+ plugins)

Category 3: eBPF / Kernel Tracing Tools

(Zero-overhead kernel-level observability; require root + Linux kernel 4.1+)

Tool	Notes	Details
BCC	Toolkit for writing eBPF programs; 70+ ready-made tools	C/Python/Lua; Linux only; CLI
bpftrace	DTrace-like one-liners for eBPF; ad-hoc analysis	C++ DSL; Linux only; CLI
Parca + Parca Agent	Continuous eBPF-based CPU profiling; pprof format; <1% overhead	Go; Linux only; CLI
Pyroscope (Grafana)	Continuous profiling database + eBPF agent; multi-language SDK; Grafana integration	Go; Linux only; CLI

Category 4: Native C/C++ Profiling Tools

Tool	Notes	Details
perf (Linux perf_events)	Foundation for many other tools; hardware counter sampling	C (kernel); Linux only; CLI; CPU/kernel events
FlameGraph	Visualizes perf/DTrace output as SVG flame graphs	Perl; Linux; CLI; visualization
gperftools	Google Performance Tools: CPU profiler, heap profiler, TCMalloc	C++; Linux; partial CLI (pprof); CPU/Memory
Valgrind / Massif	High-overhead instrumentation; Massif=heap profiler; 10–50× slowdown	C; Linux; CLI; CPU/Memory
Heaptrack	KDE; faster alternative to Valgrind/Massif for heap profiling	C++; Linux only; CLI; Memory
Perfetto	Google; default Android profiler; SQL-queryable traces; browser UI	C++; Linux; CLI; CPU/Mem/GPU/Disk/Sched
async-profiler	Low-overhead JVM profiler; flame graphs; JVM only	C (JVM agent); Linux; CLI (asprof); CPU/Heap
TAU	HPC parallel profiling suite; complex setup	C++; Linux; CLI; CPU/GPU/MPI
HPCToolkit	HPC sampling profiler; 1–5% overhead; supercomputer use	C/C++; Linux; CLI; CPU/GPU

Category 5: Rust Tools

Tool	Notes	Details
below	Facebook/Meta. Time-traveling system monitor with cgroup/PSI support; record+replay mode. System-wide daemon, not a batch-job wrapper. Architecturally most relevant Rust project.	Linux only; CLI
samply	Sampling CPU profiler; wraps a subprocess (`samply record ./program`); uses Linux perf events; Firefox Profiler UI. CPU only.	Linux; CLI
Bytehound	Heap memory profiler; LD_PRELOAD-based; multi-arch (AMD64, ARM, AArch64, MIPS64); web-based GUI. Memory only.	Linux only; CLI
pprof-rs	CPU profiler for Rust programs using backtrace-rs; pprof output format. Library only.	Linux; no CLI

Category 6: Infrastructure Metrics Collection (Daemons & Exporters)

(Not batch-job wrappers; relevant for pipeline integration and metric output targets)

Tool	Notes	Details
Prometheus node_exporter	System-level Prometheus exporter; `/proc`-based	Go; Linux; CLI
Prometheus Pushgateway	Allows batch jobs to push metrics to Prometheus; standard solution for short-lived jobs	Go; Linux; CLI
process-exporter	Per-process-group Prometheus metrics from `/proc`	Go; Linux only; CLI
cAdvisor	Container resource usage and performance; Prometheus exporter	Go; Linux only; CLI
Telegraf	Plugin-driven metrics agent; 300+ inputs; InfluxDB backend	Go; Linux; CLI
OpenTelemetry	CNCF standard for traces/metrics/logs; structured output for jobs	Multi-lang; Linux; CLI (otelcol)
NVIDIA DCGM + dcgm-exporter	GPU telemetry for Kubernetes/data center; Prometheus exporter	C/Go; Linux only; CLI
kube-state-metrics	Kubernetes object state metrics for Prometheus	Go; Linux; CLI
Jobstats (HPC)	Slurm-compatible per-job efficiency reports (CPU+GPU). Conceptually very close to resource-tracker but Slurm-specific.	Python; Linux only; CLI

Category 7: Per-Process Network and Disk I/O Monitors

Tool	Notes	Details
nethogs	Per-process network bandwidth using `/proc/net/tcp` + libpcap	C++; Linux only; CLI
iftop	Per-connection (not per-process) bandwidth monitor	C; Linux; CLI
iotop	Per-process disk I/O using kernel I/O accounting	C; Linux only; CLI
dstat	System-wide CPU+disk+network+memory with CSV output	Python; Linux only; CLI

Category 8: ML Experiment Tracking with Resource Monitoring

Tool	Notes	Details
Weights & Biases	Auto-logs GPU, CPU, memory, network during training runs; cloud-first; rich dashboards	Linux; CLI (wandb)
ClearML	Open-source MLOps platform; auto-logs GPU+CPU+memory+network; requires ClearML server	Linux; CLI
MLflow	Experiment tracking but no native system resource monitoring	Linux; CLI (mlflow)

Category 9: R Language Profiling

Tool	Notes	Details
profvis	Interactive R profiling visualization; CPU + memory timeline; used within R session	Linux; R session only
bench	Benchmarking with memory tracking; used within R session	Linux; R session only
microbenchmark	Micro-benchmarking tool; used within R session	Linux; R session only
profmem	Memory allocation tracing for R expressions; used within R session	Linux; R session only

Category 10: Python Standard Library Profiling Tools

Tool	Notes	Details
cProfile / profile	Function-level CPU time; stdlib	Linux; CLI (python -m cProfile)
tracemalloc	Python memory allocation tracing with tracebacks; stdlib since Python 3.4; used within code	Linux; no CLI (used within code)

Summary: Key Differentiators of `resource-tracker`

The table below highlights what makes resource-tracker stand out relative to the landscape:

Feature	resource-tracker	Most profilers	System monitors	ML trackers
CPU + Memory + GPU + Disk + Net	All 5	Usually 1–2	All 5	CPU+Mem+GPU
Batch-job / script wrapper	Yes	Yes	No (daemons)	Yes
Zero runtime dependencies	Yes	Varies	No	No
Per-job visual report / card	Yes	Often	No	Yes (cloud)
Workflow integration (Metaflow)	Yes	No	No	Varies
Cloud instance recommendations	Yes	No	No	No
Lightweight process footprint	Yes	Yes	No	No
Process-level granularity	Yes	Yes	Partial	No
Runs on Linux	Yes	Yes	Yes	Yes
CLI invocation	Yes	Yes (most)	Yes	Yes

Rust Crate-Level Competitive Landscape: Resource Monitoring

This document surveys Rust crates relevant to resource monitoring — tracking CPU, memory, GPU, network, and disk utilization — with particular focus on use cases analogous to the Python resource-tracker package (batch job wrapping, structured output, low overhead).

It also covers dial9-tokio-telemetry, a notable 2026 Rust telemetry crate that is not a resource monitor but is included here to explain why it falls outside this landscape.

Section 1: Core System Information Libraries

(Foundational libraries; highest relevance as building blocks)

Crate	Notes	Details
sysinfo	The dominant Rust system-info library. Cross-platform (Linux, macOS, Windows, FreeBSD). Covers everything resource-tracker needs except GPU. Used internally by most other crates here. ~2,700 GitHub stars.	Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 123M downloads
procfs	Direct interface to Linux `/proc`. Most granular per-process data available (CPU time, RSS, VMS, I/O counters, smaps). Authoritative source for Linux-first tools.	Linux only; no CLI; CPU/Mem/Net/Disk; process-level; active; 51M downloads
psutil	Rust port of Python’s psutil. Modular feature flags. Linux + macOS. README self-describes as “not well maintained” despite a July 2025 update.	Linux; no CLI; CPU/Mem/Net/Disk; process-level; active*; 3.1M downloads
systemstat	Pure Rust (no C bindings). Cross-platform. System-wide only — no per-process metrics.	Linux; no CLI; CPU/Mem/Net/Disk; system-wide only; active; 3.6M downloads
libproc	Per-process data on Linux + macOS. Useful complement to `procfs` for cross-platform support.	Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 5M downloads
memory-stats	Cross-platform. Reports the current process’s own RSS and virtual memory only. Narrow scope but zero-dependency and reliable.	Linux; no CLI; Mem only; self-process only; active; 10.3M downloads
perf_monitor	Larksuite (Lark/Feishu). Designed explicitly as a monitoring foundation: per-process CPU, memory, FDs, disk I/O. Cross-platform. Archived January 2026 — do not adopt for new projects.	Linux; no CLI; CPU/Mem/Disk; process-level; archived; 36K downloads
heim	Async-first psutil/gopsutil equivalent. Conceptually ideal but last released 2020; 74 open issues. Not safe to adopt.	Linux; no CLI; CPU/Mem/Net/Disk; process-level; abandoned; 490K downloads

*psutil: stated as “not well maintained” in README despite recent activity.

Section 2: GPU Monitoring Libraries

Crate	Notes	Details
nvml-wrapper	Safe, ergonomic Rust wrapper for NVIDIA NVML. Covers GPU utilization, memory, temperature, power, fan speed, running compute processes. The standard library for NVIDIA GPU metrics in Rust.	Linux; no CLI; NVIDIA GPU; active; 3.5M downloads
all-smi	Most comprehensive multi-vendor GPU CLI in Rust. Prometheus metrics integration. Display-oriented but scriptable.	Linux; CLI + Prometheus; NVIDIA/AMD/Intel/Apple/TPU/NPU GPU; active; 8.3K downloads
nviwatch	Interactive TUI + InfluxDB integration. NVIDIA-only.	Linux; TUI; NVIDIA GPU; active; 4.9K downloads
gpuinfo	Minimal CLI for GPU status with `--watch` and `--format` flags. Scriptable. NVIDIA-only.	Linux; CLI; NVIDIA GPU; active; 5.9K downloads

Section 3: CLI Tools for Batch Job / Process Resource Tracking

(Most directly comparable to resource-tracker’s execution model)

Crate	Notes	Details
denet	Closest Rust analogue to resource-tracker. `denet run <cmd>` wraps a command and streams CPU%, memory (RSS+VMS), and I/O metrics. JSON/JSONL/CSV output. Adaptive sampling. Child process aggregation. Python API bindings. No GPU or network monitoring.	Linux; CLI; CPU/Mem/Disk; active; 2.6K downloads
session-process-monitor	Kubernetes-focused but `spm run` pattern directly wraps a batch job with monitoring + OOM protection + headless JSON logging. Tracks USS/PSS/RSS memory and disk I/O rate. Very new (March 2026). No GPU or network.	Linux only; CLI (spm run); CPU/Mem/Disk; active; 173 downloads
stop-cli	Modern process viewer with JSON/CSV structured output designed for piping to `jq`. Per-process CPU%, memory, disk I/O, FDs. Very early stage (v0.0.1, November 2025).	Linux; CLI; CPU/Mem/Disk; active; 72 downloads
procrec	Records and plots CPU + memory for a process. Conceptually aligned but last updated 2021.	Linux; CLI; CPU/Mem; abandoned; 1.7K downloads
radvisor	Container/Kubernetes batch monitoring at 50ms granularity via cgroups. CSVY output. CPU (including throttling), memory, block I/O. Dormant since 2022.	Linux only; CLI; CPU/Mem/Disk; dormant; 1.7K downloads
pidtree_mon	CLI monitor for CPU load across entire process trees (parent + all descendants). CPU-only; no memory/disk/network/GPU.	Linux only; CLI; CPU only; active; 6.2K downloads
gotta-watch-em-all	CLI memory monitor for process trees. Memory-only. Dormant since 2022.	Linux; CLI; Mem only; dormant; 6.5K downloads
procweb-rust	Web interface for per-process Linux resource usage. No structured data output. Stale since 2023.	Linux only; web UI; CPU/Mem; stale; 5.5K downloads
systrack	Library for tracking CPU and memory usage over configurable time intervals (rolling windows) — the exact pattern resource-tracker uses. Single release in 2023; dormant since.	Linux; no CLI; CPU/Mem; dormant; 1.4K downloads

Section 4: Interactive TUI System Monitors

(Visual monitors; not designed for non-interactive batch job instrumentation)

Crate	Notes	Details
bottom (`btm`)	Most popular Rust TUI monitor. Cross-platform. No GPU. Uses `sysinfo` internally. Interactive only — not suitable for batch job instrumentation.	Linux; TUI; CPU/Mem/Net/Disk; active; 13,100 stars
mltop	ML-focused TUI combining CPU + NVIDIA GPU (via NVML). Directly targets the ML engineer use case. Interactive only.	Linux; TUI; CPU/Mem/NVIDIA GPU; active; 14 stars
rtop	TUI with optional NVIDIA GPU support. Covers all five resource types in a single tool. Interactive only.	Linux; TUI; CPU/Mem/NVIDIA GPU/Net/Disk; active; 36 stars
ttop	TUI with multi-vendor GPU (NVIDIA, AMD, Apple Silicon). Very new (March 2026). Interactive only.	Linux; TUI; CPU/Mem/multi-vendor GPU; active
hegemon	Modular safe-Rust TUI. Last release 2018. Historical reference only.	Linux only; TUI; CPU/Mem; abandoned; 336 stars

Section 5: Comprehensive Hardware Monitoring

Crate	Notes	Details
silicon-monitor	Most comprehensive hardware monitoring scope of any crate here. NVIDIA (NVML) + AMD (ROCm/sysfs) + Intel (i915) GPU. Also covers temperatures, SMART disk data, USB, audio, per-process GPU attribution. Provides CLI (JSON output), TUI, GUI, library (`simonlib`), and MCP/AI agent server. Very new (133 downloads, 1 star as of March 2026); unclear stability. Worth watching.	Linux; CLI (JSON); CPU/Mem/multi-vendor GPU/Net/Disk; active

Section 6: Kernel / Low-Level Profiling Crates

(Measure hardware counters, not high-level resource utilization)

Crate	Notes	Details
perf-event	Safe Rust interface to `perf_event_open`. Exposes hardware counters: CPU cycles, instructions, cache hits/misses, branch predictions, page faults, context switches. Deep profiling of batch jobs; not high-level resource tracking.	Linux only; no CLI; active; 4.2M downloads
pprof	CPU profiler for Rust programs (stack sampling → flamegraph/pprof output). Profiler, not a resource monitor.	Linux; no CLI; active; 34M downloads
metrics	Application metrics facade (counters, gauges, histograms). Used to emit measurements; not a collector of system resources.	Linux; no CLI; active; 74M downloads

Section 7: `dial9-tokio-telemetry` — Async Runtime Telemetry (Out of Scope)

dial9-tokio-telemetry is a runtime telemetry “flight recorder” for the Tokio async runtime in Rust, announced on the Tokio blog on March 18, 2026 (authored by Russell Cohen, with AWS contributions). It is included here to explain why it is not a resource monitor and does not belong in this landscape.

What it does

dial9 hooks into Tokio’s internal instrumentation to capture a microsecond-resolution event log of every:

Task poll (timing per poll)
Worker park / unpark event
Task wake event and lifecycle (creation, worker migration)
Queue depth change
Lock contention event (with stack traces on Linux)
Linux kernel scheduling delay (gap between “ready to run” and “actually scheduled”)
CPU profile samples (Linux perf/eBPF-style)
Application-level tracing spans and logs

Traces are written to compact rotating binary files (or directly to S3) with <5% overhead, enabling continuous production deployment. A web-based trace viewer renders the results.

Why it is not a resource monitor

Dimension	`resource-tracker`	`dial9-tokio-telemetry`
Target workload	Batch jobs (ML, HPC, pipelines)	Long-running async Rust services
Metrics tracked	CPU%, RAM, GPU, network, disk	Tokio task polls, scheduling delays, lock contention
Integration	Decorator / subprocess wrap	Must be compiled into the Rust binary
Output	Time-series resource usage / plots	Binary event traces for async runtime debugging
Question answered	“How much CPU/RAM did this job use?”	“Why did this async request take 18ms instead of 1ms?”
Platform	Cross-platform	Linux-primary

dial9 is an async runtime debugger. It tracks none of the metrics — CPU utilization %, memory, GPU, network bandwidth, disk I/O — that define the resource-tracker use case. It is relevant to Rust async service reliability engineering, not to batch job resource instrumentation.

Summary: Key Findings

No single Rust crate fully replicates `resource-tracker`

No existing Rust crate combines: subprocess/batch-job wrapping + CPU% + memory + GPU + network + disk + structured JSON/CSV output + low overhead. The gap is real.

Closest existing tools

Crate	Why it is close	What is missing
`denet`	`denet run <cmd>` wraps a command; JSON/CSV output; Python bindings	GPU, network
`session-process-monitor`	`spm run` pattern; OOM protection; headless JSON logging	GPU, network
`stop-cli`	Structured JSON/CSV; scripting-friendly	Not a job wrapper; no GPU/network

Recommended building blocks for a Rust resource-tracker port

Purpose	Crate
CPU, memory, disk, network (system + process)	`sysinfo`
Fine-grained Linux per-process I/O and memory	`procfs`
NVIDIA GPU metrics	`nvml-wrapper`
Multi-vendor GPU CLI	`all-smi`

The GPU gap

No Rust library cleanly integrates CPU + memory + multi-vendor GPU + network + disk in a single programmatic API suitable for batch job wrapping. silicon-monitor attempts this scope but is brand new and unproven. nvml-wrapper covers NVIDIA programmatically; multi-vendor GPU support requires either all-smi (CLI) or direct vendor SDK bindings.

Specification Proposal — `resource-tracker`

Status: Proposal / Work-in-Progress
Date: 2026-03-30
Based on: README.md (SpareCores), src/ prototype, Python PR #9, s3_upload.py
AI large language model tools were used throughout research, specification, and implementation phases of this project to accelerate and improve the quality of the work.

0. Conventions

The key words MUST, MUST NOT, REQUIRED, SHALL, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

A verifiable requirement is one that can be confirmed by an automated test without manual inspection. Every normative statement below (MUST/SHALL) is intended to be verifiable.

1. Purpose and Scope

resource-tracker is a lightweight, statically self-contained Linux binary that:

Polls system- and process-level resource utilization at a configurable interval.
Emits structured samples to stdout (JSON Lines or CSV).
Optionally streams those samples to the Sentinel API (SpareCores data ingestion endpoint) via gzip-compressed (CVS, TSV, or JSONL) files uploaded to S3 using temporary STS credentials.

The binary is intended as a drop-in CLI wrapper: run it alongside any process and it will transparently record how that process consumes hardware.

Out of scope (v1): macOS, Windows, eBPF, EBPF-based tracing, container image introspection beyond environment variables, multi-host federation.

2. Platform Requirements

Requirement	Detail
Operating System	Linux only (kernel ≥ 4.18 recommended for full `/proc` coverage)
CPU Architectures	x86_64 and aarch64 (ARM64)
Linkage	Dynamic linkage for GPU libraries; all other code statically linked or carried as crate dependencies
Minimum Rust Edition	2024

GPU support MUST NOT be required for the binary to build or run.
On a CPU-only host GpuCollector::collect() SHALL return an empty Vec and no error.

3. Configuration

3.1 Precedence (highest to lowest)

CLI flags  >  TOML config file  >  built-in defaults

Future enhancement: Support RESOURCE_TRACKER_-prefixed environment variables (e.g. RESOURCE_TRACKER_INTERVAL, RESOURCE_TRACKER_FORMAT) as an additional configuration layer between CLI flags and the TOML file. Environment variables are more practical than file-based config for containerized and scripted workloads and are preferred for the Sentinel integration use case.

3.2 CLI Parameters

The binary MUST accept the following flags via a command line parser:

Short	Long	Type	Default	Description
`-n`	`--job-name`	`String`	none	Human-readable label attached to every sample
`-p`	`--pid`	`i32`	none	Root PID of the process tree to track (CPU attribution)
`-i`	`--interval`	`u64`	`1`	Polling interval in seconds (≥ 1)
`-c`	`--config`	path	`resource-tracker.toml`	Path to TOML config file
`-f`	`--format`	enum	`json`	Output format: `json` or `csv`
	`--version`	flag		Print binary version and exit

All metadata fields listed in Section 9.3 (job_name, project_name, stage_name, etc.) MUST also be accepted as CLI flags. See Section 9.3 for the full flag and environment variable table.

Shell-wrapper mode (MVP target): The binary SHOULD support being used as a transparent process wrapper, where the command to monitor is passed as trailing arguments after a -- separator or as positional arguments:

resource-tracker Rscript model.R
resource-tracker -- python train.py --epochs 10

In this mode the binary spawns the given command as a child process, sets --pid to that child’s PID automatically, and exits when the child exits (propagating the child exit code). This is a significant usability improvement over the Python implementation and is a first-class v1 goal.

--interval MUST be > 0. Values of 0 SHALL be rejected with a non-zero exit code and a descriptive error message.

3.3 TOML Config File

The config file is optional. If the file does not exist or cannot be parsed, the binary MUST continue using defaults (no error, no warning).

Schema:

[job]
name = "my-benchmark"   # String; optional
pid  = 12345            # i32;   optional

[tracker]
interval_secs = 5       # u64;   optional; default 1

Unrecognized keys MUST be silently ignored.

3.4 Verifiable Configuration Tests

T-CFG-01: Running with no flags produces valid JSON Lines output on stdout.
T-CFG-02: --format csv emits a header line matching the exact column list in Section 6.2 before the first data row.
T-CFG-03: --interval 0 exits with code ≠ 0.
T-CFG-04: A TOML file with [tracker] interval_secs = 3 results in samples separated by ≈ 3 seconds when no --interval flag is provided.
T-CFG-05: A CLI --interval 2 overrides a TOML interval_secs = 5.
T-CFG-06: A missing TOML file path silently falls back to defaults.

4. Startup Behavior

On startup the binary MUST:

Parse configuration (Section 3).
Initialize all collectors.
Execute one warm-up collection pass to prime delta state in stateful collectors (CpuCollector, NetworkCollector, DiskCollector).
Sleep exactly one full interval.
Emit the CSV header (if format = CSV) before the first data row.
Enter the polling loop (Section 5).

The warm-up pass result MUST NOT be emitted to stdout.

5. Polling Loop

The loop MUST:

Record timestamp_secs = current Unix time as u64 (seconds since UNIX epoch, UTC).
Collect all metric subsystems (Section 6.1) in the order: CPU, Memory, Network, Disk, GPU.
Serialize and emit one line to stdout per the chosen format (Section 6.2, Section 6.3).
Sleep the configured interval.
Repeat indefinitely until killed.

Collection of any subsystem MUST NOT block the other subsystems. Failures in optional subsystems (GPU) MUST be surfaced as empty/zero values, not panics.

6. Data Model

6.1 Sample

A Sample is a point-in-time snapshot of all tracked resources.

#![allow(unused)]
fn main() {
pub struct Sample {
    pub timestamp_secs: u64,          // Unix time (seconds)
    pub job_name:       Option<String>,
    pub cpu:            CpuMetrics,
    pub memory:         MemoryMetrics,
    pub network:        Vec<NetworkMetrics>,  // one per interface
    pub disk:           Vec<DiskMetrics>,     // one per block device
    pub gpu:            Vec<GpuMetrics>,      // one per GPU; empty if none
}
}

6.1.1 CpuMetrics

Source: /proc/stat tick deltas; /proc/<pid>/stat for process tracking.

Note: total_cores (logical CPU count) is a static host property that rarely changes. It belongs in the host discovery snapshot (Section 8.1) rather than in every per-second sample. It is referenced here only for computing cpu_usage in the CSV output (Section 7.2).

Field	Type	Unit	Source	Notes
`utilization_pct`	`f64`	fractional cores	`/proc/stat`	Aggregate utilization expressed as cores-in-use (0.0..N_cores)
`per_core_pct`	`Vec<f64>`	%	`/proc/stat`	Per logical CPU percentage; len == `host_vcpus`; range 0.0–100.0
`utime_secs`	`f64`	seconds	`/proc/stat`	Δ(user+nice ticks) / ticks_per_second for this interval
`stime_secs`	`f64`	seconds	`/proc/stat`	Δ(system ticks) / ticks_per_second for this interval
`process_count`	`u32`	count	`/proc` numeric dirs	Number of running processes visible to the OS
`process_cores_used`	`Option<f64>`	fractional cores	`/proc/<pid>/stat`	None when no PID tracked
`process_child_count`	`Option<u32>`	count	`/proc/<pid>/stat`	Descendant count; excludes root PID; None when no PID tracked

Computation rules:

utilization_pct = (Δtotal − Δidle) / Δtotal × N_cores where N_cores is the logical CPU count from host discovery. The result is expressed as fractional cores in use (e.g. 4.6 on a 16-core host means ~4.6 vCPUs are fully utilized). Do NOT clamp this value; values very slightly above N_cores are valid under kernel accounting rounding. Δtotal = Δ(user + nice + system + idle + iowait + irq + softirq + steal). Δidle = Δ(idle + iowait).
utime_secs = Δ(user + nice) / ticks_per_second.
stime_secs = Δ(system) / ticks_per_second.
process_cores_used = Σ Δ(utime+stime) for root PID and all descendants / (elapsed_wall_clock_seconds × ticks_per_second). Must be ≥ 0.
On the first collection call (no previous snapshot), all delta-based fields MUST return 0. The caller MUST discard this result (warm-up pass).

Verifiable CpuMetrics Tests:

T-CPU-01: utilization_pct is in [0.0, N_cores] for all samples (N_cores from host discovery).
T-CPU-02: len(per_core_pct) == host_vcpus for all samples.
T-CPU-03: When --pid is not set, process_cores_used and process_child_count are None.
T-CPU-04: When --pid <self> is set, process_cores_used ≥ 0.
T-CPU-05: process_count ≥ 1 on any running Linux system.
T-CPU-06: First collect() call returns 0.0 for all delta fields.

6.1.2 MemoryMetrics

Source: /proc/meminfo. All values in mebibytes (MiB = 1024 × 1024 bytes), standardized to match Python resource-tracker PR #9 which also adopts MiB throughout.

Field	Type	Unit	`/proc/meminfo` key(s)	Notes
`total_mib`	`u64`	MiB	`MemTotal`
`free_mib`	`u64`	MiB	`MemFree`	Truly free RAM
`available_mib`	`u64`	MiB	`MemAvailable`	Free + reclaimable
`used_mib`	`u64`	MiB	`MemTotal − MemFree − Buffers − Cached`	Matches Python `memory_used`
`used_pct`	`f64`	%	derived	`used_mib / total_mib × 100`; range 0.0–100.0
`buffers_mib`	`u64`	MiB	`Buffers`	Kernel I/O buffers
`cached_mib`	`u64`	MiB	`Cached + SReclaimable`	Page cache + slab reclaimable
`swap_total_mib`	`u64`	MiB	`SwapTotal`
`swap_used_mib`	`u64`	MiB	`SwapTotal − SwapFree`
`swap_used_pct`	`f64`	%	derived	0.0 when `SwapTotal` == 0
`active_mib`	`u64`	MiB	`Active`
`inactive_mib`	`u64`	MiB	`Inactive`

Verifiable MemoryMetrics Tests:

T-MEM-01: free_mib + used_mib + buffers_mib + cached_mib ≤ total_mib (accounting for kernel reserved memory).
T-MEM-02: used_pct is in [0.0, 100.0].
T-MEM-03: swap_used_pct is 0.0 when swap_total_mib == 0.
T-MEM-04: available_mib ≤ total_mib.

6.1.3 NetworkMetrics

Source: /proc/net/dev (throughput), /sys/class/net/<iface>/ (identity/link state). One NetworkMetrics record per non-loopback interface.

Architecture note: Fields such as mac_address, driver, operstate, speed_mbps, and mtu are static properties that do not change every second. They are candidates for promotion to a host-discovery snapshot (Section 8.1) rather than being repeated in every per-second sample. This applies similarly to static fields in Section 6.1.4 (disk) and Section 6.1.5 (GPU). The current spec includes them here for completeness; a future revision should separate static identity fields from dynamic rate fields.

Field	Type	Unit	Source	Notes
`interface`	`String`	—	interface name	e.g. `"eth0"`
`mac_address`	`Option<String>`	—	`/sys/class/net/<iface>/address`	`"00:11:22:33:44:55"`
`driver`	`Option<String>`	—	`/sys/class/net/<iface>/device/driver` symlink	e.g. `"igc"`
`operstate`	`Option<String>`	—	`/sys/class/net/<iface>/operstate`	`"up"`, `"down"`, `"unknown"`
`speed_mbps`	`Option<i64>`	Mbps	`/sys/class/net/<iface>/speed`	−1 when not reported
`mtu`	`Option<u32>`	bytes	`/sys/class/net/<iface>/mtu`
`rx_bytes_per_sec`	`f64`	bytes/s	`/proc/net/dev` Δ	Rate for this interval
`tx_bytes_per_sec`	`f64`	bytes/s	`/proc/net/dev` Δ	Rate for this interval
`rx_bytes_total`	`u64`	bytes	`/proc/net/dev`	Cumulative since boot
`tx_bytes_total`	`u64`	bytes	`/proc/net/dev`	Cumulative since boot

Verifiable NetworkMetrics Tests:

T-NET-01: rx_bytes_per_sec ≥ 0.0 and tx_bytes_per_sec ≥ 0.0 for all interfaces.
T-NET-02: rx_bytes_total monotonically non-decreasing between consecutive samples (absent interface reset).
T-NET-03: The loopback interface (lo) is NOT included in the output.

6.1.4 DiskMetrics

Source: /proc/diskstats (throughput), /sys/block/<dev>/ (identity), statvfs(3) (space). One DiskMetrics record per block device (excluding partitions and device-mapper synthetic devices unless mounted independently).

Field	Type	Unit	Source	Notes
`device`	`String`	—	kernel device name	e.g. `"sda"`, `"nvme0n1"`
`model`	`Option<String>`	—	`/sys/block/<dev>/device/model`
`vendor`	`Option<String>`	—	`/sys/block/<dev>/device/vendor`
`serial`	`Option<String>`	—	`/sys/block/<dev>/device/wwid` or `serial`
`device_type`	`Option<DiskType>`	—	`/sys/block/<dev>/queue/rotational`	`Nvme`, `Ssd`, or `Hdd`; `None` when type cannot be determined
`capacity_bytes`	`Option<u64>`	bytes	`/sys/block/<dev>/size × 512`
`mounts`	`Vec<DiskMountMetrics>`	—	`statvfs(3)`	One per mount point
`read_bytes_per_sec`	`f64`	bytes/s	`/proc/diskstats` Δ
`write_bytes_per_sec`	`f64`	bytes/s	`/proc/diskstats` Δ
`read_bytes_total`	`u64`	bytes	`/proc/diskstats` sectors × sector_size	Cumulative since boot; see sector size note
`write_bytes_total`	`u64`	bytes	`/proc/diskstats` sectors × sector_size	Cumulative since boot; see sector size note

DiskMountMetrics fields:

Field	Type	Unit	Notes
`mount_point`	`String`	—	e.g. `"/"`
`filesystem`	`String`	—	Filesystem type from `/proc/mounts`; e.g. `"ext4"`, `"xfs"`
`total_bytes`	`u64`	bytes	`statvfs.f_blocks × f_bsize`
`available_bytes`	`u64`	bytes	`statvfs.f_bavail × f_bsize` (unprivileged)
`used_bytes`	`u64`	bytes	`total_bytes − (statvfs.f_bfree × f_bsize)`
`used_pct`	`f64`	%	`used_bytes / total_bytes × 100`; 0.0 when total == 0

Sector size note: The current implementation hard-codes 512 bytes/sector for /proc/diskstats conversions. Python’s get_sector_sizes() reads /sys/block/<dev>/queue/hw_sector_size (fallback 512). On 4K-native drives (some NVMe) the Rust code will under-count I/O bytes by up to 8×. A future fix should read /sys/block/<dev>/queue/logical_block_size at startup and use the actual sector size. See implementation plan P-DSK-SECTOR.

Verifiable DiskMetrics Tests:

T-DSK-01: read_bytes_per_sec ≥ 0.0 and write_bytes_per_sec ≥ 0.0.
T-DSK-02: For each mount, used_bytes + available_bytes ≤ total_bytes.
T-DSK-03: capacity_bytes (when Some) > 0.

6.1.5 GpuMetrics

Source: NVML (nvml-wrapper crate, runtime-loads libnvidia-ml.so) for NVIDIA GPUs; libamdgpu_top (runtime-loads libdrm) for AMD GPUs.

Field	Type	Unit	Notes
`uuid`	`String`	—	Stable vendor UUID; AMD uses PCI bus address
`name`	`String`	—	Human-readable device name
`device_type`	`String`	—	`"GPU"`, `"NPU"`, `"TPU"`
`host_id`	`String`	—	Host-level device identifier
`detail`	`HashMap<String,String>`	—	Vendor-specific extras (driver version, PCI bus ID, ROCm version)
`utilization_pct`	`f64`	%	Core utilization; range 0.0–100.0
`vram_total_bytes`	`u64`	bytes
`vram_used_bytes`	`u64`	bytes
`vram_used_pct`	`f64`	%	`vram_used / vram_total × 100`; 0.0 when total == 0
`temperature_celsius`	`u32`	°C	Die temperature
`power_watts`	`f64`	W	NVML reports mW; converted to W
`frequency_mhz`	`u32`	MHz	Core/graphics clock
`core_count`	`Option<u32>`	count	Shader/compute cores; None if not reported

AMD-specific: When /sys/module/amdgpu does not exist the AMD collection path MUST be skipped entirely (no panic).

NVIDIA-specific: power_watts = raw NVML milliwatt value / 1000.

Verifiable GpuMetrics Tests:

T-GPU-01: On a CPU-only host, gpu Vec is empty and no error is returned.
T-GPU-02: utilization_pct is in [0.0, 100.0] for each GPU.
T-GPU-03: vram_used_bytes ≤ vram_total_bytes for each GPU.
T-GPU-04: vram_used_pct is 0.0 when vram_total_bytes == 0.
T-GPU-05: On a host with AMD GPU, uuid equals the PCI bus address string.

7. Output Formats

7.1 JSON Lines (default)

Each sample is emitted as a single JSON object followed by \n. The binary MUST include a version field keyed as "<crate-name>-version" with the value being the Cargo package version string.

Example (abbreviated):

{"timestamp_secs":1743300000,"job_name":null,"cpu":{...},"memory":{...},"network":[...],"disk":[...],"gpu":[],"resource-tracker-version":"0.1.0"}

Requirements:

T-OUT-01: Each line MUST be valid JSON parseable with any standard JSON library.
T-OUT-02: timestamp_secs MUST be present and be a positive integer.
T-OUT-03: The version key "resource-tracker-version" MUST be present.
T-OUT-04: Consecutive samples MUST have non-decreasing timestamp_secs.

7.2 CSV Format

CSV is the primary and required output format for Sentinel S3 streaming (Section 9.2.2). It uses the same column names and units as the Python resource-tracker so the Sentinel backend can ingest both without schema changes. When uploaded to S3 the CSV content MUST be gzip-compressed and the object key MUST carry the extension .csv.gz.

When --format csv is selected for stdout output the raw (uncompressed) CSV bytes are written. Gzip compression is applied only when writing the S3 batch upload payload (Section 9.2.2).

When --format csv is selected:

The header line MUST be emitted exactly once, before the first data row.
The header MUST match the following column names in this exact order:

timestamp,processes,utime,stime,cpu_usage,memory_free,memory_used,memory_buffers,memory_cached,memory_active,memory_inactive,disk_read_bytes,disk_write_bytes,disk_space_total_gb,disk_space_used_gb,disk_space_free_gb,net_recv_bytes,net_sent_bytes,gpu_usage,gpu_vram,gpu_utilized

Column definitions:

CSV Column	Source Field	Unit	Computation
`timestamp`	`timestamp_secs`	Unix seconds	direct
`processes`	`cpu.process_count`	count	direct
`utime`	`cpu.utime_secs`	seconds	direct; 3 decimal places
`stime`	`cpu.stime_secs`	seconds	direct; 3 decimal places
`cpu_usage`	`cpu.utilization_pct`	fractional cores	`utilization_pct` directly; field is already in fractional cores (0..N_cores); 4 decimal places
`memory_free`	`memory.free_mib`	MiB	direct
`memory_used`	`memory.used_mib`	MiB	direct
`memory_buffers`	`memory.buffers_mib`	MiB	direct
`memory_cached`	`memory.cached_mib`	MiB	direct
`memory_active`	`memory.active_mib`	MiB	direct
`memory_inactive`	`memory.inactive_mib`	MiB	direct
`disk_read_bytes`	disk subsystem	bytes	Σ `read_bytes_per_sec × interval_secs` across all devices; integer
`disk_write_bytes`	disk subsystem	bytes	Σ `write_bytes_per_sec × interval_secs` across all devices; integer
`disk_space_total_gb`	disk mounts	GB (10⁹)	Σ `total_bytes / 1_000_000_000` across all mounts; 6 decimal places
`disk_space_used_gb`	disk mounts	GB (10⁹)	`disk_space_total_gb − disk_space_free_gb`; 6 decimal places
`disk_space_free_gb`	disk mounts	GB (10⁹)	Σ `available_bytes / 1_000_000_000` across all mounts; 6 decimal places
`net_recv_bytes`	network subsystem	bytes	Σ `rx_bytes_per_sec × interval_secs` across all interfaces; integer
`net_sent_bytes`	network subsystem	bytes	Σ `tx_bytes_per_sec × interval_secs` across all interfaces; integer
`gpu_usage`	gpu subsystem	fractional GPUs	Σ `utilization_pct / 100` across all GPUs; 4 decimal places
`gpu_vram`	gpu subsystem	MiB	Σ `vram_used_bytes / 1_048_576`; 4 decimal places
`gpu_utilized`	gpu subsystem	count	count of GPUs where `utilization_pct > 0.0`

Verifiable CSV Tests:

T-CSV-01: Header is emitted exactly once, as the first line.
T-CSV-02: Column count per data row equals column count in header.
T-CSV-03: cpu_usage column equals utilization_pct directly (field is already fractional cores, 0..N_cores) to 4 dp.
T-CSV-04: disk_space_used_gb = disk_space_total_gb − disk_space_free_gb for all rows.
T-CSV-05: CSV output for a given sample is byte-for-byte reproducible (deterministic).
T-CSV-06: No trailing commas; no quoted fields (all values are numbers or bare identifiers).

8. Host and Cloud Discovery

The binary SHOULD collect machine-level metadata once at startup and include it in the Sentinel API registration payload (Section 9.1). Collected fields use the prefix host_ or cloud_.

8.1 Host Discovery

All fields are optional; collection failure MUST be silently swallowed.

Field	Type	Source
`host_id`	`Option<String>`	AWS: `/sys/class/dmi/id/board_asset_tag`; fallback: `/etc/machine-id`
`host_name`	`Option<String>`	`gethostname(3)`
`host_ip`	`Option<String>`	First non-loopback IPv4 from `getifaddrs(3)`
`host_allocation`	`Option<String>`	`"dedicated"` or `"shared"`; heuristic TBD
`host_vcpus`	`Option<u32>`	Count of logical CPUs (`/proc/cpuinfo` processor entries)
`host_cpu_model`	`Option<String>`	`/proc/cpuinfo` `model name` field
`host_memory_mib`	`Option<u64>`	`MemTotal / 1024` from `/proc/meminfo`
`host_gpu_model`	`Option<String>`	First GPU name from `GpuCollector`
`host_gpu_count`	`Option<u32>`	Length of GPU Vec
`host_gpu_vram_mib`	`Option<u64>`	Sum of `vram_total_bytes / 1_048_576` across all GPUs
`host_storage_gb`	`Option<f64>`	Sum of `capacity_bytes / 1_000_000_000` across all block devices

Users MUST be able to suppress any field by setting the corresponding environment variable to "0" or "" (exact mechanism TBD in implementation).

8.2 Cloud Discovery

Cloud metadata is probed by making HTTP GET requests to each cloud provider’s Instance Metadata Service (IMDS) with a short timeout (≤ 2 seconds per provider). Probes MUST be attempted in the background and MUST NOT delay the first sample emission.

Field	Probe endpoint	Notes
`cloud_vendor_id`	AWS: `169.254.169.254/latest/meta-data/`; GCP: `metadata.google.internal`; Azure: `169.254.169.254/metadata/instance`	Infer vendor from which endpoint responds
`cloud_account_id`	AWS: `/latest/meta-data/identity-credentials/ec2/info`
`cloud_region_id`	AWS: `/latest/meta-data/placement/region`
`cloud_zone_id`	AWS: `/latest/meta-data/placement/availability-zone`
`cloud_instance_type`	AWS: `/latest/meta-data/instance-type`

Verifiable Cloud Discovery Tests:

T-CLD-01: On a non-cloud host, all cloud_* fields are None and the binary does not hang for more than 5 seconds total on startup.
T-CLD-02: IMDS probe timeout is ≤ 2 seconds per provider.

9. Sentinel API Streaming (Extra Component)

Activation is gated on the SENTINEL_API_TOKEN environment variable being set.

Resolved design decisions:

Streaming is enabled automatically whenever SENTINEL_API_TOKEN is set; no additional flag needed.

Upload format is csv.gz only; jsonl.gz is not supported.

Streaming is not separately configurable via TOML or CLI beyond the token env var.

On network unavailability: start_run logs a warning and disables streaming; local stdout output continues normally (see Section 11 error handling).

9.1 Authentication

The binary MUST read the API token from the environment variable SENTINEL_API_TOKEN. Every Sentinel API request MUST include the HTTP header:

Authorization: Bearer <token>

If SENTINEL_API_TOKEN is not set, all streaming functionality MUST be silently disabled. Local stdout emission continues normally.

9.2 Run Lifecycle

9.2.1 Start of Run

At startup (after host/cloud discovery), the binary MUST POST to the data ingestion endpoint to register a new Run.

POST /runs (default base URL: https://api.sentinel.sparecores.net).

Request payload (JSON, Content-Type: application/json): all metadata, host, and cloud fields are merged into a flat top-level object (no nesting):

{
  "job_name": "...",
  "project_name": "...",
  "pid": 12345,
  "host_vcpus": 8,
  "cloud_vendor_id": "aws",
  ...
}

Response fields the binary MUST store:

Response Field	Type	Usage
`run_id`	`String`	Referenced in all subsequent API calls
`upload_uri_prefix`	`String`	S3 URI prefix for metric uploads
`upload_credentials.access_key`	`String`	STS credential
`upload_credentials.secret_key`	`String`	STS credential
`upload_credentials.session_token`	`String`	STS credential
`upload_credentials.expiration`	`String` (ISO 8601)	STS credential expiry; optional

9.2.2 Batch Upload (Background Thread)

The binary MUST start a background thread that:

Every 60 seconds (configurable, default 60), takes all samples collected since the previous upload.
Serializes them as CSV (same column layout as Section 7.2) – CSV is the only accepted format for the Sentinel S3 bucket.
Gzip-compresses the CSV bytes.
Generates a unique S3 object key under upload_uri_prefix: <upload_uri_prefix>/<run_id>/<batch_seq_number>.csv.gz
Uploads via AWS Signature V4 (Section 10).
Appends the uploaded URI to an internal list uploaded_uris.

If STS credentials are within 5 minutes of expiration, the binary MUST refresh them by POSTing to /runs/{run_id}/refresh-credentials before attempting the upload.

Upload failures MUST be retried at least once with exponential back-off before being recorded as errors. After 3 consecutive upload failures the background thread MUST log a warning and continue buffering (data is not lost).

Verifiable Streaming Tests:

T-STR-01: Without SENTINEL_API_TOKEN, no HTTP connection is made.
T-STR-02: A batch upload request contains Content-Encoding: gzip and the body decompresses to valid CSV or JSONL.
T-STR-03: uploaded_uris contains the S3 URIs of all successfully uploaded batches.
T-STR-04: Credential refresh is triggered when ≤ 5 minutes remain before expires_at.

9.2.3 End of Run

When the tracked process terminates (or the binary receives SIGTERM), the binary MUST:

SIGINT note: An explicit SIGINT handler is not installed. When the binary is used in shell-wrapper mode, Ctrl-C is delivered to the entire process group, so both the child and the tracker receive SIGINT and exit together. Explicit SIGTERM forwarding to the child process is a future enhancement.

Flush any remaining samples as a final batch upload (if uploaded_uris is non-empty).
POST to /runs/{run_id}/finish to close the Run, including:
- run_id
- exit_code (i32, if tracked process exited cleanly; else None)
- run_status enum: "finished" (exit 0 or SIGTERM) or "failed" (non-zero exit)
- data_source:
  - "s3" + data_uris: Vec<String> if any S3 uploads succeeded.
  - "inline" + data_csv: <base64(gzip(csv))> for short runs with no S3 uploads.

Verifiable End-of-Run Tests:

T-EOR-01: On SIGTERM, the binary exits with code 0 after flushing remaining data.
T-EOR-02: The close-run request body contains run_id matching the start-run response.
T-EOR-03: data_source is "inline" when no S3 uploads occurred.
T-EOR-04: data_source is "s3" when at least one S3 upload succeeded.

9.3 Metadata Fields

The following metadata MAY be supplied by the user via CLI flags or environment variables. All are optional strings unless noted.

Field	CLI Flag	Env Variable
`job_name`	`--job-name`	`TRACKER_JOB_NAME`
`project_name`	`--project-name`	`TRACKER_PROJECT_NAME`
`stage_name`	`--stage-name`	`TRACKER_STAGE_NAME`
`task_name`	`--task-name`	`TRACKER_TASK_NAME`
`team`	`--team`	`TRACKER_TEAM`
`env`	`--env`	`TRACKER_ENV`
`language`	`--language`	`TRACKER_LANGUAGE`
`orchestrator`	`--orchestrator`	`TRACKER_ORCHESTRATOR`
`executor`	`--executor`	`TRACKER_EXECUTOR`
`external_run_id`	`--external-run-id`	`TRACKER_EXTERNAL_RUN_ID`
`container_image`	`--container-image`	`TRACKER_CONTAINER_IMAGE`

Users MUST also be able to supply arbitrary key-value tags via repeated --tag key=value flags.

10. S3 Upload — AWS Signature V4

The upload is implemented in pure Rust without any AWS SDK dependency (zero additional transitive deps for this path). The implementation mirrors the Python s3_upload.py module from PR #9.

10.1 URI Parsing

An S3 URI has the form s3://bucket/path/to/object. Parsing MUST:

Require scheme == "s3".
Require a non-empty bucket name.
Require a non-empty key (path after bucket).
Return an error for any other form.

10.2 Bucket Region Detection

If the upload region is not supplied, the binary MUST determine it by sending an HTTP HEAD request to https://<bucket>.s3.amazonaws.com/ and reading the x-amz-bucket-region response header. The header is present even on 3xx/4xx responses. Results MUST be cached in-process for the lifetime of the run. Default fallback: "eu-central-1".

10.3 Request Construction

A PUT request to https://<bucket>.s3.<region>.amazonaws.com/<key> with:

Content-Length: byte count of body.
x-amz-content-sha256: SHA-256 hex of body.
x-amz-date: YYYYMMDDTHHmmSSZ UTC.
x-amz-security-token: STS session token.
Authorization: AWS4-HMAC-SHA256 signature (see Section 10.4).

10.4 AWS Signature V4

Signing key derivation:

kDate    = HMAC-SHA256("AWS4" + secret_key, date_stamp)
kRegion  = HMAC-SHA256(kDate, region)
kService = HMAC-SHA256(kRegion, "s3")
kSigning = HMAC-SHA256(kService, "aws4_request")

Canonical request:

PUT
/<key>

host:<bucket>.s3.<region>.amazonaws.com
x-amz-content-sha256:<payload_hash>
x-amz-date:<amz_date>
x-amz-security-token:<session_token>

host;x-amz-content-sha256;x-amz-date;x-amz-security-token
<payload_hash>

String to sign:

AWS4-HMAC-SHA256
<amz_date>
<date_stamp>/<region>/s3/aws4_request
<canonical_request_sha256>

Authorization header:

AWS4-HMAC-SHA256 Credential=<access_key>/<credential_scope>, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=<hex_sig>

10.5 Upload Success Criteria

HTTP 200 or 201 response from S3 = success. Any other status = error (with response body included in the error message).

10.6 Verifiable S3 Upload Tests

T-S3-01: parse_s3_uri("s3://bucket/path/obj") returns ("bucket", "path/obj").
T-S3-02: parse_s3_uri("https://bucket/path") returns an error.
T-S3-03: parse_s3_uri("s3://bucket/") returns an error (empty key).
T-S3-04: Given known access_key, secret_key, session_token, region, and a fixed timestamp, the generated Authorization header MUST match a pre-computed golden value.
T-S3-05: Bucket region cache prevents duplicate HEAD requests for the same bucket.
T-S3-06: An upload to a mock S3 server returns the S3 URI on success.

11. Error Handling

Scenario	Required behavior
`/proc` file is unreadable for a single metric	Return 0 / None for that field; do not abort
GPU library absent	GPU Vec is empty; no error propagated
Sentinel API unreachable at start	Log warning; streaming disabled; local output continues
S3 upload fails	Retry once; after 3 consecutive failures log warning and continue
Config TOML parse error	Silently fall back to defaults
`--interval 0`	Exit with code ≠ 0 before starting collectors
Tracked PID not found	`process_cores_used` = None; do not abort

The binary MUST NEVER panic in production code. expect() is only permissible during development; all expect() calls MUST be replaced with proper error handling before v1.0 release.

12. Non-Functional Requirements

Requirement	Target
Binary size	< 15 MiB stripped (CPU-only build)
Startup latency	< 1 × configured interval before first sample
CPU overhead of the tracker itself	< 1% of one core at 1-second interval on a 4-core host
Memory footprint	< 20 MiB RSS at steady state
Stdout buffering	Each line MUST be flushed atomically (no partial lines)

13. Compatibility with Python `resource-tracker`

The CSV output format MUST maintain byte-for-byte column-name compatibility with the Python SystemTracker output so that the Sentinel API backend can ingest both without schema changes.

Confirmed equivalent columns (see Section 7.2 for derivation):

Python column	Rust CSV column	Python unit	Rust unit
`timestamp`	`timestamp`	Unix seconds	Unix seconds
`processes`	`processes`	count	count
`utime`	`utime`	seconds	seconds
`stime`	`stime`	seconds	seconds
`cpu_usage`	`cpu_usage`	fractional cores	fractional cores
`memory_free`	`memory_free`	MiB	MiB
`memory_used`	`memory_used`	MiB	MiB
`memory_buffers`	`memory_buffers`	MiB	MiB
`memory_cached`	`memory_cached`	MiB	MiB
`memory_active`	`memory_active`	MiB	MiB
`memory_inactive`	`memory_inactive`	MiB	MiB
`disk_read_bytes`	`disk_read_bytes`	bytes/interval	bytes/interval
`disk_write_bytes`	`disk_write_bytes`	bytes/interval	bytes/interval
`disk_space_total_gb`	`disk_space_total_gb`	GB (10⁹)	GB (10⁹)
`disk_space_used_gb`	`disk_space_used_gb`	GB (10⁹)	GB (10⁹)
`disk_space_free_gb`	`disk_space_free_gb`	GB (10⁹)	GB (10⁹)
`net_recv_bytes`	`net_recv_bytes`	bytes/interval	bytes/interval
`net_sent_bytes`	`net_sent_bytes`	bytes/interval	bytes/interval
`gpu_usage`	`gpu_usage`	fractional GPUs	fractional GPUs
`gpu_vram`	`gpu_vram`	MiB	MiB
`gpu_utilized`	`gpu_utilized`	count	count

Verifiable compatibility test:

T-COMPAT-01: Run Python and Rust trackers in parallel on the same host for 60 seconds. For each interval, the difference between corresponding scalar columns MUST be within 5% of the Python value (allowing for measurement-time skew).

14. Open Questions / Future Work

eBPF integration: Using aya-rs or libbpf-rs for sub-millisecond tracing (CPU saturation, IPC, TLB misses, cache hit rates) — currently considered v2.
Process-level memory (PSS): Preferred over RSS; requires reading /proc/<pid>/smaps_rollup which may be slow for large processes.
Per-process disk and network I/O: /proc/<pid>/io and network namespaces; currently only system-wide.
Configurable metric suppression: Allow users to opt out of fields containing PII (e.g. host_ip, hostname).
ARM-specific GPU support: Apple Metal not in scope (Linux only); Qualcomm Adreno / Mali GPU metrics TBD.
Static linking of NVML: Currently not possible; NVML requires a dynamically loaded vendor library.
Heartbeat endpoint: Periodic ping to Sentinel API while tracking is active (distinct from batch S3 uploads).

Project Dependencies

This is a Rust programming language project requiring the Rust toolchain, including the Rust build system and package manager, named cargo.

In addition to the base toolchain, this project also makes use of the following:

Tool	Description	Rationale
uv	An extremely fast Python package and project manager	Solely for benchmarking against the Python implementation
just	A handy way to save and run project-specific commands	Convenience
jq	A handy way to slice and filter JSON output	Convenience tool for JSON and JSONL.
mdbook	A tool to create books with Markdown.	This project is documented via mdbook.

Rust Crate Dependencies

Dependencies are declared in Cargo.toml and managed by cargo.

Runtime dependencies

Crate	Version	Purpose
`nvml-wrapper`	0.12	NVIDIA GPU monitoring via NVML; loaded at runtime with `libloading` – no build-time system deps; returns empty on non-NVIDIA hosts
`clap`	4	CLI argument parsing; stripped to `derive`, `std`, `help`, `usage`, `error-context`, `env` features only
`procfs`	0.18	Linux `/proc` parsing for CPU, memory, disk, and network metrics
`ureq`	3	Lightweight synchronous HTTP client for Sentinel API and S3 PUT; avoids tokio runtime overhead
`serde`	1	Serialization/deserialization framework with `derive` macros
`serde_json`	1	JSON serialization for metric output and API payloads
`toml`	1.0	TOML config file parsing; `parse` + `serde` features only, no `display` overhead
`hmac`	0.13.0-rc.6	HMAC-SHA256 for manual AWS Signature Version 4 signing of S3 PUT requests
`sha2`	0.11.0	SHA-256 hashing required by AWS Sig V4; paired with `hmac`
`hex`	0.4	Hex encoding of HMAC digests for Sig V4 canonical request construction
`libc`	0.2	FFI bindings for `statvfs` (filesystem space), `gethostname`, and `SIGTERM` signal handling
`flate2`	1.1.9 (pinned)	Gzip compression for `.csv.gz` S3 batch uploads; `rust_backend` feature uses pure Rust (no `zlib-sys` C dep)
`libamdgpu_top`	0.11.2	AMD GPU monitoring via `libdrm`; `libdrm_dynamic_loading` feature loads the library at runtime – gracefully skipped on non-AMD hosts

Dev dependencies

Crate	Version	Purpose
`num_cpus`	1	Smoke tests: verifies `cpu.utilization_pct` is expressed as fractional cores (bounded by logical CPU count), not a percentage

resource-tracker — Design Notes

Spec Summary

Linux resource tracker (x86 + ARM), using procfs where appropriate
Configurable polling interval for: CPU, memory, GPU, VRAM, network in/out, disk read/write
GPU support requires dynamic linking (no static link)
CLI tool with optional params (job name/metadata); TOML config file with sane defaults
Basic HTTP client: hit API endpoints at start, stop, and every X minutes (heartbeat)
Lightweight S3 PUT using AWS creds to stream resource utilization data

Dependency Assessment

Current `Cargo.toml` dependencies

Crate	Version	Purpose
`nvml-wrapper`	0.12	NVIDIA GPU/VRAM monitoring via NVML; runtime dynamic loading
`libamdgpu_top`	0.11.2, no defaults, libdrm_dynamic_loading	AMD GPU monitoring via libdrm; runtime dynamic loading
`clap`	4, no defaults, derive+std+help+usage+error-context+env	CLI argument parsing, minimal footprint
`procfs`	0.18, serde feature only	Linux `/proc` – CPU, memory, network, disk
`ureq`	3, json feature	Lightweight sync HTTP – no tokio, no async runtime
`serde`	1, derive	Serialization/deserialization
`serde_json`	1	JSON payload encoding for API and S3
`toml`	1.0, no defaults, parse+serde features	TOML config file parsing
`hmac`	0.13.0-rc.6	AWS Signature V4 HMAC signing
`sha2`	0.11.0	SHA-256 hashing for AWS Sig V4
`hex`	0.4	Hex encoding for AWS Sig V4 signature
`libc`	0.2	`statvfs` for filesystem space, `gethostname`, SIGTERM
`flate2`	=1.1.9 (pinned), no defaults, rust_backend	Gzip compression for S3 batch uploads; pure Rust, no zlib-sys

Release profile

[profile.release]
opt-level = "z"      # optimize for size
lto = true           # link-time optimization
codegen-units = 1    # better dead-code elimination
strip = true         # strip symbols
panic = "abort"      # smaller panic handler

Key decisions

nvml-wrapper + libamdgpu_top over all-smi: all-smi required protoc at build time. Replaced with nvml-wrapper (NVIDIA, no build-time deps) and libamdgpu_top with libdrm_dynamic_loading (AMD, runtime-only). Both load their respective drivers at runtime and degrade gracefully when absent.
ureq over reqwest: reqwest v0.13 pulls in tokio (full async runtime), hyper, and TLS stacks – adds ~5-10 MB. ureq v3 is synchronous, no runtime, comparable API surface.
procfs features trimmed: Dropped chrono (heavy date/time lib, std::time suffices) and flate2 (only needed for gzip-compressed /proc files, which are rare).
clap defaults disabled: Default clap features include terminal color, unicode width, etc. Stripped to the functional minimum; env feature added to support TRACKER_* environment variable overrides.
Manual AWS Sig V4 (hmac + sha2 + hex): Avoids aws-sdk-s3 (~50+ transitive deps, large binary). S3 PUT only needs ~100-150 lines of signing logic.
toml v1.0 defaults disabled: parse + serde features; serde feature required for toml::from_str deserialization into config structs.
flate2 pinned to =1.1.9 with rust_backend: Pure Rust gzip implementation; avoids a zlib-sys C build dependency. Version pinned to prevent unexpected breakage from pre-1.0 semver.
libc for sysfs/POSIX calls: statvfs for filesystem space, gethostname for host identity, and SIGTERM signal handling – pure FFI bindings with no additional binary size overhead.

Implementation Approaches

Option A — Single-file polling loop

All logic in main.rs. One tight loop: sleep → collect → diff deltas → buffer → flush.

main.rs
 ├── CLI parsing (clap)
 ├── Config loading (toml)
 ├── Polling loop
 │    ├── procfs → CPU/mem/net/disk snapshots + delta computation
 │    ├── all-smi → GPU/VRAM snapshots
 │    └── Vec<Sample> batch buffer
 ├── HTTP calls (ureq) — start / stop / heartbeat
 └── AWS Sig V4 signing + ureq PUT (inline)

Pros:

Simplest to read and audit end-to-end
Zero abstraction overhead
Fastest to prototype

Cons:

main.rs grows large and hard to navigate
No isolation between collectors — hard to unit test
Tight coupling makes it hard to disable/swap individual collectors

Best for: MVP / proof of concept.

Option B — Module-per-resource + collector trait (current)

A Collector trait drives a scheduler. Each resource lives in its own module with its own delta state.

src/
 ├── main.rs            — CLI, config, scheduler loop
 ├── config.rs          — TOML config struct + CLI override merge
 ├── sample.rs          — Sample / Report structs (serde)
 ├── collector/
 │    ├── mod.rs        — Collector trait: fn collect(&mut self) -> Metric
 │    ├── cpu.rs        — procfs::CpuTime, delta between ticks
 │    ├── memory.rs     — procfs::Meminfo
 │    ├── network.rs    — procfs::Net, bytes delta
 │    ├── disk.rs       — procfs::DiskStats, read/write delta
 │    └── gpu.rs        — all-smi wrapper
 └── reporter/
      ├── mod.rs        — Reporter trait: fn report(&self, batch: &[Sample])
      ├── http.rs       — ureq: start/stop/heartbeat endpoints
      └── s3.rs         — AWS Sig V4 + ureq PUT (batch upload)

Collector trait sketch:

#![allow(unused)]
fn main() {
pub trait Collector {
    fn collect(&mut self) -> Metric;
}
}

Reporter trait sketch:

#![allow(unused)]
fn main() {
pub trait Reporter {
    fn on_start(&self, meta: &JobMeta);
    fn on_sample(&self, batch: &[Sample]);
    fn on_stop(&self, meta: &JobMeta);
}
}

Pros:

Each collector independently testable with mock /proc data
Clean ownership: delta state lives inside each collector struct
Easy to add/remove resources without touching other collectors
Reporter abstraction allows multiple outputs (HTTP + S3 simultaneously)

Cons:

Slightly more upfront boilerplate (trait definitions, module layout)
Minor indirection vs. inline code

Best for: Production implementation. Right level of structure for the spec.

Option C — Config-driven pipeline with Cargo feature flags

Extends Option B with #[cfg(feature = "...")] gates. GPU collector is behind feature = "gpu" since it requires dynamic linking. This enables a statically-linked build for non-GPU targets.

[features]
default = ["gpu", "s3", "http"]
gpu     = ["dep:all-smi"]
s3      = []
http    = []

src/
 ├── main.rs
 ├── config.rs
 ├── sample.rs
 ├── collector/
 │    ├── cpu.rs
 │    ├── memory.rs
 │    ├── network.rs
 │    ├── disk.rs
 │    └── gpu.rs          — #[cfg(feature = "gpu")]
 └── reporter/
      ├── http.rs         — #[cfg(feature = "http")]
      └── s3.rs           — #[cfg(feature = "s3")]

Build variants:

# Full build (default)
cargo build --release

# No GPU — allows static linking (musl target)
cargo build --release --no-default-features --features http,s3
cargo build --release --target x86_64-unknown-linux-musl --no-default-features --features http,s3

# Minimal — metrics only, no reporting
cargo build --release --no-default-features

Pros:

Truly minimal binary for constrained/embedded/container targets
Static linking possible when GPU excluded
Clean separation of optional functionality

Cons:

#[cfg(...)] gates add noise throughout the code
More complex CI/build matrix (multiple feature combinations to test)
Premature if targets are homogeneous

Best for: Distributing to heterogeneous environments — e.g., some hosts have GPUs, some don’t; or when a stripped container image is a requirement.

Status

Implement Option B first. This provides the right structure for the spec without over-engineering. The Collector and Reporter traits give clean boundaries for testing and future extension.

Option C’s feature-flag layer can be added on top of B later with minimal refactoring; the module boundaries are already in place.

Implementation order (Option B)

config.rs — TOML struct + CLI merge (clap + toml)
sample.rs — data model (serde + serde_json)
collector/cpu.rs, memory.rs, network.rs, disk.rs — procfs collectors
collector/gpu.rs — all-smi wrapper
reporter/http.rs — ureq start/stop/heartbeat
reporter/s3.rs — AWS Sig V4 + ureq PUT
main.rs — wire scheduler loop

Benchmarks

Comparison with https://github.com/SpareCores/resource-tracker

Status

The Rust binary collects every field that Python’s SystemTracker emits, and emits them as either JSON Lines (default) or CSV (--format csv).

The CSV output has parity with Python for all columns (same names, units, and computation formulas). The JSON output is a strict superset – it carries all CSV fields plus additional metrics not available in Python.

CSV Column Mapping

Column	Python formula	Rust CSV source	Unit	Parity?
`timestamp`	`time.time()` (float)	`timestamp_secs` (integer)	Unix seconds	approx (see note 1)
`processes`	count of all `/proc/[0-9]+` entries	`cpu.process_count` – same `/proc` count	count	yes
`utime`	per-interval delta(user+nice ticks) / ticks_per_sec	`cpu.utime_secs` – same delta calculation	seconds/interval	yes
`stime`	per-interval delta(system ticks) / ticks_per_sec	`cpu.stime_secs` – same delta calculation	seconds/interval	yes
`cpu_usage`	fractional cores (0..N)	`cpu.utilization_pct` directly (field is already fractional cores)	fractional cores	yes
`memory_free`	`MemFree` from `/proc/meminfo`	`memory.free_mib` (`MemFree` / 1,048,576)	MiB	yes
`memory_used`	`MemTotal - MemFree - Buffers - (Cached+SReclaimable)`	`memory.used_mib` – same formula	MiB	yes
`memory_buffers`	`Buffers`	`memory.buffers_mib`	MiB	yes
`memory_cached`	`Cached + SReclaimable`	`memory.cached_mib` – same formula	MiB	yes
`memory_active`	`Active`	`memory.active_mib`	MiB	yes
`memory_inactive`	`Inactive`	`memory.inactive_mib`	MiB	yes
`disk_read_bytes`	per-interval delta(sectors_read) x sector_size, all non-partition diskstats entries	sum of rate x interval across all `/sys/block` whole-disk entries	bytes/interval	approx (see note 2)
`disk_write_bytes`	same, write side	same, write side	bytes/interval	approx (see note 2)
`disk_space_total_gb`	sum of all non-virtual mount points (incl. snap/loop)	sum of all mounts under `/sys/block` devices (incl. loop mounts)	GB	approx (see note 3)
`disk_space_used_gb`	same, `total - free` (incl. reserved-for-root blocks)	same formula	GB	approx (see note 3)
`disk_space_free_gb`	`f_bavail` from `statvfs`	`f_bavail` from `statvfs`	GB	approx (see note 3)
`net_recv_bytes`	per-interval delta(rx_bytes) across all interfaces	sum of rate x interval across all interfaces	bytes/interval	yes
`net_sent_bytes`	same, tx side	same, tx side	bytes/interval	yes
`gpu_usage`	fractional GPUs (0..N)	sum `gpu[].utilization_pct / 100`	fractional GPUs	yes
`gpu_vram`	used VRAM in MiB	sum `gpu[].vram_used_bytes / 1,048,576`	MiB	yes
`gpu_utilized`	count of GPUs with utilization > 0	count `gpu[].utilization_pct > 0`	count	yes

Documented Semantic Differences

Note 1 – Timestamp precision

Python’s timestamp is a float (sub-second resolution). Rust emits an integer Unix timestamp. When aligning rows for comparison, use a +/-0.5 s tolerance.

Note 2 – Disk I/O: device set and sector size

Both Python and Rust use /proc/diskstats deltas and iterate all whole-disk (non-partition) entries. The device sets should match on most Linux systems.

Python’s device filter (is_partition from resource_tracker.helpers):

# Returns True only for names matching (sd*, nvme*, mmcblk*) partition patterns
# where a parent device exists in /sys/block. Everything else -- including
# loop*, dm-*, zram* -- is treated as a whole-disk device and included.

Rust’s device filter:

#![allow(unused)]
fn main() {
// Reads /sys/block/ directory entries into a HashSet.
// Keeps every diskstats entry whose name is a direct /sys/block/<name> entry.
// Logically equivalent to Python's filter: partitions like nvme0n1p1
// appear under /sys/block/nvme0n1/ (not top-level) and are excluded.
let block_set: HashSet<String> = read_dir("/sys/block")...;
let devs = diskstats.filter(|d| block_set.contains(&d.name));
}

Sector size: both Python and Rust read the actual hardware sector size per device from /sys/block/<dev>/queue/hw_sector_size, falling back to 512 bytes. This was implemented in Rust as P-DSK-SECTOR.

Rationale for explicit sector size: on 4K-native drives the logical sector size is 4,096 bytes; using a hard-coded 512 would under-count I/O bytes by 8x. Reading the actual value from sysfs ensures correctness on all drive types.

Note 2a – ZFS volumes

Python’s disk I/O implementation handles ZFS volumes, where disk usage is reported differently at /sys/block. Rust does not currently account for this. ZFS support is a planned enhancement (not required for MVP).

Note 3 – Disk space: mount set

Python sums all mount points that psutil.disk_partitions() reports as non-virtual (including snap squashfs loop mounts). Rust sums all mount points found in /proc/mounts whose source device matches a /sys/block entry.

On systems with many snap packages, Python includes the squashfs read-only mounts for each snap. Because /dev/loop* devices appear in /sys/block, Rust’s mounts_for_device("loopN") will pick these up too. However, psutil may enumerate mount points that are not under /dev/ (e.g., tmpfs, overlay, cgroup2) which Rust’s /dev/<device> prefix filter skips. This can cause small differences in disk_space_total_gb on container hosts or systems with unusual mount configurations.

To investigate: run mount | grep -v '^/dev' | grep -v ' type tmpfs' to see which mount points Python may be counting that Rust is not.

Running the comparison

Prerequisites

uv >= 0.9 (Astral): which uv
Rust release binary: cargo build --release

Directory layout

benchmarks/
+-- pyproject.toml      # uv project -- resource-tracker dependency
+-- run_python.py       # SystemTracker -> results/python_metrics.csv
+-- run_rust.sh         # resource-tracker --format csv -> results/rust_metrics.csv
+-- compare.py          # merge on timestamp, print diff table
+-- results/            # populated at runtime (gitignore this)
    +-- python_metrics.csv
    +-- rust_metrics.csv

Step 1 – Set up Python environment

cd benchmarks
uv init --no-workspace
uv add resource-tracker

Step 2 – `run_python.py`

"""Collect SystemTracker metrics for DURATION seconds -> results/python_metrics.csv"""
import time
from resource_tracker import SystemTracker

DURATION = 60
INTERVAL = 1

tracker = SystemTracker(interval=INTERVAL, output_file="results/python_metrics.csv")
time.sleep(DURATION)
tracker.stop()
print(f"Done -> results/python_metrics.csv")

Step 3 – `run_rust.sh`

#!/usr/bin/env bash
set -euo pipefail
DURATION=60
INTERVAL=1
mkdir -p results
timeout "$DURATION" \
  ../target/release/resource-tracker --interval "$INTERVAL" --format csv \
  > results/rust_metrics.csv || true
echo "Collected $(( $(wc -l < results/rust_metrics.csv) - 1 )) rows -> results/rust_metrics.csv"

Step 4 – `compare.py`

Strategy:

Load both CSVs, parse timestamp columns.
Differentiate Python’s cumulative I/O columns with diff() to get rates, matching Rust’s per-interval values.
Merge on nearest timestamp (tolerance +/-0.5 x interval).
For each shared metric, report: mean, std, min/max for each side plus mean absolute difference (MAD) and % deviation.

"""Compare python_metrics.csv and rust_metrics.csv side by side."""
import csv, sys
from pathlib import Path

IO_COLS = {"disk_read_bytes", "disk_write_bytes", "net_recv_bytes", "net_sent_bytes"}

def load(path):
    rows = list(csv.DictReader(Path(path).open()))
    return [{k: float(v) if v else 0.0 for k, v in row.items()} for row in rows]

def diff_col(rows, col):
    """Replace cumulative totals with per-row deltas (rate proxy)."""
    for i in range(len(rows) - 1, 0, -1):
        rows[i][col] = rows[i][col] - rows[i-1][col]
    rows[0][col] = 0.0

py  = load("results/python_metrics.csv")
rs  = load("results/rust_metrics.csv")

for col in IO_COLS:
    if col in (py[0] if py else {}):
        diff_col(py, col)

shared_cols = set(py[0]) & set(rs[0]) - {"timestamp"} if py and rs else set()

print(f"{'column':<30} {'py_mean':>12} {'rs_mean':>12} {'MAD':>12} {'%dev':>8}")
print("-" * 80)
for col in sorted(shared_cols):
    py_vals = [r[col] for r in py]
    rs_vals = [r[col] for r in rs]
    py_mean = sum(py_vals) / len(py_vals)
    rs_mean = sum(rs_vals) / len(rs_vals)
    mad = sum(abs(a - b) for a, b in zip(py_vals, rs_vals)) / len(py_vals)
    pct = (mad / py_mean * 100) if py_mean != 0 else float("inf")
    print(f"{col:<30} {py_mean:>12.3f} {rs_mean:>12.3f} {mad:>12.3f} {pct:>7.1f}%")

Results

To be populated after running the benchmark on target hardware.

Fill in: host specs (CPU model, RAM, OS, kernel), Rust git SHA, Python resource-tracker version, output table from compare.py, and observations on where the two implementations agree and diverge.

Remaining known differences

Aspect	Python	Rust	Status
Timestamp precision	Float (sub-second)	Integer (Unix seconds)	By design; use +/-0.5 s tolerance when aligning rows
Disk I/O sector size	Per-device from `/sys/block/<dev>/queue/hw_sector_size`, fallback 512	Per-device from same sysfs path, fallback 512	Implemented (P-DSK-SECTOR); parity achieved
Disk space: non-`/dev/` mounts	`psutil` includes overlay/tmpfs/cgroup mounts if reported non-virtual	Only `/dev/<device>` prefixed sources in `/proc/mounts`	Low impact on physical hosts; notable on container/VM hosts
ZFS volumes	Handled via `psutil` disk partition enumeration	Not yet implemented	Planned enhancement

JSON superset fields (not in Python CSV)

The JSON output carries richer data than any Python CSV column can express.

Rationale: the CSV columns match Python for downstream compatibility. The JSON output is the primary format for new consumers and exposes all available data without being constrained by the Python column set.

Type	Field	Description	Rationale
cpu	`cpu.per_core_pct[]`	Per-logical-core utilization (0–100 each)	Identify hot cores and NUMA imbalance; not expressible as a single CSV scalar
cpu	`cpu.process_cores_used`	Fractional cores consumed by tracked PID tree	Covers multi-process workloads (workers, MPI ranks); Python tracks only the root process
cpu	`cpu.process_child_count`	Live descendants under tracked root PID	Detect fork/thread storms without external tooling
memory	`memory.total_mib`	Total installed RAM	Baseline for capacity planning
memory	`memory.available_mib`	`MemAvailable`: free + reclaimable	Better headroom estimate than `free_mib` alone on systems with large page caches
memory	`memory.used_pct`	RAM usage as a percentage	Convenient derived field; avoids client-side division
memory	`memory.active_mib` / `memory.inactive_mib`	Active and inactive page counts	Distinguish working-set pressure from cold cache
memory	`memory.swap_total_mib` / `memory.swap_used_mib` / `memory.swap_used_pct`	Swap metrics	Detect swap pressure before OOM; Python omits swap entirely
network	`network[].interface` etc.	Interface name, MAC, driver, operstate, speed, MTU	Identify which NIC is under load and whether the link is at full speed
network	`network[].rx_bytes_total` / `tx_bytes_total`	Cumulative byte counters	Enables client-side rate computation at any granularity
disk	`disk[].device_type`	`nvme`, `ssd`, or `hdd`	Correlate latency with drive class without parsing device names
disk	`disk[].capacity_bytes`	Raw device capacity	Capacity planning without a separate `lsblk` call
disk	`disk[].mounts[]`	Per-mount-point space (total/used/available/pct)	Python aggregates all mounts into three scalars; Rust retains per-volume detail
disk	`disk[].model` / `vendor` / `serial`	Drive identity	Correlate metrics with physical hardware inventory
gpu	`gpu[].temperature_celsius`	Die temperature	Detect thermal throttling in real time
gpu	`gpu[].power_watts`	Power draw	Power-efficiency analysis; watts-per-FLOP budgeting
gpu	`gpu[].frequency_mhz`	Core clock	Confirm boost clock is active; correlate with thermal state
gpu	`gpu[].vram_total_bytes`	Total VRAM	Baseline for VRAM utilization percentage
gpu	`gpu[].uuid` / `name` / `device_type` / `host_id`	GPU identity	Multi-GPU systems: attribute metrics to specific devices

resource-tracker – Usage Guide

resource-tracker is a lightweight Linux resource tracker. It polls CPU, memory, disk, network, and GPU metrics at a configurable interval and emits metrics as newline-delimited JSON (JSONL) or CSV lines to stderr or target file.

Quick start

# Build
cargo build --release

# Run with defaults to track resources used by hashing for 5 seconds
./target/release/resource-tracker timeout 5s sha512sum /dev/zero

# Track a specific process tree
./target/release/resource-tracker --pid 1234 --job-name "my-job"

Each line of output is a complete JSON object representing one sample by default:

{
  "timestamp_secs": 1718000000,
  "job_name": "my-benchmark",
  "cpu": { "utilization_pct": 4.6, "per_core_pct": [12.5, 38.0, "..."], "process_cores_used": 3.8, "process_child_count": 4 },
  "memory": { "total_mib": 64000, "used_mib": 30468, "used_pct": 47.6, "free_mib": 2289, "available_mib": 18432, "buffers_mib": 263, "cached_mib": 8472, "active_mib": 8157, "inactive_mib": 7404, "swap_total_mib": 0, "swap_used_mib": 0, "swap_used_pct": 0.0 },
  "network": [{ "interface": "eth0", "rx_bytes_per_sec": 1200.0, "tx_bytes_per_sec": 400.0, "rx_bytes_total": 9834200, "tx_bytes_total": 312400, "driver": "virtio_net", "operstate": "up", "speed_mbps": 1000, "mtu": 1500, "mac_address": "02:00:00:aa:bb:cc" }],
  "disk": [{ "device": "nvme0n1", "model": "Samsung SSD 990 PRO", "device_type": "nvme", "capacity_bytes": 1000204886016, "read_bytes_per_sec": 0.0, "write_bytes_per_sec": 204800.0, "mounts": [{ "mount_point": "/", "filesystem": "ext4", "total_bytes": 999292796928, "used_bytes": 841676800000, "available_bytes": 142023000000, "used_pct": 84.2 }] }],
  "gpu": [{ "name": "NVIDIA GeForce RTX 4090", "utilization_pct": 98.0, "vram_used_pct": 72.3, "vram_used_bytes": 17394819072, "vram_total_bytes": 24026849280, "temperature_celsius": 74, "power_watts": 318.5, "frequency_mhz": 2520 }]
}

CLI flags

Flag	Short	Default	Description
`--pid PID`	`-p`	(none)	Root PID of the process tree to attribute CPU usage to. Includes all child processes.
`--interval SECS`	`-i`	`1`	How often to emit a sample, in seconds.
`--config FILE`	`-c`	`resource-tracker.toml`	Path to a TOML config file. Silently ignored if the file does not exist.
`--format FORMAT`	`-f`	`json`	Output format: `json` or `csv`.
`--output FILE`	`-o`		Path to the output file. Defaults to stderr.
`--quiet`			Suppress metric output entirely, e.g. when streaming metrics to Sentinel and local output is not needed.
`--help`	`-h`		Print help.
`--version`	`-V`		Print version.

Precedence: CLI flags > config file > built-in defaults.

Config file (`resource-tracker.toml`)

The TOML config file lets you persist settings so you don’t have to repeat CLI flags on every invocation. It is optional – the tool works with no config file at all. Any field set on the CLI overrides the corresponding field in the file.

The default lookup path is resource-tracker.toml in the current working directory. Use --config /path/to/file.toml to point elsewhere.

Full reference

[job]
# Human-readable label for this tracking session.
# Appears as "job_name" in every emitted JSON sample.
# Useful when multiple runs are collected into the same data store so you can
# filter and group by job.
name = "gpu-benchmark-run-42"

# Root PID of the process to track.
# resource-tracker will walk the full process tree (parent + all descendants)
# and sum their CPU tick usage to report process_cores_used.
# Leave unset to collect system-wide metrics only.
pid = 12345

[tracker]
# Sampling interval in seconds.  Lower values give finer resolution at the
# cost of more output volume and slightly higher observer overhead.
# Default: 1
interval_secs = 10

Minimal example – system-wide monitoring

[tracker]
interval_secs = 30

Example – named job with process tracking

[job]
name    = "my_job_i_want_to_track"
pid     = 98231

[tracker]
interval_secs = 5

Sentinel API streaming and S3 output

When SENTINEL_API_TOKEN is set, the tracker registers the run with the Sentinel API and streams metric batches to S3 in the background. No network connections are ever made when the token is absent.

How it works

At startup, start_run API endpoint is called to register the run and obtain temporary S3 upload credentials from the Sentinel API.
A background upload thread wakes every TRACKER_UPLOAD_INTERVAL seconds (default 60), drains the in-memory sample buffer, serializes as CSV, gzip-compresses, and PUTs the file to the S3 prefix returned by the API.
On clean exit (SIGTERM, shell-wrapper child exits), any samples not yet uploaded are base64-encoded and sent inline to finish_run inside a gzip-compressed JSON body. If S3 uploads did occur, only the S3 URIs are sent.

Environment variables

Variable	Required	Default	Description
`SENTINEL_API_TOKEN`	Yes	–	Bearer token for the Sentinel API. Streaming is disabled when absent or empty.
`SENTINEL_API_URL`	No	`https://api.sentinel.sparecores.net`	Override the Sentinel API base URL.
`TRACKER_UPLOAD_INTERVAL`	No	`60`	Seconds between S3 batch uploads.

Job metadata environment variables

All Section 9.3 metadata fields can be set via environment variable instead of CLI flags. Environment variables are overridden by the corresponding CLI flag when both are supplied.

Variable	CLI flag
`TRACKER_JOB_NAME`	`--job-name`
`TRACKER_PROJECT_NAME`	`--project-name`
`TRACKER_STAGE_NAME`	`--stage-name`
`TRACKER_TASK_NAME`	`--task-name`
`TRACKER_TEAM`	`--team`
`TRACKER_ENV`	`--env`
`TRACKER_LANGUAGE`	`--language`
`TRACKER_ORCHESTRATOR`	`--orchestrator`
`TRACKER_EXECUTOR`	`--executor`
`TRACKER_EXTERNAL_RUN_ID`	`--external-run-id`
`TRACKER_CONTAINER_IMAGE`	`--container-image`

Example

export SENTINEL_API_TOKEN="your-token-here"
export TRACKER_JOB_NAME="gpu-benchmark"
export TRACKER_UPLOAD_INTERVAL=30

./resource-tracker --interval 1 -- python train.py

The tracker spawns python train.py, monitors it, uploads a gzip-compressed CSV batch to S3 every 30 seconds, and calls finish_run when the script exits.

When to use the config file vs CLI flags

Situation	Recommended approach
One-off interactive run	CLI flags – faster, no file to manage
Recurring job (cron, SLURM, systemd unit)	TOML file alongside the job definition
CI / benchmark pipeline	TOML file checked into the repository
Multiple named jobs on the same host	One TOML file per job, point to it with `--config`
Containerized workload	Set config via CLI flags in the `CMD` / `ENTRYPOINT`

Capturing output

Because samples are emitted as newline-delimited JSON to stdout, standard Unix tools work directly with the output.

# Write to a file
./resource-tracker > run.jsonl

# Tail live output
./resource-tracker | tee run.jsonl

# Pretty-print with jq
./resource-tracker | jq .

# Extract only CPU utilization over time
./resource-tracker | jq '{ t: .timestamp_secs, cpu: .cpu.utilization_pct }'

# Watch GPU VRAM usage
./resource-tracker --interval 1 | jq '.gpu[] | { name, vram_used_pct }'

Shell-wrapper mode

Pass a command after -- to have the tracker spawn and monitor it:

./resource-tracker --interval 1 --job-name "training-run" -- python train.py --epochs 50

The tracker sets --pid automatically to the spawned child’s PID, emits one final sample when the child exits, then exits with the child’s exit code.

Rationale: eliminates the two-process boilerplate (tracker & python ...; wait) and guarantees the tracker always exits with the job’s exit code, making it transparent to CI systems.

Process tree tracking (`--pid`)

When --pid is set, every sample includes two extra fields under cpu:

process_cores_used – fractional cores consumed by the process tree (e.g. 3.8 means the tree is using the equivalent of 3.8 full cores).
process_child_count – number of live child/descendant processes at the time of sampling (does not include the root PID itself).

If the tracked PID exits during a run, its contribution drops to zero and process_child_count drops to zero. The tracker itself keeps running.

Rationale: Python’s SystemTracker tracks only the calling process’s own ticks. Rust walks the full /proc tree so multi-process and multi-threaded workloads (e.g. PyTorch data-loader workers, MPI ranks, Spark executors) are attributed correctly under a single root PID.

Finding the PID of a running process:

# By name
pgrep -x python

# Most recently launched
pgrep -n my-training-script

# Already know the command? Launch and capture PID
my-training-script &
./resource-tracker --pid $! --job-name "training-run-1"

GPU support

GPUs are detected automatically at startup via NVML (NVIDIA) and libamdgpu_top (AMD). No configuration is needed. On hosts without GPU hardware or without the relevant driver libraries installed, the gpu array in each sample will be empty – the tracker continues running normally.

Supported accelerators: NVIDIA GPUs (NVML), AMD GPUs (ROCm/AMDGPU).

Rationale: per-GPU temperature, power draw, and clock frequency are not emitted by Python’s SystemTracker. These fields enable thermal throttle detection and power-efficiency analysis without a separate monitoring tool.

Metrics reference

`cpu`

Field	Unit	Description
`utilization_pct`	fractional cores	Aggregate cores in use (0.0..N_cores). 4.6 on a 16-core host means ~4.6 vCPUs fully utilized.
`per_core_pct`	% each	Per-logical-core utilization array (0.0–100.0).
`utime_secs`	seconds	User+nice CPU time across all cores this interval.
`stime_secs`	seconds	System CPU time across all cores this interval.
`process_count`	count	Runnable processes (`procs_running` from `/proc/stat`).
`process_cores_used`	fractional cores	Cores consumed by tracked process tree (`null` if no PID).
`process_child_count`	count	Live descendant processes (`null` if no PID).

`memory`

All values in mebibytes (MiB = 1,048,576 bytes).

Field	Description
`total_mib`	Total installed RAM
`free_mib`	Truly free RAM (`MemFree` from `/proc/meminfo`)
`available_mib`	Free + reclaimable RAM (`MemAvailable`); better estimate of headroom
`used_mib`	`total - free - buffers - cached` (excludes reclaimable cache)
`used_pct`	Fraction of total RAM in use
`buffers_mib`	Kernel I/O buffer cache
`cached_mib`	Page cache including slab-reclaimable (`Cached + SReclaimable`)
`active_mib`	Active pages (recently accessed)
`inactive_mib`	Inactive pages (candidates for reclaim)
`swap_total_mib`	Total swap space (0 if no swap)
`swap_used_mib`	Used swap
`swap_used_pct`	Fraction of swap in use

Rationale: Python’s SystemTracker reports memory in KiB and omits available_mib, active_mib, inactive_mib, swap_*. Rust reports all fields in MiB (matching Python resource-tracker PR #9) and adds available_mib (MemAvailable) which is a more reliable headroom estimate than free_mib alone on systems with large page caches.

`disk` (one entry per whole-disk block device)

Field	Unit	Description
`device`	–	Kernel device name, e.g. `nvme0n1`, `sda`
`model`	–	Drive model string from `/sys/block/`
`vendor`	–	Vendor string from `/sys/block/`
`serial`	–	Serial number or WWID
`device_type`	–	`nvme`, `ssd`, or `hdd`
`capacity_bytes`	bytes	Raw device capacity
`mounts`	–	Array of mounted filesystems on this device
`mounts[].mount_point`	–	e.g. `/`, `/home`
`mounts[].filesystem`	–	e.g. `ext4`, `xfs`, `btrfs`
`mounts[].total_bytes`	bytes	Filesystem total size
`mounts[].used_bytes`	bytes	Space in use
`mounts[].available_bytes`	bytes	Space available to non-root users
`mounts[].used_pct`	%	Fraction of filesystem in use
`read_bytes_per_sec`	bytes/s	Disk read throughput
`write_bytes_per_sec`	bytes/s	Disk write throughput
`read_bytes_total`	bytes	Cumulative bytes read since boot
`write_bytes_total`	bytes	Cumulative bytes written since boot

Rationale: Python aggregates disk space across all mounts into three scalar CSV columns. Rust retains per-device, per-mount detail in the JSON output, enabling per-volume capacity tracking and per-device I/O attribution that the aggregated CSV cannot express.

`network` (one entry per non-loopback interface)

Field	Unit	Description
`interface`	–	Interface name, e.g. `eth0`, `ens3`
`mac_address`	–	Hardware MAC address
`driver`	–	Kernel driver name, e.g. `igc`, `virtio_net`
`operstate`	–	Link state: `up`, `down`, `unknown`
`speed_mbps`	Mbps	Negotiated link speed (-1 if not reported)
`mtu`	bytes	Maximum transmission unit
`rx_bytes_per_sec`	bytes/s	Received throughput
`tx_bytes_per_sec`	bytes/s	Transmitted throughput
`rx_bytes_total`	bytes	Cumulative bytes received since boot
`tx_bytes_total`	bytes	Cumulative bytes sent since boot

Rationale: Python’s SystemTracker emits only cumulative rx/tx byte totals per interface. Rust adds per-interval rates, driver identity, link state, negotiated speed, and MTU, enabling network saturation and driver-level diagnostics without a separate tool.

`gpu` (one entry per detected accelerator)

Field	Unit	Description
`uuid`	–	Vendor-assigned device UUID
`name`	–	Device name, e.g. `NVIDIA GeForce RTX 4090`
`device_type`	–	`GPU`, `NPU`, `TPU`, etc.
`host_id`	–	Host-level device identifier (PCIe slot or platform index)
`detail`	–	Driver-specific key/value map (PCI IDs, ASIC name, driver version, …)
`utilization_pct`	%	Core utilization
`vram_total_bytes`	bytes	Total VRAM
`vram_used_bytes`	bytes	Used VRAM
`vram_used_pct`	%	Fraction of VRAM in use
`temperature_celsius`	deg C	Die temperature
`power_watts`	W	Power draw
`frequency_mhz`	MHz	Core clock
`core_count`	count	Shader/compute cores (`null` if not reported)

Keyboard shortcuts

resource-tracker