Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This file contains the initial specification/ideation of the resource-tracker-rs project.

Background

The resource-tracker Python package was brought to life in 2025 to have a way to track the resources used by long-running DS/ML/AI jobs in the cloud, and recommend better cloud resource allocations. This was started as an experimentation and resulted in the following features:

  • Supports Linux, macOS, and Windows. No dependencies on Linux, required psutil on other operating systems.
  • Tracks CPU, memory, NVIDIA GPU and VRAM (even at the process level), disk usage, network usage at the system and process level.
  • Monitoring happens at a configurable interval (defaults to 1 second), and collects metrics to local (temp) CSV files.
  • Performance is unnoticeable at 1-sec frequency, but cannot go much lower without significant performance overhead.
  • Computes aggregated statistics on the metrics (e.g. average and peak values).
  • Recommends optimal cloud resource allocations based on the metrics.
  • Recommends best-priced cloud servers for the given workload.
  • Renders a local HTML report with all the metrics and recommendations.
  • Has an R package wrapper for the same functionality.
  • Integrates well with Metaflow.

While it worked well for Python and R, we also wanted a standalone tool that can be better used as a CLI wrapper to track any processes in any environment, and eventually integrate back in the existing Python and R packages. The overall goal is to have a lightweight binary, compiled cross-platform, that can

  • Track a wide range of resource utilization metrics locally – including CPU, memory, GPU and VRAM, disk usage, network usage.

  • Optionally stream these metrics to a remote server for centralized analysis, visualization, and further optimization.

    This allows us not to embed any complex logic in the binary, and just focus on data collection and delivery, so that am accompained free/commercial service can deliver the centeralized visibility, recommendations, automation and optimization – while keeping most of the ecosystem open-source and open to extend with other tools and services.

Data Collection

Discovery Tools

What worked great in the Python implementation was the ability to discover the

  • Most important specs of the host machine, such as CPU cores count, memory amount etc.
  • Cloud environment of the server (when available), such as vendor, region, instance type.

These limited tools are implemented at

We are sure the hardware discovery could be improved further, and we aim to collect at least the following (all prefixed with host_ in the data ingestion endpoint):

  • host_id (text): Unique identifier of the host machine, such as AWS EC2 instance ID or the server S/N.
  • host_name (text): Hostname of the machine.
  • host_ ip (text): IP address of the machine.
  • host_allocation (enum): If the server is dedicated to the monitored process, or shared with other processes.
  • host_vcpus (int): Number of logical virtual CPU cores.
  • host_cpu_model (text): Model of the CPU (e.g. from lscpu output).
  • host_memory_mib (int): Amount of memory in MiB.
  • host_gpu_model (text): Model of the GPU (e.g. from nvidia-smi output).
  • host_gpu_count (int): Number of GPUs.
  • host_gpu_vram_mib (int): Amount of VRAM in MiB.
  • host_storage_gb (float): Amount of storage in GB.

All these fields are optional, and only collected when available. Users should be able to suppress any sensitive fields, such as the host IP address.

The cloud discovery is implemented via probing the Metadata server endpoints of the supported cloud providers. We should try to get information about the following fields (all using the cloud_ prefix in the data ingestion endpoint):

  • cloud_vendor_id (text): The cloud provider’s id, mapped to the Spare Cores Navigator’s vendor table reference (e.g. aws).
  • cloud_account_id (text): The cloud account id.
  • cloud_region_id (text): The cloud region id, mapped to the Spare Cores Navigator’s region table reference (e.g. us-east-1).
  • cloud_zone_id (text): The cloud zone id, mapped to the Spare Cores Navigator’s zone table reference (e.g. us-east-1a).
  • cloud_instance_type (text): The cloud instance type, mapped to the Spare Cores Navigator’s server table’s api_reference field (e.g. t3a.nano).

Find the Spare Cores Navigator’s vendor, region, zone and server tables at https://github.com/SpareCores/sc-data-dumps/tree/main/data and schemas described at https://dbdocs.io/spare-cores/sc-crawler.

Metrics to Track

The data ingestion endpoint is rather liberal and any arbitrary metric can be tracked. The only restriction is that the submitted data needs to be a CSV file with at least one column named timestamp, which should be UNIX timestamp in seconds.

All other columns are treated as metrics. We recommend storing machine-wide metrics prefixed with system_ and the process-level metrics prefixed with process_. If distinguishing between machine-wide and process-level metrics is not feasible, metrics can be submitted without any prefix.

Recommended column names for commonly tracked process-level metrics that are taken into consideration in the backend:

  • children: The number of child processes.
  • utime: The total user+nice mode CPU time in seconds.
  • stime: The total system mode CPU time in seconds.
  • cpu_usage: The current CPU usage between 0 and number of CPUs.
  • memory_mib: Current memory usage in MiB. Preferably PSS (Proportional Set Size) on Linux, fall back to RSS (Resident Set Size).
  • disk_read_bytes: The total number of bytes read from disk.
  • disk_write_bytes: The total number of bytes written to disk.
  • gpu_usage: The current GPU utilization between 0 and GPU count.
  • gpu_vram_mib: The current GPU memory used in MiB.
  • gpu_utilized: The number of GPUs with utilization > 0.

Recommended column names for commonly tracked machine-wide metrics that are taken into consideration in the backend:

  • processes: The number of running processes.
  • utime: The total user+nice mode CPU time in seconds.
  • stime: The total system mode CPU time in seconds.
  • cpu_usage: The current CPU usage between 0 and number of CPUs.
  • memory_free_mib: The amount of free memory in MiB.
  • memory_used_mib: The amount of used memory in MiB.
  • memory_buffers_mib: The amount of memory used for buffers in MiB.
  • memory_cached_mib: The amount of memory used for caching in MiB.
  • memory_active_mib: The amount of memory used for active pages in MiB.
  • memory_inactive_mib: The amount of memory used for inactive pages in MiB.
  • disk_read_bytes: The total number of bytes read from all disks.
  • disk_write_bytes: The total number of bytes written to all disks.
  • disk_space_total_gb: The total disk space in GB.
  • disk_space_used_gb: The used disk space in GB.
  • disk_space_free_gb: The free disk space in GB.
  • net_recv_bytes: The total number of bytes received over network.
  • net_sent_bytes: The total number of bytes sent over network.
  • gpu_usage: The current GPU utilization between 0 and GPU count.
  • gpu_vram_mib: The current GPU memory used in MiB.
  • gpu_utilized: The number of GPUs with utilization > 0.

No other metrics are officially supported by the backend at the moment, but the user can submit any arbitrary values (even strings!) for future use.

Wishlist for future metrics:

  • CPU saturation and efficiency metrics:

    • Load average (1m)
    • L1/L2/L3 cache hit rate
    • TLB miss rate
    • Major page faults
    • iowait
    • IPC (Instructions Per Cycle)
    • Context switches
  • GPU saturation and efficiency metrics:

    • PCIe TX and RX throughput, Nvlink throughput + theoretical max throughput (e.g. nvidia-smi nvlink -c)
    • Power usage (W)
    • Temperature (C)
  • Disk saturation and efficiency metrics:

    • Disk latency (ms)
    • Disk queue length

Overall, we are looking for metrics that can help identify potential bottlenecks and find better cloud servers for the monitored workload.

Metadata

We also want to support collecting the following metadata about the monitored process:

  • pid (int): The process ID.
  • container_image (text): The container image, including optional tag.
  • command (json): JSON array of the command and its arguments.
  • env (text): The environment (e.g. dev or prod).
  • language (text): The language of the process (e.g. python or r).
  • orchestrator (text): The orchestrator of the process (e.g. metaflow).
  • executor (text): The executor of the process (e.g. k8s).
  • team (text): The team of the process.
  • project_name (text): The project name of the process.
  • job_name (text): The job name of the process (e.g. flow in metaflow, workflow in flyte).
  • stage_name (text): The stage name of the process (e.g. step in metaflow, node in flyte).
  • task_name (text): The task name of the process (e.g. task both in metaflow and flyte).
  • external_run_id (text): The external run id of the process (e.g. Jenkins build number – internal to the orchestrator).

Most (if not all: except for the command) of these fields are to be provided voluntarily and manually by the user (or job orchestrator) and should be optional. Privacy and security concerns are addressed in the public service’s legal docs.

The user should be also able to provide any ad-hoc key-value pairs (tags) for tracking purposes.

Status

The data ingestion endpoint automatically captures the start and end time of the process, and calculates the duration in seconds. It also captures user and organization information based on the user’s credentials. Once a job is finished, statistics and recommendations are calculated and stored in a database, made available to the user via a web interface, API, and potentially via the CLI tool as well in the future.

But the CLI tool need to collect the following fields and pass to the data ingestion endpoint:

  • exit_code (int): The exit code of the process.
  • run_status (enum): The status of the run (e.g. success, failure, etc).

Data Streaming

To authenticate with the data ingestion API endpoint, the Resource Tracker needs to use a long-lived API token set by the user in the SENTINEL_API_TOKEN environment variable. This needs to be passed as the Authorization header with the value Bearer <token>.

At the start of the Resource Tracker, hit the data ingestion endpoint to register the start of a Run along with the following optional parameters:

  • metadata (e.g. project_name etc.)
  • server and cloud discovery information (e.g. number of CPUs and/or actual instance type)

The response contains:

  • run_id that should be stored until the end of the run as all future API calls will need to reference that.
  • upload_uri_prefix: An S3 URI prefix to upload the metrics to.
  • upload_credentials: The temporary AWS STS session credentials for the upload authentication, including an expires_at timestamp.

Then the Resource Tracker should start a background thread (or similar solution) to upload collected metrics in batches (e.g. every 1 minute) as new objects under the upload_uri_prefix as gzipped CSV files. The Resource Tracker should also keep track of the uploaded URIs.

When the temporary upload credentials expire, the Resource Tracker should hit the data ingestion endpoint to refresh the credentials.

When the tracked process finishes, the Resource Tracker should hit the data ingestion endpoint to register the end of the run. This takes

  • The run_id,
  • The status of the run (e.g. success, failure, etc.) along with an optional exit_code as described above,
  • And either the list of the uploaded URIs as data_uris along with data_source set to s3, or if no S3 uploads happened yet (e.g. short duration run), then the CSV file as data_csv along with data_source set to local.

The endpoint will process the data in synchronous manner, and return statistics.

More Details

Find the data ingestion API endpoints docs at https://api.sentinel.sparecores.net/docs, including the data contracts and API references.

Rationale

resource-tracker is a Rust rewrite of the Python resource-tracker library. It preserves full CSV column parity with the Python implementation while adding new capabilities that are difficult or impossible to express in the original.


Why Rust

PropertyPython resource-trackerresource-tracker
Runtime dependencyPython interpreter + psutilSingle static binary
Startup overhead~200-500 ms< 5 ms
Observer CPU overhead~0.5-1% per core< 0.1% per core
Memory footprint~30-60 MiB (interpreter)~2-4 MiB
Deploymentpip / uv installCopy binary

The lower observer overhead matters when tracking short-lived or CPU-intensive workloads where the tracker itself would otherwise appear in the numbers it is collecting.


New user-facing functionality

Shell-wrapper mode

./resource-tracker --interval 1 -- python train.py --epochs 50

Pass any command after -- and the tracker spawns it, sets --pid automatically, emits one final sample on exit, and forwards the child’s exit code. This eliminates the two-process boilerplate (tracker & child; wait) and makes the tracker transparent to CI systems and schedulers that check exit codes.

Full process tree tracking (--pid)

Python’s SystemTracker attributes CPU ticks only to the root process. Rust walks the full /proc tree and sums every descendant (workers, threads, MPI ranks, Spark executors) under the given root PID. Two fields appear in every JSON sample when --pid is active:

  • cpu.process_cores_used – fractional cores consumed by the whole tree
  • cpu.process_child_count – live descendant count at each sample

Sentinel API streaming and S3 upload

When SENTINEL_API_TOKEN is set, the tracker registers the run, streams gzip-compressed CSV batches to S3 every TRACKER_UPLOAD_INTERVAL seconds (default 60), and posts a finish_run call on clean exit. No network connections are made when the token is absent.

TOML config file + environment variable overrides

All settings (interval, job name, PID, metadata) can be persisted in a resource-tracker.toml file alongside the job definition. Every field also has a TRACKER_* environment variable override, which is convenient for containerized or CI environments where config files are impractical.


Richer metrics (JSON superset)

The CSV output matches Python column-for-column. The JSON output carries additional fields not expressible as Python CSV scalars.

CPU

  • per_core_pct[] – per-logical-core utilization; identifies hot cores and NUMA imbalance
  • utilization_pct expressed as fractional cores (0.0..N_cores), not a percentage clamped to 100; more useful on multi-core hosts

Memory

  • available_mib (MemAvailable) – free + reclaimable; a more reliable headroom estimate than free_mib on systems with large page caches
  • swap_total_mib, swap_used_mib, swap_used_pct – swap pressure visible before OOM; Python omits swap entirely
  • active_mib / inactive_mib – distinguish working-set pressure from cold cache

Disk

  • Per-device, per-mount detail instead of three aggregated scalars; enables per-volume capacity tracking and per-device I/O attribution
  • device_type (nvme, ssd, hdd), model, vendor, serial – correlate metrics with physical hardware without a separate lsblk call
  • Per-device hardware sector size read from sysfs; correct byte counts on 4K-native drives where a hard-coded 512 would under-count I/O by 8x

Network

  • Per-interval rates (rx_bytes_per_sec, tx_bytes_per_sec) in addition to cumulative totals; no client-side diff required
  • driver, operstate, speed_mbps, mtu per interface; identify which NIC is under load and whether the link is running at full negotiated speed

GPU (NVIDIA and AMD)

Python emits no GPU metrics at all. Rust supports both NVIDIA (NVML) and AMD (ROCm/AMDGPU) accelerators via runtime dynamic loading, with no build-time driver dependencies. Additional fields beyond utilization and VRAM:

  • temperature_celsius – detect thermal throttling in real time
  • power_watts – power-efficiency analysis; watts-per-FLOP budgeting
  • frequency_mhz – confirm boost clock is active; correlate with thermal state
  • uuid, name, host_id – attribute metrics to specific devices in multi-GPU systems

Open-Source Resource Monitoring Landscape

Competitive Analysis for resource-tracker (SpareCores)

Prepared: March 25, 2026 Context: Phase 1 feasibility assessment for a Rust/Linux CLI implementation of ResourceTracker Reference tool: https://github.com/SpareCores/resource-tracker


Executive Summary

resource-tracker occupies a specific and underserved niche: a lightweight, zero-dependency, batch-job-oriented process + system resource monitor with workflow framework integration (Metaflow), visualization via cards, and cloud server recommendations. The open-source landscape has many partial overlaps but no single tool matches all its characteristics simultaneously.

The tools below are organized into meaningful categories. Most tools are either:

  • Too low-level (profilers that require code instrumentation or produce flame graphs rather than time-series resource logs)
  • Too heavy (system daemons, full observability stacks)
  • Too narrow (single-resource: CPU only, or memory only, or GPU only)
  • Not batch-job oriented (designed for long-running services, not scripts that run and exit)

Category 1: Python Libraries for Process/System Resource Monitoring

These are the closest functional analogues to resource-tracker in the Python ecosystem.


1.1 psutil

  • URL: https://github.com/giampaolo/psutil
  • Language: Python (C extension)
  • Description: The foundational library for cross-platform system/process information in Python. resource-tracker itself uses psutil as an optional backend on non-Linux systems. psutil retrieves CPU, memory, disk, network, and process-level data programmatically but provides no time-series tracking, no decorator/wrapper API, no visualization, and no batch job reporting.
  • Key features: CPU %, memory (RSS/PSS/USS/VMS), per-process I/O, network I/O, disk usage, process tree traversal. Cross-platform (Linux, macOS, Windows).
  • Difference: Raw data API only. No tracking loop, no reports, no workflow integration. It is a building block, not a solution.

1.2 memory_profiler

  • URL: https://github.com/pythonprofilers/memory_profiler
  • Language: Python
  • Description: Line-by-line memory usage profiler for Python scripts. Uses @profile decorator and mprof CLI to record memory usage over time and plot it. Built on psutil.
  • Key features: Line-level memory profiling, time-series memory plot via mprof, @profile decorator, memory_usage() API.
  • Difference: Memory only (no CPU, GPU, disk, network). Requires code instrumentation for line-level profiling. Targeted at developers finding memory leaks, not at batch job operators seeking resource utilization logs.

1.3 Scalene

  • URL: https://github.com/plasma-umass/scalene
  • Language: Python + C++
  • Description: High-performance, high-precision CPU, GPU, and memory profiler for Python. Uniquely profiles CPU time, GPU time, and memory at the line level simultaneously. Includes AI-powered optimization suggestions and an interactive web UI.
  • Key features: Line-level CPU + GPU + memory profiling, separates Python vs native time, web-based interactive report, minimal overhead (~10-20%).
  • Difference: A developer profiler (find bottlenecks in code), not a resource utilization logger for batch jobs. Does not track network or disk I/O, does not integrate with workflow tools, does not produce time-series utilization logs for operational use.

1.4 Memray

  • URL: https://github.com/bloomberg/memray
  • Language: Python + C++
  • Description: Bloomberg’s memory profiler for Python. Tracks every allocation in Python, native extensions, and the interpreter itself. Produces flame graphs, heap charts, and other visualizations.
  • Key features: Full allocation tracking (Python + C/C++), flame graphs, live mode, Jupyter integration, reporter API.
  • Difference: Memory only, developer-oriented (find leaks/hotspots in code). Does not track CPU, GPU, disk, or network. Not designed for batch job monitoring.

1.5 Fil (filprofiler)

  • URL: https://github.com/pythonspeed/filprofiler
  • Language: Python + Rust
  • Description: Memory profiler from pythonspeed targeting data scientists and scientific computing. Finds peak memory usage and identifies what code caused the peak. Produces flame graphs.
  • Key features: Peak memory tracking (captures C and Python allocations), flame graphs, designed for NumPy/Pandas workloads, CLI usage.
  • Difference: Memory only, developer-oriented. No CPU, GPU, disk, network. Produces offline profiling reports, not operational time-series logs.

1.6 pyinstrument

  • URL: https://github.com/joerick/pyinstrument
  • Language: Python
  • Description: Sampling call-stack profiler for Python. Samples the call stack every 1ms and shows a readable summary of where time is spent. Supports context manager and decorator API.
  • Key features: Low-overhead sampling, context manager (with Profiler()), decorator, CLI, HTML/text/JSON output, async support.
  • Difference: CPU time only (call stack), no memory/GPU/disk/network. Developer-oriented (why is code slow?), not a resource utilization monitor.

1.7 py-spy

  • URL: https://github.com/benfred/py-spy
  • Language: Rust
  • Description: Sampling profiler for Python programs written in Rust. Attaches to a running Python process without modifying it. Can generate flame graphs or a top-like display.
  • Key features: Attaches to running process (no code changes), flame graphs, top-like live view, very low overhead, works across OS.
  • Difference: CPU only (call stack). No memory, GPU, disk, or network tracking. Attach-to-process model differs from resource-tracker’s wrap-a-job model.

1.8 Austin

  • URL: https://github.com/P403n1x87/austin
  • Language: C
  • Description: Python frame stack sampler for CPython. Samples the Python interpreter’s memory space directly to retrieve running thread stacks. Extremely low overhead.
  • Key features: Zero-instrumentation, pure C, very low overhead, multi-thread and multi-process support, output compatible with flame graph tools.
  • Difference: CPU/call stack profiling only. No resource utilization metrics (memory, GPU, disk, network).

1.9 Glances

  • URL: https://github.com/nicolargo/glances
  • Language: Python
  • Description: Cross-platform system monitoring tool with a rich curses/web UI. Shows CPU, memory, disk, network, process list, temperatures, GPU (via plugin), Docker containers, and more. Can export data to InfluxDB, CSV, Prometheus, etc.
  • Key features: Real-time monitoring, web UI, REST API, exporters (InfluxDB, Prometheus, CSV, JSON), Docker/container awareness, GPU plugin, cross-platform (Linux, macOS, Windows, BSD).
  • Difference: A long-running system monitor daemon/interactive tool, not designed to wrap a batch job, produce a per-job report, or integrate with workflow frameworks. No job-level summary reports.

1.10 nvitop

  • URL: https://github.com/XuehaiPan/nvitop
  • Language: Python
  • Description: Interactive NVIDIA GPU process viewer with a rich terminal UI. Goes beyond nvidia-smi by showing per-process GPU/VRAM usage in real time, supports programmatic API access.
  • Key features: Per-process GPU utilization and VRAM, process tree, interactive kill/signal, rich terminal UI, Python API (ResourceMetricCollector).
  • Difference: GPU-only (NVIDIA). Covers system + process level GPU metrics well. Its ResourceMetricCollector API is a meaningful overlap with resource-tracker for GPU tracking. No CPU/memory/disk/network integration.

1.11 gpustat

  • URL: https://github.com/wookayin/gpustat
  • Language: Python
  • Description: Simple command-line utility for querying and monitoring NVIDIA GPU status. Aggregates nvidia-smi output with color-coded display. Supports --watch mode.
  • Key features: GPU utilization, VRAM usage, temperature, power draw, per-process GPU use, JSON output, watch mode.
  • Difference: NVIDIA GPU only, read-only display tool, no time-series logging, no CPU/memory/disk/network.

1.12 pynvml / nvidia-ml-py

  • URL: https://github.com/gpuopenanalytics/pynvml
  • Language: Python (NVML binding)
  • Description: Python bindings for NVIDIA’s NVML C library, enabling programmatic GPU diagnostics. Used as a building block by gpustat, nvitop, and resource-tracker itself.
  • Key features: Full NVML API access: GPU utilization, VRAM, temperature, power, clock speed, process-level GPU usage, fan speed.
  • Difference: Raw API, no tracking loop, no reporting. A building block.

1.13 CodeCarbon

  • URL: https://github.com/mlco2/codecarbon
  • Language: Python
  • Description: Tracks CPU, GPU, and RAM energy consumption and converts it to estimated CO2 emissions. Designed for ML training runs. Provides decorator and context manager APIs.
  • Key features: @track_emissions decorator, context manager, estimates CO2 equivalent, per-run reporting, dashboard, supports Intel RAPL and NVML.
  • Difference: Focused on energy/carbon footprint rather than raw resource utilization metrics. Does not track disk I/O or network. Closest in UX philosophy (decorator for batch scripts) but different output goal.

1.14 CarbonTracker

  • URL: https://github.com/lfwa/carbontracker
  • Language: Python
  • Description: Tracks and predicts energy consumption and carbon footprint of deep learning model training. Can stop training when predicted impact exceeds a threshold.
  • Key features: Predictive carbon footprint, supports GPU and CPU energy, training-run oriented, can send alerts.
  • Difference: Energy/carbon focused, ML training specific, no disk/network tracking.

1.15 pyRAPL

  • URL: https://github.com/powerapi-ng/pyRAPL
  • Language: Python
  • Description: Measures energy consumption of Python code using Intel RAPL (Running Average Power Limit) hardware counters. Provides decorator and context manager APIs.
  • Key features: CPU socket, DRAM, and integrated GPU energy measurement, decorator and with block APIs, per-domain granularity.
  • Difference: Intel RAPL only (Intel CPUs since Sandy Bridge), energy not utilization percentage, no GPU computation metrics, no disk/network.

1.16 pyJoules

  • URL: https://github.com/powerapi-ng/pyJoules
  • Language: Python
  • Description: Captures energy consumption of code snippets using Intel RAPL and NVIDIA NVML. Provides decorator and context manager APIs with breakpoints.
  • Key features: Multi-device energy capture (CPU, DRAM, NVIDIA GPU), decorator API, MongoDB and Pandas export handlers.
  • Difference: Energy measurement, not utilization tracking. Requires Intel RAPL-capable hardware.

1.17 PowerAPI

  • URL: https://github.com/powerapi-ng/powerapi
  • Language: Python
  • Description: Middleware framework for building software-defined power meters. Estimates power at process, container, VM, or application level. Can use hardware counters or performance counters.
  • Key features: Pluggable sensors and estimators, multiple granularity levels (process, container, VM), real-time power estimation.
  • Difference: Power/energy framework requiring configuration and sensor setup. Not a drop-in decorator for batch jobs.

1.18 eco2AI

  • URL: https://github.com/sb-ai-lab/eco2AI
  • Language: Python
  • Description: Tracks carbon emissions while training/inferring Python ML models. Accounts for CPU, GPU, and RAM energy consumption.
  • Key features: @track_emissions decorator, real-time emission monitoring, CSV reporting.
  • Difference: Carbon/energy focus, similar decorator pattern to resource-tracker, no disk/network.

1.19 pyperf

  • URL: https://github.com/psf/pyperf
  • Language: Python
  • Description: Python Software Foundation toolkit for writing and running benchmarks. Includes memory tracking (--track-memory, --tracemalloc) as part of benchmark metadata collection.
  • Key features: Benchmark calibration, worker process management, memory peak tracking, JSON results, statistical analysis.
  • Difference: Benchmarking framework, not a general resource monitor. Memory tracking is incidental to benchmarking.

1.20 ClearML

  • URL: https://github.com/clearml/clearml
  • Language: Python
  • Description: Open-source MLOps platform. Automatically tracks GPU, CPU, memory, and network metrics during ML experiment runs. Provides an experiment tracker, data manager, orchestrator, and more.
  • Key features: Automatic system metric logging (GPU, CPU, memory, network), experiment tracking, model registry, pipeline orchestration, web UI.
  • Difference: Full MLOps platform (not a lightweight library). Requires a ClearML server. Targets ML experiments rather than general batch jobs.

1.21 python-resmon

  • URL: https://github.com/xybu/python-resmon
  • Language: Python
  • Description: Lightweight resource monitor that records CPU usage, RAM usage, disk I/O, and NIC speed, outputting data in CSV format for post-processing.
  • Key features: CSV output, configurable polling interval, system-level metrics, easy post-processing.
  • Difference: System-level only (no per-process tracking), no GPU, no visualization, no workflow integration. Small utility script rather than a library.

Category 2: Interactive Terminal Monitors (System-Level)

These tools provide real-time visual monitoring of system resources. They do not produce per-job reports or integrate with batch workflows, but they are widely used for manual resource observation.


2.1 htop

  • URL: https://github.com/htop-dev/htop
  • Language: C
  • Description: Interactive process viewer and system monitor. The modern replacement for top. Shows per-CPU usage, memory, swap, and a process list with tree view.
  • Key features: Interactive (kill, renice, filter), color-coded per-CPU bars, tree view, mouse support, cross-platform.
  • Difference: Interactive visual tool only. No data capture, no time-series, no batch job integration.

2.2 btop / btop++

  • URL: https://github.com/aristocratos/btop
  • Language: C++
  • Description: Advanced terminal resource monitor. Third generation of bashtop->bpytop->btop++. Shows CPU, memory, disk I/O, network, and process list with rich ASCII art graphs.
  • Key features: Responsive UI, mouse support, GPU support (Nvidia/AMD/Intel via plugins), disk I/O, network I/O, process filtering, themes.
  • Difference: Interactive visual tool only. No data export, no batch job tracking.

2.3 bpytop

  • URL: https://github.com/aristocratos/bpytop
  • Language: Python
  • Description: Python predecessor to btop++. Linux/macOS/FreeBSD resource monitor with animated ASCII graphs.
  • Key features: CPU, memory, disk, network, process list, ASCII graphs.
  • Difference: Interactive visual tool. Superseded by btop++.

2.4 bashtop

  • URL: https://github.com/aristocratos/bashtop
  • Language: Bash
  • Description: Original Bash-based resource monitor from the same developer. Ancestor of bpytop and btop++.
  • Key features: CPU, memory, disk, network, process monitoring in pure Bash.
  • Difference: Superseded by btop++. Interactive visual only.

2.5 glances (see 1.9 above)

  • Interactive + exportable, see Category 1 entry.

2.6 atop

  • URL: https://github.com/Atoptool/atop
  • Language: C
  • Description: Advanced interactive system and process monitor for Linux. Records all system activity and writes to binary log files for later replay/analysis. Integrates with atopsar for historical reporting.
  • Key features: Full system activity logging (CPU, memory, disk, network, process), persistent binary logs, replay mode, atopsar for reporting.
  • Difference: Long-running daemon for system-wide logging. Not designed to wrap a specific job; tracks the whole system. Closest among CLI tools to providing historical per-process data.

2.7 nmon (Nigel’s Monitor)

  • URL: http://nmon.sourceforge.net/
  • Language: C
  • Description: Performance monitoring tool for AIX and Linux. Provides real-time view and can capture data to CSV for later analysis with nmon Analyser.
  • Key features: CPU, memory, disk I/O, network, filesystem, processes; CSV capture mode, lightweight.
  • Difference: System-wide monitor. No batch job integration or workflow decorator. The CSV output mode is useful for offline analysis.

2.8 collectl

  • URL: http://collectl.sourceforge.net/
  • Language: Perl
  • Description: Collects a broad set of Linux system statistics (CPU, memory, network, disk, inodes, processes, NFS, TCP, sockets) and can write to files, print to stdout, or feed to Graphite/ganglia.
  • Key features: Wide metric coverage, multiple output formats (CSV, plot, etc.), daemon or one-shot mode.
  • Difference: System-wide collection daemon. No batch job wrapping, no workflow integration.

2.9 sysstat (sar/sadc/sadf/iostat/pidstat/mpstat)

  • URL: https://github.com/sysstat/sysstat
  • Language: C
  • Description: Collection of Linux performance monitoring utilities. sar collects and reports system activity historically. pidstat reports per-process CPU, memory, and I/O. iostat reports disk I/O. sadc is the backend data collector.
  • Key features: Historical data collection, per-process stats via pidstat, JSON/CSV/XML output via sadf, schedulable via cron/systemd, very low overhead.
  • Difference: System and process monitoring utilities, not designed for batch job wrapping. pidstat is the closest to per-job process monitoring but requires manual invocation.

2.10 nvtop

  • URL: https://github.com/Syllo/nvtop
  • Language: C
  • Description: (h)top-like task monitor for GPUs and accelerators. Supports AMD, Apple M1/M2 (limited), Huawei Ascend, Intel, NVIDIA, Qualcomm, Broadcom, Rockchip.
  • Key features: Multi-GPU and multi-vendor support, real-time GPU/VRAM utilization, per-process GPU use, interactive UI.
  • Difference: GPU-focused interactive monitor. No data export, no CPU/memory/disk/network integration.

2.11 vtop

  • URL: https://github.com/MrRio/vtop
  • Language: JavaScript (Node.js)
  • Description: Graphical terminal activity monitor with Unicode braille charts. Groups processes sharing the same name (e.g., NGINX master + workers).
  • Key features: ASCII charts, process grouping, extensible via plugins.
  • Difference: Interactive visual only, no data capture. Note: project appears unmaintained.

2.12 Netdata

  • URL: https://github.com/netdata/netdata
  • Language: C (agent core)
  • Description: Real-time performance monitoring with per-second metrics and a powerful web UI. 800+ integrations. Most-starred monitoring project on GitHub (76k+ stars).
  • Key features: Per-second metrics, web dashboard, alerts, ML anomaly detection, 800+ integrations (Docker, Kubernetes, StatsD, OpenMetrics), process-level metrics, GPU plugins.
  • Difference: Full-stack observability daemon. Requires installation as a service. Not designed for wrapping a batch job.

Category 3: eBPF / Kernel-Level Tracing Tools

These tools use Linux eBPF (extended Berkeley Packet Filter) for highly efficient, zero-instrumentation tracing deep in the kernel. Most relevant for system-level visibility with very low overhead.


3.1 BCC (BPF Compiler Collection)

  • URL: https://github.com/iovisor/bcc
  • Language: C + Python/Lua frontends
  • Description: Toolkit for creating efficient kernel tracing and manipulation programs using eBPF. Includes ready-made tools (execsnoop, biolatency, tcplife, memleak, etc.) and a framework for writing custom eBPF programs with Python frontends.
  • Key features: Kernel + userspace tracing, network/disk/memory/CPU tools, Python API for custom programs, very low overhead.
  • Difference: Requires kernel support (Linux 4.1+), root privileges, and knowledge of eBPF to build custom tools. Not a drop-in batch job monitor.

3.2 bpftrace

  • URL: https://github.com/bpftrace/bpftrace
  • Language: C++ (awk/DTrace-like scripting language)
  • Description: High-level tracing language for Linux eBPF. Write concise one-liners or short scripts for ad-hoc analysis.
  • Key features: High-level scripting, LLVM backend, supports tracepoints, kprobes, uprobes, usdt. One-liner analysis.
  • Difference: Ad-hoc kernel tracing tool. Requires root and kernel support. Not designed for operational batch job monitoring.

3.3 Parca / Parca Agent

  • URL: https://github.com/parca-dev/parca
  • Language: Go
  • Description: Continuous profiling for CPU and memory usage, down to the line number and throughout time. Parca Agent is an eBPF-based always-on profiler with Kubernetes auto-discovery. Uses pprof format.
  • Key features: Zero-instrumentation eBPF profiling, <1% overhead, continuous collection, icicle graph UI, SQL-queryable profile storage, multi-language support.
  • Difference: Continuous profiling infrastructure (runs as a DaemonSet on Kubernetes nodes). Not a per-job wrapper. Heavy infrastructure requirement.

3.4 Pyroscope (Grafana)

  • URL: https://github.com/grafana/pyroscope
  • Language: Go
  • Description: Continuous profiling database and platform (formed from merger of Phlare + Pyroscope). Stores profiling data from applications instrumented with Pyroscope SDKs or from eBPF agents. Integrates with Grafana.
  • Key features: SDK-based push profiling (Python, Go, Java, Ruby, .NET, Rust, PHP, Node.js), eBPF pull mode, flame graphs, Grafana integration, scalable storage.
  • Difference: Continuous profiling infrastructure. Requires a server and SDK integration. Not a lightweight batch job wrapper.

Category 4: Linux Performance Profiling Tools (C/C++/Native)

These tools profile native code at a low level. Most are developer-focused profilers rather than operational monitors.


4.1 perf (Linux perf_events)

  • URL: https://perfwiki.github.io/main/
  • Language: C (Linux kernel subsystem)
  • Description: The primary Linux performance tool. Samples CPU events using hardware performance counters, traces system calls, and instruments kernel/userspace functions. Foundation for many other tools.
  • Key features: Hardware counter sampling, call graph recording, per-process and system-wide, flame graph generation (via FlameGraph scripts), supports all architectures.
  • Difference: Low-level developer profiler. Requires root for many features. No time-series resource logging, no workflow integration.

4.2 FlameGraph

  • URL: https://github.com/brendangregg/FlameGraph
  • Language: Perl
  • Description: Stack trace visualization toolkit by Brendan Gregg. Generates SVG flame graphs from perf, DTrace, SystemTap, and other profiler output.
  • Key features: CPU, memory, and off-CPU flame graphs, works with many backends.
  • Difference: Visualization tool for profiler output, not a monitoring tool itself.

4.3 gperftools (Google Performance Tools)

  • URL: https://github.com/gperftools/gperftools
  • Language: C++
  • Description: Collection from Google: fast malloc (TCMalloc), CPU profiler, heap profiler, and heap checker. Used via LD_PRELOAD or explicit linking.
  • Key features: CPU profiling (sampling), heap profiling, heap leak detection, pprof visualization, multi-threaded support.
  • Difference: Developer profiler requiring code linking or LD_PRELOAD. No time-series operational monitoring, no disk/network/GPU.

4.4 Valgrind / Massif / Callgrind

  • URL: https://valgrind.org/
  • Language: C
  • Description: Instrumentation framework for building dynamic analysis tools. Massif is its heap profiler; Callgrind is its call graph profiler; Memcheck is its memory error detector.
  • Key features: Complete heap tracking, memory leak detection, call graph analysis, massif-visualizer GUI.
  • Difference: High-overhead instrumentation (10-50x slowdown). Developer tool, not operational monitor. No GPU, disk, or network metrics.

4.5 Heaptrack

  • URL: https://github.com/KDE/heaptrack
  • Language: C++ + Python
  • Description: Fast heap memory profiler for Linux, designed as a faster, lower-overhead alternative to Valgrind/Massif. Traces all allocations and annotates with stack traces.
  • Key features: Lower overhead than Valgrind, flame graph output, heaptrack_gui for visualization, finds memory leaks and allocation hotspots.
  • Difference: Memory only, developer profiler. No GPU, CPU utilization, disk, or network.

4.6 Perfetto

  • URL: https://github.com/google/perfetto
  • Language: C++
  • Description: Google’s open-source production-grade system profiling and tracing tool. Default tracing system for Android and used in Chromium. Can capture CPU scheduling, memory, I/O, GPU events, and custom trace points.
  • Key features: Multi-process system trace, SQL-based analysis, browser-based UI, heap profiling (heapprofd), CPU frequency and scheduling, Android + Linux support.
  • Difference: Complex tracing infrastructure primarily targeting Android/embedded and browser use cases. Not a lightweight batch job wrapper.

4.7 async-profiler

  • URL: https://github.com/async-profiler/async-profiler
  • Language: C (JVM agent)
  • Description: Low-overhead sampling CPU and heap profiler for JVM (Java/Kotlin/Scala/Clojure). Uses AsyncGetCallTrace + perf_events to avoid safepoint bias.
  • Key features: CPU + heap sampling, flame graphs, JFR files, tracks native + JVM code, suitable for production.
  • Difference: JVM-specific. No Python/R/general process monitoring. No disk, network, or GPU.

4.8 TAU (Tuning and Analysis Utilities)

  • URL: https://www.cs.uoregon.edu/research/tau/home.php
  • Language: C++ (with Python, Fortran, Java support)
  • Description: Comprehensive profiling and tracing toolkit for HPC parallel programs (MPI, OpenMP, CUDA). Supports hardware counters, GPU profiling, and generates call graphs.
  • Key features: Parallel program profiling (MPI, OpenMP), hardware counters, GPU support, ParaProf visualization, call graph.
  • Difference: HPC research tool for parallel program performance analysis. Complex setup, not a lightweight batch job wrapper.

4.9 HPCToolkit

  • URL: https://hpctoolkit.org/
  • Language: C/C++
  • Description: Sampling-based measurement and analysis suite for HPC programs on CPUs and GPUs. Supports supercomputers.
  • Key features: 1-5% overhead sampling, full calling context, hpcviewer GUI, GPU support.
  • Difference: HPC research tool, complex setup, not designed for general batch jobs or Python/R scripts.

Category 5: Rust Tools


5.1 below (Facebook/Meta)

  • URL: https://github.com/facebookincubator/below
  • Language: Rust
  • Description: Time-traveling resource monitor for modern Linux systems. Records system activity to disk and allows replay of historical data. Cgroup-aware with PSI (Pressure Stall Information) support.
  • Key features: Record + replay mode, cgroup hierarchy view, PSI metrics, process-level stats, live mode, persistent storage. Built on cgroupv2.
  • Difference: System-wide monitoring daemon. Designed for Linux infrastructure monitoring, not for wrapping individual batch jobs. No workflow integration. Very strong on cgroup/container awareness.

5.2 samply

  • URL: https://github.com/mstange/samply
  • Language: Rust
  • Description: Command-line sampling CPU profiler for macOS, Linux, and Windows. Uses Linux perf events. Spawns the target process as a subprocess and profiles it, then opens Firefox Profiler UI.
  • Key features: Subprocess wrapping (samply record ./your_program), Firefox Profiler UI, local symbol resolution, flame graphs.
  • Difference: CPU profiling only (call stack). No memory, GPU, disk, or network tracking. Developer profiler.

5.3 Bytehound

  • URL: https://github.com/koute/bytehound
  • Language: Rust
  • Description: Memory profiler for Linux. Intercepts all heap allocations via LD_PRELOAD. Produces detailed allocation timelines with stack traces.
  • Key features: Full allocation tracking, web-based GUI, Rhai scripting for analysis, multi-architecture (AMD64, ARM, AArch64, MIPS64).
  • Difference: Memory only. Developer profiler. Requires LD_PRELOAD, no GPU/disk/network.

5.4 pprof-rs

  • URL: https://github.com/tikv/pprof-rs
  • Language: Rust
  • Description: Rust CPU profiler using backtrace-rs. Generates pprof-compatible output.
  • Key features: CPU profiling for Rust applications, pprof output, flame graphs, low overhead.
  • Difference: CPU profiler for Rust programs only.

Category 6: System-Level Daemons and Metrics Collection Infrastructure

These tools are designed for long-running infrastructure monitoring, not individual batch jobs, but represent the broader ecosystem.


6.1 Prometheus + node_exporter

  • URL: https://github.com/prometheus/node_exporter
  • Language: Go
  • Description: Prometheus exporter for hardware and OS metrics from /proc and /sys. Exposes CPU, memory, disk, network, filesystem, and more as Prometheus metrics.
  • Key features: Pull-based metrics, scrape-able endpoint, very broad metric coverage, alerting via Prometheus + Alertmanager.
  • Difference: Infrastructure monitoring daemon. Requires Prometheus server. No per-job tracking.

6.2 Prometheus Pushgateway

  • URL: https://github.com/prometheus/pushgateway
  • Language: Go
  • Description: Push acceptor for ephemeral and batch jobs. Allows short-lived jobs to push metrics to Prometheus (which normally pulls). Stores last-received metrics until explicitly deleted.
  • Key features: HTTP push endpoint, labels/grouping by job, integrates with Prometheus.
  • Difference: Infrastructure component. Not a resource tracker itself; requires a separate process to collect and push metrics. Most relevant for a Rust implementation that needs to output to Prometheus.

6.3 Prometheus process-exporter

  • URL: https://github.com/ncabatoff/process-exporter
  • Language: Go
  • Description: Prometheus exporter that reads /proc to report on selected processes. Groups processes by name or regex and exposes CPU, memory, file descriptors, I/O, and thread counts.
  • Key features: Per-process-group CPU and memory metrics, /proc-based, configurable process selection, Prometheus compatible.
  • Difference: Infrastructure daemon, not a batch job wrapper. Monitors selected processes continuously.

6.4 cAdvisor (Container Advisor)

  • URL: https://github.com/google/cadvisor
  • Language: Go
  • Description: Google’s container resource usage and performance analysis agent. Exposes Prometheus metrics for running containers.
  • Key features: Container-level CPU, memory, disk, and network metrics, Prometheus endpoint, supports Docker and other runtimes.
  • Difference: Container/cgroup focused daemon. Not for general process monitoring.

6.5 Telegraf

  • URL: https://github.com/influxdata/telegraf
  • Language: Go
  • Description: Plugin-driven metrics collection agent from InfluxData. Single agent collecting system metrics (CPU, memory, disk, network, GPU, containers) and writing to InfluxDB or other backends.
  • Key features: 300+ input plugins (system, Docker, SNMP, statsd, etc.), multiple output backends, flexible configuration.
  • Difference: Infrastructure agent daemon. Not designed for per-job wrapping.

6.6 Netdata (see 2.12)


6.7 kube-state-metrics

  • URL: https://github.com/kubernetes/kube-state-metrics
  • Language: Go
  • Description: Kubernetes add-on that generates metrics about Kubernetes object state (pod resource requests/limits, deployment status, etc.) for Prometheus.
  • Key features: Pod/node resource quota metrics, deployment health, Prometheus format.
  • Difference: Kubernetes-only, no process-level metrics.

6.8 OpenTelemetry (OTel)

  • URL: https://opentelemetry.io/ / https://github.com/open-telemetry/opentelemetry-python
  • Language: Multi-language (Go, Python, Java, .NET, etc.)
  • Description: CNCF standard for collecting traces, metrics, and logs. Includes system metrics via the OTel Collector. Growing support for profiling via OTel.
  • Key features: Traces + metrics + logs, vendor-neutral, collector, SDKs in all major languages, exporters to Prometheus, Jaeger, OTLP.
  • Difference: General observability framework, not a resource tracker per se. Relevant for instrumenting a Rust CLI to expose metrics in a standard format.

6.9 NVIDIA DCGM + dcgm-exporter

  • URL: https://github.com/NVIDIA/DCGM / https://github.com/NVIDIA/dcgm-exporter
  • Language: C (DCGM) + Go (exporter)
  • Description: NVIDIA Data Center GPU Manager for GPU telemetry in large Linux clusters. dcgm-exporter exposes GPU metrics for Prometheus.
  • Key features: Per-GPU and per-process GPU metrics, health monitoring, diagnostics, Kubernetes integration, Prometheus exporter.
  • Difference: NVIDIA GPU infrastructure daemon for data center clusters. Not a batch job wrapper.

Category 7: Per-Process Network and Disk I/O Monitors


7.1 nethogs

  • URL: https://github.com/raboof/nethogs
  • Language: C++
  • Description: Linux “net top” tool that groups network bandwidth by process using /proc/net/tcp and libpcap.
  • Key features: Per-process network bandwidth (upload/download), real-time top-like display.
  • Difference: Network only, interactive display, no data capture to file.

7.2 iftop

  • URL: https://www.ex-parrot.com/pdw/iftop/
  • Language: C
  • Description: Shows network bandwidth grouped by source/destination host pairs. Does not show per-process breakdown.
  • Key features: Per-connection bandwidth, host name resolution.
  • Difference: Network only, host-pair level (not process level).

7.3 iotop

  • URL: https://github.com/Tomas-M/iotop
  • Language: C (rewrite of original Python version)
  • Description: Top-like tool for disk I/O. Shows per-process disk read/write rates using kernel I/O accounting.
  • Key features: Per-process disk I/O, real-time display, accumulated I/O counters.
  • Difference: Disk I/O only, interactive display, no data capture.

7.4 dstat

  • URL: https://github.com/dagwieers/dstat
  • Language: Python
  • Description: Versatile system statistics tool combining vmstat, iostat, netstat, and ifstat. Outputs columns of metrics to terminal, can write to CSV.
  • Key features: CPU, disk, network, memory, system statistics; CSV output; pluggable.
  • Difference: System-wide only (not per-process), no GPU. CSV output mode is useful for offline analysis.

Category 8: ML Experiment Tracking Platforms with Resource Monitoring

These platforms include resource metric tracking as one feature among many.


8.1 Weights & Biases (W&B)

  • URL: https://github.com/wandb/wandb
  • Language: Python
  • Description: ML experiment tracking platform with automatic system metric logging. Tracks GPU, CPU, memory, and network during training runs.
  • Key features: Automatic system metric logging (GPU, CPU, RAM, network), experiment tracking, model registry, artifacts, collaborative dashboards.
  • Difference: Primarily an ML experiment tracker. Resource monitoring is automatic and integrated but secondary to experiment logging. Requires W&B account (cloud-first, has open-source local server option).

8.2 MLflow

  • URL: https://github.com/mlflow/mlflow
  • Language: Python
  • Description: Open-source ML lifecycle management. Does not natively log CPU/GPU metrics; requires external integration.
  • Key features: Experiment tracking, model registry, deployment. No built-in system resource monitoring.
  • Difference: No native resource tracking.

8.3 ClearML (see 1.20)


Category 9: HPC Batch Job Monitoring


9.1 Jobstats

  • URL: https://github.com/PrincetonUniversity/jobstats
  • Language: Python + Prometheus stack
  • Description: Slurm-compatible job monitoring platform for CPU and GPU clusters. Displays per-job CPU and GPU efficiency summaries using Prometheus, Grafana, and Slurm Prolog/Epilog hooks.
  • Key features: Per-Slurm-job efficiency report (CPU utilization, memory, GPU utilization), compares requested vs. used resources, automatically stores data in Slurm AdminComment field.
  • Difference: Slurm HPC specific. Requires full Prometheus + Grafana + Slurm infrastructure. Very close in concept to resource-tracker (per-job resource reports) but for HPC/Slurm, not general Python/R scripts.

9.2 Open XDMoD

  • URL: https://open.xdmod.org/
  • Language: PHP + Python
  • Description: Open-source tool for analyzing HPC center usage and job efficiency. Tracks CPU, memory, GPU, and I/O for Slurm/PBS/SGE jobs.
  • Key features: Job-level resource utilization reports, efficiency recommendations, web portal.
  • Difference: HPC management tool. Requires full HPC stack. Not for general batch jobs.

Category 10: R Language Profiling Tools

Resource-tracker explicitly supports R scripts. These are the closest R-ecosystem analogues.


10.1 profvis

  • URL: https://github.com/rstudio/profvis
  • Language: R
  • Description: Interactive visualization of R code profiling data. Uses Rprof() to collect call stack samples and displays an interactive flame graph and memory timeline in a web browser.
  • Key features: Interactive flame graph, memory timeline, line-level time attribution, RStudio integration.
  • Difference: CPU + memory profiling for R code, developer-oriented. No disk, network, or GPU. No batch job wrapping or time-series operational logging.

10.2 bench

  • URL: https://github.com/r-lib/bench
  • Language: R
  • Description: High-precision benchmarking for R with memory tracking.
  • Key features: High-resolution timing, memory allocation tracking, comparison of multiple expressions.
  • Difference: Benchmarking tool. No operational resource monitoring.

10.3 microbenchmark

  • URL: https://github.com/joshuaulrich/microbenchmark
  • Language: R
  • Description: R package for sub-millisecond timing benchmarks.
  • Key features: High-precision CPU timing.
  • Difference: CPU timing only, micro-benchmarking specific.

10.4 profmem

  • URL: https://github.com/HenrikBengtsson/profmem
  • Language: R
  • Description: Simple memory profiling for R expressions. Uses tracemem/R internals to log all memory allocations.
  • Key features: Per-expression memory allocation log.
  • Difference: Memory only, developer-oriented.

Category 11: Python Standard Library / Built-in Profiling


11.1 cProfile / profile

  • URL: https://docs.python.org/3/library/profile.html
  • Language: Python (stdlib)
  • Description: Python’s built-in deterministic profiler. Records function call counts and cumulative time.
  • Key features: Function-level timing, call count, cumulative/per-call time, pstats for analysis.
  • Difference: CPU time only, function-level. No memory, GPU, disk, or network.

11.2 tracemalloc

  • URL: https://docs.python.org/3/library/tracemalloc.html
  • Language: Python (stdlib, since 3.4)
  • Description: Traces Python memory allocations with tracebacks to allocation sites.
  • Key features: Peak memory tracking, traceback to allocation sites, snapshot comparison.
  • Difference: Python-managed memory only. No native/C allocations, no GPU/disk/network.

11.3 yappi

  • URL: https://github.com/sumerc/yappi
  • Language: Python + C
  • Description: Yet Another Python Profiler. Supports both wall clock and CPU time, multi-threaded profiling, and async code.
  • Key features: Wall + CPU time, multi-thread awareness, async support, pstats/callgrind output.
  • Difference: CPU profiling only.

11.4 line_profiler

  • URL: https://github.com/pyutils/line_profiler
  • Language: Python + C
  • Description: Line-by-line CPU time profiler for Python using @profile decorator.
  • Key features: Line-level execution time, @profile decorator.
  • Difference: CPU time only, requires decoration.

Summary Comparison Table

ToolLangCPUMemGPUDiskNetBatch-job wrapPer-job reportWorkflow integrationOutput
resource-trackerPythonYYYYYYYMetaflow, Flyte, AirflowMetrics + card visualization
psutilPythonYYYYRaw API
memory_profilerPythonYY (mprof)Y (plot)Plot + log
ScalenePythonYYYY (CLI)Y (web UI)Interactive web report
MemrayPythonYY (CLI)Y (flame graph)Flame graphs
FilPythonYY (CLI)Y (flame graph)Flame graph
pyinstrumentPythonYYYHTML/text
py-spyRustYY (attach)Y (flame graph)Flame graph
AustinCYYStack samples
GlancesPythonYYY*YYTUI + web API
nvitopPythonYTUI + Python API
gpustatPythonYCLI display
CodeCarbonPythonY*Y*Y*Y (decorator)Y (CSV)CO2 report
ClearMLPythonYYYYY (auto)Y (web)ML frameworksWeb dashboard
belowRustYYYYTUI + replay
samplyRustYY (subprocess)Y (flame graph)Firefox profiler
BytehoundRustYY (LD_PRELOAD)Y (web GUI)Web GUI
atopCYYYYTUI + binary log
sysstat/pidstatCYYYYCLI + CSV
htopCYYYYTUI
btop++C++YYY*YYTUI
JobstatsPythonYYYY* (Slurm)Y (Slurm)SlurmCLI + DB
PyroscopeGoYYY (SDK)Flame graphs
ParcaGoYYKubernetesIcicle graphs
perfCYYY (subprocess)Raw perf data
ValgrindCYYY (subprocess)YText + GUI
nethogsC++YTUI
iotopCYTUI
PowerAPIPythonY*Y*Power estimates
W&BPythonYYYYY (auto)Y (web)ML frameworksWeb dashboard
Prometheus stackGoYYY*YYKubernetesTime-series DB

Y = partial/plugin-based support


Key Findings for Rust CLI Implementation

Based on this landscape analysis, the following observations are most relevant to the planned Rust/Linux CLI implementation:

  1. No existing Rust tool covers the full feature set of resource-tracker (CPU + memory + GPU + disk + network + batch job wrapping + per-job reporting). below (Rust) is the closest in scope but is a system-wide daemon, not a per-job wrapper.

  2. procfs is the right foundation for Linux. The /proc filesystem is used by psutil, process-exporter, sysstat, and resource-tracker itself. A Rust implementation can use the procfs crate or read /proc directly with zero external dependencies.

  3. GPU support requires dynamic linking (NVML via libpynvml or direct libnvidia-ml.so). This is a hard constraint noted in the SOW. The Rust NVML binding (nvidia-management-library crate or similar) will be needed.

  4. The Pushgateway integration (Extra Component: S3 PUT) is unique to resource-tracker and not present in any comparable tool. This makes it particularly well-suited for cloud batch job environments.

  5. The decorator/wrapper pattern (similar to samply record ./program) is present in py-spy, samply, Austin, and Fil — wrapping a subprocess is the right architectural pattern for a CLI tool.

  6. The closest functional analogues (tools that wrap a job, collect multi-resource metrics, and produce a per-job report) are:

    • Scalene (Python, CPU+GPU+memory, developer-oriented)
    • memory_profiler (Python, memory only, has mprof)
    • Jobstats (HPC/Slurm specific)
    • resource-tracker itself (the reference implementation)

    None of these is in Rust, none covers all six resource dimensions (CPU, memory, GPU, VRAM, network, disk) in a single zero-dependency binary.


Sources

  • https://github.com/SpareCores/resource-tracker
  • https://github.com/giampaolo/psutil
  • https://github.com/pythonprofilers/memory_profiler
  • https://github.com/plasma-umass/scalene
  • https://github.com/bloomberg/memray
  • https://github.com/pythonspeed/filprofiler
  • https://github.com/joerick/pyinstrument
  • https://github.com/benfred/py-spy
  • https://github.com/P403n1x87/austin
  • https://github.com/nicolargo/glances
  • https://github.com/XuehaiPan/nvitop
  • https://github.com/wookayin/gpustat
  • https://github.com/gpuopenanalytics/pynvml
  • https://github.com/mlco2/codecarbon
  • https://github.com/lfwa/carbontracker
  • https://github.com/powerapi-ng/pyRAPL
  • https://github.com/powerapi-ng/pyJoules
  • https://github.com/powerapi-ng/powerapi
  • https://github.com/sb-ai-lab/eco2AI
  • https://github.com/psf/pyperf
  • https://github.com/clearml/clearml
  • https://github.com/xybu/python-resmon
  • https://github.com/htop-dev/htop
  • https://github.com/aristocratos/btop
  • https://github.com/aristocratos/bpytop
  • https://github.com/aristocratos/bashtop
  • https://github.com/Atoptool/atop
  • https://github.com/sysstat/sysstat
  • https://github.com/Syllo/nvtop
  • https://github.com/MrRio/vtop
  • https://github.com/netdata/netdata
  • https://github.com/iovisor/bcc
  • https://github.com/bpftrace/bpftrace
  • https://github.com/parca-dev/parca
  • https://github.com/grafana/pyroscope
  • https://github.com/brendangregg/FlameGraph
  • https://github.com/gperftools/gperftools
  • https://valgrind.org/
  • https://github.com/KDE/heaptrack
  • https://github.com/google/perfetto
  • https://github.com/async-profiler/async-profiler
  • https://github.com/facebookincubator/below
  • https://github.com/mstange/samply
  • https://github.com/koute/bytehound
  • https://github.com/tikv/pprof-rs
  • https://github.com/prometheus/node_exporter
  • https://github.com/prometheus/pushgateway
  • https://github.com/ncabatoff/process-exporter
  • https://github.com/google/cadvisor
  • https://github.com/influxdata/telegraf
  • https://github.com/kubernetes/kube-state-metrics
  • https://opentelemetry.io/
  • https://github.com/NVIDIA/DCGM
  • https://github.com/NVIDIA/dcgm-exporter
  • https://github.com/raboof/nethogs
  • https://github.com/wandb/wandb
  • https://github.com/mlflow/mlflow
  • https://github.com/PrincetonUniversity/jobstats
  • https://github.com/rstudio/profvis
  • https://github.com/r-lib/bench
  • https://github.com/sumerc/yappi
  • https://github.com/pyutils/line_profiler
  • https://github.com/msaroufim/awesome-profiling
  • https://lambda.ai/blog/keeping-an-eye-on-your-gpus-2
  • https://sparecores.com/article/metaflow-resource-tracker
  • https://developers.facebook.com/blog/post/2021/09/21/below-time-travelling-resource-monitoring-tool/

Open-Source Tools with Similar Functionality to resource-tracker

resource-tracker is a lightweight, zero-dependency Python package for monitoring CPU, memory, GPU, network, and disk utilization across processes and at the system level, designed for batch jobs (Python/R scripts, Metaflow steps), with decorator-based workflow integration and per-job visualization reports.

The tools below are organized into meaningful categories. No single open-source tool matches all of resource-tracker’s characteristics simultaneously — most are either too narrow (single metric), too heavy (infrastructure daemons), or not batch-job oriented.


Category 1: Python Libraries for Process/System Resource Monitoring

(Closest functional analogues)

ToolNotesDetails
psutilThe foundational building block used by resource-tracker itself. Raw API only, no tracking loop or reports.Linux; no CLI; CPU/Mem/Disk/Net/Process; no batch wrap; no report
memory_profilerLine-by-line memory, @profile decorator, mprof plot. No CPU/GPU/disk/network.Linux; CLI (mprof); Memory; batch wrap (mprof CLI); report (plot)
ScaleneHigh-precision line-level profiler with AI optimization suggestions. No disk/network. Developer profiler.Linux; CLI; CPU/GPU/Mem; batch wrap (CLI); report (web UI)
MemrayBloomberg. Tracks every allocation including C/C++. No CPU/GPU/disk/network.Linux; CLI; Memory; batch wrap (CLI); report (flame graphs)
FilPeak memory focus for data scientists (NumPy/Pandas). Written in Rust+Python. Linux/macOS only.Linux; CLI; Memory (peak); batch wrap (CLI); report (flame graph)
pyinstrumentContext manager + decorator. 1ms sampling. No memory/GPU/disk/network.Linux; CLI; CPU; batch wrap; report
py-spyWritten in Rust. Attaches to a running process. No memory/GPU/disk/network.Linux; CLI; CPU; batch wrap (attach); report (flame graph)
AustinPure C, extremely low overhead CPython frame stack sampler.Linux; CLI; CPU; batch wrap; no report
GlancesFull system monitor with REST API, web UI, and exporters. Long-running daemon, not a batch-job wrapper.Linux; CLI; CPU/Mem/Disk/Net/GPU; no batch wrap; no report
nvitopBest GPU process viewer. Has programmatic ResourceMetricCollector API. No CPU/mem/disk/net.Linux; CLI; NVIDIA GPU; no batch wrap; no report
gpustatSimple NVIDIA GPU status CLI. No time-series logging.Linux; CLI; NVIDIA GPU; no batch wrap; no report
pynvml / nvidia-ml-pyPython NVML bindings. Building block only.Linux; no CLI; GPU (raw API); no batch wrap; no report
CodeCarbon@track_emissions decorator. CO2/energy focus, not utilization %. No disk/network.Linux; partial CLI; CPU/Mem/GPU energy; batch wrap (decorator); report (CSV + dashboard)
CarbonTrackerPredicts carbon footprint, can halt training. ML training specific.Linux; no CLI; CPU/GPU energy; batch wrap; report
pyRAPLIntel RAPL via /sys/class/powercap. Intel CPUs only. Energy joules, not utilization %.Linux only; no CLI; CPU/DRAM energy; batch wrap (decorator); no report
pyJoulesMulti-device energy (Intel RAPL + NVML). Context manager and decorator.Linux only; no CLI; CPU/DRAM/GPU energy; batch wrap (decorator); no report
PowerAPIFramework for software-defined power meters. Process/container/VM granularity. Complex setup.Linux only; partial CLI; CPU/Mem power; no batch wrap; no report
eco2AIML training focused CO2 tracking.Linux; no CLI; CPU/GPU/RAM energy; batch wrap (decorator); report (CSV)
pyperfPSF benchmarking toolkit. --track-memory and --tracemalloc options. Not an operational monitor.Linux; CLI; Memory (benchmarks); batch wrap; report
ClearMLFull MLOps platform. Auto-logs system metrics. Requires ClearML server.Linux; CLI; CPU/Mem/GPU/Net; auto batch wrap; report (web UI)
python-resmonLightweight script outputting CSV. System-level only, no per-process or GPU tracking.Linux; CLI; CPU/Mem/Disk/Net; no batch wrap; report (CSV)
yappiCPU + wall time profiler with multi-thread and async support.Linux; no CLI; CPU; batch wrap; report
line_profilerLine-by-line CPU time. No memory/GPU/disk/network.Linux; CLI (kernprof); CPU; batch wrap (@profile); report

Category 2: Interactive Terminal System Monitors

(Real-time visual monitoring; do not produce per-job reports or integrate with batch workflows)

ToolNotesDetails
htopInteractive process viewer; no data captureC; Linux; CLI; CPU/Mem/Proc
btop++Most modern TUI monitor; GPU via pluginsC++; Linux; CLI; CPU/Mem/Disk/Net/GPU
bpytopPredecessor to btop++Python; Linux; CLI; CPU/Mem/Disk/Net
bashtopPredecessor to bpytopBash; Linux; CLI; CPU/Mem/Disk/Net
atopWrites persistent binary logs; replay mode; strong process-level detailC; Linux only; CLI; CPU/Mem/Disk/Net/Proc
nmonCSV capture mode for offline analysis; primarily Linux/AIXC; Linux; CLI; CPU/Mem/Disk/Net
collectlWide metric coverage; daemon or one-shot modePerl; Linux only; CLI; CPU/Mem/Disk/Net
sysstat (sar/pidstat)pidstat for per-process; sadf for JSON/CSV/XML export; schedulable via cronC; Linux only; CLI; CPU/Mem/Disk/Net/Proc
nvtopAMD, Apple, Intel, NVIDIA, Qualcomm support; interactive GPU monitorC; Linux; CLI; GPU (multi-vendor)
vtopNode.js, Unicode chartsJS; Linux; CLI; CPU/Mem/Proc
Netdata76k+ GitHub stars. Per-second metrics, web UI, ML anomaly detectionC; Linux; CLI; all (800+ plugins)

Category 3: eBPF / Kernel Tracing Tools

(Zero-overhead kernel-level observability; require root + Linux kernel 4.1+)

ToolNotesDetails
BCCToolkit for writing eBPF programs; 70+ ready-made toolsC/Python/Lua; Linux only; CLI
bpftraceDTrace-like one-liners for eBPF; ad-hoc analysisC++ DSL; Linux only; CLI
Parca + Parca AgentContinuous eBPF-based CPU profiling; pprof format; <1% overheadGo; Linux only; CLI
Pyroscope (Grafana)Continuous profiling database + eBPF agent; multi-language SDK; Grafana integrationGo; Linux only; CLI

Category 4: Native C/C++ Profiling Tools

ToolNotesDetails
perf (Linux perf_events)Foundation for many other tools; hardware counter samplingC (kernel); Linux only; CLI; CPU/kernel events
FlameGraphVisualizes perf/DTrace output as SVG flame graphsPerl; Linux; CLI; visualization
gperftoolsGoogle Performance Tools: CPU profiler, heap profiler, TCMallocC++; Linux; partial CLI (pprof); CPU/Memory
Valgrind / MassifHigh-overhead instrumentation; Massif=heap profiler; 10–50× slowdownC; Linux; CLI; CPU/Memory
HeaptrackKDE; faster alternative to Valgrind/Massif for heap profilingC++; Linux only; CLI; Memory
PerfettoGoogle; default Android profiler; SQL-queryable traces; browser UIC++; Linux; CLI; CPU/Mem/GPU/Disk/Sched
async-profilerLow-overhead JVM profiler; flame graphs; JVM onlyC (JVM agent); Linux; CLI (asprof); CPU/Heap
TAUHPC parallel profiling suite; complex setupC++; Linux; CLI; CPU/GPU/MPI
HPCToolkitHPC sampling profiler; 1–5% overhead; supercomputer useC/C++; Linux; CLI; CPU/GPU

Category 5: Rust Tools

ToolNotesDetails
belowFacebook/Meta. Time-traveling system monitor with cgroup/PSI support; record+replay mode. System-wide daemon, not a batch-job wrapper. Architecturally most relevant Rust project.Linux only; CLI
samplySampling CPU profiler; wraps a subprocess (samply record ./program); uses Linux perf events; Firefox Profiler UI. CPU only.Linux; CLI
BytehoundHeap memory profiler; LD_PRELOAD-based; multi-arch (AMD64, ARM, AArch64, MIPS64); web-based GUI. Memory only.Linux only; CLI
pprof-rsCPU profiler for Rust programs using backtrace-rs; pprof output format. Library only.Linux; no CLI

Category 6: Infrastructure Metrics Collection (Daemons & Exporters)

(Not batch-job wrappers; relevant for pipeline integration and metric output targets)

ToolNotesDetails
Prometheus node_exporterSystem-level Prometheus exporter; /proc-basedGo; Linux; CLI
Prometheus PushgatewayAllows batch jobs to push metrics to Prometheus; standard solution for short-lived jobsGo; Linux; CLI
process-exporterPer-process-group Prometheus metrics from /procGo; Linux only; CLI
cAdvisorContainer resource usage and performance; Prometheus exporterGo; Linux only; CLI
TelegrafPlugin-driven metrics agent; 300+ inputs; InfluxDB backendGo; Linux; CLI
OpenTelemetryCNCF standard for traces/metrics/logs; structured output for jobsMulti-lang; Linux; CLI (otelcol)
NVIDIA DCGM + dcgm-exporterGPU telemetry for Kubernetes/data center; Prometheus exporterC/Go; Linux only; CLI
kube-state-metricsKubernetes object state metrics for PrometheusGo; Linux; CLI
Jobstats (HPC)Slurm-compatible per-job efficiency reports (CPU+GPU). Conceptually very close to resource-tracker but Slurm-specific.Python; Linux only; CLI

Category 7: Per-Process Network and Disk I/O Monitors

ToolNotesDetails
nethogsPer-process network bandwidth using /proc/net/tcp + libpcapC++; Linux only; CLI
iftopPer-connection (not per-process) bandwidth monitorC; Linux; CLI
iotopPer-process disk I/O using kernel I/O accountingC; Linux only; CLI
dstatSystem-wide CPU+disk+network+memory with CSV outputPython; Linux only; CLI

Category 8: ML Experiment Tracking with Resource Monitoring

ToolNotesDetails
Weights & BiasesAuto-logs GPU, CPU, memory, network during training runs; cloud-first; rich dashboardsLinux; CLI (wandb)
ClearMLOpen-source MLOps platform; auto-logs GPU+CPU+memory+network; requires ClearML serverLinux; CLI
MLflowExperiment tracking but no native system resource monitoringLinux; CLI (mlflow)

Category 9: R Language Profiling

ToolNotesDetails
profvisInteractive R profiling visualization; CPU + memory timeline; used within R sessionLinux; R session only
benchBenchmarking with memory tracking; used within R sessionLinux; R session only
microbenchmarkMicro-benchmarking tool; used within R sessionLinux; R session only
profmemMemory allocation tracing for R expressions; used within R sessionLinux; R session only

Category 10: Python Standard Library Profiling Tools

ToolNotesDetails
cProfile / profileFunction-level CPU time; stdlibLinux; CLI (python -m cProfile)
tracemallocPython memory allocation tracing with tracebacks; stdlib since Python 3.4; used within codeLinux; no CLI (used within code)

Summary: Key Differentiators of resource-tracker

The table below highlights what makes resource-tracker stand out relative to the landscape:

Featureresource-trackerMost profilersSystem monitorsML trackers
CPU + Memory + GPU + Disk + NetAll 5Usually 1–2All 5CPU+Mem+GPU
Batch-job / script wrapperYesYesNo (daemons)Yes
Zero runtime dependenciesYesVariesNoNo
Per-job visual report / cardYesOftenNoYes (cloud)
Workflow integration (Metaflow)YesNoNoVaries
Cloud instance recommendationsYesNoNoNo
Lightweight process footprintYesYesNoNo
Process-level granularityYesYesPartialNo
Runs on LinuxYesYesYesYes
CLI invocationYesYes (most)YesYes

Rust Crate-Level Competitive Landscape: Resource Monitoring

This document surveys Rust crates relevant to resource monitoring — tracking CPU, memory, GPU, network, and disk utilization — with particular focus on use cases analogous to the Python resource-tracker package (batch job wrapping, structured output, low overhead).

It also covers dial9-tokio-telemetry, a notable 2026 Rust telemetry crate that is not a resource monitor but is included here to explain why it falls outside this landscape.


Section 1: Core System Information Libraries

(Foundational libraries; highest relevance as building blocks)

CrateNotesDetails
sysinfoThe dominant Rust system-info library. Cross-platform (Linux, macOS, Windows, FreeBSD). Covers everything resource-tracker needs except GPU. Used internally by most other crates here. ~2,700 GitHub stars.Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 123M downloads
procfsDirect interface to Linux /proc. Most granular per-process data available (CPU time, RSS, VMS, I/O counters, smaps). Authoritative source for Linux-first tools.Linux only; no CLI; CPU/Mem/Net/Disk; process-level; active; 51M downloads
psutilRust port of Python’s psutil. Modular feature flags. Linux + macOS. README self-describes as “not well maintained” despite a July 2025 update.Linux; no CLI; CPU/Mem/Net/Disk; process-level; active*; 3.1M downloads
systemstatPure Rust (no C bindings). Cross-platform. System-wide only — no per-process metrics.Linux; no CLI; CPU/Mem/Net/Disk; system-wide only; active; 3.6M downloads
libprocPer-process data on Linux + macOS. Useful complement to procfs for cross-platform support.Linux; no CLI; CPU/Mem/Net/Disk; process-level; active; 5M downloads
memory-statsCross-platform. Reports the current process’s own RSS and virtual memory only. Narrow scope but zero-dependency and reliable.Linux; no CLI; Mem only; self-process only; active; 10.3M downloads
perf_monitorLarksuite (Lark/Feishu). Designed explicitly as a monitoring foundation: per-process CPU, memory, FDs, disk I/O. Cross-platform. Archived January 2026 — do not adopt for new projects.Linux; no CLI; CPU/Mem/Disk; process-level; archived; 36K downloads
heimAsync-first psutil/gopsutil equivalent. Conceptually ideal but last released 2020; 74 open issues. Not safe to adopt.Linux; no CLI; CPU/Mem/Net/Disk; process-level; abandoned; 490K downloads

*psutil: stated as “not well maintained” in README despite recent activity.


Section 2: GPU Monitoring Libraries

CrateNotesDetails
nvml-wrapperSafe, ergonomic Rust wrapper for NVIDIA NVML. Covers GPU utilization, memory, temperature, power, fan speed, running compute processes. The standard library for NVIDIA GPU metrics in Rust.Linux; no CLI; NVIDIA GPU; active; 3.5M downloads
all-smiMost comprehensive multi-vendor GPU CLI in Rust. Prometheus metrics integration. Display-oriented but scriptable.Linux; CLI + Prometheus; NVIDIA/AMD/Intel/Apple/TPU/NPU GPU; active; 8.3K downloads
nviwatchInteractive TUI + InfluxDB integration. NVIDIA-only.Linux; TUI; NVIDIA GPU; active; 4.9K downloads
gpuinfoMinimal CLI for GPU status with --watch and --format flags. Scriptable. NVIDIA-only.Linux; CLI; NVIDIA GPU; active; 5.9K downloads

Section 3: CLI Tools for Batch Job / Process Resource Tracking

(Most directly comparable to resource-tracker’s execution model)

CrateNotesDetails
denetClosest Rust analogue to resource-tracker. denet run <cmd> wraps a command and streams CPU%, memory (RSS+VMS), and I/O metrics. JSON/JSONL/CSV output. Adaptive sampling. Child process aggregation. Python API bindings. No GPU or network monitoring.Linux; CLI; CPU/Mem/Disk; active; 2.6K downloads
session-process-monitorKubernetes-focused but spm run pattern directly wraps a batch job with monitoring + OOM protection + headless JSON logging. Tracks USS/PSS/RSS memory and disk I/O rate. Very new (March 2026). No GPU or network.Linux only; CLI (spm run); CPU/Mem/Disk; active; 173 downloads
stop-cliModern process viewer with JSON/CSV structured output designed for piping to jq. Per-process CPU%, memory, disk I/O, FDs. Very early stage (v0.0.1, November 2025).Linux; CLI; CPU/Mem/Disk; active; 72 downloads
procrecRecords and plots CPU + memory for a process. Conceptually aligned but last updated 2021.Linux; CLI; CPU/Mem; abandoned; 1.7K downloads
radvisorContainer/Kubernetes batch monitoring at 50ms granularity via cgroups. CSVY output. CPU (including throttling), memory, block I/O. Dormant since 2022.Linux only; CLI; CPU/Mem/Disk; dormant; 1.7K downloads
pidtree_monCLI monitor for CPU load across entire process trees (parent + all descendants). CPU-only; no memory/disk/network/GPU.Linux only; CLI; CPU only; active; 6.2K downloads
gotta-watch-em-allCLI memory monitor for process trees. Memory-only. Dormant since 2022.Linux; CLI; Mem only; dormant; 6.5K downloads
procweb-rustWeb interface for per-process Linux resource usage. No structured data output. Stale since 2023.Linux only; web UI; CPU/Mem; stale; 5.5K downloads
systrackLibrary for tracking CPU and memory usage over configurable time intervals (rolling windows) — the exact pattern resource-tracker uses. Single release in 2023; dormant since.Linux; no CLI; CPU/Mem; dormant; 1.4K downloads

Section 4: Interactive TUI System Monitors

(Visual monitors; not designed for non-interactive batch job instrumentation)

CrateNotesDetails
bottom (btm)Most popular Rust TUI monitor. Cross-platform. No GPU. Uses sysinfo internally. Interactive only — not suitable for batch job instrumentation.Linux; TUI; CPU/Mem/Net/Disk; active; 13,100 stars
mltopML-focused TUI combining CPU + NVIDIA GPU (via NVML). Directly targets the ML engineer use case. Interactive only.Linux; TUI; CPU/Mem/NVIDIA GPU; active; 14 stars
rtopTUI with optional NVIDIA GPU support. Covers all five resource types in a single tool. Interactive only.Linux; TUI; CPU/Mem/NVIDIA GPU/Net/Disk; active; 36 stars
ttopTUI with multi-vendor GPU (NVIDIA, AMD, Apple Silicon). Very new (March 2026). Interactive only.Linux; TUI; CPU/Mem/multi-vendor GPU; active
hegemonModular safe-Rust TUI. Last release 2018. Historical reference only.Linux only; TUI; CPU/Mem; abandoned; 336 stars

Section 5: Comprehensive Hardware Monitoring

CrateNotesDetails
silicon-monitorMost comprehensive hardware monitoring scope of any crate here. NVIDIA (NVML) + AMD (ROCm/sysfs) + Intel (i915) GPU. Also covers temperatures, SMART disk data, USB, audio, per-process GPU attribution. Provides CLI (JSON output), TUI, GUI, library (simonlib), and MCP/AI agent server. Very new (133 downloads, 1 star as of March 2026); unclear stability. Worth watching.Linux; CLI (JSON); CPU/Mem/multi-vendor GPU/Net/Disk; active

Section 6: Kernel / Low-Level Profiling Crates

(Measure hardware counters, not high-level resource utilization)

CrateNotesDetails
perf-eventSafe Rust interface to perf_event_open. Exposes hardware counters: CPU cycles, instructions, cache hits/misses, branch predictions, page faults, context switches. Deep profiling of batch jobs; not high-level resource tracking.Linux only; no CLI; active; 4.2M downloads
pprofCPU profiler for Rust programs (stack sampling → flamegraph/pprof output). Profiler, not a resource monitor.Linux; no CLI; active; 34M downloads
metricsApplication metrics facade (counters, gauges, histograms). Used to emit measurements; not a collector of system resources.Linux; no CLI; active; 74M downloads

Section 7: dial9-tokio-telemetry — Async Runtime Telemetry (Out of Scope)

dial9-tokio-telemetry is a runtime telemetry “flight recorder” for the Tokio async runtime in Rust, announced on the Tokio blog on March 18, 2026 (authored by Russell Cohen, with AWS contributions). It is included here to explain why it is not a resource monitor and does not belong in this landscape.

What it does

dial9 hooks into Tokio’s internal instrumentation to capture a microsecond-resolution event log of every:

  • Task poll (timing per poll)
  • Worker park / unpark event
  • Task wake event and lifecycle (creation, worker migration)
  • Queue depth change
  • Lock contention event (with stack traces on Linux)
  • Linux kernel scheduling delay (gap between “ready to run” and “actually scheduled”)
  • CPU profile samples (Linux perf/eBPF-style)
  • Application-level tracing spans and logs

Traces are written to compact rotating binary files (or directly to S3) with <5% overhead, enabling continuous production deployment. A web-based trace viewer renders the results.

Why it is not a resource monitor

Dimensionresource-trackerdial9-tokio-telemetry
Target workloadBatch jobs (ML, HPC, pipelines)Long-running async Rust services
Metrics trackedCPU%, RAM, GPU, network, diskTokio task polls, scheduling delays, lock contention
IntegrationDecorator / subprocess wrapMust be compiled into the Rust binary
OutputTime-series resource usage / plotsBinary event traces for async runtime debugging
Question answered“How much CPU/RAM did this job use?”“Why did this async request take 18ms instead of 1ms?”
PlatformCross-platformLinux-primary

dial9 is an async runtime debugger. It tracks none of the metrics — CPU utilization %, memory, GPU, network bandwidth, disk I/O — that define the resource-tracker use case. It is relevant to Rust async service reliability engineering, not to batch job resource instrumentation.


Summary: Key Findings

No single Rust crate fully replicates resource-tracker

No existing Rust crate combines: subprocess/batch-job wrapping + CPU% + memory + GPU + network + disk + structured JSON/CSV output + low overhead. The gap is real.

Closest existing tools

CrateWhy it is closeWhat is missing
denetdenet run <cmd> wraps a command; JSON/CSV output; Python bindingsGPU, network
session-process-monitorspm run pattern; OOM protection; headless JSON loggingGPU, network
stop-cliStructured JSON/CSV; scripting-friendlyNot a job wrapper; no GPU/network
PurposeCrate
CPU, memory, disk, network (system + process)sysinfo
Fine-grained Linux per-process I/O and memoryprocfs
NVIDIA GPU metricsnvml-wrapper
Multi-vendor GPU CLIall-smi

The GPU gap

No Rust library cleanly integrates CPU + memory + multi-vendor GPU + network + disk in a single programmatic API suitable for batch job wrapping. silicon-monitor attempts this scope but is brand new and unproven. nvml-wrapper covers NVIDIA programmatically; multi-vendor GPU support requires either all-smi (CLI) or direct vendor SDK bindings.

Specification Proposal — resource-tracker

  • Status: Proposal / Work-in-Progress
  • Date: 2026-03-30
  • Based on: README.md (SpareCores), src/ prototype, Python PR #9, s3_upload.py
  • AI large language model tools were used throughout research, specification, and implementation phases of this project to accelerate and improve the quality of the work.

0. Conventions

The key words MUST, MUST NOT, REQUIRED, SHALL, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

A verifiable requirement is one that can be confirmed by an automated test without manual inspection. Every normative statement below (MUST/SHALL) is intended to be verifiable.


1. Purpose and Scope

resource-tracker is a lightweight, statically self-contained Linux binary that:

  1. Polls system- and process-level resource utilization at a configurable interval.
  2. Emits structured samples to stdout (JSON Lines or CSV).
  3. Optionally streams those samples to the Sentinel API (SpareCores data ingestion endpoint) via gzip-compressed (CVS, TSV, or JSONL) files uploaded to S3 using temporary STS credentials.

The binary is intended as a drop-in CLI wrapper: run it alongside any process and it will transparently record how that process consumes hardware.

Out of scope (v1): macOS, Windows, eBPF, EBPF-based tracing, container image introspection beyond environment variables, multi-host federation.


2. Platform Requirements

RequirementDetail
Operating SystemLinux only (kernel ≥ 4.18 recommended for full /proc coverage)
CPU Architecturesx86_64 and aarch64 (ARM64)
LinkageDynamic linkage for GPU libraries; all other code statically linked or carried as crate dependencies
Minimum Rust Edition2024

GPU support MUST NOT be required for the binary to build or run.
On a CPU-only host GpuCollector::collect() SHALL return an empty Vec and no error.


3. Configuration

3.1 Precedence (highest to lowest)

CLI flags  >  TOML config file  >  built-in defaults

Future enhancement: Support RESOURCE_TRACKER_-prefixed environment variables (e.g. RESOURCE_TRACKER_INTERVAL, RESOURCE_TRACKER_FORMAT) as an additional configuration layer between CLI flags and the TOML file. Environment variables are more practical than file-based config for containerized and scripted workloads and are preferred for the Sentinel integration use case.

3.2 CLI Parameters

The binary MUST accept the following flags via a command line parser:

ShortLongTypeDefaultDescription
-n--job-nameStringnoneHuman-readable label attached to every sample
-p--pidi32noneRoot PID of the process tree to track (CPU attribution)
-i--intervalu641Polling interval in seconds (≥ 1)
-c--configpathresource-tracker.tomlPath to TOML config file
-f--formatenumjsonOutput format: json or csv
--versionflagPrint binary version and exit

All metadata fields listed in Section 9.3 (job_name, project_name, stage_name, etc.) MUST also be accepted as CLI flags. See Section 9.3 for the full flag and environment variable table.

Shell-wrapper mode (MVP target): The binary SHOULD support being used as a transparent process wrapper, where the command to monitor is passed as trailing arguments after a -- separator or as positional arguments:

resource-tracker Rscript model.R
resource-tracker -- python train.py --epochs 10

In this mode the binary spawns the given command as a child process, sets --pid to that child’s PID automatically, and exits when the child exits (propagating the child exit code). This is a significant usability improvement over the Python implementation and is a first-class v1 goal.

--interval MUST be > 0. Values of 0 SHALL be rejected with a non-zero exit code and a descriptive error message.

3.3 TOML Config File

The config file is optional. If the file does not exist or cannot be parsed, the binary MUST continue using defaults (no error, no warning).

Schema:

[job]
name = "my-benchmark"   # String; optional
pid  = 12345            # i32;   optional

[tracker]
interval_secs = 5       # u64;   optional; default 1

Unrecognized keys MUST be silently ignored.

3.4 Verifiable Configuration Tests

  • T-CFG-01: Running with no flags produces valid JSON Lines output on stdout.
  • T-CFG-02: --format csv emits a header line matching the exact column list in Section 6.2 before the first data row.
  • T-CFG-03: --interval 0 exits with code ≠ 0.
  • T-CFG-04: A TOML file with [tracker] interval_secs = 3 results in samples separated by ≈ 3 seconds when no --interval flag is provided.
  • T-CFG-05: A CLI --interval 2 overrides a TOML interval_secs = 5.
  • T-CFG-06: A missing TOML file path silently falls back to defaults.

4. Startup Behavior

On startup the binary MUST:

  1. Parse configuration (Section 3).
  2. Initialize all collectors.
  3. Execute one warm-up collection pass to prime delta state in stateful collectors (CpuCollector, NetworkCollector, DiskCollector).
  4. Sleep exactly one full interval.
  5. Emit the CSV header (if format = CSV) before the first data row.
  6. Enter the polling loop (Section 5).

The warm-up pass result MUST NOT be emitted to stdout.


5. Polling Loop

The loop MUST:

  1. Record timestamp_secs = current Unix time as u64 (seconds since UNIX epoch, UTC).
  2. Collect all metric subsystems (Section 6.1) in the order: CPU, Memory, Network, Disk, GPU.
  3. Serialize and emit one line to stdout per the chosen format (Section 6.2, Section 6.3).
  4. Sleep the configured interval.
  5. Repeat indefinitely until killed.

Collection of any subsystem MUST NOT block the other subsystems. Failures in optional subsystems (GPU) MUST be surfaced as empty/zero values, not panics.


6. Data Model

6.1 Sample

A Sample is a point-in-time snapshot of all tracked resources.

#![allow(unused)]
fn main() {
pub struct Sample {
    pub timestamp_secs: u64,          // Unix time (seconds)
    pub job_name:       Option<String>,
    pub cpu:            CpuMetrics,
    pub memory:         MemoryMetrics,
    pub network:        Vec<NetworkMetrics>,  // one per interface
    pub disk:           Vec<DiskMetrics>,     // one per block device
    pub gpu:            Vec<GpuMetrics>,      // one per GPU; empty if none
}
}

6.1.1 CpuMetrics

Source: /proc/stat tick deltas; /proc/<pid>/stat for process tracking.

Note: total_cores (logical CPU count) is a static host property that rarely changes. It belongs in the host discovery snapshot (Section 8.1) rather than in every per-second sample. It is referenced here only for computing cpu_usage in the CSV output (Section 7.2).

FieldTypeUnitSourceNotes
utilization_pctf64fractional cores/proc/statAggregate utilization expressed as cores-in-use (0.0..N_cores)
per_core_pctVec<f64>%/proc/statPer logical CPU percentage; len == host_vcpus; range 0.0–100.0
utime_secsf64seconds/proc/statΔ(user+nice ticks) / ticks_per_second for this interval
stime_secsf64seconds/proc/statΔ(system ticks) / ticks_per_second for this interval
process_countu32count/proc numeric dirsNumber of running processes visible to the OS
process_cores_usedOption<f64>fractional cores/proc/<pid>/statNone when no PID tracked
process_child_countOption<u32>count/proc/<pid>/statDescendant count; excludes root PID; None when no PID tracked

Computation rules:

  • utilization_pct = (Δtotal − Δidle) / Δtotal × N_cores where N_cores is the logical CPU count from host discovery. The result is expressed as fractional cores in use (e.g. 4.6 on a 16-core host means ~4.6 vCPUs are fully utilized). Do NOT clamp this value; values very slightly above N_cores are valid under kernel accounting rounding. Δtotal = Δ(user + nice + system + idle + iowait + irq + softirq + steal). Δidle = Δ(idle + iowait).
  • utime_secs = Δ(user + nice) / ticks_per_second.
  • stime_secs = Δ(system) / ticks_per_second.
  • process_cores_used = Σ Δ(utime+stime) for root PID and all descendants / (elapsed_wall_clock_seconds × ticks_per_second). Must be ≥ 0.
  • On the first collection call (no previous snapshot), all delta-based fields MUST return 0. The caller MUST discard this result (warm-up pass).

Verifiable CpuMetrics Tests:

  • T-CPU-01: utilization_pct is in [0.0, N_cores] for all samples (N_cores from host discovery).
  • T-CPU-02: len(per_core_pct) == host_vcpus for all samples.
  • T-CPU-03: When --pid is not set, process_cores_used and process_child_count are None.
  • T-CPU-04: When --pid <self> is set, process_cores_used ≥ 0.
  • T-CPU-05: process_count ≥ 1 on any running Linux system.
  • T-CPU-06: First collect() call returns 0.0 for all delta fields.

6.1.2 MemoryMetrics

Source: /proc/meminfo. All values in mebibytes (MiB = 1024 × 1024 bytes), standardized to match Python resource-tracker PR #9 which also adopts MiB throughout.

FieldTypeUnit/proc/meminfo key(s)Notes
total_mibu64MiBMemTotal
free_mibu64MiBMemFreeTruly free RAM
available_mibu64MiBMemAvailableFree + reclaimable
used_mibu64MiBMemTotal − MemFree − Buffers − CachedMatches Python memory_used
used_pctf64%derivedused_mib / total_mib × 100; range 0.0–100.0
buffers_mibu64MiBBuffersKernel I/O buffers
cached_mibu64MiBCached + SReclaimablePage cache + slab reclaimable
swap_total_mibu64MiBSwapTotal
swap_used_mibu64MiBSwapTotal − SwapFree
swap_used_pctf64%derived0.0 when SwapTotal == 0
active_mibu64MiBActive
inactive_mibu64MiBInactive

Verifiable MemoryMetrics Tests:

  • T-MEM-01: free_mib + used_mib + buffers_mib + cached_mib ≤ total_mib (accounting for kernel reserved memory).
  • T-MEM-02: used_pct is in [0.0, 100.0].
  • T-MEM-03: swap_used_pct is 0.0 when swap_total_mib == 0.
  • T-MEM-04: available_mib ≤ total_mib.

6.1.3 NetworkMetrics

Source: /proc/net/dev (throughput), /sys/class/net/<iface>/ (identity/link state). One NetworkMetrics record per non-loopback interface.

Architecture note: Fields such as mac_address, driver, operstate, speed_mbps, and mtu are static properties that do not change every second. They are candidates for promotion to a host-discovery snapshot (Section 8.1) rather than being repeated in every per-second sample. This applies similarly to static fields in Section 6.1.4 (disk) and Section 6.1.5 (GPU). The current spec includes them here for completeness; a future revision should separate static identity fields from dynamic rate fields.

FieldTypeUnitSourceNotes
interfaceStringinterface namee.g. "eth0"
mac_addressOption<String>/sys/class/net/<iface>/address"00:11:22:33:44:55"
driverOption<String>/sys/class/net/<iface>/device/driver symlinke.g. "igc"
operstateOption<String>/sys/class/net/<iface>/operstate"up", "down", "unknown"
speed_mbpsOption<i64>Mbps/sys/class/net/<iface>/speed−1 when not reported
mtuOption<u32>bytes/sys/class/net/<iface>/mtu
rx_bytes_per_secf64bytes/s/proc/net/dev ΔRate for this interval
tx_bytes_per_secf64bytes/s/proc/net/dev ΔRate for this interval
rx_bytes_totalu64bytes/proc/net/devCumulative since boot
tx_bytes_totalu64bytes/proc/net/devCumulative since boot

Verifiable NetworkMetrics Tests:

  • T-NET-01: rx_bytes_per_sec ≥ 0.0 and tx_bytes_per_sec ≥ 0.0 for all interfaces.
  • T-NET-02: rx_bytes_total monotonically non-decreasing between consecutive samples (absent interface reset).
  • T-NET-03: The loopback interface (lo) is NOT included in the output.

6.1.4 DiskMetrics

Source: /proc/diskstats (throughput), /sys/block/<dev>/ (identity), statvfs(3) (space). One DiskMetrics record per block device (excluding partitions and device-mapper synthetic devices unless mounted independently).

FieldTypeUnitSourceNotes
deviceStringkernel device namee.g. "sda", "nvme0n1"
modelOption<String>/sys/block/<dev>/device/model
vendorOption<String>/sys/block/<dev>/device/vendor
serialOption<String>/sys/block/<dev>/device/wwid or serial
device_typeOption<DiskType>/sys/block/<dev>/queue/rotationalNvme, Ssd, or Hdd; None when type cannot be determined
capacity_bytesOption<u64>bytes/sys/block/<dev>/size × 512
mountsVec<DiskMountMetrics>statvfs(3)One per mount point
read_bytes_per_secf64bytes/s/proc/diskstats Δ
write_bytes_per_secf64bytes/s/proc/diskstats Δ
read_bytes_totalu64bytes/proc/diskstats sectors × sector_sizeCumulative since boot; see sector size note
write_bytes_totalu64bytes/proc/diskstats sectors × sector_sizeCumulative since boot; see sector size note

DiskMountMetrics fields:

FieldTypeUnitNotes
mount_pointStringe.g. "/"
filesystemStringFilesystem type from /proc/mounts; e.g. "ext4", "xfs"
total_bytesu64bytesstatvfs.f_blocks × f_bsize
available_bytesu64bytesstatvfs.f_bavail × f_bsize (unprivileged)
used_bytesu64bytestotal_bytes − (statvfs.f_bfree × f_bsize)
used_pctf64%used_bytes / total_bytes × 100; 0.0 when total == 0

Sector size note: The current implementation hard-codes 512 bytes/sector for /proc/diskstats conversions. Python’s get_sector_sizes() reads /sys/block/<dev>/queue/hw_sector_size (fallback 512). On 4K-native drives (some NVMe) the Rust code will under-count I/O bytes by up to 8×. A future fix should read /sys/block/<dev>/queue/logical_block_size at startup and use the actual sector size. See implementation plan P-DSK-SECTOR.

Verifiable DiskMetrics Tests:

  • T-DSK-01: read_bytes_per_sec ≥ 0.0 and write_bytes_per_sec ≥ 0.0.
  • T-DSK-02: For each mount, used_bytes + available_bytes ≤ total_bytes.
  • T-DSK-03: capacity_bytes (when Some) > 0.

6.1.5 GpuMetrics

Source: NVML (nvml-wrapper crate, runtime-loads libnvidia-ml.so) for NVIDIA GPUs; libamdgpu_top (runtime-loads libdrm) for AMD GPUs.

FieldTypeUnitNotes
uuidStringStable vendor UUID; AMD uses PCI bus address
nameStringHuman-readable device name
device_typeString"GPU", "NPU", "TPU"
host_idStringHost-level device identifier
detailHashMap<String,String>Vendor-specific extras (driver version, PCI bus ID, ROCm version)
utilization_pctf64%Core utilization; range 0.0–100.0
vram_total_bytesu64bytes
vram_used_bytesu64bytes
vram_used_pctf64%vram_used / vram_total × 100; 0.0 when total == 0
temperature_celsiusu32°CDie temperature
power_wattsf64WNVML reports mW; converted to W
frequency_mhzu32MHzCore/graphics clock
core_countOption<u32>countShader/compute cores; None if not reported

AMD-specific: When /sys/module/amdgpu does not exist the AMD collection path MUST be skipped entirely (no panic).

NVIDIA-specific: power_watts = raw NVML milliwatt value / 1000.

Verifiable GpuMetrics Tests:

  • T-GPU-01: On a CPU-only host, gpu Vec is empty and no error is returned.
  • T-GPU-02: utilization_pct is in [0.0, 100.0] for each GPU.
  • T-GPU-03: vram_used_bytes ≤ vram_total_bytes for each GPU.
  • T-GPU-04: vram_used_pct is 0.0 when vram_total_bytes == 0.
  • T-GPU-05: On a host with AMD GPU, uuid equals the PCI bus address string.

7. Output Formats

7.1 JSON Lines (default)

Each sample is emitted as a single JSON object followed by \n. The binary MUST include a version field keyed as "<crate-name>-version" with the value being the Cargo package version string.

Example (abbreviated):

{"timestamp_secs":1743300000,"job_name":null,"cpu":{...},"memory":{...},"network":[...],"disk":[...],"gpu":[],"resource-tracker-version":"0.1.0"}

Requirements:

  • T-OUT-01: Each line MUST be valid JSON parseable with any standard JSON library.
  • T-OUT-02: timestamp_secs MUST be present and be a positive integer.
  • T-OUT-03: The version key "resource-tracker-version" MUST be present.
  • T-OUT-04: Consecutive samples MUST have non-decreasing timestamp_secs.

7.2 CSV Format

CSV is the primary and required output format for Sentinel S3 streaming (Section 9.2.2). It uses the same column names and units as the Python resource-tracker so the Sentinel backend can ingest both without schema changes. When uploaded to S3 the CSV content MUST be gzip-compressed and the object key MUST carry the extension .csv.gz.

When --format csv is selected for stdout output the raw (uncompressed) CSV bytes are written. Gzip compression is applied only when writing the S3 batch upload payload (Section 9.2.2).

When --format csv is selected:

  • The header line MUST be emitted exactly once, before the first data row.
  • The header MUST match the following column names in this exact order:
timestamp,processes,utime,stime,cpu_usage,memory_free,memory_used,memory_buffers,memory_cached,memory_active,memory_inactive,disk_read_bytes,disk_write_bytes,disk_space_total_gb,disk_space_used_gb,disk_space_free_gb,net_recv_bytes,net_sent_bytes,gpu_usage,gpu_vram,gpu_utilized

Column definitions:

CSV ColumnSource FieldUnitComputation
timestamptimestamp_secsUnix secondsdirect
processescpu.process_countcountdirect
utimecpu.utime_secssecondsdirect; 3 decimal places
stimecpu.stime_secssecondsdirect; 3 decimal places
cpu_usagecpu.utilization_pctfractional coresutilization_pct directly; field is already in fractional cores (0..N_cores); 4 decimal places
memory_freememory.free_mibMiBdirect
memory_usedmemory.used_mibMiBdirect
memory_buffersmemory.buffers_mibMiBdirect
memory_cachedmemory.cached_mibMiBdirect
memory_activememory.active_mibMiBdirect
memory_inactivememory.inactive_mibMiBdirect
disk_read_bytesdisk subsystembytesΣ read_bytes_per_sec × interval_secs across all devices; integer
disk_write_bytesdisk subsystembytesΣ write_bytes_per_sec × interval_secs across all devices; integer
disk_space_total_gbdisk mountsGB (10⁹)Σ total_bytes / 1_000_000_000 across all mounts; 6 decimal places
disk_space_used_gbdisk mountsGB (10⁹)disk_space_total_gb − disk_space_free_gb; 6 decimal places
disk_space_free_gbdisk mountsGB (10⁹)Σ available_bytes / 1_000_000_000 across all mounts; 6 decimal places
net_recv_bytesnetwork subsystembytesΣ rx_bytes_per_sec × interval_secs across all interfaces; integer
net_sent_bytesnetwork subsystembytesΣ tx_bytes_per_sec × interval_secs across all interfaces; integer
gpu_usagegpu subsystemfractional GPUsΣ utilization_pct / 100 across all GPUs; 4 decimal places
gpu_vramgpu subsystemMiBΣ vram_used_bytes / 1_048_576; 4 decimal places
gpu_utilizedgpu subsystemcountcount of GPUs where utilization_pct > 0.0

Verifiable CSV Tests:

  • T-CSV-01: Header is emitted exactly once, as the first line.
  • T-CSV-02: Column count per data row equals column count in header.
  • T-CSV-03: cpu_usage column equals utilization_pct directly (field is already fractional cores, 0..N_cores) to 4 dp.
  • T-CSV-04: disk_space_used_gb = disk_space_total_gb − disk_space_free_gb for all rows.
  • T-CSV-05: CSV output for a given sample is byte-for-byte reproducible (deterministic).
  • T-CSV-06: No trailing commas; no quoted fields (all values are numbers or bare identifiers).

8. Host and Cloud Discovery

The binary SHOULD collect machine-level metadata once at startup and include it in the Sentinel API registration payload (Section 9.1). Collected fields use the prefix host_ or cloud_.

8.1 Host Discovery

All fields are optional; collection failure MUST be silently swallowed.

FieldTypeSource
host_idOption<String>AWS: /sys/class/dmi/id/board_asset_tag; fallback: /etc/machine-id
host_nameOption<String>gethostname(3)
host_ipOption<String>First non-loopback IPv4 from getifaddrs(3)
host_allocationOption<String>"dedicated" or "shared"; heuristic TBD
host_vcpusOption<u32>Count of logical CPUs (/proc/cpuinfo processor entries)
host_cpu_modelOption<String>/proc/cpuinfo model name field
host_memory_mibOption<u64>MemTotal / 1024 from /proc/meminfo
host_gpu_modelOption<String>First GPU name from GpuCollector
host_gpu_countOption<u32>Length of GPU Vec
host_gpu_vram_mibOption<u64>Sum of vram_total_bytes / 1_048_576 across all GPUs
host_storage_gbOption<f64>Sum of capacity_bytes / 1_000_000_000 across all block devices

Users MUST be able to suppress any field by setting the corresponding environment variable to "0" or "" (exact mechanism TBD in implementation).

8.2 Cloud Discovery

Cloud metadata is probed by making HTTP GET requests to each cloud provider’s Instance Metadata Service (IMDS) with a short timeout (≤ 2 seconds per provider). Probes MUST be attempted in the background and MUST NOT delay the first sample emission.

FieldProbe endpointNotes
cloud_vendor_idAWS: 169.254.169.254/latest/meta-data/; GCP: metadata.google.internal; Azure: 169.254.169.254/metadata/instanceInfer vendor from which endpoint responds
cloud_account_idAWS: /latest/meta-data/identity-credentials/ec2/info
cloud_region_idAWS: /latest/meta-data/placement/region
cloud_zone_idAWS: /latest/meta-data/placement/availability-zone
cloud_instance_typeAWS: /latest/meta-data/instance-type

Verifiable Cloud Discovery Tests:

  • T-CLD-01: On a non-cloud host, all cloud_* fields are None and the binary does not hang for more than 5 seconds total on startup.
  • T-CLD-02: IMDS probe timeout is ≤ 2 seconds per provider.

9. Sentinel API Streaming (Extra Component)

Activation is gated on the SENTINEL_API_TOKEN environment variable being set.

Resolved design decisions:

  1. Streaming is enabled automatically whenever SENTINEL_API_TOKEN is set; no additional flag needed.
  2. Upload format is csv.gz only; jsonl.gz is not supported.
  3. Streaming is not separately configurable via TOML or CLI beyond the token env var.
  4. On network unavailability: start_run logs a warning and disables streaming; local stdout output continues normally (see Section 11 error handling).

9.1 Authentication

The binary MUST read the API token from the environment variable SENTINEL_API_TOKEN. Every Sentinel API request MUST include the HTTP header:

Authorization: Bearer <token>

If SENTINEL_API_TOKEN is not set, all streaming functionality MUST be silently disabled. Local stdout emission continues normally.

9.2 Run Lifecycle

9.2.1 Start of Run

At startup (after host/cloud discovery), the binary MUST POST to the data ingestion endpoint to register a new Run.

POST /runs (default base URL: https://api.sentinel.sparecores.net).

Request payload (JSON, Content-Type: application/json): all metadata, host, and cloud fields are merged into a flat top-level object (no nesting):

{
  "job_name": "...",
  "project_name": "...",
  "pid": 12345,
  "host_vcpus": 8,
  "cloud_vendor_id": "aws",
  ...
}

Response fields the binary MUST store:

Response FieldTypeUsage
run_idStringReferenced in all subsequent API calls
upload_uri_prefixStringS3 URI prefix for metric uploads
upload_credentials.access_keyStringSTS credential
upload_credentials.secret_keyStringSTS credential
upload_credentials.session_tokenStringSTS credential
upload_credentials.expirationString (ISO 8601)STS credential expiry; optional

9.2.2 Batch Upload (Background Thread)

The binary MUST start a background thread that:

  1. Every 60 seconds (configurable, default 60), takes all samples collected since the previous upload.
  2. Serializes them as CSV (same column layout as Section 7.2) – CSV is the only accepted format for the Sentinel S3 bucket.
  3. Gzip-compresses the CSV bytes.
  4. Generates a unique S3 object key under upload_uri_prefix: <upload_uri_prefix>/<run_id>/<batch_seq_number>.csv.gz
  5. Uploads via AWS Signature V4 (Section 10).
  6. Appends the uploaded URI to an internal list uploaded_uris.

If STS credentials are within 5 minutes of expiration, the binary MUST refresh them by POSTing to /runs/{run_id}/refresh-credentials before attempting the upload.

Upload failures MUST be retried at least once with exponential back-off before being recorded as errors. After 3 consecutive upload failures the background thread MUST log a warning and continue buffering (data is not lost).

Verifiable Streaming Tests:

  • T-STR-01: Without SENTINEL_API_TOKEN, no HTTP connection is made.
  • T-STR-02: A batch upload request contains Content-Encoding: gzip and the body decompresses to valid CSV or JSONL.
  • T-STR-03: uploaded_uris contains the S3 URIs of all successfully uploaded batches.
  • T-STR-04: Credential refresh is triggered when ≤ 5 minutes remain before expires_at.

9.2.3 End of Run

When the tracked process terminates (or the binary receives SIGTERM), the binary MUST:

SIGINT note: An explicit SIGINT handler is not installed. When the binary is used in shell-wrapper mode, Ctrl-C is delivered to the entire process group, so both the child and the tracker receive SIGINT and exit together. Explicit SIGTERM forwarding to the child process is a future enhancement.

  1. Flush any remaining samples as a final batch upload (if uploaded_uris is non-empty).
  2. POST to /runs/{run_id}/finish to close the Run, including:
    • run_id
    • exit_code (i32, if tracked process exited cleanly; else None)
    • run_status enum: "finished" (exit 0 or SIGTERM) or "failed" (non-zero exit)
    • data_source:
      • "s3" + data_uris: Vec<String> if any S3 uploads succeeded.
      • "inline" + data_csv: <base64(gzip(csv))> for short runs with no S3 uploads.

Verifiable End-of-Run Tests:

  • T-EOR-01: On SIGTERM, the binary exits with code 0 after flushing remaining data.
  • T-EOR-02: The close-run request body contains run_id matching the start-run response.
  • T-EOR-03: data_source is "inline" when no S3 uploads occurred.
  • T-EOR-04: data_source is "s3" when at least one S3 upload succeeded.

9.3 Metadata Fields

The following metadata MAY be supplied by the user via CLI flags or environment variables. All are optional strings unless noted.

FieldCLI FlagEnv Variable
job_name--job-nameTRACKER_JOB_NAME
project_name--project-nameTRACKER_PROJECT_NAME
stage_name--stage-nameTRACKER_STAGE_NAME
task_name--task-nameTRACKER_TASK_NAME
team--teamTRACKER_TEAM
env--envTRACKER_ENV
language--languageTRACKER_LANGUAGE
orchestrator--orchestratorTRACKER_ORCHESTRATOR
executor--executorTRACKER_EXECUTOR
external_run_id--external-run-idTRACKER_EXTERNAL_RUN_ID
container_image--container-imageTRACKER_CONTAINER_IMAGE

Users MUST also be able to supply arbitrary key-value tags via repeated --tag key=value flags.


10. S3 Upload — AWS Signature V4

The upload is implemented in pure Rust without any AWS SDK dependency (zero additional transitive deps for this path). The implementation mirrors the Python s3_upload.py module from PR #9.

10.1 URI Parsing

An S3 URI has the form s3://bucket/path/to/object. Parsing MUST:

  • Require scheme == "s3".
  • Require a non-empty bucket name.
  • Require a non-empty key (path after bucket).
  • Return an error for any other form.

10.2 Bucket Region Detection

If the upload region is not supplied, the binary MUST determine it by sending an HTTP HEAD request to https://<bucket>.s3.amazonaws.com/ and reading the x-amz-bucket-region response header. The header is present even on 3xx/4xx responses. Results MUST be cached in-process for the lifetime of the run. Default fallback: "eu-central-1".

10.3 Request Construction

A PUT request to https://<bucket>.s3.<region>.amazonaws.com/<key> with:

  • Content-Length: byte count of body.
  • x-amz-content-sha256: SHA-256 hex of body.
  • x-amz-date: YYYYMMDDTHHmmSSZ UTC.
  • x-amz-security-token: STS session token.
  • Authorization: AWS4-HMAC-SHA256 signature (see Section 10.4).

10.4 AWS Signature V4

Signing key derivation:

kDate    = HMAC-SHA256("AWS4" + secret_key, date_stamp)
kRegion  = HMAC-SHA256(kDate, region)
kService = HMAC-SHA256(kRegion, "s3")
kSigning = HMAC-SHA256(kService, "aws4_request")

Canonical request:

PUT
/<key>

host:<bucket>.s3.<region>.amazonaws.com
x-amz-content-sha256:<payload_hash>
x-amz-date:<amz_date>
x-amz-security-token:<session_token>

host;x-amz-content-sha256;x-amz-date;x-amz-security-token
<payload_hash>

String to sign:

AWS4-HMAC-SHA256
<amz_date>
<date_stamp>/<region>/s3/aws4_request
<canonical_request_sha256>

Authorization header:

AWS4-HMAC-SHA256 Credential=<access_key>/<credential_scope>, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=<hex_sig>

10.5 Upload Success Criteria

HTTP 200 or 201 response from S3 = success. Any other status = error (with response body included in the error message).

10.6 Verifiable S3 Upload Tests

  • T-S3-01: parse_s3_uri("s3://bucket/path/obj") returns ("bucket", "path/obj").
  • T-S3-02: parse_s3_uri("https://bucket/path") returns an error.
  • T-S3-03: parse_s3_uri("s3://bucket/") returns an error (empty key).
  • T-S3-04: Given known access_key, secret_key, session_token, region, and a fixed timestamp, the generated Authorization header MUST match a pre-computed golden value.
  • T-S3-05: Bucket region cache prevents duplicate HEAD requests for the same bucket.
  • T-S3-06: An upload to a mock S3 server returns the S3 URI on success.

11. Error Handling

ScenarioRequired behavior
/proc file is unreadable for a single metricReturn 0 / None for that field; do not abort
GPU library absentGPU Vec is empty; no error propagated
Sentinel API unreachable at startLog warning; streaming disabled; local output continues
S3 upload failsRetry once; after 3 consecutive failures log warning and continue
Config TOML parse errorSilently fall back to defaults
--interval 0Exit with code ≠ 0 before starting collectors
Tracked PID not foundprocess_cores_used = None; do not abort

The binary MUST NEVER panic in production code. expect() is only permissible during development; all expect() calls MUST be replaced with proper error handling before v1.0 release.


12. Non-Functional Requirements

RequirementTarget
Binary size< 15 MiB stripped (CPU-only build)
Startup latency< 1 × configured interval before first sample
CPU overhead of the tracker itself< 1% of one core at 1-second interval on a 4-core host
Memory footprint< 20 MiB RSS at steady state
Stdout bufferingEach line MUST be flushed atomically (no partial lines)

13. Compatibility with Python resource-tracker

The CSV output format MUST maintain byte-for-byte column-name compatibility with the Python SystemTracker output so that the Sentinel API backend can ingest both without schema changes.

Confirmed equivalent columns (see Section 7.2 for derivation):

Python columnRust CSV columnPython unitRust unit
timestamptimestampUnix secondsUnix seconds
processesprocessescountcount
utimeutimesecondsseconds
stimestimesecondsseconds
cpu_usagecpu_usagefractional coresfractional cores
memory_freememory_freeMiBMiB
memory_usedmemory_usedMiBMiB
memory_buffersmemory_buffersMiBMiB
memory_cachedmemory_cachedMiBMiB
memory_activememory_activeMiBMiB
memory_inactivememory_inactiveMiBMiB
disk_read_bytesdisk_read_bytesbytes/intervalbytes/interval
disk_write_bytesdisk_write_bytesbytes/intervalbytes/interval
disk_space_total_gbdisk_space_total_gbGB (10⁹)GB (10⁹)
disk_space_used_gbdisk_space_used_gbGB (10⁹)GB (10⁹)
disk_space_free_gbdisk_space_free_gbGB (10⁹)GB (10⁹)
net_recv_bytesnet_recv_bytesbytes/intervalbytes/interval
net_sent_bytesnet_sent_bytesbytes/intervalbytes/interval
gpu_usagegpu_usagefractional GPUsfractional GPUs
gpu_vramgpu_vramMiBMiB
gpu_utilizedgpu_utilizedcountcount

Verifiable compatibility test:

  • T-COMPAT-01: Run Python and Rust trackers in parallel on the same host for 60 seconds. For each interval, the difference between corresponding scalar columns MUST be within 5% of the Python value (allowing for measurement-time skew).

14. Open Questions / Future Work

  1. eBPF integration: Using aya-rs or libbpf-rs for sub-millisecond tracing (CPU saturation, IPC, TLB misses, cache hit rates) — currently considered v2.
  2. Process-level memory (PSS): Preferred over RSS; requires reading /proc/<pid>/smaps_rollup which may be slow for large processes.
  3. Per-process disk and network I/O: /proc/<pid>/io and network namespaces; currently only system-wide.
  4. Configurable metric suppression: Allow users to opt out of fields containing PII (e.g. host_ip, hostname).
  5. ARM-specific GPU support: Apple Metal not in scope (Linux only); Qualcomm Adreno / Mali GPU metrics TBD.
  6. Static linking of NVML: Currently not possible; NVML requires a dynamically loaded vendor library.
  7. Heartbeat endpoint: Periodic ping to Sentinel API while tracking is active (distinct from batch S3 uploads).

Project Dependencies

This is a Rust programming language project requiring the Rust toolchain, including the Rust build system and package manager, named cargo.

In addition to the base toolchain, this project also makes use of the following:

ToolDescriptionRationale
uvAn extremely fast Python package and project managerSolely for benchmarking against the Python implementation
justA handy way to save and run project-specific commandsConvenience
jqA handy way to slice and filter JSON outputConvenience tool for JSON and JSONL.
mdbookA tool to create books with Markdown.This project is documented via mdbook.

Rust Crate Dependencies

Dependencies are declared in Cargo.toml and managed by cargo.

Runtime dependencies

CrateVersionPurpose
nvml-wrapper0.12NVIDIA GPU monitoring via NVML; loaded at runtime with libloading – no build-time system deps; returns empty on non-NVIDIA hosts
clap4CLI argument parsing; stripped to derive, std, help, usage, error-context, env features only
procfs0.18Linux /proc parsing for CPU, memory, disk, and network metrics
ureq3Lightweight synchronous HTTP client for Sentinel API and S3 PUT; avoids tokio runtime overhead
serde1Serialization/deserialization framework with derive macros
serde_json1JSON serialization for metric output and API payloads
toml1.0TOML config file parsing; parse + serde features only, no display overhead
hmac0.13.0-rc.6HMAC-SHA256 for manual AWS Signature Version 4 signing of S3 PUT requests
sha20.11.0SHA-256 hashing required by AWS Sig V4; paired with hmac
hex0.4Hex encoding of HMAC digests for Sig V4 canonical request construction
libc0.2FFI bindings for statvfs (filesystem space), gethostname, and SIGTERM signal handling
flate21.1.9 (pinned)Gzip compression for .csv.gz S3 batch uploads; rust_backend feature uses pure Rust (no zlib-sys C dep)
libamdgpu_top0.11.2AMD GPU monitoring via libdrm; libdrm_dynamic_loading feature loads the library at runtime – gracefully skipped on non-AMD hosts

Dev dependencies

CrateVersionPurpose
num_cpus1Smoke tests: verifies cpu.utilization_pct is expressed as fractional cores (bounded by logical CPU count), not a percentage

resource-tracker — Design Notes

Spec Summary

  1. Linux resource tracker (x86 + ARM), using procfs where appropriate
  2. Configurable polling interval for: CPU, memory, GPU, VRAM, network in/out, disk read/write
  3. GPU support requires dynamic linking (no static link)
  4. CLI tool with optional params (job name/metadata); TOML config file with sane defaults
  5. Basic HTTP client: hit API endpoints at start, stop, and every X minutes (heartbeat)
  6. Lightweight S3 PUT using AWS creds to stream resource utilization data

Dependency Assessment

Current Cargo.toml dependencies

CrateVersionPurpose
nvml-wrapper0.12NVIDIA GPU/VRAM monitoring via NVML; runtime dynamic loading
libamdgpu_top0.11.2, no defaults, libdrm_dynamic_loadingAMD GPU monitoring via libdrm; runtime dynamic loading
clap4, no defaults, derive+std+help+usage+error-context+envCLI argument parsing, minimal footprint
procfs0.18, serde feature onlyLinux /proc – CPU, memory, network, disk
ureq3, json featureLightweight sync HTTP – no tokio, no async runtime
serde1, deriveSerialization/deserialization
serde_json1JSON payload encoding for API and S3
toml1.0, no defaults, parse+serde featuresTOML config file parsing
hmac0.13.0-rc.6AWS Signature V4 HMAC signing
sha20.11.0SHA-256 hashing for AWS Sig V4
hex0.4Hex encoding for AWS Sig V4 signature
libc0.2statvfs for filesystem space, gethostname, SIGTERM
flate2=1.1.9 (pinned), no defaults, rust_backendGzip compression for S3 batch uploads; pure Rust, no zlib-sys

Release profile

[profile.release]
opt-level = "z"      # optimize for size
lto = true           # link-time optimization
codegen-units = 1    # better dead-code elimination
strip = true         # strip symbols
panic = "abort"      # smaller panic handler

Key decisions

  • nvml-wrapper + libamdgpu_top over all-smi: all-smi required protoc at build time. Replaced with nvml-wrapper (NVIDIA, no build-time deps) and libamdgpu_top with libdrm_dynamic_loading (AMD, runtime-only). Both load their respective drivers at runtime and degrade gracefully when absent.
  • ureq over reqwest: reqwest v0.13 pulls in tokio (full async runtime), hyper, and TLS stacks – adds ~5-10 MB. ureq v3 is synchronous, no runtime, comparable API surface.
  • procfs features trimmed: Dropped chrono (heavy date/time lib, std::time suffices) and flate2 (only needed for gzip-compressed /proc files, which are rare).
  • clap defaults disabled: Default clap features include terminal color, unicode width, etc. Stripped to the functional minimum; env feature added to support TRACKER_* environment variable overrides.
  • Manual AWS Sig V4 (hmac + sha2 + hex): Avoids aws-sdk-s3 (~50+ transitive deps, large binary). S3 PUT only needs ~100-150 lines of signing logic.
  • toml v1.0 defaults disabled: parse + serde features; serde feature required for toml::from_str deserialization into config structs.
  • flate2 pinned to =1.1.9 with rust_backend: Pure Rust gzip implementation; avoids a zlib-sys C build dependency. Version pinned to prevent unexpected breakage from pre-1.0 semver.
  • libc for sysfs/POSIX calls: statvfs for filesystem space, gethostname for host identity, and SIGTERM signal handling – pure FFI bindings with no additional binary size overhead.

Implementation Approaches

Option A — Single-file polling loop

All logic in main.rs. One tight loop: sleep → collect → diff deltas → buffer → flush.

main.rs
 ├── CLI parsing (clap)
 ├── Config loading (toml)
 ├── Polling loop
 │    ├── procfs → CPU/mem/net/disk snapshots + delta computation
 │    ├── all-smi → GPU/VRAM snapshots
 │    └── Vec<Sample> batch buffer
 ├── HTTP calls (ureq) — start / stop / heartbeat
 └── AWS Sig V4 signing + ureq PUT (inline)

Pros:

  • Simplest to read and audit end-to-end
  • Zero abstraction overhead
  • Fastest to prototype

Cons:

  • main.rs grows large and hard to navigate
  • No isolation between collectors — hard to unit test
  • Tight coupling makes it hard to disable/swap individual collectors

Best for: MVP / proof of concept.


Option B — Module-per-resource + collector trait (current)

A Collector trait drives a scheduler. Each resource lives in its own module with its own delta state.

src/
 ├── main.rs            — CLI, config, scheduler loop
 ├── config.rs          — TOML config struct + CLI override merge
 ├── sample.rs          — Sample / Report structs (serde)
 ├── collector/
 │    ├── mod.rs        — Collector trait: fn collect(&mut self) -> Metric
 │    ├── cpu.rs        — procfs::CpuTime, delta between ticks
 │    ├── memory.rs     — procfs::Meminfo
 │    ├── network.rs    — procfs::Net, bytes delta
 │    ├── disk.rs       — procfs::DiskStats, read/write delta
 │    └── gpu.rs        — all-smi wrapper
 └── reporter/
      ├── mod.rs        — Reporter trait: fn report(&self, batch: &[Sample])
      ├── http.rs       — ureq: start/stop/heartbeat endpoints
      └── s3.rs         — AWS Sig V4 + ureq PUT (batch upload)

Collector trait sketch:

#![allow(unused)]
fn main() {
pub trait Collector {
    fn collect(&mut self) -> Metric;
}
}

Reporter trait sketch:

#![allow(unused)]
fn main() {
pub trait Reporter {
    fn on_start(&self, meta: &JobMeta);
    fn on_sample(&self, batch: &[Sample]);
    fn on_stop(&self, meta: &JobMeta);
}
}

Pros:

  • Each collector independently testable with mock /proc data
  • Clean ownership: delta state lives inside each collector struct
  • Easy to add/remove resources without touching other collectors
  • Reporter abstraction allows multiple outputs (HTTP + S3 simultaneously)

Cons:

  • Slightly more upfront boilerplate (trait definitions, module layout)
  • Minor indirection vs. inline code

Best for: Production implementation. Right level of structure for the spec.


Option C — Config-driven pipeline with Cargo feature flags

Extends Option B with #[cfg(feature = "...")] gates. GPU collector is behind feature = "gpu" since it requires dynamic linking. This enables a statically-linked build for non-GPU targets.

[features]
default = ["gpu", "s3", "http"]
gpu     = ["dep:all-smi"]
s3      = []
http    = []
src/
 ├── main.rs
 ├── config.rs
 ├── sample.rs
 ├── collector/
 │    ├── cpu.rs
 │    ├── memory.rs
 │    ├── network.rs
 │    ├── disk.rs
 │    └── gpu.rs          — #[cfg(feature = "gpu")]
 └── reporter/
      ├── http.rs         — #[cfg(feature = "http")]
      └── s3.rs           — #[cfg(feature = "s3")]

Build variants:

# Full build (default)
cargo build --release

# No GPU — allows static linking (musl target)
cargo build --release --no-default-features --features http,s3
cargo build --release --target x86_64-unknown-linux-musl --no-default-features --features http,s3

# Minimal — metrics only, no reporting
cargo build --release --no-default-features

Pros:

  • Truly minimal binary for constrained/embedded/container targets
  • Static linking possible when GPU excluded
  • Clean separation of optional functionality

Cons:

  • #[cfg(...)] gates add noise throughout the code
  • More complex CI/build matrix (multiple feature combinations to test)
  • Premature if targets are homogeneous

Best for: Distributing to heterogeneous environments — e.g., some hosts have GPUs, some don’t; or when a stripped container image is a requirement.


Status

Implement Option B first. This provides the right structure for the spec without over-engineering. The Collector and Reporter traits give clean boundaries for testing and future extension.

Option C’s feature-flag layer can be added on top of B later with minimal refactoring; the module boundaries are already in place.

Implementation order (Option B)

  1. config.rs — TOML struct + CLI merge (clap + toml)
  2. sample.rs — data model (serde + serde_json)
  3. collector/cpu.rs, memory.rs, network.rs, disk.rs — procfs collectors
  4. collector/gpu.rs — all-smi wrapper
  5. reporter/http.rs — ureq start/stop/heartbeat
  6. reporter/s3.rs — AWS Sig V4 + ureq PUT
  7. main.rs — wire scheduler loop

Benchmarks

Comparison with https://github.com/SpareCores/resource-tracker

Status

The Rust binary collects every field that Python’s SystemTracker emits, and emits them as either JSON Lines (default) or CSV (--format csv).

The CSV output has parity with Python for all columns (same names, units, and computation formulas). The JSON output is a strict superset – it carries all CSV fields plus additional metrics not available in Python.


CSV Column Mapping

ColumnPython formulaRust CSV sourceUnitParity?
timestamptime.time() (float)timestamp_secs (integer)Unix secondsapprox (see note 1)
processescount of all /proc/[0-9]+ entriescpu.process_count – same /proc countcountyes
utimeper-interval delta(user+nice ticks) / ticks_per_seccpu.utime_secs – same delta calculationseconds/intervalyes
stimeper-interval delta(system ticks) / ticks_per_seccpu.stime_secs – same delta calculationseconds/intervalyes
cpu_usagefractional cores (0..N)cpu.utilization_pct directly (field is already fractional cores)fractional coresyes
memory_freeMemFree from /proc/meminfomemory.free_mib (MemFree / 1,048,576)MiByes
memory_usedMemTotal - MemFree - Buffers - (Cached+SReclaimable)memory.used_mib – same formulaMiByes
memory_buffersBuffersmemory.buffers_mibMiByes
memory_cachedCached + SReclaimablememory.cached_mib – same formulaMiByes
memory_activeActivememory.active_mibMiByes
memory_inactiveInactivememory.inactive_mibMiByes
disk_read_bytesper-interval delta(sectors_read) x sector_size, all non-partition diskstats entriessum of rate x interval across all /sys/block whole-disk entriesbytes/intervalapprox (see note 2)
disk_write_bytessame, write sidesame, write sidebytes/intervalapprox (see note 2)
disk_space_total_gbsum of all non-virtual mount points (incl. snap/loop)sum of all mounts under /sys/block devices (incl. loop mounts)GBapprox (see note 3)
disk_space_used_gbsame, total - free (incl. reserved-for-root blocks)same formulaGBapprox (see note 3)
disk_space_free_gbf_bavail from statvfsf_bavail from statvfsGBapprox (see note 3)
net_recv_bytesper-interval delta(rx_bytes) across all interfacessum of rate x interval across all interfacesbytes/intervalyes
net_sent_bytessame, tx sidesame, tx sidebytes/intervalyes
gpu_usagefractional GPUs (0..N)sum gpu[].utilization_pct / 100fractional GPUsyes
gpu_vramused VRAM in MiBsum gpu[].vram_used_bytes / 1,048,576MiByes
gpu_utilizedcount of GPUs with utilization > 0count gpu[].utilization_pct > 0countyes

Documented Semantic Differences

Note 1 – Timestamp precision

Python’s timestamp is a float (sub-second resolution). Rust emits an integer Unix timestamp. When aligning rows for comparison, use a +/-0.5 s tolerance.

Note 2 – Disk I/O: device set and sector size

Both Python and Rust use /proc/diskstats deltas and iterate all whole-disk (non-partition) entries. The device sets should match on most Linux systems.

Python’s device filter (is_partition from resource_tracker.helpers):

# Returns True only for names matching (sd*, nvme*, mmcblk*) partition patterns
# where a parent device exists in /sys/block. Everything else -- including
# loop*, dm-*, zram* -- is treated as a whole-disk device and included.

Rust’s device filter:

#![allow(unused)]
fn main() {
// Reads /sys/block/ directory entries into a HashSet.
// Keeps every diskstats entry whose name is a direct /sys/block/<name> entry.
// Logically equivalent to Python's filter: partitions like nvme0n1p1
// appear under /sys/block/nvme0n1/ (not top-level) and are excluded.
let block_set: HashSet<String> = read_dir("/sys/block")...;
let devs = diskstats.filter(|d| block_set.contains(&d.name));
}

Sector size: both Python and Rust read the actual hardware sector size per device from /sys/block/<dev>/queue/hw_sector_size, falling back to 512 bytes. This was implemented in Rust as P-DSK-SECTOR.

Rationale for explicit sector size: on 4K-native drives the logical sector size is 4,096 bytes; using a hard-coded 512 would under-count I/O bytes by 8x. Reading the actual value from sysfs ensures correctness on all drive types.

Note 2a – ZFS volumes

Python’s disk I/O implementation handles ZFS volumes, where disk usage is reported differently at /sys/block. Rust does not currently account for this. ZFS support is a planned enhancement (not required for MVP).

Note 3 – Disk space: mount set

Python sums all mount points that psutil.disk_partitions() reports as non-virtual (including snap squashfs loop mounts). Rust sums all mount points found in /proc/mounts whose source device matches a /sys/block entry.

On systems with many snap packages, Python includes the squashfs read-only mounts for each snap. Because /dev/loop* devices appear in /sys/block, Rust’s mounts_for_device("loopN") will pick these up too. However, psutil may enumerate mount points that are not under /dev/ (e.g., tmpfs, overlay, cgroup2) which Rust’s /dev/<device> prefix filter skips. This can cause small differences in disk_space_total_gb on container hosts or systems with unusual mount configurations.

To investigate: run mount | grep -v '^/dev' | grep -v ' type tmpfs' to see which mount points Python may be counting that Rust is not.


Running the comparison

Prerequisites

  • uv >= 0.9 (Astral): which uv
  • Rust release binary: cargo build --release

Directory layout

benchmarks/
+-- pyproject.toml      # uv project -- resource-tracker dependency
+-- run_python.py       # SystemTracker -> results/python_metrics.csv
+-- run_rust.sh         # resource-tracker --format csv -> results/rust_metrics.csv
+-- compare.py          # merge on timestamp, print diff table
+-- results/            # populated at runtime (gitignore this)
    +-- python_metrics.csv
    +-- rust_metrics.csv

Step 1 – Set up Python environment

cd benchmarks
uv init --no-workspace
uv add resource-tracker

Step 2 – run_python.py

"""Collect SystemTracker metrics for DURATION seconds -> results/python_metrics.csv"""
import time
from resource_tracker import SystemTracker

DURATION = 60
INTERVAL = 1

tracker = SystemTracker(interval=INTERVAL, output_file="results/python_metrics.csv")
time.sleep(DURATION)
tracker.stop()
print(f"Done -> results/python_metrics.csv")

Step 3 – run_rust.sh

#!/usr/bin/env bash
set -euo pipefail
DURATION=60
INTERVAL=1
mkdir -p results
timeout "$DURATION" \
  ../target/release/resource-tracker --interval "$INTERVAL" --format csv \
  > results/rust_metrics.csv || true
echo "Collected $(( $(wc -l < results/rust_metrics.csv) - 1 )) rows -> results/rust_metrics.csv"

Step 4 – compare.py

Strategy:

  1. Load both CSVs, parse timestamp columns.
  2. Differentiate Python’s cumulative I/O columns with diff() to get rates, matching Rust’s per-interval values.
  3. Merge on nearest timestamp (tolerance +/-0.5 x interval).
  4. For each shared metric, report: mean, std, min/max for each side plus mean absolute difference (MAD) and % deviation.
"""Compare python_metrics.csv and rust_metrics.csv side by side."""
import csv, sys
from pathlib import Path

IO_COLS = {"disk_read_bytes", "disk_write_bytes", "net_recv_bytes", "net_sent_bytes"}

def load(path):
    rows = list(csv.DictReader(Path(path).open()))
    return [{k: float(v) if v else 0.0 for k, v in row.items()} for row in rows]

def diff_col(rows, col):
    """Replace cumulative totals with per-row deltas (rate proxy)."""
    for i in range(len(rows) - 1, 0, -1):
        rows[i][col] = rows[i][col] - rows[i-1][col]
    rows[0][col] = 0.0

py  = load("results/python_metrics.csv")
rs  = load("results/rust_metrics.csv")

for col in IO_COLS:
    if col in (py[0] if py else {}):
        diff_col(py, col)

shared_cols = set(py[0]) & set(rs[0]) - {"timestamp"} if py and rs else set()

print(f"{'column':<30} {'py_mean':>12} {'rs_mean':>12} {'MAD':>12} {'%dev':>8}")
print("-" * 80)
for col in sorted(shared_cols):
    py_vals = [r[col] for r in py]
    rs_vals = [r[col] for r in rs]
    py_mean = sum(py_vals) / len(py_vals)
    rs_mean = sum(rs_vals) / len(rs_vals)
    mad = sum(abs(a - b) for a, b in zip(py_vals, rs_vals)) / len(py_vals)
    pct = (mad / py_mean * 100) if py_mean != 0 else float("inf")
    print(f"{col:<30} {py_mean:>12.3f} {rs_mean:>12.3f} {mad:>12.3f} {pct:>7.1f}%")

Results

To be populated after running the benchmark on target hardware.

Fill in: host specs (CPU model, RAM, OS, kernel), Rust git SHA, Python resource-tracker version, output table from compare.py, and observations on where the two implementations agree and diverge.


Remaining known differences

AspectPythonRustStatus
Timestamp precisionFloat (sub-second)Integer (Unix seconds)By design; use +/-0.5 s tolerance when aligning rows
Disk I/O sector sizePer-device from /sys/block/<dev>/queue/hw_sector_size, fallback 512Per-device from same sysfs path, fallback 512Implemented (P-DSK-SECTOR); parity achieved
Disk space: non-/dev/ mountspsutil includes overlay/tmpfs/cgroup mounts if reported non-virtualOnly /dev/<device> prefixed sources in /proc/mountsLow impact on physical hosts; notable on container/VM hosts
ZFS volumesHandled via psutil disk partition enumerationNot yet implementedPlanned enhancement

JSON superset fields (not in Python CSV)

The JSON output carries richer data than any Python CSV column can express.

Rationale: the CSV columns match Python for downstream compatibility. The JSON output is the primary format for new consumers and exposes all available data without being constrained by the Python column set.

TypeFieldDescriptionRationale
cpucpu.per_core_pct[]Per-logical-core utilization (0–100 each)Identify hot cores and NUMA imbalance; not expressible as a single CSV scalar
cpucpu.process_cores_usedFractional cores consumed by tracked PID treeCovers multi-process workloads (workers, MPI ranks); Python tracks only the root process
cpucpu.process_child_countLive descendants under tracked root PIDDetect fork/thread storms without external tooling
memorymemory.total_mibTotal installed RAMBaseline for capacity planning
memorymemory.available_mibMemAvailable: free + reclaimableBetter headroom estimate than free_mib alone on systems with large page caches
memorymemory.used_pctRAM usage as a percentageConvenient derived field; avoids client-side division
memorymemory.active_mib / memory.inactive_mibActive and inactive page countsDistinguish working-set pressure from cold cache
memorymemory.swap_total_mib / memory.swap_used_mib / memory.swap_used_pctSwap metricsDetect swap pressure before OOM; Python omits swap entirely
networknetwork[].interface etc.Interface name, MAC, driver, operstate, speed, MTUIdentify which NIC is under load and whether the link is at full speed
networknetwork[].rx_bytes_total / tx_bytes_totalCumulative byte countersEnables client-side rate computation at any granularity
diskdisk[].device_typenvme, ssd, or hddCorrelate latency with drive class without parsing device names
diskdisk[].capacity_bytesRaw device capacityCapacity planning without a separate lsblk call
diskdisk[].mounts[]Per-mount-point space (total/used/available/pct)Python aggregates all mounts into three scalars; Rust retains per-volume detail
diskdisk[].model / vendor / serialDrive identityCorrelate metrics with physical hardware inventory
gpugpu[].temperature_celsiusDie temperatureDetect thermal throttling in real time
gpugpu[].power_wattsPower drawPower-efficiency analysis; watts-per-FLOP budgeting
gpugpu[].frequency_mhzCore clockConfirm boost clock is active; correlate with thermal state
gpugpu[].vram_total_bytesTotal VRAMBaseline for VRAM utilization percentage
gpugpu[].uuid / name / device_type / host_idGPU identityMulti-GPU systems: attribute metrics to specific devices

resource-tracker – Usage Guide

resource-tracker is a lightweight Linux resource tracker. It polls CPU, memory, disk, network, and GPU metrics at a configurable interval and emits metrics as newline-delimited JSON (JSONL) or CSV lines to stderr or target file.


Quick start

# Build
cargo build --release

# Run with defaults to track resources used by hashing for 5 seconds
./target/release/resource-tracker timeout 5s sha512sum /dev/zero

# Track a specific process tree
./target/release/resource-tracker --pid 1234 --job-name "my-job"

Each line of output is a complete JSON object representing one sample by default:

{
  "timestamp_secs": 1718000000,
  "job_name": "my-benchmark",
  "cpu": { "utilization_pct": 4.6, "per_core_pct": [12.5, 38.0, "..."], "process_cores_used": 3.8, "process_child_count": 4 },
  "memory": { "total_mib": 64000, "used_mib": 30468, "used_pct": 47.6, "free_mib": 2289, "available_mib": 18432, "buffers_mib": 263, "cached_mib": 8472, "active_mib": 8157, "inactive_mib": 7404, "swap_total_mib": 0, "swap_used_mib": 0, "swap_used_pct": 0.0 },
  "network": [{ "interface": "eth0", "rx_bytes_per_sec": 1200.0, "tx_bytes_per_sec": 400.0, "rx_bytes_total": 9834200, "tx_bytes_total": 312400, "driver": "virtio_net", "operstate": "up", "speed_mbps": 1000, "mtu": 1500, "mac_address": "02:00:00:aa:bb:cc" }],
  "disk": [{ "device": "nvme0n1", "model": "Samsung SSD 990 PRO", "device_type": "nvme", "capacity_bytes": 1000204886016, "read_bytes_per_sec": 0.0, "write_bytes_per_sec": 204800.0, "mounts": [{ "mount_point": "/", "filesystem": "ext4", "total_bytes": 999292796928, "used_bytes": 841676800000, "available_bytes": 142023000000, "used_pct": 84.2 }] }],
  "gpu": [{ "name": "NVIDIA GeForce RTX 4090", "utilization_pct": 98.0, "vram_used_pct": 72.3, "vram_used_bytes": 17394819072, "vram_total_bytes": 24026849280, "temperature_celsius": 74, "power_watts": 318.5, "frequency_mhz": 2520 }]
}

CLI flags

FlagShortDefaultDescription
--pid PID-p(none)Root PID of the process tree to attribute CPU usage to. Includes all child processes.
--interval SECS-i1How often to emit a sample, in seconds.
--config FILE-cresource-tracker.tomlPath to a TOML config file. Silently ignored if the file does not exist.
--format FORMAT-fjsonOutput format: json or csv.
--output FILE-oPath to the output file. Defaults to stderr.
--quietSuppress metric output entirely, e.g. when streaming metrics to Sentinel and local output is not needed.
--help-hPrint help.
--version-VPrint version.

Precedence: CLI flags > config file > built-in defaults.


Config file (resource-tracker.toml)

The TOML config file lets you persist settings so you don’t have to repeat CLI flags on every invocation. It is optional – the tool works with no config file at all. Any field set on the CLI overrides the corresponding field in the file.

The default lookup path is resource-tracker.toml in the current working directory. Use --config /path/to/file.toml to point elsewhere.

Full reference

[job]
# Human-readable label for this tracking session.
# Appears as "job_name" in every emitted JSON sample.
# Useful when multiple runs are collected into the same data store so you can
# filter and group by job.
name = "gpu-benchmark-run-42"

# Root PID of the process to track.
# resource-tracker will walk the full process tree (parent + all descendants)
# and sum their CPU tick usage to report process_cores_used.
# Leave unset to collect system-wide metrics only.
pid = 12345

[tracker]
# Sampling interval in seconds.  Lower values give finer resolution at the
# cost of more output volume and slightly higher observer overhead.
# Default: 1
interval_secs = 10

Minimal example – system-wide monitoring

[tracker]
interval_secs = 30

Example – named job with process tracking

[job]
name    = "my_job_i_want_to_track"
pid     = 98231

[tracker]
interval_secs = 5

Sentinel API streaming and S3 output

When SENTINEL_API_TOKEN is set, the tracker registers the run with the Sentinel API and streams metric batches to S3 in the background. No network connections are ever made when the token is absent.

How it works

  1. At startup, start_run API endpoint is called to register the run and obtain temporary S3 upload credentials from the Sentinel API.
  2. A background upload thread wakes every TRACKER_UPLOAD_INTERVAL seconds (default 60), drains the in-memory sample buffer, serializes as CSV, gzip-compresses, and PUTs the file to the S3 prefix returned by the API.
  3. On clean exit (SIGTERM, shell-wrapper child exits), any samples not yet uploaded are base64-encoded and sent inline to finish_run inside a gzip-compressed JSON body. If S3 uploads did occur, only the S3 URIs are sent.

Environment variables

VariableRequiredDefaultDescription
SENTINEL_API_TOKENYesBearer token for the Sentinel API. Streaming is disabled when absent or empty.
SENTINEL_API_URLNohttps://api.sentinel.sparecores.netOverride the Sentinel API base URL.
TRACKER_UPLOAD_INTERVALNo60Seconds between S3 batch uploads.

Job metadata environment variables

All Section 9.3 metadata fields can be set via environment variable instead of CLI flags. Environment variables are overridden by the corresponding CLI flag when both are supplied.

VariableCLI flag
TRACKER_JOB_NAME--job-name
TRACKER_PROJECT_NAME--project-name
TRACKER_STAGE_NAME--stage-name
TRACKER_TASK_NAME--task-name
TRACKER_TEAM--team
TRACKER_ENV--env
TRACKER_LANGUAGE--language
TRACKER_ORCHESTRATOR--orchestrator
TRACKER_EXECUTOR--executor
TRACKER_EXTERNAL_RUN_ID--external-run-id
TRACKER_CONTAINER_IMAGE--container-image

Example

export SENTINEL_API_TOKEN="your-token-here"
export TRACKER_JOB_NAME="gpu-benchmark"
export TRACKER_UPLOAD_INTERVAL=30

./resource-tracker --interval 1 -- python train.py

The tracker spawns python train.py, monitors it, uploads a gzip-compressed CSV batch to S3 every 30 seconds, and calls finish_run when the script exits.


When to use the config file vs CLI flags

SituationRecommended approach
One-off interactive runCLI flags – faster, no file to manage
Recurring job (cron, SLURM, systemd unit)TOML file alongside the job definition
CI / benchmark pipelineTOML file checked into the repository
Multiple named jobs on the same hostOne TOML file per job, point to it with --config
Containerized workloadSet config via CLI flags in the CMD / ENTRYPOINT

Capturing output

Because samples are emitted as newline-delimited JSON to stdout, standard Unix tools work directly with the output.

# Write to a file
./resource-tracker > run.jsonl

# Tail live output
./resource-tracker | tee run.jsonl

# Pretty-print with jq
./resource-tracker | jq .

# Extract only CPU utilization over time
./resource-tracker | jq '{ t: .timestamp_secs, cpu: .cpu.utilization_pct }'

# Watch GPU VRAM usage
./resource-tracker --interval 1 | jq '.gpu[] | { name, vram_used_pct }'

Shell-wrapper mode

Pass a command after -- to have the tracker spawn and monitor it:

./resource-tracker --interval 1 --job-name "training-run" -- python train.py --epochs 50

The tracker sets --pid automatically to the spawned child’s PID, emits one final sample when the child exits, then exits with the child’s exit code.

Rationale: eliminates the two-process boilerplate (tracker & python ...; wait) and guarantees the tracker always exits with the job’s exit code, making it transparent to CI systems.


Process tree tracking (--pid)

When --pid is set, every sample includes two extra fields under cpu:

  • process_cores_used – fractional cores consumed by the process tree (e.g. 3.8 means the tree is using the equivalent of 3.8 full cores).
  • process_child_count – number of live child/descendant processes at the time of sampling (does not include the root PID itself).

If the tracked PID exits during a run, its contribution drops to zero and process_child_count drops to zero. The tracker itself keeps running.

Rationale: Python’s SystemTracker tracks only the calling process’s own ticks. Rust walks the full /proc tree so multi-process and multi-threaded workloads (e.g. PyTorch data-loader workers, MPI ranks, Spark executors) are attributed correctly under a single root PID.

Finding the PID of a running process:

# By name
pgrep -x python

# Most recently launched
pgrep -n my-training-script

# Already know the command? Launch and capture PID
my-training-script &
./resource-tracker --pid $! --job-name "training-run-1"

GPU support

GPUs are detected automatically at startup via NVML (NVIDIA) and libamdgpu_top (AMD). No configuration is needed. On hosts without GPU hardware or without the relevant driver libraries installed, the gpu array in each sample will be empty – the tracker continues running normally.

Supported accelerators: NVIDIA GPUs (NVML), AMD GPUs (ROCm/AMDGPU).

Rationale: per-GPU temperature, power draw, and clock frequency are not emitted by Python’s SystemTracker. These fields enable thermal throttle detection and power-efficiency analysis without a separate monitoring tool.


Metrics reference

cpu

FieldUnitDescription
utilization_pctfractional coresAggregate cores in use (0.0..N_cores). 4.6 on a 16-core host means ~4.6 vCPUs fully utilized.
per_core_pct% eachPer-logical-core utilization array (0.0–100.0).
utime_secssecondsUser+nice CPU time across all cores this interval.
stime_secssecondsSystem CPU time across all cores this interval.
process_countcountRunnable processes (procs_running from /proc/stat).
process_cores_usedfractional coresCores consumed by tracked process tree (null if no PID).
process_child_countcountLive descendant processes (null if no PID).

memory

All values in mebibytes (MiB = 1,048,576 bytes).

FieldDescription
total_mibTotal installed RAM
free_mibTruly free RAM (MemFree from /proc/meminfo)
available_mibFree + reclaimable RAM (MemAvailable); better estimate of headroom
used_mibtotal - free - buffers - cached (excludes reclaimable cache)
used_pctFraction of total RAM in use
buffers_mibKernel I/O buffer cache
cached_mibPage cache including slab-reclaimable (Cached + SReclaimable)
active_mibActive pages (recently accessed)
inactive_mibInactive pages (candidates for reclaim)
swap_total_mibTotal swap space (0 if no swap)
swap_used_mibUsed swap
swap_used_pctFraction of swap in use

Rationale: Python’s SystemTracker reports memory in KiB and omits available_mib, active_mib, inactive_mib, swap_*. Rust reports all fields in MiB (matching Python resource-tracker PR #9) and adds available_mib (MemAvailable) which is a more reliable headroom estimate than free_mib alone on systems with large page caches.

disk (one entry per whole-disk block device)

FieldUnitDescription
deviceKernel device name, e.g. nvme0n1, sda
modelDrive model string from /sys/block/
vendorVendor string from /sys/block/
serialSerial number or WWID
device_typenvme, ssd, or hdd
capacity_bytesbytesRaw device capacity
mountsArray of mounted filesystems on this device
mounts[].mount_pointe.g. /, /home
mounts[].filesysteme.g. ext4, xfs, btrfs
mounts[].total_bytesbytesFilesystem total size
mounts[].used_bytesbytesSpace in use
mounts[].available_bytesbytesSpace available to non-root users
mounts[].used_pct%Fraction of filesystem in use
read_bytes_per_secbytes/sDisk read throughput
write_bytes_per_secbytes/sDisk write throughput
read_bytes_totalbytesCumulative bytes read since boot
write_bytes_totalbytesCumulative bytes written since boot

Rationale: Python aggregates disk space across all mounts into three scalar CSV columns. Rust retains per-device, per-mount detail in the JSON output, enabling per-volume capacity tracking and per-device I/O attribution that the aggregated CSV cannot express.

network (one entry per non-loopback interface)

FieldUnitDescription
interfaceInterface name, e.g. eth0, ens3
mac_addressHardware MAC address
driverKernel driver name, e.g. igc, virtio_net
operstateLink state: up, down, unknown
speed_mbpsMbpsNegotiated link speed (-1 if not reported)
mtubytesMaximum transmission unit
rx_bytes_per_secbytes/sReceived throughput
tx_bytes_per_secbytes/sTransmitted throughput
rx_bytes_totalbytesCumulative bytes received since boot
tx_bytes_totalbytesCumulative bytes sent since boot

Rationale: Python’s SystemTracker emits only cumulative rx/tx byte totals per interface. Rust adds per-interval rates, driver identity, link state, negotiated speed, and MTU, enabling network saturation and driver-level diagnostics without a separate tool.

gpu (one entry per detected accelerator)

FieldUnitDescription
uuidVendor-assigned device UUID
nameDevice name, e.g. NVIDIA GeForce RTX 4090
device_typeGPU, NPU, TPU, etc.
host_idHost-level device identifier (PCIe slot or platform index)
detailDriver-specific key/value map (PCI IDs, ASIC name, driver version, …)
utilization_pct%Core utilization
vram_total_bytesbytesTotal VRAM
vram_used_bytesbytesUsed VRAM
vram_used_pct%Fraction of VRAM in use
temperature_celsiusdeg CDie temperature
power_wattsWPower draw
frequency_mhzMHzCore clock
core_countcountShader/compute cores (null if not reported)

Cargo documentation

References