resource-tracker — Design Notes
Spec Summary
- Linux resource tracker (x86 + ARM), using
procfswhere appropriate - Configurable polling interval for: CPU, memory, GPU, VRAM, network in/out, disk read/write
- GPU support requires dynamic linking (no static link)
- CLI tool with optional params (job name/metadata); TOML config file with sane defaults
- Basic HTTP client: hit API endpoints at start, stop, and every X minutes (heartbeat)
- Lightweight S3 PUT using AWS creds to stream resource utilization data
Dependency Assessment
Current Cargo.toml dependencies
| Crate | Version | Purpose |
|---|---|---|
nvml-wrapper | 0.12 | NVIDIA GPU/VRAM monitoring via NVML; runtime dynamic loading |
libamdgpu_top | 0.11.2, no defaults, libdrm_dynamic_loading | AMD GPU monitoring via libdrm; runtime dynamic loading |
clap | 4, no defaults, derive+std+help+usage+error-context+env | CLI argument parsing, minimal footprint |
procfs | 0.18, serde feature only | Linux /proc – CPU, memory, network, disk |
ureq | 3, json feature | Lightweight sync HTTP – no tokio, no async runtime |
serde | 1, derive | Serialization/deserialization |
serde_json | 1 | JSON payload encoding for API and S3 |
toml | 1.0, no defaults, parse+serde features | TOML config file parsing |
hmac | 0.13.0-rc.6 | AWS Signature V4 HMAC signing |
sha2 | 0.11.0 | SHA-256 hashing for AWS Sig V4 |
hex | 0.4 | Hex encoding for AWS Sig V4 signature |
libc | 0.2 | statvfs for filesystem space, gethostname, SIGTERM |
flate2 | =1.1.9 (pinned), no defaults, rust_backend | Gzip compression for S3 batch uploads; pure Rust, no zlib-sys |
Release profile
[profile.release]
opt-level = "z" # optimize for size
lto = true # link-time optimization
codegen-units = 1 # better dead-code elimination
strip = true # strip symbols
panic = "abort" # smaller panic handler
Key decisions
nvml-wrapper+libamdgpu_topoverall-smi:all-smirequiredprotocat build time. Replaced withnvml-wrapper(NVIDIA, no build-time deps) andlibamdgpu_topwithlibdrm_dynamic_loading(AMD, runtime-only). Both load their respective drivers at runtime and degrade gracefully when absent.ureqoverreqwest:reqwestv0.13 pulls intokio(full async runtime),hyper, and TLS stacks – adds ~5-10 MB.ureqv3 is synchronous, no runtime, comparable API surface.procfsfeatures trimmed: Droppedchrono(heavy date/time lib,std::timesuffices) andflate2(only needed for gzip-compressed/procfiles, which are rare).clapdefaults disabled: Default clap features include terminal color, unicode width, etc. Stripped to the functional minimum;envfeature added to supportTRACKER_*environment variable overrides.- Manual AWS Sig V4 (
hmac+sha2+hex): Avoidsaws-sdk-s3(~50+ transitive deps, large binary). S3 PUT only needs ~100-150 lines of signing logic. tomlv1.0 defaults disabled:parse+serdefeatures;serdefeature required fortoml::from_strdeserialization into config structs.flate2pinned to=1.1.9withrust_backend: Pure Rust gzip implementation; avoids azlib-sysC build dependency. Version pinned to prevent unexpected breakage from pre-1.0 semver.libcfor sysfs/POSIX calls:statvfsfor filesystem space,gethostnamefor host identity, andSIGTERMsignal handling – pure FFI bindings with no additional binary size overhead.
Implementation Approaches
Option A — Single-file polling loop
All logic in main.rs. One tight loop: sleep → collect → diff deltas → buffer → flush.
main.rs
├── CLI parsing (clap)
├── Config loading (toml)
├── Polling loop
│ ├── procfs → CPU/mem/net/disk snapshots + delta computation
│ ├── all-smi → GPU/VRAM snapshots
│ └── Vec<Sample> batch buffer
├── HTTP calls (ureq) — start / stop / heartbeat
└── AWS Sig V4 signing + ureq PUT (inline)
Pros:
- Simplest to read and audit end-to-end
- Zero abstraction overhead
- Fastest to prototype
Cons:
main.rsgrows large and hard to navigate- No isolation between collectors — hard to unit test
- Tight coupling makes it hard to disable/swap individual collectors
Best for: MVP / proof of concept.
Option B — Module-per-resource + collector trait (current)
A Collector trait drives a scheduler. Each resource lives in its own module with its own delta state.
src/
├── main.rs — CLI, config, scheduler loop
├── config.rs — TOML config struct + CLI override merge
├── sample.rs — Sample / Report structs (serde)
├── collector/
│ ├── mod.rs — Collector trait: fn collect(&mut self) -> Metric
│ ├── cpu.rs — procfs::CpuTime, delta between ticks
│ ├── memory.rs — procfs::Meminfo
│ ├── network.rs — procfs::Net, bytes delta
│ ├── disk.rs — procfs::DiskStats, read/write delta
│ └── gpu.rs — all-smi wrapper
└── reporter/
├── mod.rs — Reporter trait: fn report(&self, batch: &[Sample])
├── http.rs — ureq: start/stop/heartbeat endpoints
└── s3.rs — AWS Sig V4 + ureq PUT (batch upload)
Collector trait sketch:
#![allow(unused)]
fn main() {
pub trait Collector {
fn collect(&mut self) -> Metric;
}
}
Reporter trait sketch:
#![allow(unused)]
fn main() {
pub trait Reporter {
fn on_start(&self, meta: &JobMeta);
fn on_sample(&self, batch: &[Sample]);
fn on_stop(&self, meta: &JobMeta);
}
}
Pros:
- Each collector independently testable with mock
/procdata - Clean ownership: delta state lives inside each collector struct
- Easy to add/remove resources without touching other collectors
- Reporter abstraction allows multiple outputs (HTTP + S3 simultaneously)
Cons:
- Slightly more upfront boilerplate (trait definitions, module layout)
- Minor indirection vs. inline code
Best for: Production implementation. Right level of structure for the spec.
Option C — Config-driven pipeline with Cargo feature flags
Extends Option B with #[cfg(feature = "...")] gates. GPU collector is behind feature = "gpu" since it requires dynamic linking. This enables a statically-linked build for non-GPU targets.
[features]
default = ["gpu", "s3", "http"]
gpu = ["dep:all-smi"]
s3 = []
http = []
src/
├── main.rs
├── config.rs
├── sample.rs
├── collector/
│ ├── cpu.rs
│ ├── memory.rs
│ ├── network.rs
│ ├── disk.rs
│ └── gpu.rs — #[cfg(feature = "gpu")]
└── reporter/
├── http.rs — #[cfg(feature = "http")]
└── s3.rs — #[cfg(feature = "s3")]
Build variants:
# Full build (default)
cargo build --release
# No GPU — allows static linking (musl target)
cargo build --release --no-default-features --features http,s3
cargo build --release --target x86_64-unknown-linux-musl --no-default-features --features http,s3
# Minimal — metrics only, no reporting
cargo build --release --no-default-features
Pros:
- Truly minimal binary for constrained/embedded/container targets
- Static linking possible when GPU excluded
- Clean separation of optional functionality
Cons:
#[cfg(...)]gates add noise throughout the code- More complex CI/build matrix (multiple feature combinations to test)
- Premature if targets are homogeneous
Best for: Distributing to heterogeneous environments — e.g., some hosts have GPUs, some don’t; or when a stripped container image is a requirement.
Status
Implement Option B first. This provides the right structure for the spec without over-engineering. The Collector and Reporter traits give clean boundaries for testing and future extension.
Option C’s feature-flag layer can be added on top of B later with minimal refactoring; the module boundaries are already in place.
Implementation order (Option B)
config.rs— TOML struct + CLI merge (clap + toml)sample.rs— data model (serde + serde_json)collector/cpu.rs,memory.rs,network.rs,disk.rs— procfs collectorscollector/gpu.rs— all-smi wrapperreporter/http.rs— ureq start/stop/heartbeatreporter/s3.rs— AWS Sig V4 + ureq PUTmain.rs— wire scheduler loop