Specification Proposal — resource-tracker
- Status: Proposal / Work-in-Progress
- Date: 2026-03-30
- Based on: README.md (SpareCores),
src/prototype, Python PR #9,s3_upload.py - AI large language model tools were used throughout research, specification, and implementation phases of this project to accelerate and improve the quality of the work.
0. Conventions
The key words MUST, MUST NOT, REQUIRED, SHALL, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.
A verifiable requirement is one that can be confirmed by an automated test without manual inspection. Every normative statement below (MUST/SHALL) is intended to be verifiable.
1. Purpose and Scope
resource-tracker is a lightweight, statically self-contained Linux binary that:
- Polls system- and process-level resource utilization at a configurable interval.
- Emits structured samples to stdout (JSON Lines or CSV).
- Optionally streams those samples to the Sentinel API (SpareCores data ingestion endpoint) via gzip-compressed (CVS, TSV, or JSONL) files uploaded to S3 using temporary STS credentials.
The binary is intended as a drop-in CLI wrapper: run it alongside any process and it will transparently record how that process consumes hardware.
Out of scope (v1): macOS, Windows, eBPF, EBPF-based tracing, container image introspection beyond environment variables, multi-host federation.
2. Platform Requirements
| Requirement | Detail |
|---|---|
| Operating System | Linux only (kernel ≥ 4.18 recommended for full /proc coverage) |
| CPU Architectures | x86_64 and aarch64 (ARM64) |
| Linkage | Dynamic linkage for GPU libraries; all other code statically linked or carried as crate dependencies |
| Minimum Rust Edition | 2024 |
GPU support MUST NOT be required for the binary to build or run.
On a CPU-only host GpuCollector::collect() SHALL return an empty Vec and no error.
3. Configuration
3.1 Precedence (highest to lowest)
CLI flags > TOML config file > built-in defaults
Future enhancement: Support
RESOURCE_TRACKER_-prefixed environment variables (e.g.RESOURCE_TRACKER_INTERVAL,RESOURCE_TRACKER_FORMAT) as an additional configuration layer between CLI flags and the TOML file. Environment variables are more practical than file-based config for containerized and scripted workloads and are preferred for the Sentinel integration use case.
3.2 CLI Parameters
The binary MUST accept the following flags via a command line parser:
| Short | Long | Type | Default | Description |
|---|---|---|---|---|
-n | --job-name | String | none | Human-readable label attached to every sample |
-p | --pid | i32 | none | Root PID of the process tree to track (CPU attribution) |
-i | --interval | u64 | 1 | Polling interval in seconds (≥ 1) |
-c | --config | path | resource-tracker.toml | Path to TOML config file |
-f | --format | enum | json | Output format: json or csv |
--version | flag | Print binary version and exit |
All metadata fields listed in Section 9.3 (job_name, project_name, stage_name, etc.) MUST also be accepted as CLI flags. See Section 9.3 for the full flag and environment variable table.
Shell-wrapper mode (MVP target): The binary SHOULD support being used as a
transparent process wrapper, where the command to monitor is passed as trailing
arguments after a -- separator or as positional arguments:
resource-tracker Rscript model.R
resource-tracker -- python train.py --epochs 10
In this mode the binary spawns the given command as a child process, sets
--pid to that child’s PID automatically, and exits when the child exits
(propagating the child exit code). This is a significant usability improvement
over the Python implementation and is a first-class v1 goal.
--interval MUST be > 0. Values of 0 SHALL be rejected with a non-zero exit code and a descriptive error message.
3.3 TOML Config File
The config file is optional. If the file does not exist or cannot be parsed, the binary MUST continue using defaults (no error, no warning).
Schema:
[job]
name = "my-benchmark" # String; optional
pid = 12345 # i32; optional
[tracker]
interval_secs = 5 # u64; optional; default 1
Unrecognized keys MUST be silently ignored.
3.4 Verifiable Configuration Tests
T-CFG-01: Running with no flags produces valid JSON Lines output on stdout.T-CFG-02:--format csvemits a header line matching the exact column list in Section 6.2 before the first data row.T-CFG-03:--interval 0exits with code ≠ 0.T-CFG-04: A TOML file with[tracker] interval_secs = 3results in samples separated by ≈ 3 seconds when no--intervalflag is provided.T-CFG-05: A CLI--interval 2overrides a TOMLinterval_secs = 5.T-CFG-06: A missing TOML file path silently falls back to defaults.
4. Startup Behavior
On startup the binary MUST:
- Parse configuration (Section 3).
- Initialize all collectors.
- Execute one warm-up collection pass to prime delta state in stateful collectors (
CpuCollector,NetworkCollector,DiskCollector). - Sleep exactly one full interval.
- Emit the CSV header (if format = CSV) before the first data row.
- Enter the polling loop (Section 5).
The warm-up pass result MUST NOT be emitted to stdout.
5. Polling Loop
The loop MUST:
- Record
timestamp_secs= current Unix time asu64(seconds since UNIX epoch, UTC). - Collect all metric subsystems (Section 6.1) in the order: CPU, Memory, Network, Disk, GPU.
- Serialize and emit one line to stdout per the chosen format (Section 6.2, Section 6.3).
- Sleep the configured interval.
- Repeat indefinitely until killed.
Collection of any subsystem MUST NOT block the other subsystems. Failures in optional subsystems (GPU) MUST be surfaced as empty/zero values, not panics.
6. Data Model
6.1 Sample
A Sample is a point-in-time snapshot of all tracked resources.
#![allow(unused)]
fn main() {
pub struct Sample {
pub timestamp_secs: u64, // Unix time (seconds)
pub job_name: Option<String>,
pub cpu: CpuMetrics,
pub memory: MemoryMetrics,
pub network: Vec<NetworkMetrics>, // one per interface
pub disk: Vec<DiskMetrics>, // one per block device
pub gpu: Vec<GpuMetrics>, // one per GPU; empty if none
}
}
6.1.1 CpuMetrics
Source: /proc/stat tick deltas; /proc/<pid>/stat for process tracking.
Note:
total_cores(logical CPU count) is a static host property that rarely changes. It belongs in the host discovery snapshot (Section 8.1) rather than in every per-second sample. It is referenced here only for computingcpu_usagein the CSV output (Section 7.2).
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
utilization_pct | f64 | fractional cores | /proc/stat | Aggregate utilization expressed as cores-in-use (0.0..N_cores) |
per_core_pct | Vec<f64> | % | /proc/stat | Per logical CPU percentage; len == host_vcpus; range 0.0–100.0 |
utime_secs | f64 | seconds | /proc/stat | Δ(user+nice ticks) / ticks_per_second for this interval |
stime_secs | f64 | seconds | /proc/stat | Δ(system ticks) / ticks_per_second for this interval |
process_count | u32 | count | /proc numeric dirs | Number of running processes visible to the OS |
process_cores_used | Option<f64> | fractional cores | /proc/<pid>/stat | None when no PID tracked |
process_child_count | Option<u32> | count | /proc/<pid>/stat | Descendant count; excludes root PID; None when no PID tracked |
Computation rules:
utilization_pct=(Δtotal − Δidle) / Δtotal × N_coreswhere N_cores is the logical CPU count from host discovery. The result is expressed as fractional cores in use (e.g. 4.6 on a 16-core host means ~4.6 vCPUs are fully utilized). Do NOT clamp this value; values very slightly above N_cores are valid under kernel accounting rounding. Δtotal = Δ(user + nice + system + idle + iowait + irq + softirq + steal). Δidle = Δ(idle + iowait).utime_secs= Δ(user + nice) /ticks_per_second.stime_secs= Δ(system) /ticks_per_second.process_cores_used= Σ Δ(utime+stime) for root PID and all descendants / (elapsed_wall_clock_seconds × ticks_per_second). Must be ≥ 0.- On the first collection call (no previous snapshot), all delta-based fields MUST return 0. The caller MUST discard this result (warm-up pass).
Verifiable CpuMetrics Tests:
T-CPU-01:utilization_pctis in [0.0, N_cores] for all samples (N_cores from host discovery).T-CPU-02:len(per_core_pct)==host_vcpusfor all samples.T-CPU-03: When--pidis not set,process_cores_usedandprocess_child_countareNone.T-CPU-04: When--pid <self>is set,process_cores_used≥ 0.T-CPU-05:process_count≥ 1 on any running Linux system.T-CPU-06: Firstcollect()call returns 0.0 for all delta fields.
6.1.2 MemoryMetrics
Source: /proc/meminfo. All values in mebibytes (MiB = 1024 × 1024 bytes),
standardized to match Python resource-tracker PR #9 which also adopts MiB
throughout.
| Field | Type | Unit | /proc/meminfo key(s) | Notes |
|---|---|---|---|---|
total_mib | u64 | MiB | MemTotal | |
free_mib | u64 | MiB | MemFree | Truly free RAM |
available_mib | u64 | MiB | MemAvailable | Free + reclaimable |
used_mib | u64 | MiB | MemTotal − MemFree − Buffers − Cached | Matches Python memory_used |
used_pct | f64 | % | derived | used_mib / total_mib × 100; range 0.0–100.0 |
buffers_mib | u64 | MiB | Buffers | Kernel I/O buffers |
cached_mib | u64 | MiB | Cached + SReclaimable | Page cache + slab reclaimable |
swap_total_mib | u64 | MiB | SwapTotal | |
swap_used_mib | u64 | MiB | SwapTotal − SwapFree | |
swap_used_pct | f64 | % | derived | 0.0 when SwapTotal == 0 |
active_mib | u64 | MiB | Active | |
inactive_mib | u64 | MiB | Inactive |
Verifiable MemoryMetrics Tests:
T-MEM-01:free_mib + used_mib + buffers_mib + cached_mib ≤ total_mib(accounting for kernel reserved memory).T-MEM-02:used_pctis in [0.0, 100.0].T-MEM-03:swap_used_pctis 0.0 whenswap_total_mib== 0.T-MEM-04:available_mib ≤ total_mib.
6.1.3 NetworkMetrics
Source: /proc/net/dev (throughput), /sys/class/net/<iface>/ (identity/link state).
One NetworkMetrics record per non-loopback interface.
Architecture note: Fields such as
mac_address,driver,operstate,speed_mbps, andmtuare static properties that do not change every second. They are candidates for promotion to a host-discovery snapshot (Section 8.1) rather than being repeated in every per-second sample. This applies similarly to static fields in Section 6.1.4 (disk) and Section 6.1.5 (GPU). The current spec includes them here for completeness; a future revision should separate static identity fields from dynamic rate fields.
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
interface | String | — | interface name | e.g. "eth0" |
mac_address | Option<String> | — | /sys/class/net/<iface>/address | "00:11:22:33:44:55" |
driver | Option<String> | — | /sys/class/net/<iface>/device/driver symlink | e.g. "igc" |
operstate | Option<String> | — | /sys/class/net/<iface>/operstate | "up", "down", "unknown" |
speed_mbps | Option<i64> | Mbps | /sys/class/net/<iface>/speed | −1 when not reported |
mtu | Option<u32> | bytes | /sys/class/net/<iface>/mtu | |
rx_bytes_per_sec | f64 | bytes/s | /proc/net/dev Δ | Rate for this interval |
tx_bytes_per_sec | f64 | bytes/s | /proc/net/dev Δ | Rate for this interval |
rx_bytes_total | u64 | bytes | /proc/net/dev | Cumulative since boot |
tx_bytes_total | u64 | bytes | /proc/net/dev | Cumulative since boot |
Verifiable NetworkMetrics Tests:
T-NET-01:rx_bytes_per_sec≥ 0.0 andtx_bytes_per_sec≥ 0.0 for all interfaces.T-NET-02:rx_bytes_totalmonotonically non-decreasing between consecutive samples (absent interface reset).T-NET-03: The loopback interface (lo) is NOT included in the output.
6.1.4 DiskMetrics
Source: /proc/diskstats (throughput), /sys/block/<dev>/ (identity),
statvfs(3) (space). One DiskMetrics record per block device (excluding
partitions and device-mapper synthetic devices unless mounted independently).
| Field | Type | Unit | Source | Notes |
|---|---|---|---|---|
device | String | — | kernel device name | e.g. "sda", "nvme0n1" |
model | Option<String> | — | /sys/block/<dev>/device/model | |
vendor | Option<String> | — | /sys/block/<dev>/device/vendor | |
serial | Option<String> | — | /sys/block/<dev>/device/wwid or serial | |
device_type | Option<DiskType> | — | /sys/block/<dev>/queue/rotational | Nvme, Ssd, or Hdd; None when type cannot be determined |
capacity_bytes | Option<u64> | bytes | /sys/block/<dev>/size × 512 | |
mounts | Vec<DiskMountMetrics> | — | statvfs(3) | One per mount point |
read_bytes_per_sec | f64 | bytes/s | /proc/diskstats Δ | |
write_bytes_per_sec | f64 | bytes/s | /proc/diskstats Δ | |
read_bytes_total | u64 | bytes | /proc/diskstats sectors × sector_size | Cumulative since boot; see sector size note |
write_bytes_total | u64 | bytes | /proc/diskstats sectors × sector_size | Cumulative since boot; see sector size note |
DiskMountMetrics fields:
| Field | Type | Unit | Notes |
|---|---|---|---|
mount_point | String | — | e.g. "/" |
filesystem | String | — | Filesystem type from /proc/mounts; e.g. "ext4", "xfs" |
total_bytes | u64 | bytes | statvfs.f_blocks × f_bsize |
available_bytes | u64 | bytes | statvfs.f_bavail × f_bsize (unprivileged) |
used_bytes | u64 | bytes | total_bytes − (statvfs.f_bfree × f_bsize) |
used_pct | f64 | % | used_bytes / total_bytes × 100; 0.0 when total == 0 |
Sector size note: The current implementation hard-codes 512 bytes/sector for
/proc/diskstatsconversions. Python’sget_sector_sizes()reads/sys/block/<dev>/queue/hw_sector_size(fallback 512). On 4K-native drives (some NVMe) the Rust code will under-count I/O bytes by up to 8×. A future fix should read/sys/block/<dev>/queue/logical_block_sizeat startup and use the actual sector size. See implementation plan P-DSK-SECTOR.
Verifiable DiskMetrics Tests:
T-DSK-01:read_bytes_per_sec≥ 0.0 andwrite_bytes_per_sec≥ 0.0.T-DSK-02: For each mount,used_bytes + available_bytes ≤ total_bytes.T-DSK-03:capacity_bytes(when Some) > 0.
6.1.5 GpuMetrics
Source: NVML (nvml-wrapper crate, runtime-loads libnvidia-ml.so) for
NVIDIA GPUs; libamdgpu_top (runtime-loads libdrm) for AMD GPUs.
| Field | Type | Unit | Notes |
|---|---|---|---|
uuid | String | — | Stable vendor UUID; AMD uses PCI bus address |
name | String | — | Human-readable device name |
device_type | String | — | "GPU", "NPU", "TPU" |
host_id | String | — | Host-level device identifier |
detail | HashMap<String,String> | — | Vendor-specific extras (driver version, PCI bus ID, ROCm version) |
utilization_pct | f64 | % | Core utilization; range 0.0–100.0 |
vram_total_bytes | u64 | bytes | |
vram_used_bytes | u64 | bytes | |
vram_used_pct | f64 | % | vram_used / vram_total × 100; 0.0 when total == 0 |
temperature_celsius | u32 | °C | Die temperature |
power_watts | f64 | W | NVML reports mW; converted to W |
frequency_mhz | u32 | MHz | Core/graphics clock |
core_count | Option<u32> | count | Shader/compute cores; None if not reported |
AMD-specific: When /sys/module/amdgpu does not exist the AMD collection path MUST be skipped entirely (no panic).
NVIDIA-specific: power_watts = raw NVML milliwatt value / 1000.
Verifiable GpuMetrics Tests:
T-GPU-01: On a CPU-only host,gpuVec is empty and no error is returned.T-GPU-02:utilization_pctis in [0.0, 100.0] for each GPU.T-GPU-03:vram_used_bytes ≤ vram_total_bytesfor each GPU.T-GPU-04:vram_used_pctis 0.0 whenvram_total_bytes== 0.T-GPU-05: On a host with AMD GPU,uuidequals the PCI bus address string.
7. Output Formats
7.1 JSON Lines (default)
Each sample is emitted as a single JSON object followed by \n. The binary
MUST include a version field keyed as "<crate-name>-version" with the value
being the Cargo package version string.
Example (abbreviated):
{"timestamp_secs":1743300000,"job_name":null,"cpu":{...},"memory":{...},"network":[...],"disk":[...],"gpu":[],"resource-tracker-version":"0.1.0"}
Requirements:
T-OUT-01: Each line MUST be valid JSON parseable with any standard JSON library.T-OUT-02:timestamp_secsMUST be present and be a positive integer.T-OUT-03: The version key"resource-tracker-version"MUST be present.T-OUT-04: Consecutive samples MUST have non-decreasingtimestamp_secs.
7.2 CSV Format
CSV is the primary and required output format for Sentinel S3 streaming
(Section 9.2.2). It uses the same column names and units as the Python
resource-tracker so the Sentinel backend can ingest both without schema
changes. When uploaded to S3 the CSV content MUST be gzip-compressed and the
object key MUST carry the extension .csv.gz.
When --format csv is selected for stdout output the raw (uncompressed) CSV
bytes are written. Gzip compression is applied only when writing the S3 batch
upload payload (Section 9.2.2).
When --format csv is selected:
- The header line MUST be emitted exactly once, before the first data row.
- The header MUST match the following column names in this exact order:
timestamp,processes,utime,stime,cpu_usage,memory_free,memory_used,memory_buffers,memory_cached,memory_active,memory_inactive,disk_read_bytes,disk_write_bytes,disk_space_total_gb,disk_space_used_gb,disk_space_free_gb,net_recv_bytes,net_sent_bytes,gpu_usage,gpu_vram,gpu_utilized
Column definitions:
| CSV Column | Source Field | Unit | Computation |
|---|---|---|---|
timestamp | timestamp_secs | Unix seconds | direct |
processes | cpu.process_count | count | direct |
utime | cpu.utime_secs | seconds | direct; 3 decimal places |
stime | cpu.stime_secs | seconds | direct; 3 decimal places |
cpu_usage | cpu.utilization_pct | fractional cores | utilization_pct directly; field is already in fractional cores (0..N_cores); 4 decimal places |
memory_free | memory.free_mib | MiB | direct |
memory_used | memory.used_mib | MiB | direct |
memory_buffers | memory.buffers_mib | MiB | direct |
memory_cached | memory.cached_mib | MiB | direct |
memory_active | memory.active_mib | MiB | direct |
memory_inactive | memory.inactive_mib | MiB | direct |
disk_read_bytes | disk subsystem | bytes | Σ read_bytes_per_sec × interval_secs across all devices; integer |
disk_write_bytes | disk subsystem | bytes | Σ write_bytes_per_sec × interval_secs across all devices; integer |
disk_space_total_gb | disk mounts | GB (10⁹) | Σ total_bytes / 1_000_000_000 across all mounts; 6 decimal places |
disk_space_used_gb | disk mounts | GB (10⁹) | disk_space_total_gb − disk_space_free_gb; 6 decimal places |
disk_space_free_gb | disk mounts | GB (10⁹) | Σ available_bytes / 1_000_000_000 across all mounts; 6 decimal places |
net_recv_bytes | network subsystem | bytes | Σ rx_bytes_per_sec × interval_secs across all interfaces; integer |
net_sent_bytes | network subsystem | bytes | Σ tx_bytes_per_sec × interval_secs across all interfaces; integer |
gpu_usage | gpu subsystem | fractional GPUs | Σ utilization_pct / 100 across all GPUs; 4 decimal places |
gpu_vram | gpu subsystem | MiB | Σ vram_used_bytes / 1_048_576; 4 decimal places |
gpu_utilized | gpu subsystem | count | count of GPUs where utilization_pct > 0.0 |
Verifiable CSV Tests:
T-CSV-01: Header is emitted exactly once, as the first line.T-CSV-02: Column count per data row equals column count in header.T-CSV-03:cpu_usagecolumn equalsutilization_pctdirectly (field is already fractional cores, 0..N_cores) to 4 dp.T-CSV-04:disk_space_used_gb = disk_space_total_gb − disk_space_free_gbfor all rows.T-CSV-05: CSV output for a given sample is byte-for-byte reproducible (deterministic).T-CSV-06: No trailing commas; no quoted fields (all values are numbers or bare identifiers).
8. Host and Cloud Discovery
The binary SHOULD collect machine-level metadata once at startup and include it
in the Sentinel API registration payload (Section 9.1). Collected fields use the prefix host_ or cloud_.
8.1 Host Discovery
All fields are optional; collection failure MUST be silently swallowed.
| Field | Type | Source |
|---|---|---|
host_id | Option<String> | AWS: /sys/class/dmi/id/board_asset_tag; fallback: /etc/machine-id |
host_name | Option<String> | gethostname(3) |
host_ip | Option<String> | First non-loopback IPv4 from getifaddrs(3) |
host_allocation | Option<String> | "dedicated" or "shared"; heuristic TBD |
host_vcpus | Option<u32> | Count of logical CPUs (/proc/cpuinfo processor entries) |
host_cpu_model | Option<String> | /proc/cpuinfo model name field |
host_memory_mib | Option<u64> | MemTotal / 1024 from /proc/meminfo |
host_gpu_model | Option<String> | First GPU name from GpuCollector |
host_gpu_count | Option<u32> | Length of GPU Vec |
host_gpu_vram_mib | Option<u64> | Sum of vram_total_bytes / 1_048_576 across all GPUs |
host_storage_gb | Option<f64> | Sum of capacity_bytes / 1_000_000_000 across all block devices |
Users MUST be able to suppress any field by setting the corresponding
environment variable to "0" or "" (exact mechanism TBD in implementation).
8.2 Cloud Discovery
Cloud metadata is probed by making HTTP GET requests to each cloud provider’s Instance Metadata Service (IMDS) with a short timeout (≤ 2 seconds per provider). Probes MUST be attempted in the background and MUST NOT delay the first sample emission.
| Field | Probe endpoint | Notes |
|---|---|---|
cloud_vendor_id | AWS: 169.254.169.254/latest/meta-data/; GCP: metadata.google.internal; Azure: 169.254.169.254/metadata/instance | Infer vendor from which endpoint responds |
cloud_account_id | AWS: /latest/meta-data/identity-credentials/ec2/info | |
cloud_region_id | AWS: /latest/meta-data/placement/region | |
cloud_zone_id | AWS: /latest/meta-data/placement/availability-zone | |
cloud_instance_type | AWS: /latest/meta-data/instance-type |
Verifiable Cloud Discovery Tests:
T-CLD-01: On a non-cloud host, allcloud_*fields areNoneand the binary does not hang for more than 5 seconds total on startup.T-CLD-02: IMDS probe timeout is ≤ 2 seconds per provider.
9. Sentinel API Streaming (Extra Component)
Activation is gated on the SENTINEL_API_TOKEN environment variable being set.
Resolved design decisions:
- Streaming is enabled automatically whenever
SENTINEL_API_TOKENis set; no additional flag needed.- Upload format is
csv.gzonly;jsonl.gzis not supported.- Streaming is not separately configurable via TOML or CLI beyond the token env var.
- On network unavailability:
start_runlogs a warning and disables streaming; local stdout output continues normally (see Section 11 error handling).
9.1 Authentication
The binary MUST read the API token from the environment variable
SENTINEL_API_TOKEN. Every Sentinel API request MUST include the HTTP header:
Authorization: Bearer <token>
If SENTINEL_API_TOKEN is not set, all streaming functionality MUST be silently disabled. Local stdout emission continues normally.
9.2 Run Lifecycle
9.2.1 Start of Run
At startup (after host/cloud discovery), the binary MUST POST to the data ingestion endpoint to register a new Run.
POST /runs (default base URL: https://api.sentinel.sparecores.net).
Request payload (JSON, Content-Type: application/json): all metadata, host, and
cloud fields are merged into a flat top-level object (no nesting):
{
"job_name": "...",
"project_name": "...",
"pid": 12345,
"host_vcpus": 8,
"cloud_vendor_id": "aws",
...
}
Response fields the binary MUST store:
| Response Field | Type | Usage |
|---|---|---|
run_id | String | Referenced in all subsequent API calls |
upload_uri_prefix | String | S3 URI prefix for metric uploads |
upload_credentials.access_key | String | STS credential |
upload_credentials.secret_key | String | STS credential |
upload_credentials.session_token | String | STS credential |
upload_credentials.expiration | String (ISO 8601) | STS credential expiry; optional |
9.2.2 Batch Upload (Background Thread)
The binary MUST start a background thread that:
- Every 60 seconds (configurable, default 60), takes all samples collected since the previous upload.
- Serializes them as CSV (same column layout as Section 7.2) – CSV is the only accepted format for the Sentinel S3 bucket.
- Gzip-compresses the CSV bytes.
- Generates a unique S3 object key under
upload_uri_prefix:<upload_uri_prefix>/<run_id>/<batch_seq_number>.csv.gz - Uploads via AWS Signature V4 (Section 10).
- Appends the uploaded URI to an internal list
uploaded_uris.
If STS credentials are within 5 minutes of expiration, the binary MUST refresh
them by POSTing to /runs/{run_id}/refresh-credentials before attempting the upload.
Upload failures MUST be retried at least once with exponential back-off before being recorded as errors. After 3 consecutive upload failures the background thread MUST log a warning and continue buffering (data is not lost).
Verifiable Streaming Tests:
T-STR-01: WithoutSENTINEL_API_TOKEN, no HTTP connection is made.T-STR-02: A batch upload request containsContent-Encoding: gzipand the body decompresses to valid CSV or JSONL.T-STR-03:uploaded_uriscontains the S3 URIs of all successfully uploaded batches.T-STR-04: Credential refresh is triggered when ≤ 5 minutes remain before expires_at.
9.2.3 End of Run
When the tracked process terminates (or the binary receives SIGTERM), the binary MUST:
SIGINT note: An explicit SIGINT handler is not installed. When the binary is used in shell-wrapper mode, Ctrl-C is delivered to the entire process group, so both the child and the tracker receive SIGINT and exit together. Explicit SIGTERM forwarding to the child process is a future enhancement.
- Flush any remaining samples as a final batch upload (if
uploaded_urisis non-empty). - POST to
/runs/{run_id}/finishto close the Run, including:run_idexit_code(i32, if tracked process exited cleanly; else None)run_statusenum:"finished"(exit 0 or SIGTERM) or"failed"(non-zero exit)data_source:"s3"+data_uris: Vec<String>if any S3 uploads succeeded."inline"+data_csv: <base64(gzip(csv))>for short runs with no S3 uploads.
Verifiable End-of-Run Tests:
T-EOR-01: On SIGTERM, the binary exits with code 0 after flushing remaining data.T-EOR-02: The close-run request body containsrun_idmatching the start-run response.T-EOR-03:data_sourceis"inline"when no S3 uploads occurred.T-EOR-04:data_sourceis"s3"when at least one S3 upload succeeded.
9.3 Metadata Fields
The following metadata MAY be supplied by the user via CLI flags or environment variables. All are optional strings unless noted.
| Field | CLI Flag | Env Variable |
|---|---|---|
job_name | --job-name | TRACKER_JOB_NAME |
project_name | --project-name | TRACKER_PROJECT_NAME |
stage_name | --stage-name | TRACKER_STAGE_NAME |
task_name | --task-name | TRACKER_TASK_NAME |
team | --team | TRACKER_TEAM |
env | --env | TRACKER_ENV |
language | --language | TRACKER_LANGUAGE |
orchestrator | --orchestrator | TRACKER_ORCHESTRATOR |
executor | --executor | TRACKER_EXECUTOR |
external_run_id | --external-run-id | TRACKER_EXTERNAL_RUN_ID |
container_image | --container-image | TRACKER_CONTAINER_IMAGE |
Users MUST also be able to supply arbitrary key-value tags via repeated --tag key=value flags.
10. S3 Upload — AWS Signature V4
The upload is implemented in pure Rust without any AWS SDK dependency (zero
additional transitive deps for this path). The implementation mirrors the
Python s3_upload.py module from PR #9.
10.1 URI Parsing
An S3 URI has the form s3://bucket/path/to/object. Parsing MUST:
- Require scheme ==
"s3". - Require a non-empty bucket name.
- Require a non-empty key (path after bucket).
- Return an error for any other form.
10.2 Bucket Region Detection
If the upload region is not supplied, the binary MUST determine it by sending
an HTTP HEAD request to https://<bucket>.s3.amazonaws.com/ and reading the
x-amz-bucket-region response header. The header is present even on 3xx/4xx
responses. Results MUST be cached in-process for the lifetime of the run.
Default fallback: "eu-central-1".
10.3 Request Construction
A PUT request to https://<bucket>.s3.<region>.amazonaws.com/<key> with:
Content-Length: byte count of body.x-amz-content-sha256: SHA-256 hex of body.x-amz-date:YYYYMMDDTHHmmSSZUTC.x-amz-security-token: STS session token.Authorization: AWS4-HMAC-SHA256 signature (see Section 10.4).
10.4 AWS Signature V4
Signing key derivation:
kDate = HMAC-SHA256("AWS4" + secret_key, date_stamp)
kRegion = HMAC-SHA256(kDate, region)
kService = HMAC-SHA256(kRegion, "s3")
kSigning = HMAC-SHA256(kService, "aws4_request")
Canonical request:
PUT
/<key>
host:<bucket>.s3.<region>.amazonaws.com
x-amz-content-sha256:<payload_hash>
x-amz-date:<amz_date>
x-amz-security-token:<session_token>
host;x-amz-content-sha256;x-amz-date;x-amz-security-token
<payload_hash>
String to sign:
AWS4-HMAC-SHA256
<amz_date>
<date_stamp>/<region>/s3/aws4_request
<canonical_request_sha256>
Authorization header:
AWS4-HMAC-SHA256 Credential=<access_key>/<credential_scope>, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=<hex_sig>
10.5 Upload Success Criteria
HTTP 200 or 201 response from S3 = success. Any other status = error (with response body included in the error message).
10.6 Verifiable S3 Upload Tests
T-S3-01:parse_s3_uri("s3://bucket/path/obj")returns("bucket", "path/obj").T-S3-02:parse_s3_uri("https://bucket/path")returns an error.T-S3-03:parse_s3_uri("s3://bucket/")returns an error (empty key).T-S3-04: Given known access_key, secret_key, session_token, region, and a fixed timestamp, the generatedAuthorizationheader MUST match a pre-computed golden value.T-S3-05: Bucket region cache prevents duplicate HEAD requests for the same bucket.T-S3-06: An upload to a mock S3 server returns the S3 URI on success.
11. Error Handling
| Scenario | Required behavior |
|---|---|
/proc file is unreadable for a single metric | Return 0 / None for that field; do not abort |
| GPU library absent | GPU Vec is empty; no error propagated |
| Sentinel API unreachable at start | Log warning; streaming disabled; local output continues |
| S3 upload fails | Retry once; after 3 consecutive failures log warning and continue |
| Config TOML parse error | Silently fall back to defaults |
--interval 0 | Exit with code ≠ 0 before starting collectors |
| Tracked PID not found | process_cores_used = None; do not abort |
The binary MUST NEVER panic in production code. expect() is only permissible during development;
all expect() calls MUST be replaced with proper error handling before v1.0 release.
12. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Binary size | < 15 MiB stripped (CPU-only build) |
| Startup latency | < 1 × configured interval before first sample |
| CPU overhead of the tracker itself | < 1% of one core at 1-second interval on a 4-core host |
| Memory footprint | < 20 MiB RSS at steady state |
| Stdout buffering | Each line MUST be flushed atomically (no partial lines) |
13. Compatibility with Python resource-tracker
The CSV output format MUST maintain byte-for-byte column-name compatibility
with the Python SystemTracker output so that the Sentinel API backend can
ingest both without schema changes.
Confirmed equivalent columns (see Section 7.2 for derivation):
| Python column | Rust CSV column | Python unit | Rust unit |
|---|---|---|---|
timestamp | timestamp | Unix seconds | Unix seconds |
processes | processes | count | count |
utime | utime | seconds | seconds |
stime | stime | seconds | seconds |
cpu_usage | cpu_usage | fractional cores | fractional cores |
memory_free | memory_free | MiB | MiB |
memory_used | memory_used | MiB | MiB |
memory_buffers | memory_buffers | MiB | MiB |
memory_cached | memory_cached | MiB | MiB |
memory_active | memory_active | MiB | MiB |
memory_inactive | memory_inactive | MiB | MiB |
disk_read_bytes | disk_read_bytes | bytes/interval | bytes/interval |
disk_write_bytes | disk_write_bytes | bytes/interval | bytes/interval |
disk_space_total_gb | disk_space_total_gb | GB (10⁹) | GB (10⁹) |
disk_space_used_gb | disk_space_used_gb | GB (10⁹) | GB (10⁹) |
disk_space_free_gb | disk_space_free_gb | GB (10⁹) | GB (10⁹) |
net_recv_bytes | net_recv_bytes | bytes/interval | bytes/interval |
net_sent_bytes | net_sent_bytes | bytes/interval | bytes/interval |
gpu_usage | gpu_usage | fractional GPUs | fractional GPUs |
gpu_vram | gpu_vram | MiB | MiB |
gpu_utilized | gpu_utilized | count | count |
Verifiable compatibility test:
T-COMPAT-01: Run Python and Rust trackers in parallel on the same host for 60 seconds. For each interval, the difference between corresponding scalar columns MUST be within 5% of the Python value (allowing for measurement-time skew).
14. Open Questions / Future Work
- eBPF integration: Using
aya-rsorlibbpf-rsfor sub-millisecond tracing (CPU saturation, IPC, TLB misses, cache hit rates) — currently considered v2. - Process-level memory (PSS): Preferred over RSS; requires reading
/proc/<pid>/smaps_rollupwhich may be slow for large processes. - Per-process disk and network I/O:
/proc/<pid>/ioand network namespaces; currently only system-wide. - Configurable metric suppression: Allow users to opt out of fields
containing PII (e.g.
host_ip, hostname). - ARM-specific GPU support: Apple Metal not in scope (Linux only); Qualcomm Adreno / Mali GPU metrics TBD.
- Static linking of NVML: Currently not possible; NVML requires a dynamically loaded vendor library.
- Heartbeat endpoint: Periodic ping to Sentinel API while tracking is active (distinct from batch S3 uploads).