autopilot/_docs/02_document/tests/resource-limit-tests.md

# Resource Limit Tests

Authored by `/test-spec` Phase 2 (2026-05-19). Resource-limit tests assert that the SUT stays within a quantified resource ceiling for the configured duration. Short bursts do not satisfy these tests — every scenario has an explicit sustained-monitoring window.

---

### NFT-RES-LIM-Re1: Combined onboard RSS ≤ 6 GB sustained
**Summary**: Combined process RSS on the deployed compute device for everything autopilot owns onboard (excluding Tier 1) MUST stay ≤ 6 GB throughout a 5-minute steady-state window with the full onboard workload active.
**Traces to**: AC `Resources & Data — Combined RSS on the deployed compute device, for everything autopilot owns onboard (excluding Tier 1), MUST stay within ≤ 6 GB / Re1`, RESTRICT `Hardware — Compute device: Jetson Orin Nano Super, 8 GB shared LPDDR5; Tier 1 consumes ~2 GB, leaving ~6 GB for autopilot`.

**Tier**: HW (representative Jetson Orin Nano Super) — pure-x86 reports informational only and does NOT satisfy the project-level Acceptance Gate.

**Preconditions**:
- Full onboard workload active: frame ingest from `rtsp-loopback`, Tier-2 + Tier-3 (when enabled) inferring at the documented steady-state load, gimbal commands flowing, MAVLink stream consumed at 10 Hz, operator-stream connected, MapObjects store hydrated for a 30×30 km region.
- Warm-up: 60 s before measurement starts (any first-load model warm-up complete).
- Tier-1 process is RUNNING in parallel but its RSS is EXCLUDED from the measurement (the AC scope is autopilot-owned RSS, excluding Tier 1).

**Monitoring**:
- Cgroup-level RSS for every process the SUT owns (the SUT binary plus any child processes it spawns — e.g., the VLM IPC peer if it lives in autopilot's cgroup), sampled at 1 Hz.
- Cgroup-level RSS for Tier 1 sampled at the same cadence (for the Re2 cross-reference).
- Per-process RSS captured to `reports/<run-id>/rss-trace.csv` for forensic review on failure.

**Duration**: 5 minutes of measurement after warm-up.

**Pass criteria**:
- `threshold_max`: per 1 s sample, `sum(autopilot_owned_RSS) ≤ 6 GB`.
- No single 1 s sample exceeds the ceiling.
- (Reporting only — not pass/fail): peak RSS, mean RSS, P95 RSS recorded in the CSV report.

**Test status**: DEFERRED — `<DEFERRED: long-running scenario harness exercising the full onboard workload for 5 min; inline-authorable but requires that the SUT be operational end-to-end first>`.

---

### NFT-RES-LIM-Re2: Tier-1 non-degradation under autopilot workload
**Summary**: When autopilot's full onboard workload runs concurrently with Tier 1 on the same Jetson, Tier-1 per-frame latency MUST NOT degrade by more than ± 5 ms versus the Tier-1-alone baseline (recorded by NFT-PERF-L1).
**Traces to**: AC `Resources & Data — Tier 1 per-frame latency MUST NOT degrade by more than ± 5 ms when autopilot's own onboard workload is running concurrently / Re2`, RESTRICT `Tier 1 (YOLO) and any local large model with GPU memory pressure share the Jetson GPU — only one of them may execute at any wall-clock instant`.

**Tier**: HW (the only meaningful environment for this assertion — GPU contention behaviour does not reproduce on x86).

**Preconditions**:
- NFT-PERF-L1 has been run on the same HW configuration in the SAME session and a baseline `tier1_baseline_p95_ms` recorded.
- Full onboard workload active (same as Re1).

**Monitoring**:
- Tier-1 per-frame latency sampled per frame for the duration of the test.
- The same metric source as NFT-PERF-L1 — for direct delta comparison.

**Duration**: 5 minutes of measurement after warm-up (matches Re1 window so both can run in the same session).

**Pass criteria**:
- `numeric_tolerance`: `|p95(tier1_with_autopilot) - tier1_baseline_p95_ms| ≤ 5 ms`.
- (Reporting only): mean, P95, max delta over the window.

**Test status**: DEFERRED — same fixture dependency as Re1; requires SUT operational + Tier 1 colocated on HW.

---

### NFT-RES-LIM-Storage: On-device persistent store stays under 95 % for in-flight operation
**Summary**: During a steady-state mission run (no abnormal load), the on-device persistent store MUST NOT exceed 95 % full. This protects the takeoff gate (R3) from being silently violated mid-mission and protects the post-flight push (Mp4) from running out of room to persist a failed diff.
**Traces to**: AC `Reliability & Safety — On-device storage MUST be bounded` (via R3 BIT gate), RESTRICT `On-device storage MUST be bounded`.

**Tier**: B + HW.

**Preconditions**:
- SUT mid-flight; persistent store at typical post-takeoff utilisation (e.g. 30 %).
- Normal-operation event volume: telemetry persistence, ignored-item appends, pending map-diff buffer (empty in this scenario).

**Monitoring**:
- Volume utilisation sampled at 10 Hz throughout the duration.

**Duration**: 60 minutes (representative mission duration per Mp3).

**Pass criteria**:
- `threshold_max`: `volume_used / volume_total ≤ 0.95` at every sample point.
- On approach to 85 %: structured-log INFO `storage_pressure` with current utilisation.
- On approach to 90 %: structured-log WARN with current utilisation; health.storage transitions to yellow.
- On 95 %: health.storage transitions to red; the SUT begins its documented eviction policy (this scenario does NOT test the policy semantics — that belongs to its own scenario; this scenario only asserts the policy IS triggered).

**Test status**: READY (no external fixture beyond the SUT itself; the persistent-store seed file controls starting utilisation).

---

### NFT-RES-LIM-CPU: CPU headroom for the Tier-1 colocation guarantee
**Summary**: Combined CPU utilisation of every autopilot-owned process MUST leave enough Jetson CPU headroom for Tier 1 to keep its NFT-PERF-L1 budget. Concretely: per-second sustained CPU usage by autopilot-owned processes MUST stay ≤ the configured budget (default 60 % of total CPU cycles measured at the cgroup level) for the duration of the run.
**Traces to**: AC `Resources & Data — Tier 1 per-frame latency MUST NOT degrade by more than ± 5 ms / Re2` (CPU-side mechanism backing Re2), RESTRICT `Hardware — Jetson Orin Nano Super`.

**Tier**: HW (CPU contention does not reproduce on x86).

**Preconditions**:
- Same workload as Re1 + Re2.

**Monitoring**:
- Cgroup CPU usage at 1 Hz.

**Duration**: 5 minutes after warm-up.

**Pass criteria**:
- `threshold_max`: per 1 s sample, `sum(autopilot_cpu_usage) ≤ 60 %` of total CPU.
- Reporting: mean, P95, max.

**Test status**: DEFERRED — same dependency as Re1/Re2.

---

### NFT-RES-LIM-GPU: GPU mutual exclusion contract (Tier 1 vs local large model)
**Summary**: Per RESTRICT (`Tier 1 (YOLO) and any local large model with GPU memory pressure share the Jetson GPU — only one of them may execute at any wall-clock instant`), the SUT MUST NOT issue a GPU compute call (e.g. Tier-3 VLM inference) while Tier 1 is executing on the GPU. The serialisation MUST be observable: a single GPU is busy at one instant.
**Traces to**: RESTRICT `Tier 1 and any local large model … only one of them may execute at any wall-clock instant`.

**Tier**: HW.

**Preconditions**:
- Tier 1 active; SUT in a ZoomedIn hold with deep-analysis enabled (Tier-3 will fire).

**Monitoring**:
- GPU-instance occupancy via `tegrastats` / equivalent at the highest available sampling rate.
- The SUT's own internal "compute-class" telemetry exposed on the health endpoint as `gpu_owner_current` ∈ { "tier1", "tier3", "idle" }.

**Duration**: 60 s containing ≥ 5 Tier-3 hold cycles.

**Pass criteria**:
- `exact`: at every sample point, `gpu_owner_current ∈ { "tier1", "tier3", "idle" }`; never simultaneously both.
- `tegrastats` peak GPU occupancy attributable to autopilot processes never overlaps Tier 1's known activity window for the same wall-clock instant.

**Test status**: DEFERRED — depends on the SUT being operational end-to-end + Tier-3 enabled; also depends on the SUT exposing `gpu_owner_current` (which is an architectural choice not yet locked).

---

### NFT-RES-LIM-FileHandles: File-descriptor and socket bound
**Summary**: Sustained operation MUST NOT leak file descriptors or sockets. The count MUST stay within a documented headroom of the initial-post-warmup baseline for the duration of the run.
**Traces to**: RESTRICT `On-device storage MUST be bounded` (general bounded-resource principle), security principle `No silent error swallowing for security-relevant failures` (FD exhaustion would silently break the operator-stream).

**Tier**: B + HW.

**Preconditions**:
- Warm-up: 60 s.
- Workload: full onboard workload at steady state.

**Monitoring**:
- `/proc/<pid>/fd` count per autopilot process at 1 Hz.

**Duration**: 60 minutes.

**Pass criteria**:
- `threshold_max`: at every sample point, `fd_count ≤ fd_baseline_post_warmup + 50` (50 = documented churn headroom for intermittent operator reconnects).
- A monotonically rising trend (slope > 0 over the run) is a TEST FAILURE even if the absolute ceiling is not breached.

**Test status**: READY for a Tier-B run; gains its real value once HW + sustained-workload land.

---

## Common assertions for every resource-limit scenario

- **Sustained-monitoring is non-negotiable.** Each scenario specifies a duration ≥ 60 s; short bursts that pass do not satisfy the test. The CSV report records the full sample trace path under `artifacts_path`.
- **No silent eviction.** Where a ceiling is approached, the SUT MUST surface the pressure (structured-log INFO at 85 %, WARN at 90 %, transition to yellow/red on health) BEFORE reaching the ceiling. A pass with no observable pressure signal at thresholds is a TEST FAILURE.
- **HW reporting vs gating.** Pure-x86 runs report informational deltas only; they do NOT satisfy the project-level Acceptance Gate. Every CSV row records its tier so this distinction stays auditable.
- **Re1 + Re2 are paired.** Re1 establishes the autopilot RSS ceiling; Re2 establishes that respecting Re1 does not cost Tier 1 latency. They MUST be run in the same session to make the Re2 baseline meaningful.