[AZ-626] Decompose complete: 47 tasks + docs + module layout

Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-19 11:02:01 +03:00
parent f7d6cb4a3a
commit bc40ea7300
235 changed files with 12585 additions and 15097 deletions
+535
View File
@@ -0,0 +1,535 @@
# Blackbox Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Every scenario observes the SUT only through public surfaces (RTSP, gRPC, MAVLink, REST, operator stream, gimbal UDP, VLM IPC, health endpoint, structured logs). No scenario imports internal modules or peeks at on-device state directly.
Each scenario header records:
- **Summary** — one-line behaviour validated.
- **Traces to** — AC ID(s) and any RESTRICT ID.
- **Tier** — execution tier required (U / I / B / E / HW).
- **Test status** — `READY` or `DEFERRED — <reason>` (per the override 2026-05-19 deferred scenarios are kept; release-gate items).
The `Expected result` field gives the inline pass/fail criterion; the authoritative comparison lives in `_docs/00_problem/input_data/expected_results/results_report.md` (referenced by row id).
---
## Positive Scenarios — Detection Quality (functional)
### FT-P-001: Tier-1 normalised-box contract conformance
**Summary**: Frame in → autopilot must consume and re-emit the Tier-1 detection stream conforming to the suite's normalised-box schema (class ids 0..18, coords ∈ [0,1]).
**Traces to**: AC `Detection Quality / D6`, RESTRICT `Suite-level architectural splits — Tier 1 lives in ../detections`.
**Tier**: B (mock detector) + E (live `../detections`).
**Test status**: READY.
**Preconditions**:
- SUT started; `detections-mock` serving recorded Tier-1 stream for `image-set-existing`.
- `e2e-consumer` subscribed to the SUT's outbound normalised-box stream (observable via the operator-stream channel and via the internal-test-only `/debug/detections` socket IF exposed in test build; otherwise via operator-stream only).
**Input data**: `fixtures/images/4d6e1830d211ad50.jpg` (1280 px aerial frame) → encoded into `rtsp-loopback` as a 1-second loop.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Begin RTSP playback of the frame loop | SUT consumes the frame; emits a normalised-box detection record on the operator-stream channel |
| 2 | Capture one emitted detection record | Record validates against `fixtures/schemas/expected_detections.schema.json`; every bbox coord ∈ [0,1]; class_id ∈ {0..18} |
**Expected outcome**: D6 — schema-match + range comparison passes.
**Max execution time**: 10 s.
---
### FT-P-002: Existing-class regression vs documented baseline
**Summary**: Per-class precision and recall must not regress by more than ±2 percentage points against the pinned baseline (P=0.816, R=0.852).
**Traces to**: AC `Detection Quality — Existing-class regression / D2`.
**Tier**: E + HW (HW required for the project-level Acceptance Gate).
**Test status**: DEFERRED — expected_results baseline JSON not yet recorded (`<DEFERRED: expected_results/existing_classes_baseline.json>`). Visual fixtures (5 frames) are on disk; baseline numbers depend on a recording against the currently pinned `../detections` model.
**Preconditions**:
- Baseline JSON recorded against pinned `../detections` model (DEFERRED).
- SUT + live `../detections` running (Tier E) or HW Jetson (HW).
**Input data**: `fixtures/images/{4d6e1830...,54f6459...,6dd601b7...,805bcf1e...,f997d093...}.jpg` (5 frames).
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stream each frame through RTSP | SUT emits detections per frame |
| 2 | Compare aggregated per-class P/R to baseline | each per-class P, R within ± 0.02 absolute of baseline |
**Expected outcome**: D2 — `numeric_tolerance` passes.
**Max execution time**: 60 s.
---
### FT-P-003: New-class precision and recall ≥80%
**Summary**: New target classes (black entrances, branch piles, footpaths, roads, trees, tree blocks) reach precision ≥0.80 AND recall ≥0.80 per class.
**Traces to**: AC `Detection Quality — New target classes / D1`.
**Tier**: E + HW.
**Test status**: DEFERRED — multi-season annotated new-class eval set not acquired; annotation campaign owned by `../ai-training` repo. `<DEFERRED: expected_results/new_classes_pr.json>`.
**Preconditions**:
- Multi-season annotated new-class eval set acquired.
- Tier-1 model updated to include the 5 new classes.
**Input data**: `<DEFERRED: new-class eval set across all four seasons>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stream eval-set frames through RTSP | SUT emits detections including new-class items |
| 2 | Compute per-class P, R | each ≥ 0.80 |
**Expected outcome**: D1 — `threshold_min` passes for every new class.
**Max execution time**: 120 s.
---
### FT-P-004: Concealed-position recall ≥60% (initial gate)
**Summary**: System surfaces concealed positions (FPV hideouts, dugouts) with recall ≥0.60, accepting high false-positive rate as operators filter.
**Traces to**: AC `Detection Quality — Concealed-position recall / D3`.
**Tier**: E + HW.
**Test status**: DEFERRED — only 4 starter PNGs on disk; full multi-season annotated set required.
**Input data**: `fixtures/semantic/semantic0[1-4].png` (starter) + `<DEFERRED full set>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stream concealed-position frames | SUT emits concealed-structure POIs |
| 2 | Compute aggregate recall against ground truth | recall ≥ 0.60 |
**Expected outcome**: D3 — `threshold_min` passes.
**Max execution time**: 120 s.
---
### FT-P-005: Concealed-position precision ≥20% (initial gate)
**Summary**: Concealed-position precision ≥0.20 (operators filter; high-FP-accepting gate).
**Traces to**: AC `Detection Quality — Concealed-position precision / D4`.
**Tier**: E + HW.
**Test status**: DEFERRED — same dataset gap as FT-P-004.
**Input data**: same as FT-P-004.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stream concealed-position frames | SUT emits POIs |
| 2 | Compute aggregate precision against ground truth | precision ≥ 0.20 |
**Expected outcome**: D4 — `threshold_min` passes.
---
### FT-P-006: Footpath recall ≥70%
**Summary**: Footpath recall ≥0.70 across multi-season polyline-annotated eval set.
**Traces to**: AC `Detection Quality — Footpath recall / D5`.
**Tier**: E + HW.
**Test status**: DEFERRED — `<DEFERRED: footpath sequences (fresh + stale, all seasons), polyline-annotated>`.
**Input data**: `fixtures/semantic/semantic0[1-4].png` (starter; 4 frames feature footpaths leading to concealment) + `<DEFERRED full multi-season set>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stream footpath-bearing frames | SUT emits footpath polyline annotations |
| 2 | Compute recall against polyline ground truth | recall ≥ 0.70 |
**Expected outcome**: D5 — `threshold_min` passes.
---
## Positive Scenarios — Movement Detection Behaviour
### FT-P-007: Ego-motion compensation rejects stable scene elements
**Summary**: With the camera platform itself moving, stable elements (tree rows, houses, roads) MUST NOT generate movement candidates; only the actual mover does.
**Traces to**: AC `Movement Detection — Stable objects MUST NOT be treated as moving / M1`, RESTRICT `Operational — moving camera platform`.
**Tier**: B (with paired CSVs) + E.
**Test status**: DEFERRED — `<DEFERRED: paired gimbal.csv + telemetry.csv for video01.mp4; scene must contain 1 stable tree row + 1 moving vehicle>`.
**Preconditions**:
- `rtsp-loopback` plays `fixtures/movement/video01.mp4` at 30 fps.
- `gimbal-mock` replays paired gimbal.csv synchronised to RTSP frame timestamps.
- `mavlink-sitl` replays paired telemetry (position + attitude) for the same duration.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Begin synchronised playback (video + gimbal + telemetry) | SUT begins consuming frames and ego-motion compensating |
| 2 | Capture every movement candidate emitted on operator-stream for the clip duration | exactly 1 candidate (the vehicle); tree-row position is NOT among candidates |
**Expected outcome**: M1 — `set_contains` passes; candidate set == {vehicle}; tree-row position ∉ candidates.
**Max execution time**: clip_duration + 10 s.
---
### FT-P-008: Movement detection continues during zoomed-in hold
**Summary**: While the camera is in a zoomed-in hold on a confirmed POI, a new mover appearing in the ROI is still detected and enqueued; current ROI is preempted only if the new candidate's priority exceeds it.
**Traces to**: AC `Movement Detection — MUST continue during the zoomed-in inspection / M2`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED zoomed-in gimbal.csv + telemetry.csv pair; 1 small mover>`.
**Input data**: `fixtures/movement/video02.mp4` + DEFERRED CSV pair.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold via prior FT-P-016 setup | SUT in `ZoomedIn { roi, hold_started_at }` |
| 2 | Begin playback of the zoomed-in scene with the small mover | Movement candidate enqueued within ≤ 1.5 s (timing checked by NFT-PERF-L7) |
| 3 | Observe ROI lifecycle | ROI is preempted only if new candidate's `confidence × proximity × age_factor` exceeds the held ROI's; otherwise the held ROI completes |
**Expected outcome**: M2 — `exact` passes; 1 candidate enqueued; ROI preempt decision matches the documented priority rule.
---
### FT-P-009: Per-zoom-band threshold honoured (no false candidate at edge)
**Summary**: When a movement-cluster persists for one frame BELOW the configured per-zoom-band threshold, no candidate is emitted.
**Traces to**: AC `Movement Detection — configurable per-zoom-band false-positive budget MUST be honoured / M3`.
**Tier**: B.
**Test status**: DEFERRED — `<DEFERRED gimbal.csv simulating threshold edge>`.
**Input data**: `fixtures/movement/video03.mp4` + DEFERRED CSV.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Replay scene at the threshold edge | SUT processes frames |
| 2 | Observe candidate count over the clip duration | count == 0 |
**Expected outcome**: M3 — `exact` passes.
---
### FT-P-010: Movement zoomed-in benchmark FP-rate budget
**Summary**: Across the zoom-out + zoomed-in benchmark suite, false-positive rate per zoom band stays within the configurable per-zoom-band budget (Q14 fallback trigger).
**Traces to**: AC `Q-tagged criteria — Movement detection FP rate at zoomed-in inspection / M4` (depends on Q14).
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: zoom-out + zoomed-in benchmark suite + expected_results/movement_benchmark_caps.json; Q14>`.
**Input data**: `fixtures/movement/video04.mp4` (visual ref) + DEFERRED benchmark suite.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Replay the benchmark suite end-to-end | SUT processes all frames |
| 2 | Aggregate FP candidates per zoom band | rate per band ≤ configured cap (default ≤ 0.5 / min at zoomed-in) |
**Expected outcome**: M4 — `threshold_max` passes per zoom band.
---
## Positive Scenarios — Scan & Camera Control
### FT-P-011: Sweep → zoomed-inspection transition + POI enqueue
**Summary**: A POI detected mid-sweep triggers a transition into zoomed-inspection within 2 s (timing: NFT-PERF-L8) AND the POI is enqueued correctly.
**Traces to**: AC `Scan & Camera Control — Transition from sweep to detailed inspection / S1`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: scripted mission with planned route + simulated POI detected mid-sweep>`.
**Input data**: scripted MAVLink mission + scripted Tier-1 detection injection at known frame index.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Start SUT with scripted mission; begin RTSP playback | SUT enters `ZoomedOut`, performs sweep |
| 2 | Inject Tier-1 detection of a high-confidence target at frame N | SUT transitions to `ZoomedIn { roi, hold_started_at }`; ROI bbox matches the injected detection's bbox; POI queue length increments by 1 |
**Expected outcome**: S1 — `exact (transition)` + `exact (ROI matches POI bbox)` + `exact (queue Δ+1)`.
---
### FT-P-012: Footpath-pan during zoomed-in hold
**Summary**: During a zoomed-in hold on a footpath ROI, the camera pans along the footpath while the airframe continues to fly. The footpath stays in the centre 50% of frame for the duration of the hold.
**Traces to**: AC `Scan & Camera Control — pan to keep features visible / S2`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: zoomed-inspection scenario with footpath polyline overlapping the ROI>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold on a footpath ROI | SUT in `ZoomedIn { roi, hold_started_at }` |
| 2 | Continue airframe flight; observe gimbal commands stream | SUT issues pan commands to track the footpath; observed centre offset ≤ 25% per frame |
**Expected outcome**: S2 — `numeric_tolerance` passes; per-frame centre offset ≤ 0.25 × frame_dim.
---
### FT-P-013: Target-follow centre-window
**Summary**: After operator confirmation, target-follow mode keeps the target within the centre 25% of frame while visible.
**Traces to**: AC `Scan & Camera Control — target-follow mode / S3`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: operator-confirmed target + 60 s follow window>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Drive SUT into `TargetFollow { target_id, started_at }` via prior FT-P-016 | mode == target-follow |
| 2 | Observe gimbal commands + per-frame target position for 60 s | per-frame |dx, dy| ≤ 0.125 × frame_size |
**Expected outcome**: S3 — `threshold_max` passes per frame.
---
### FT-P-014: POI queue ordering by `confidence × proximity × age_factor`
**Summary**: With 3 POIs varying in confidence × proximity × age_factor, the system pops them in the documented relative order.
**Traces to**: AC `Scan & Camera Control — POI queue MUST be ordered by … / S4`.
**Tier**: B.
**Test status**: READY (synthetic-poi-feeds inline-authorable).
**Input data**: `synthetic-poi-feeds` ordering test — 3 POIs with confidence ∈ {0.50, 0.80, 0.60}, proximity ∈ {near, mid, far}, age_factor ∈ {fresh, fresh, stale} chosen to produce a known relative ordering.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Inject the 3 POIs as Tier-1 detections | all 3 enter the queue |
| 2 | Observe ZoomedIn transitions over the next N seconds | SUT inspects POIs in the documented relative order |
**Expected outcome**: S4 — `exact (order)` passes.
---
### FT-P-015: Zoomed-in hold cap interacts with deep-analysis
**Summary**: Zoomed-in hold defaults to 5 s/POI but caps deep-analysis interactions at 2 s; actual hold duration = min(5 s, deep_analysis_complete_at).
**Traces to**: AC `Scan & Camera Control — hold endpoints up to 2 s for deep analysis … per-POI timeout (default 5 s/POI) / S5`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: VLM-enabled hold scenario with vlm_io_pair returning within 2 s>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold; enable deep-analysis | SUT begins VLM IPC call on enter |
| 2a | Case A: VLM returns at 1.5 s | hold ends at 1.5 s (deep_analysis_complete) |
| 2b | Case B: VLM returns at 3.0 s | hold ends at 2.0 s (deep-analysis cap) |
| 2c | Case C: deep-analysis disabled | hold ends at 5.0 s (per-POI timeout) |
**Expected outcome**: S5 — `exact` passes for each case.
---
## Positive Scenarios — Operator Workflow
### FT-P-016: Operator confirm → middle waypoint inserted + target-follow
**Summary**: Valid + signed operator-confirm command results in a middle waypoint POSTed to `missions` AND a transition into target-follow mode.
**Traces to**: AC `Operator Workflow — Operator confirmation MUST result in … / O8`.
**Tier**: B + E.
**Test status**: READY for happy path (default placeholder envelope until Q9 resolves; envelope replaced when Q9 ships).
**Input data**: `operator-envelopes` (valid happy path) + `mission-suite-fixture` (DEFERRED full version) + `operator-session-scripts` (nominal session).
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | SUT in ZoomedIn hold on a POI surfaced to the operator | mode == ZoomedIn |
| 2 | Replay operator-confirm envelope on the return path | SUT validates envelope; commits decision |
| 3 | Observe HTTPS POST to `missions-mock` | `POST /missions/{id}` with a middle waypoint at the POI MGRS; HTTP 200 |
| 4 | Observe scan-mode state | mode == `TargetFollow { target_id, started_at }` |
**Expected outcome**: O8 — `exact (HTTP 200)` + `exact (mode == TargetFollow)`.
---
### FT-P-017: Decision window = 30 s at conf = 0.40
**Summary**: At confidence = 0.40 the decision window surfaced to the operator MUST equal 30 s (lower-bound anchor of the linear scale).
**Traces to**: AC `Operator Workflow — decision window … 40% confidence → 30 s / O1`.
**Tier**: B.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Inject a synthetic POI at conf=0.40 | POI surfaced on operator-stream with `decision_window_seconds: 30` |
**Expected outcome**: O1 — `exact (window == 30 s)`.
---
### FT-P-018: Decision window = 120 s at conf = 1.00
**Summary**: At confidence = 1.00 the decision window MUST equal 120 s (upper-bound anchor).
**Traces to**: AC `O2`.
**Tier**: B.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Inject a synthetic POI at conf=1.00 | window == 120 s |
**Expected outcome**: O2 — `exact`.
---
### FT-P-019: Decision window linear interpolation at conf = 0.70
**Summary**: At conf=0.70 the window is interpolated linearly between (0.40, 30 s) and (1.00, 120 s) → 75 s ± 0.5 s.
**Traces to**: AC `O3`.
**Tier**: B.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Inject a synthetic POI at conf=0.70 | window ≈ 75 s ± 0.5 s |
**Expected outcome**: O3 — `numeric_tolerance ± 0.5 s`.
---
### FT-P-020: Operator decline → persistent ignored-item
**Summary**: Operator-decline on a surfaced POI MUST persist an ignored-item entry keyed by `(MGRS cell, class_group)`.
**Traces to**: AC `Operator Workflow — Operator-decline MUST result in a persistent ignored-item entry / O5`.
**Tier**: B + E.
**Test status**: READY (operator-session-scripts inline-authorable; envelope uses default placeholder until Q9 resolves).
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Surface a POI to the operator | POI on operator-stream |
| 2 | Replay operator-decline envelope | SUT validates; ignored-item count via health endpoint increments by 1; new item has `(MGRS, class_group)` matching the declined POI |
**Expected outcome**: O5 — `exact (count Δ+1)` + `schema_match` (ignored-item record shape).
---
### FT-P-021: Ignored-item suppresses future matching detections
**Summary**: A new detection whose `(MGRS, class_group)` matches an existing ignored-item MUST NOT be surfaced to the operator.
**Traces to**: AC `Operator Workflow — A new detection whose (MGRS, class_group) matches an existing ignored-item MUST NOT be surfaced / O6`.
**Tier**: B + E.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Seed an ignored-item for `(MGRS=X, class_group=Y)` via FT-P-020 | ignored-item present |
| 2 | Inject a new detection at `(MGRS=X, class_group=Y)` | operator-stream emits NO POI for this detection; counter `pois_suppressed_by_ignored_total` increments |
**Expected outcome**: O6 — `exact (count surfaced == 0)`.
---
### FT-P-022: Operator timeout = forget (no ignored-item)
**Summary**: If the decision window expires with no operator response, the POI is removed from the queue but NO ignored-item is created (forget, do not blacklist).
**Traces to**: AC `Operator Workflow — Timeout (no operator response within the window) MUST NOT create an ignored-item entry / O7`.
**Tier**: B + E.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Surface a POI at conf=0.40 (30 s window) | POI on operator-stream |
| 2 | Wait > 30 s with no response | POI removed from queue; ignored-item count UNCHANGED |
**Expected outcome**: O7 — `exact (queue 1)` + `exact (ignored-item count unchanged)`.
---
## Positive Scenarios — Pre-flight & Map Reconciliation
### FT-P-023: BIT pre-flight pass with every dependency healthy
**Summary**: When every external dependency is reachable + healthy AND on-device storage < 95 % full AND wall-clock is bound, BIT passes and takeoff is permitted.
**Traces to**: AC `Reliability & Safety — Pre-flight self-test MUST pass / R1`, RESTRICT `Reliability & Safety obligations — Pre-flight self-test (BIT) MUST gate takeoff`.
**Tier**: B + E.
**Test status**: READY (bit-scenarios inline-authorable).
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Bring up all mocks healthy + clean autopilot-state volume | every dependency green |
| 2 | Trigger BIT via the BIT-arm operator command (or scripted in `operator-session-scripts`) | health endpoint returns `{ "ok": true, "deps": { ...all green }, "takeoff_permitted": true }` |
**Expected outcome**: R1 — `exact (takeoff_permitted == true)` + `exact (health.all == "green")`.
---
### FT-P-024: Pre-flight map pull ≤ 30 s for a 30×30 km region
**Summary**: Pulling the area-level map of previously-detected objects for a 30 km × 30 km mission area MUST complete within 30 s wall-clock.
**Traces to**: AC `Map Reconciliation — Pre-flight map pull / Mp1`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: mock central area-map service with ~10000 map objects for the 30 km × 30 km region>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Configure `missions-mock` with the 30×30 km mapobjects fixture | mock ready |
| 2 | Trigger BIT (which pulls the map) | SUT issues `GET /missions/{id}/mapobjects`; local copy hydrated within 30 s |
| 3 | Confirm BIT proceeds normally afterwards | takeoff permitted |
**Expected outcome**: Mp1 — `threshold_max` passes (NFT-PERF measures the latency; this scenario asserts the functional pathway).
---
### FT-P-025: Post-flight map diff push for a 60-minute mission
**Summary**: Pushing the post-flight pass diff (~17 500 records: NEW + MOVED + REMOVED + CONFIRMED-EXISTING) for a 60-minute mission MUST complete within 120 s wall-clock.
**Traces to**: AC `Map Reconciliation — Post-flight pass diff push / Mp3`.
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: 60-minute mission pass diff fixture>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Land the SUT after a 60-minute mission (scripted) | SUT enters post-flight reconciliation |
| 2 | Observe HTTPS POST to `missions-mock` | `POST /missions/{id}/mapobjects` with the diff; HTTP 200 within 120 s |
**Expected outcome**: Mp3 — `threshold_max` passes (NFT-PERF measures latency).
---
### FT-P-026: MapObjects conflict resolution (append-only + projection)
**Summary**: When two map updates conflict for the same `(spatial-cell, class_group)`, the SUT records both observations append-only AND computes the current view per the documented resolution rule.
**Traces to**: AC `Q-tagged — MapObjects conflict resolution / Mp5` (depends on Q8).
**Tier**: B + E.
**Test status**: DEFERRED — `<DEFERRED: conflict pair fixture + expected_results/mapobjects_conflict_resolution.json; Q8>`.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Seed the local mapobjects store via map pull | local store hydrated |
| 2 | Trigger two conflicting observations for `(cell=X, class=Y)` | both appended to the observation log |
| 3 | Observe the projected current view (via the operator-stream map-overlay channel or health debug) | current view matches the resolution rule (Q8) |
**Expected outcome**: Mp5 — `json_diff` passes against the reference.
---
## Negative Scenarios
### FT-N-001: BIT inhibits takeoff when Tier-1 detection is unreachable
**Summary**: When `../detections` is unreachable at BIT, takeoff MUST be inhibited and the detection dependency MUST report red.
**Traces to**: AC `Reliability & Safety — Pre-flight self-test MUST pass / R2`, RESTRICT `Suite-level architectural splits — Tier 1 lives in ../detections`.
**Tier**: B + E.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Stop `detections-mock` | dependency unreachable |
| 2 | Trigger BIT | health endpoint returns `takeoff_permitted: false`; `deps.detection == "red"`; operator-stream surfaces a BIT-failure event with category `detection` |
| 3 | Attempt to issue a takeoff MAVLink command (scripted) | SUT refuses; no MAVLink takeoff command observed on `mavlink-sitl` |
**Expected outcome**: R2 — `exact (takeoff inhibited)`.
---
### FT-N-002: BIT inhibits takeoff when persistent storage ≥ 95 % full
**Summary**: When the on-device persistent store is ≥ 95 % full at BIT, takeoff MUST be inhibited.
**Traces to**: AC `Reliability & Safety — Pre-flight self-test MUST pass / R3`, RESTRICT `On-device storage MUST be bounded`.
**Tier**: B.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Pre-fill `autopilot-state` volume to ≥ 95 % via seed file | storage threshold tripped |
| 2 | Trigger BIT | `takeoff_permitted: false`; `deps.storage == "red"` |
**Expected outcome**: R3 — `exact (takeoff inhibited)`.
---
### FT-N-003: Cache-fallback on map-pull timeout requires operator acknowledgement
**Summary**: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects, reports `map_sync == "cached_fallback"`, AND MUST require explicit operator acknowledgement before takeoff is permitted.
**Traces to**: AC `Map Reconciliation — Cache-fallback on timeout is acceptable only with explicit operator acknowledgement / Mp2`.
**Tier**: B + E.
**Test status**: READY (operator-session-scripts inline-authorable; cached state seeded from prior pull).
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Seed `autopilot-state` with a known prior MapObjects snapshot | cached map present |
| 2 | Configure `missions-mock` to timeout on `GET /missions/{id}/mapobjects` | mock returns 504 / silent timeout |
| 3 | Trigger BIT | SUT falls back to cached; `map_sync == "cached_fallback"`; BIT reports `takeoff_permitted: false, awaiting_ack: ["map_cache_fallback"]` |
| 4 | Replay operator-ack envelope for `map_cache_fallback` | BIT now reports `takeoff_permitted: true`; one structured-log entry at WARN with `map_cache_fallback_acked_by_operator` |
| 5 | Replay a takeoff scenario WITHOUT the ack | takeoff remains inhibited |
**Expected outcome**: Mp2 — `exact (cached_fallback)` + `exact (BIT requires explicit ack)`.
---
### FT-N-004: Below-threshold POI suppression (conf < 40 %)
**Summary**: A POI at confidence < 0.40 MUST NOT be surfaced to the operator at all.
**Traces to**: AC `Operator Workflow — Below 40% confidence, the POI MUST NOT be surfaced at all / O4`.
**Tier**: B.
**Test status**: READY.
| Step | Consumer Action | Expected System Response |
|---|---|---|
| 1 | Inject a synthetic POI at conf=0.39 | POI does NOT appear on operator-stream; counter `pois_below_threshold_total` increments by 1 |
**Expected outcome**: O4 — `exact (count surfaced == 0)`.
---
## Notes for downstream skills
- Decompose: every `READY` scenario above maps to at least one blackbox test task. DEFERRED scenarios MUST still produce a task spec (so the implementation has a placeholder), but the task spec's `Acceptance` section will reference the leftover entry that gates the fixture.
- Implement Tests: per-scenario assertion helpers (RTSP playback orchestration, MAVLink observer, operator-stream observer) are likely shared across scenarios — Phase 4's runner scripts will assume a thin `e2e/consumer/lib/` module that all scenarios depend on.
- Test-Spec Sync (cycle-update mode): post-implementation, scenarios may be split (e.g. FT-P-015's three sub-cases may become FT-P-015a/b/c) or merged. The traceability-matrix is the source of truth — every scenario MUST trace to at least one AC or RESTRICT.
+320
View File
@@ -0,0 +1,320 @@
# Test Environment
Authored by `/test-spec` Phase 2 (2026-05-19) against:
- `_docs/00_problem/problem.md`, `acceptance_criteria.md`, `restrictions.md`, `security_approach.md`
- `_docs/01_solution/solution_draft01.md`
- `_docs/02_document/architecture.md` (incl. §6 NFR Targets, §7 Detailed Design)
- `_docs/00_problem/input_data/data_parameters.md`, `services.md`, `fixtures/README.md`, `expected_results/results_report.md`
Per `.cursor/rules/artifact-srp.mdc` this artifact owns ONLY the test environment / harness shape — measurable thresholds belong in `acceptance_criteria.md`, fixture inventory belongs in `test-data.md`, and per-test specs belong in the sibling `*-tests.md` files.
---
## Overview
**System under test (SUT)**: `autopilot` — a single Rust binary that mounts onto the Jetson Orin Nano Super of a reconnaissance UAV. Its observable external surfaces:
| Surface | Direction | Protocol | Source/Sink in production |
|---|---|---|---|
| Tier-1 detection RPC | autopilot ⇄ detector | bi-directional gRPC streaming (local) | `../detections` |
| MAVLink command/telemetry | autopilot ⇄ airframe | MAVLink v2 over UDP (or serial) | ArduPilot / PX4 |
| Camera RTSP feed | camera → autopilot | H.264/265 1080p, 30/60 fps | ViewPro A40 |
| Gimbal control + telemetry | autopilot ⇄ camera | ViewPro vendor UDP | ViewPro A40 |
| Mission + MapObjects REST | autopilot ⇄ central | HTTPS JSON | `missions` service |
| Operator stream (telemetry out, commands in) | autopilot ⇄ GS | Suite-level modem protocol, signed commands | Ground Station |
| Deep-analysis VLM IPC (optional) | autopilot ⇄ VLM | Unix-domain socket | local-onboard VLM |
| Health endpoint | autopilot → ops | HTTP/JSON | scraped by ops |
| Structured logs | autopilot → ops | JSON to stdout | log shipper |
The harness exercises every one of those surfaces from outside the SUT process. No test reaches inside the binary (no module imports, no direct DB peeks, no shared memory).
**Consumer app purpose**: a black-box test runner (`e2e-consumer`) that:
1. Brings up the SUT in a controlled topology (with mock or live peers).
2. Drives inputs through public surfaces.
3. Captures every observable: outbound network frames, MAVLink commands, gimbal UDP commands, REST calls, operator-stream messages, health-endpoint JSON, log lines, plus passive resource metrics (RSS, CPU, GPU).
4. Compares each observation against the expected result tagged in `_docs/00_problem/input_data/expected_results/results_report.md` and emits a CSV report.
## Test execution tiers
Three execution tiers exist; each test scenario declares which tier(s) it must run in:
| Tier | Purpose | What is real vs mocked | When it runs |
|---|---|---|---|
| **U** — unit | Pure in-process logic with no external surface (state-machine transitions, geometry helpers, schema validators) | Everything in-process | Per commit (cargo test) |
| **I** — component-integration | One autopilot component against mocks for every peer | SUT component real; all peers stubbed/replayed | Per commit; isolates contract drift |
| **B** — blackbox / harness | Full SUT binary against mock peers in containers | SUT binary real; every external peer mocked (HTTPS mock, gRPC replay, MAVLink SITL, scripted operator trace, RTSP loopback) | Per commit + nightly |
| **E** — suite-e2e | Full SUT against live siblings (`../detections`, `../missions`, ArduPilot SITL, Ground Station replay) | All real services in the suite-e2e compose | Nightly + pre-release |
| **HW** — hardware/replay benchmark | SUT binary on representative Jetson hardware OR on a benchmarked replay of that hardware | Real Jetson Orin Nano Super OR benchmarked replay | Pre-release; the only path that satisfies the `acceptance_criteria.md → Acceptance Gates (project-level)` hardware gate |
Hardware-dependency analysis (which AC rows require HW vs replay vs commodity) is produced by the test-spec `phases/hardware-assessment.md` step before Phase 4 runner scripts are generated and is appended to this file as `## Hardware Execution Matrix`.
## Docker environment (Tier B + E)
The suite-e2e compose lives at the monorepo level (`../e2e/docker-compose.suite-e2e.yml`, owned by the `monorepo-e2e` skill — see `_docs/00_problem/input_data/services.md`). The autopilot-local harness lives at `e2e/docker-compose.autopilot-e2e.yml` (created by Phase 4) and brings up only the SUT + mocks needed for Tier-B runs.
### Services (Tier B — autopilot-local harness)
| Service | Image / Build | Purpose | Ports |
|---|---|---|---|
| `autopilot` | build: `.` (cross to `aarch64-unknown-linux-gnu` for HW, native for Tier B) | SUT | health: 9100/tcp; log: stdout; MAVLink: 14550/udp; gimbal: 9201/udp; operator: 9301/tcp |
| `detections-mock` | build: `e2e/mocks/detections-mock` (Python) | Bi-directional gRPC mock replaying recorded `Detections` streams | 50051/tcp |
| `missions-mock` | build: `e2e/mocks/missions-mock` (Python FastAPI) | HTTPS REST mock — `GET/POST /missions/{id}` + `/mapobjects` | 8443/tcp (TLS) |
| `rtsp-loopback` | image: `bluenviron/mediamtx` | RTSP server playing back recorded `.mp4` frame sequences at 30/60 fps | 8554/tcp |
| `gimbal-mock` | build: `e2e/mocks/gimbal-mock` (Rust) | ViewPro UDP echo + scripted yaw/pitch/zoom telemetry replays | 9200/udp |
| `mavlink-sitl` | image: `ardupilot/ardupilot-sitl` | ArduPilot SITL — MAVLink v2 endpoint for the autopilot to drive | 14551/udp |
| `vlm-mock` | build: `e2e/mocks/vlm-mock` (Python, UDS) | Optional Tier-3 VLM IPC mock; replays recorded `VlmAssessment` JSON | (UDS only) |
| `operator-replay` | build: `e2e/mocks/operator-replay` (Python) | Scripted Ground Station session trace: connect / push frame / push telemetry / operator-click / modem-drop / reconnect / lost-link | 9300/tcp |
| `time-injector` | build: `e2e/mocks/time-injector` (Rust) | Injects clock-drift / NTP-loss scenarios into the SUT container's clock via `faketime` LD_PRELOAD shim | — |
| `e2e-consumer` | build: `e2e/consumer` (Rust + assert crates) | The black-box test runner that drives scenarios + compares observables to expected results | — |
### Networks
| Network | Services | Purpose |
|---|---|---|
| `autopilot-e2e` | all | Isolated test network; no egress |
### Volumes
| Volume | Mounted to | Purpose |
|---|---|---|
| `fixtures-ro` | every mock service (read-only) | Mounts `_docs/00_problem/input_data/fixtures/` for replay sources |
| `expected-ro` | `e2e-consumer:/expected:ro` | Mounts `_docs/00_problem/input_data/expected_results/` for assertion comparison |
| `reports-rw` | `e2e-consumer:/reports` | CSV + JSON test output |
| `autopilot-state` | `autopilot:/var/lib/autopilot` | On-device persistent store (R3, Mp4) — wiped between runs |
### docker-compose structure (outline only — not runnable)
```yaml
services:
autopilot:
build: .
depends_on: [detections-mock, missions-mock, rtsp-loopback, gimbal-mock, mavlink-sitl, operator-replay]
networks: [autopilot-e2e]
environment:
DETECTOR_GRPC: detections-mock:50051
MISSIONS_URL: https://missions-mock:8443
RTSP_URL: rtsp://rtsp-loopback:8554/feed
GIMBAL_UDP: gimbal-mock:9200
MAVLINK_UDP: mavlink-sitl:14551
OPERATOR_TCP: operator-replay:9300
VLM_SOCK: /tmp/vlm.sock
AUTOPILOT_CONFIG: /etc/autopilot/test.toml
volumes:
- autopilot-state:/var/lib/autopilot
detections-mock: { build: e2e/mocks/detections-mock, volumes: [fixtures-ro:/fixtures:ro] }
missions-mock: { build: e2e/mocks/missions-mock, volumes: [fixtures-ro:/fixtures:ro] }
rtsp-loopback: { image: bluenviron/mediamtx, volumes: [fixtures-ro:/fixtures:ro] }
gimbal-mock: { build: e2e/mocks/gimbal-mock, volumes: [fixtures-ro:/fixtures:ro] }
mavlink-sitl: { image: ardupilot/ardupilot-sitl }
vlm-mock: { build: e2e/mocks/vlm-mock, volumes: [fixtures-ro:/fixtures:ro] }
operator-replay: { build: e2e/mocks/operator-replay, volumes: [fixtures-ro:/fixtures:ro] }
time-injector: { build: e2e/mocks/time-injector }
e2e-consumer:
build: e2e/consumer
depends_on: [autopilot]
volumes: [expected-ro:/expected:ro, reports-rw:/reports]
networks:
autopilot-e2e: {}
volumes:
fixtures-ro: { driver_opts: { type: none, o: bind, device: ${PWD}/_docs/00_problem/input_data/fixtures } }
expected-ro: { driver_opts: { type: none, o: bind, device: ${PWD}/_docs/00_problem/input_data/expected_results } }
reports-rw: {}
autopilot-state: {}
```
### Suite-e2e compose (Tier E) — referenced, not redefined
For Tier-E runs the harness uses `../e2e/docker-compose.suite-e2e.yml` (owned by `monorepo-e2e`). It adds the real `../detections`, real `../missions`, and a richer `mavlink-sitl` configuration. Autopilot's Tier-E entries in this file MUST mirror the suite-e2e topology — drift is reconciled by the `monorepo-e2e` skill, not here.
## Consumer application (`e2e-consumer`)
**Tech stack**: Rust + `assert_cmd` + `testcontainers-rs` + `prost`/`tonic` (for gRPC observation) + `mavlink-rs` (for MAVLink observation) + `reqwest`/`hyper` (for HTTPS observation) + `tokio-tungstenite` (for operator-stream observation). Tests are organised one-scenario-per-file under `e2e/consumer/tests/scenarios/`.
**Entry point**: `cargo test --release --test scenarios` (orchestrated by `scripts/run-tests.sh`, produced in Phase 4).
### Communication with the system under test
| Interface | Protocol | Endpoint / Topic | Authentication |
|---|---|---|---|
| Health endpoint | HTTP GET | `http://autopilot:9100/health` | none (loopback) |
| Structured log stream | line-delimited JSON on stdout | docker-compose log tail | none |
| MAVLink observed | MAVLink v2 / UDP | `mavlink-sitl:14551` (the harness records both sides of the link) | per Q6: MAVLink-2 message signing if configured |
| Gimbal observed | ViewPro UDP | `gimbal-mock:9200` (commands recorded + telemetry replayed) | none |
| RTSP delivered | RTSP | `rtsp://rtsp-loopback:8554/feed` (consumer schedules which clip plays per scenario) | none |
| Detection RPC observed | gRPC streaming | `detections-mock:50051` (consumer scripts the recorded replay served) | none |
| Mission REST observed | HTTPS | `missions-mock:8443` (consumer scripts JSON fixtures + asserts captured request bodies) | TLS cert (self-signed for test) |
| Operator stream observed | Suite modem protocol | `operator-replay:9300` (consumer scripts session traces + signed-command envelopes) | per Q9: signed envelope (HMAC / ed25519 / MAVLink-2-ext) |
| VLM IPC observed (when enabled) | Unix-domain socket | `/tmp/vlm.sock` shared with `vlm-mock` | peer-credential check (security_approach §"Local IPC peer authorisation") |
### What the consumer does NOT have access to
- No direct database access to the autopilot's on-device persistent store (`autopilot-state` volume) — the consumer reads it only via the health endpoint, the operator telemetry stream, or as a post-run forensic check (the storage AC R3 is checked via the BIT health response, not by peeking at SQLite rows).
- No internal Rust module imports — the consumer is a separate crate compiled against published public proto/schema files only.
- No shared memory, no `/proc/$pid/...` inspection beyond passive resource metrics.
- No direct reading of in-flight POI queue ordering — ordering is observed indirectly via the operator-stream emission order and the gimbal command stream.
## External dependency mocks
| Dependency | Mock service | Determinism guarantee | Source fixture(s) |
|---|---|---|---|
| `../detections` Tier-1 RPC | `detections-mock` | Replays recorded `Detections` stream byte-for-byte; same input → same output | `<DEFERRED: tier1_replay/*.replay; services.md §1>` (live `../detections` used as fallback in Tier-E) |
| `missions` API | `missions-mock` | Static JSON responses per scenario; recorded round-trip captured for `POST` | `<DEFERRED: missions_fixtures/*.json; services.md §2>` |
| ViewPro A40 camera frames | `rtsp-loopback` (mediamtx) | Plays back `.mp4` at exact configured fps; frame timestamps deterministic | `fixtures/videos/94d42580bd1ad6ff.mp4`, `fixtures/movement/video0[1-4].mp4` |
| ViewPro A40 gimbal control | `gimbal-mock` | Replays `gimbal.csv` per scenario; echoes commands with bounded latency budget per scenario | `<DEFERRED: gimbal_csv/*.csv paired with movement videos; services.md §6>` |
| ArduPilot airframe | `mavlink-sitl` (ArduPilot SITL) | Deterministic seed + scripted mission | scripted per scenario; no fixture file required for Tier B (SITL is the fixture) |
| Ground Station modem session | `operator-replay` | Replays `(t, event)` script per scenario | `<DEFERRED: operator_sessions/*.script; services.md §3>` |
| Local VLM (Tier-3 optional) | `vlm-mock` | Returns paired `(roi.png → VlmAssessment)` from disk; schema-violation fixtures for fail-closed tests | `<DEFERRED: vlm_io_pairs/*.json; services.md §7>` |
| Wall-clock / GPS / NTP | `time-injector` (faketime LD_PRELOAD) | Scripted offset / jump / source-loss; injected at SUT process start | scripted per scenario; no fixture file required |
Mocks that are marked `<DEFERRED:>` are bridged through `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md`. Scenarios that consume those mocks declare `Test status: DEFERRED — input fixture not yet acquired (see leftover row N)` in their entry under the relevant `*-tests.md` file.
## CI/CD integration
| Stage | Tier(s) | When | Gate | Timeout |
|---|---|---|---|---|
| PR pipeline | U, I | on every PR push | block merge on FAIL | 10 min |
| dev-branch nightly | U, I, B | nightly | warn on FAIL; report attached | 60 min |
| weekly suite-e2e | U, I, B, E | weekly + on release branch | block release on FAIL | 180 min |
| pre-release HW benchmark | HW | manual + pre-release | block release on FAIL | 240 min |
Owned in `_docs/02_document/deployment/ci_cd_pipeline.md`. This file only declares which tier each scenario MUST run in; the pipeline orchestration is documented there.
## Reporting
**Format**: CSV (one row per scenario per run).
**Columns**:
| Column | Type | Notes |
|---|---|---|
| `test_id` | string | e.g. `FT-P-001`, `NFT-PERF-L1`, `NFT-SEC-O9` |
| `test_name` | string | short title from the scenario header |
| `tier` | enum | U / I / B / E / HW |
| `seed` | int | deterministic seed used (where applicable) |
| `start_ts_utc` | ISO 8601 | scenario start |
| `duration_ms` | int | total execution time |
| `result` | enum | PASS / FAIL / SKIP / DEFERRED |
| `expected_result_ref` | string | row id in `expected_results/results_report.md` (e.g. `L1`, `Mp3`) |
| `actual_value` | string | quantitative observation (latency_ms, count, etc.) |
| `compare_method` | string | one of `expected-results.md` methods |
| `tolerance` | string | as declared in the expected-results row |
| `failure_reason` | string | populated only on FAIL or DEFERRED |
| `artifacts_path` | string | path under `/reports/<run-id>/` for captured logs / pcaps / mavlink dumps |
**Output path**: `e2e/consumer/reports/<run-id>/report.csv` (mounted host-side to `./reports/<run-id>/report.csv`).
**Sidecar artifacts** per scenario (one folder per `test_id`): `stdout.log`, `stderr.log`, `mavlink.tlog` (where applicable), `pcap.bin` (where applicable), `health-trace.jsonl`, `actual-output.json`.
## Test Execution
**Decision** (recorded 2026-05-19 by `phases/hardware-assessment.md`): **local-only on Jetson Orin Nano Super**. Every scenario — Tier B, Tier E, Tier HW — runs on representative Jetson hardware (the same hardware the airborne payload deploys to). Docker is used for **service orchestration** (mocks, sibling services) on the Jetson host, NOT for SUT execution on x86.
### Hardware dependencies found
| File | Dependency surfaced |
|---|---|
| `_docs/00_problem/restrictions.md → "Hardware"` | Jetson Orin Nano Super (aarch64), 8 GB shared LPDDR5, 67 TOPS INT8; ViewPro A40 (40× optical zoom + vendor UDP); ViewPro Z40K compatibility |
| `_docs/00_problem/restrictions.md → "Software environment"` | FP16 precision (INT8 rejected); no cloud egress; Tier 1 + local large models share Jetson GPU with mutual exclusion |
| `_docs/01_solution/solution_draft01.md` | "single Rust binary on Jetson Orin Nano Super (aarch64)"; TensorRT FP16; Tokio + Unix-domain-socket VLM IPC |
| `_docs/02_document/architecture.md §6` (NFR Targets) + `§7.6` (Solution Architecture) + `§7.14` (Tech Stack) | cross-compile target `aarch64-unknown-linux-gnu`; TensorRT engine; gimbal UDP; MAVLink-v2 transport |
| `_docs/02_document/components/*/description.md` (13 components) | physical UDP (gimbal_controller), RTSP capture (frame_ingest), MAVLink airframe link (mavlink_layer), local-onboard model (semantic_analyzer + vlm_client) |
### Why local-only on Jetson
The choice rejects two alternatives:
- **Docker-only on x86** would leave Tier-HW rows (L1L9, Re1, Re2, NFT-RES-LIM-CPU, NFT-RES-LIM-GPU) `SKIPPED-NO-HW`. That defeats the project-level Acceptance Gate (`acceptance_criteria.md → "Acceptance Gates (project-level)"`: every latency criterion MUST be measured on the deployed compute device).
- **Both x86 + Jetson** would split the test surface and let Tier-B scenarios pass on x86 while masking real-hardware regressions (e.g. GPU contention is invisible on x86). The honest path is to exercise the actual hardware path uniformly.
### Execution instructions (local on Jetson)
**Prerequisites** (one-time, per Jetson runner):
- JetPack 6.x SDK + L4T r36.x (matches the airborne deployment image).
- Rust toolchain pinned to the workspace's `rust-toolchain.toml` (added by Step 7 Implement); rustup target `aarch64-unknown-linux-gnu` already native here.
- Docker + Docker Compose v2 (for orchestrating the mock services + sibling repos in Tier-E mode).
- `mavlink-router`, `tegrastats`, `iperf3`, `tc` (network shaping).
- ViewPro A40 (or Z40K for the Z40K-swap regression run) connected over Ethernet at the documented control endpoint.
- ArduPilot SITL binary installed natively (the Docker image is x86-only; on Jetson aarch64 we run SITL natively or via Apptainer).
- A representative ViewPro A40 RTSP feed source — either the physical camera or a recorded `.mp4` looped through a local `mediamtx`.
**How to start services**: `docker compose -f e2e/docker-compose.autopilot-e2e.yml up -d` brings up `detections-mock`, `missions-mock`, `rtsp-loopback`, `gimbal-mock`, `vlm-mock`, `operator-replay`, `time-injector` on the Jetson host. The SUT (`autopilot` binary) runs **outside** the compose — `cargo run --release` on the Jetson directly, with env vars pointing at the compose-side mock endpoints. For Tier E, swap `detections-mock` → live `../detections` and `missions-mock` → live `missions` per `../e2e/docker-compose.suite-e2e.yml`.
**How to run the test runner**: `scripts/run-tests.sh` (to be created by a Decompose task per `traceability-matrix.md → "Phase 4 SKIPPED"` handoff) orchestrates: bring up compose → start SUT → run `cargo test --release --test scenarios -p e2e-consumer` → tear down. The runner reads `RUN_TIER ∈ {B, E, HW}` to decide which scenarios to execute.
**Environment variables** (consumed by both the SUT and the consumer):
- `RUN_TIER` (`B` | `E` | `HW`) — selects scenario set per the matrix below.
- `AUTOPILOT_CONFIG` — path to the test profile TOML (overrides per-scenario thresholds + Q-tagged defaults).
- `AUTOPILOT_RNG_SEED` — deterministic-seed per scenario; captured in the CSV report.
- `JETSON_RUNNER_ID` — identifier for the physical Jetson + camera+gimbal hardware combo; carried into every CSV row for forensic comparison across runners.
### CI/CD addendum (overrides the earlier `## CI/CD integration` table)
The earlier table assumed a Docker-on-x86 PR pipeline. Under this decision, every tier runs on a Jetson runner. Operationally that means:
| Stage | Tier(s) | When | Gate | Timeout | Runner |
|---|---|---|---|---|---|
| PR pipeline | U, I | on every PR push | block merge on FAIL | 10 min | Jetson runner (native cargo test for U + I) |
| dev-branch nightly | U, I, B | nightly | warn on FAIL; report attached | 60 min | Jetson runner |
| weekly suite-e2e | U, I, B, E | weekly + on release branch | block release on FAIL | 180 min | Jetson runner + live siblings reachable from it |
| pre-release HW benchmark | HW | manual + pre-release | block release on FAIL | 240 min | Jetson runner + physical A40 + airframe SITL/HW |
Capacity note: the PR pipeline running on Jetson trades x86 throughput for execution honesty. If PR latency becomes painful, the team's mitigation is to add more Jetson runners — NOT to fall back to x86 for Tier B (that would defeat the choice).
## Hardware Execution Matrix
Per the local-only-on-Jetson decision, every tier runs on Jetson. The matrix below is collapsed accordingly: it records **what each scenario actually exercises on the Jetson** (which hardware surface is the load-bearing one) so that a runner-capacity planner can predict which scenarios contend for the same physical resource.
| Scenario | Tier | Jetson surface exercised | Concurrent-with constraint |
|---|---|---|---|
| FT-P-001 (D6 Tier-1 contract) | B + E | GPU (Tier 1 inference) | conflicts with NFT-RES-LIM-Re2 / GPU |
| FT-P-002 — FT-P-006 (D1D5) | E + HW | GPU (Tier 1 inference) | as above |
| FT-P-007 — FT-P-010 (M1M4) | B + E | CPU (movement) + GPU (Tier 1 inputs) | as above |
| FT-P-011 — FT-P-015 (S1S5) | B + E | CPU + gimbal UDP + GPU (Tier 3 in S5) | gimbal contention serialises S1/S2/S3 |
| FT-P-016 — FT-P-022 (O1O7, O8 happy) | B + E | CPU + operator-stream | low contention |
| FT-P-023 (R1 BIT pass) | B + E | every dep mocked | none |
| FT-N-001 — FT-N-002 (R2/R3) | B + E | none (storage seed manipulation) | none |
| FT-N-003 (Mp2 cache-fallback) | B + E | mock timeout on `missions-mock` | none |
| FT-N-004 (O4 below-threshold) | B | CPU only | none |
| FT-P-024 / FT-P-025 / FT-P-026 (Mp1/Mp3/Mp5) | B + E | network + persistent store | persistent-store contention serialises |
| NFT-PERF-L1 | **HW** | GPU (Tier 1) | dedicate runner — measurement integrity |
| NFT-PERF-L2 | HW + B | GPU (Tier 2) | conflicts with L1/L3/L8 — serialise |
| NFT-PERF-L3 | HW + B (vlm-mock) | GPU (Tier 3 VLM) | conflicts with L1/L2 — serialise |
| NFT-PERF-L4 | **HW** | A40 physical zoom motor | dedicate runner — physical motion |
| NFT-PERF-L5 | HW + B | CPU + gimbal UDP | serialise with L4/L8 |
| NFT-PERF-L6 / L7 | B + E | CPU + ego-motion + GPU (Tier 1 inputs) | serialise with L1 |
| NFT-PERF-L8 | HW + B | A40 physical zoom + Tier 1 GPU | dedicate runner |
| NFT-PERF-L9 | B + E | CPU + operator-stream | low contention |
| NFT-PERF-T1 | B | CPU + queue | none |
| NFT-PERF-T2 | B + E | airframe link | low |
| NFT-PERF-T3 | B | RTSP throttling + health | none |
| NFT-RES-R4R9 | B + E | airframe link + persistent store | serialise per-mission |
| NFT-RES-Mp2 / Mp4 | B + E | network + persistent store | low |
| NFT-SEC-O9 / O10 | B + E | operator-stream + crypto path | low |
| NFT-SEC-CraftedFrame / OversizeCrop | B | decoder CPU | low |
| NFT-SEC-VlmSchemaViolation / FreeFormText | B (vlm-mock) | UDS IPC | low |
| NFT-SEC-IpcPeerAuth | B | UDS IPC + peer-cred | low |
| NFT-SEC-Tier1SchemaViolation | B | Tier-1 RPC | none |
| NFT-SEC-MavlinkUnsigned | B + E | airframe link (Q6 dep) | low |
| NFT-SEC-HealthExposesSecurity | B | counters + health | low |
| NFT-RES-LIM-Re1 | **HW** | full Jetson workload (RSS) | dedicate runner — measurement integrity |
| NFT-RES-LIM-Re2 | **HW** | Tier 1 + autopilot workload concurrent | runs back-to-back with NFT-PERF-L1 in same session |
| NFT-RES-LIM-Storage | B + HW | persistent store | low |
| NFT-RES-LIM-CPU | **HW** | full CPU | dedicate runner |
| NFT-RES-LIM-GPU | **HW** | GPU mutex (Tier 1 vs Tier 3) | dedicate runner |
| NFT-RES-LIM-FileHandles | B + HW | `/proc/<pid>/fd` | low |
**Bold Tier values** mark scenarios that REQUIRE physical Jetson + (sometimes) physical A40 to satisfy the project-level Acceptance Gate; surrogate replay does NOT count for those rows.
**Capacity rule**: scenarios marked `dedicate runner` MUST NOT run concurrently with any other scenario on the same Jetson — measurement integrity depends on the workload being exclusively the SUT.
## Open dependencies that affect the harness
| Open Q | Affects | Default until resolved |
|---|---|---|
| Q6 (MAVLink-2 signing) | `mavlink-sitl` config + observed-MAVLink assertions | signing disabled; tests skip signing assertions until Q6 lands |
| Q8 (MapObjects conflict resolution) | Mp5 fixture shape | `<DEFERRED>` |
| Q9 (Operator-command auth scheme) | `operator-replay` envelope format + signature validator | `<DEFERRED>` for O9/O10; O8 runs the happy path only |
| Q11 (multi-operator session policy) | `operator-replay` session-id semantics | single-operator only |
| Q14 (movement-detection classical vs learned-CV) | M4 benchmark fixture shape | `<DEFERRED>` |
@@ -0,0 +1,270 @@
# Performance Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Performance tests measure latency / rate / sustained-load characteristics. Functional behaviour that those characteristics enable lives in `blackbox-tests.md`. Resource ceilings live in `resource-limit-tests.md`.
Every scenario records steady-state metrics — cold-start measurements are explicitly excluded by a warm-up precondition. Pass criteria use the methods in `_docs/00_problem/input_data/expected_results/results_report.md` (referenced by row id).
---
## Latency
### NFT-PERF-L1: Tier-1 per-frame end-to-end latency ≤ 100 ms
**Summary**: Per-frame end-to-end latency through the Tier-1 contract (frame in → normalised-box record out) ≤ 100 ms at 1280 px input.
**Traces to**: AC `Latency — Primitive (Tier 1) object detection / L1`.
**Tier**: HW (representative Jetson Orin Nano Super) OR benchmarked replay (the only way to satisfy the project-level Acceptance Gate).
**Metric**: per-frame wall-clock from RTSP frame-receive timestamp to normalised-box emission timestamp.
**Preconditions**:
- Warm-up: 100 frames played before measurement starts (TensorRT engine warm, autopilot's frame pipeline in steady state).
- Single 1280 px frame replayed via `rtsp-loopback`; the live Tier-1 service is colocated on the same Jetson.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Play `fixtures/images/4d6e1830d211ad50.jpg` as a 60 s loop at 30 fps | record per-frame (frame_receive_ts, normalised_box_emit_ts); compute Δms |
| 2 | Aggregate over the measurement window | report p50, p95, p99, max |
**Pass criteria**: `p95 ≤ 100 ms` AND `max ≤ 150 ms` (max gives a soft headroom; AC enforces the p95 line).
**Duration**: 60 s after warm-up.
**Test status**: READY (fixture present); Tier requires HW for the release gate.
---
### NFT-PERF-L2: Tier-2 per-ROI semantic confirmation ≤ 200 ms
**Summary**: Per-ROI latency through Tier-2 semantic confirmation ≤ 200 ms.
**Traces to**: AC `Latency — Semantic confirmation (Tier 2) / L2`.
**Tier**: HW + Tier-B (inline ROI crop generation).
**Metric**: per-ROI wall-clock from ROI submitted to Tier-2 to Tier-2 emits semantic confirmation.
**Preconditions**:
- Warm-up: 50 ROIs processed before measurement.
- Test runner derives a ~640×640 ROI inline from `fixtures/images/4d6e1830d211ad50.jpg` and injects it directly into the SUT's Tier-2 entry (via a test-only ROI submission API exposed in test builds).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Submit 1000 ROIs at 5 Hz | per-ROI Δms |
| 2 | Aggregate | p50, p95, p99 |
**Pass criteria**: `p95 ≤ 200 ms`.
**Duration**: 200 s.
**Test status**: READY.
---
### NFT-PERF-L3: Tier-3 deep-analysis ≤ 5 s per ROI
**Summary**: Per-ROI deep-analysis (Tier-3 / VLM, when enabled) ≤ 5 s.
**Traces to**: AC `Latency — Deep semantic confirmation (Tier 3 / VLM, when enabled) / L3`.
**Tier**: HW + Tier-B (vlm-mock).
**Metric**: per-ROI wall-clock from SUT issuing a Tier-3 IPC call to VLM response received and schema-validated.
**Preconditions**:
- Warm-up: 5 Tier-3 calls.
- `vlm-mock` configured to respond from `vlm-io-pairs` fixture; Tier-3 enabled via SUT config.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Trigger 100 Tier-3 calls via injected ROIs | per-call Δms |
| 2 | Aggregate | p50, p95, p99 |
**Pass criteria**: `p95 ≤ 5000 ms`.
**Duration**: as needed for 100 calls.
**Test status**: DEFERRED — `<DEFERRED: vlm-io-pairs (real I/O) and the pinned local VLM model>`.
---
### NFT-PERF-L4: Camera zoom transition (medium → high) ≤ 2 s
**Summary**: Wall-clock from issuing the medium→high zoom command to the physical zoom transition completing ≤ 2 s, including the 12 s physical floor (restriction).
**Traces to**: AC `Latency — Camera zoom transition / L4`, RESTRICT `Hardware — 40× optical zoom traversal takes 12 s wall-clock`.
**Tier**: HW (physical A40 OR benchmarked replay) — pure-emulator runs not acceptable per `expected_results/results_report.md → Notes on this spec`.
**Metric**: wall-clock from outbound zoom command (observed on gimbal UDP) to gimbal-mock zoom telemetry reporting target_zoom_band.
**Preconditions**:
- SUT in `ZoomedIn` mode after a sweep-to-zoom transition; gimbal at medium zoom.
- HW Jetson OR `gimbal-mock` replaying recorded A40 zoom telemetry with realistic traversal time.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Trigger 30 medium→high zoom transitions via scripted POI sequence | per-transition Δms |
| 2 | Aggregate | p50, p95, max |
**Pass criteria**: `p95 ≤ 2000 ms`.
**Test status**: DEFERRED — `<DEFERRED: SITL or hardware-in-loop ViewPro A40 zoom command capture>`.
---
### NFT-PERF-L5: Decision-to-movement latency ≤ 500 ms
**Summary**: From the internal scan-control decision (POI detected mid-sweep) to the camera physically beginning to move ≤ 500 ms.
**Traces to**: AC `Latency — Decision-to-movement latency / L5`.
**Tier**: HW + Tier-B.
**Metric**: wall-clock from Tier-1 detection received at the scan-controller to first gimbal command observed on `gimbal-mock`.
**Preconditions**:
- Warm-up: 10 scripted POI events.
- Scripted scan-decision events followed by camera physical motion observed on the gimbal UDP channel.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject 100 POI detections at random sweep positions | per-event Δms (detection-receive-ts → gimbal-command-out-ts) |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 500 ms`.
**Test status**: DEFERRED — `<DEFERRED: scripted scan decision events with paired gimbal telemetry capture>`.
---
### NFT-PERF-L6: Movement candidate enqueue ≤ 1 s (wide sweep)
**Summary**: From the movement event in the visual stream to candidate enqueued for zoomed inspection ≤ 1 s during the wide-area sweep.
**Traces to**: AC `Latency — Movement candidate enqueue / L6`.
**Tier**: B + E.
**Metric**: wall-clock from ground-truth movement-event timestamp (annotated in the fixture) to candidate appearing on operator-stream.
**Preconditions**:
- Warm-up: 30 s of sweep playback.
- Synchronised RTSP + gimbal.csv + telemetry.csv (DEFERRED CSV pair).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay `fixtures/movement/video01.mp4` + paired CSVs | record per-event Δms |
| 2 | Aggregate over ~20 movement events | p95 |
**Pass criteria**: `p95 ≤ 1000 ms`.
**Test status**: DEFERRED — `<DEFERRED: paired gimbal.csv + telemetry.csv for video01.mp4 with annotated movement-event timestamps>`.
---
### NFT-PERF-L7: Movement candidate enqueue ≤ 1.5 s (zoomed-in)
**Summary**: Same as L6 but during a zoomed-in hold; budget relaxed to 1.5 s to accommodate gimbal slew.
**Traces to**: AC `Latency — Movement candidate enqueue … during the zoomed-in inspection / L7`.
**Tier**: B + E.
**Metric**: same as L6 but starting from a ZoomedIn hold.
**Preconditions**:
- SUT in ZoomedIn hold; small mover appears mid-hold.
- DEFERRED zoomed-in CSV pair.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold; replay zoomed-in scene with small mover | per-event Δms |
| 2 | Aggregate over ~10 movement events | p95 |
**Pass criteria**: `p95 ≤ 1500 ms`.
**Test status**: DEFERRED — `<DEFERRED: paired gimbal.csv + telemetry.csv at zoomed-in band>`.
---
### NFT-PERF-L8: Zoom-out → zoom-in transition ≤ 2 s
**Summary**: From POI detected during sweep to ROI fully zoomed and held ≤ 2 s wall-clock.
**Traces to**: AC `Latency — Zoom-out → zoom-in transition / L8`.
**Tier**: HW + Tier-B.
**Metric**: wall-clock from Tier-1 detection injected → first frame at full zoom on the ROI (observed via gimbal-mock zoom telemetry and the operator-stream ROI overlay).
**Preconditions**:
- Warm-up.
- Scripted sweep + injected POI.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject 30 mid-sweep POIs | per-transition Δms |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 2000 ms`.
**Test status**: DEFERRED — `<DEFERRED: sweep → zoomed-inspection transition capture with annotated transition-complete timestamps>`.
---
### NFT-PERF-L9: Operator command → action ≤ 500 ms
**Summary**: From operator click event (entering the SUT on the operator-stream return path) to the corresponding outbound command observed on its destination channel ≤ 500 ms; modem RTT explicitly excluded by measuring inside the SUT-side of the modem.
**Traces to**: AC `Latency — Operator command → action / L9`.
**Tier**: B + E.
**Metric**: wall-clock from operator-stream message arrival at SUT → first outbound command observed on the affected channel (MAVLink waypoint POST, gimbal command, mode-change emission).
**Preconditions**:
- Operator-session-scripts include click events at deterministic offsets.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay scripted operator-click sequence (50 clicks across confirm / decline / target-follow / abort) | per-click Δms |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 500 ms`.
**Test status**: DEFERRED — `<DEFERRED: operator-envelopes once Q9 resolves>` for signed commands; happy-path placeholder usable today for an early measurement (mark interim baseline only).
---
## Throughput / Rate
### NFT-PERF-T1: POI rate to operator capped at ≤ 5 / min
**Summary**: Even when Tier-1 produces detections faster than the cap, the rate of POIs SURFACED to the operator MUST stay ≤ 5 / min (hard cap, frozen 2026-05-06).
**Traces to**: AC `Throughput / Rate — POI rate surfaced to the operator / T1`.
**Tier**: B.
**Metric**: count of POIs emitted on operator-stream per rolling 60 s window.
**Preconditions**:
- Synthetic POI feed sustained at 20 POIs / min via `synthetic-poi-feeds`.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject sustained 20 POI/min feed for 10 minutes | per-minute count of surfaced POIs |
| 2 | Compute max over any rolling 60 s window | rolling-max |
**Pass criteria**: `rolling-max ≤ 5` POIs/min for every 60 s window.
**Duration**: 10 min.
**Test status**: READY (synthetic feeds inline-authorable).
---
### NFT-PERF-T2: Position telemetry rate ∈ [1 Hz, 10 Hz]
**Summary**: The position telemetry the SUT consumes from the airframe link MUST sustain ≥1 Hz, target 10 Hz, over a 60 s window.
**Traces to**: AC `Throughput / Rate — Position telemetry rate / T2`.
**Tier**: B (with MAVLink replay) + E (live SITL).
**Metric**: count of `GLOBAL_POSITION_INT` messages consumed by the SUT per second.
**Preconditions**:
- MAVLink stream replayed at the configured target rate (10 Hz).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay 60 s of GLOBAL_POSITION_INT at 10 Hz | per-second consumed count |
| 2 | Aggregate | min, mean |
**Pass criteria**: `min ≥ 1 Hz` AND `mean ≥ 9.5 Hz` (target 10 Hz with ≤ 5 % tolerance).
**Test status**: DEFERRED — `<DEFERRED: MAVLink replay fixture over a 60 s window>`.
---
### NFT-PERF-T3: Frame-rate floor → suppress zoom-in + health yellow
**Summary**: When the sustained camera frame rate drops below 10 fps for ≥5 s, zoom-in transitions MUST be suppressed AND overall health MUST surface yellow.
**Traces to**: AC `Throughput / Rate — Sustained camera frame-rate floor / T3`.
**Tier**: B.
**Metric**: pair: (boolean — was a zoom-in suppressed during the low-FPS window?), (boolean — did health surface yellow?).
**Preconditions**:
- SUT in normal sweep mode.
- `rtsp-loopback` plays `fixtures/videos/94d42580bd1ad6ff.mp4` with throttled decode injecting frame drops to keep FPS < 10 for ≥ 5 s.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Start playback at normal 30 fps | health remains green; zoom-in proceeds normally on detection |
| 2 | Throttle decode + drop frames to push FPS below 10 for ≥ 5 s | record: (a) whether a zoom-in-required event during this window was suppressed; (b) whether `GET /health` returns `overall == "yellow"` |
**Pass criteria**: both observations TRUE.
**Duration**: 30 s (5 s low-FPS window + buffer).
**Test status**: READY (fixture present; throttling implemented by consumer).
---
## Sustained-load (handoff to resource-limit-tests)
The two sustained-resource AC rows (Re1, Re2) live as resource-limit tests rather than performance tests because the pass criterion is "stays within ceiling for the duration", not "is fast enough":
- Re1 — combined RSS ≤ 6 GB onboard for everything autopilot owns — see `resource-limit-tests.md → NFT-RES-LIM-Re1`.
- Re2 — Tier-1 per-frame latency Δ ≤ 5 ms when autopilot's workload runs concurrently — see `resource-limit-tests.md → NFT-RES-LIM-Re2`. Re2 is the Tier-1 non-degradation contract; the absolute Tier-1 latency target is L1.
---
## Common preconditions for every performance scenario
- **Warm-up**: every scenario MUST include an explicit warm-up phase whose duration is recorded in the CSV report. This separates cold-start cost from steady-state behaviour.
- **Steady-state window**: pass criteria apply only to the steady-state window (after warm-up), not to the warm-up itself.
- **Hardware honesty**: scenarios that name Tier HW MUST run on representative Jetson Orin Nano Super OR on a benchmarked replay. Pure-x86-emulator runs report results but do NOT contribute to the project-level Acceptance Gate.
- **Concurrent workload disclosure**: every scenario records whether other autopilot subsystems were running concurrently (Tier-1 inference, VLM, MAVLink, etc.). Re2 is the only scenario that REQUIRES concurrent workload; the others MUST report it for context.
- **Seed + determinism**: where the test inputs randomness (e.g., synthetic-POI ordering tie-breakers), the seed is captured in the CSV report.
+196
View File
@@ -0,0 +1,196 @@
# Resilience Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Resilience tests inject a fault, observe behaviour during the fault, observe recovery behaviour, and assert against both. The fault and the recovery contract are both quantifiable.
BIT pre-flight pathways (positive R1, negatives R2/R3) are in `blackbox-tests.md` because they assert a functional gate. The runtime fault scenarios live here.
---
### NFT-RES-R4: Lost operator/Ground-Station link → RTL at 30 s grace (default)
**Summary**: Sustained loss of the operator/Ground-Station radio link MUST trigger an RTL exactly at the configured grace window (default 30 s), and operator-link health MUST flip red.
**Traces to**: AC `Reliability & Safety — Loss of operator/Ground-Station radio link MUST trigger a known mission-safe outcome / R4`, RESTRICT `Reliability & Safety — Lost operator-link failsafe MUST be deterministic and bounded`.
**Tier**: B + E.
**Preconditions**:
- SUT mid-flight (scripted MAVLink stream + active operator session).
- Operator session in steady state for ≥ 30 s before fault injection.
- Grace window configured to default 30 s.
**Fault injection**:
- `operator-replay` issues `lost-link` event at T=0 and STAYS silent (no reconnect) for the remainder of the window.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Inject lost-link event at T=0 | health endpoint immediately shows `deps.operator_link == "red"`; `last_seen_at` frozen |
| 2 | Wait 25 s (within grace) | NO RTL command yet on `mavlink-sitl`; SUT continues mission |
| 3 | Wait until T=30 s | RTL command observed on `mavlink-sitl` at T = 30 s ± 1 s; operator-stream emits a `failsafe_triggered` event with reason `operator_link_lost` |
| 4 | Optionally reconnect operator-replay after RTL | RTL persists (operator cannot un-RTL silently — requires explicit operator override per AC); health.operator_link transitions back to green when traffic resumes |
**Pass criteria**: RTL command at T = 30 s ± 1 s (`exact` with ± 1 s tolerance), `exact` operator-link red.
**Recovery time bound**: RTL must be issued within 31 s of fault start.
**Test status**: READY (operator-session-scripts inline-authorable; mavlink-sitl runs an ArduPilot SITL accepting RTL).
---
### NFT-RES-R5: Battery at RTL-floor → RTL
**Summary**: When the airframe battery sample drops to the configured RTL floor (e.g. 25 %), the SUT MUST issue an RTL and health MUST surface yellow.
**Traces to**: AC `Reliability & Safety — Battery at or below the configured RTL floor / R5`.
**Tier**: B + E.
**Preconditions**:
- SUT mid-flight; battery telemetry replayed via `mavlink-sitl` at 1 Hz.
**Fault injection**:
- `mavlink-sitl` scripted battery curve: starts at 80 %; ramps down to 25 % at T=T0; held at 25 % afterwards.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | At T=T0, battery reads 25 % | within 1 sample period (1 s) the SUT issues RTL on `mavlink-sitl`; health transitions to `overall == "yellow"`; operator-stream emits `failsafe_triggered` with reason `battery_rtl_floor` |
| 2 | Battery continues at 25 % | RTL persists; no oscillation |
**Pass criteria**: `exact (RTL command observed)` + `exact (health.overall == "yellow")`.
**Test status**: DEFERRED — `<DEFERRED: mid-flight battery sample at RTL-floor via mavlink-sitl battery curve script>`.
---
### NFT-RES-R6: Battery at hard floor → land-now
**Summary**: When the battery hits the configured hard floor (e.g. 15 %), the SUT MUST issue land-now and ONLY an authenticated operator command may override.
**Traces to**: AC `Reliability & Safety — battery at or below the hard floor / R6`.
**Tier**: B + E.
**Preconditions**:
- SUT mid-flight; battery ramps to 15 %.
**Fault injection**: same as R5 but ramp continues to 15 %.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | At T=T0, battery reads 15 % | within 1 sample period the SUT issues land-now (`MAV_CMD_NAV_LAND` or equivalent) on `mavlink-sitl`; health red; operator-stream emits `failsafe_triggered` with reason `battery_hard_floor` |
| 2 | Replay an UNAUTHENTICATED operator-override command | SUT REFUSES; land-now persists |
| 3 | Replay an AUTHENTICATED operator-override (placeholder until Q9; full once Q9 resolves) | land-now cancelled; SUT returns to prior mode |
**Pass criteria**: `exact (land_now observed)`; `exact (refusal of unauthenticated override)`; `exact (acceptance of authenticated override)`.
**Test status**: DEFERRED — same fixture gap as R5; step 3's full authentication semantics also `<DEFERRED: Q9>`.
---
### NFT-RES-R7: Airframe link exhaustion → health red after max-retry
**Summary**: When MAVLink commands fail through the configured bounded-retry budget (no airframe response), the airframe-link dependency MUST flip health red.
**Traces to**: AC `Reliability & Safety — MAVLink command exhaustion (bounded retry with exponential backoff fails through max-retry) / R7`.
**Tier**: B + E.
**Preconditions**:
- SUT mid-flight; max-retry configured (e.g., 5 attempts; exponential backoff base 100 ms).
**Fault injection**:
- `mavlink-sitl` configured to drop all command-ack messages for the duration of the test (peer non-responsive).
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | SUT issues a MAVLink command (e.g., waypoint upload) | command sent; no ack received |
| 2 | Backoff + retry loop executes through max-retry | retries observed on the wire with exponential backoff |
| 3 | After final retry exhausts | health.airframe_link transitions to red; operator-stream emits a `dependency_degraded` event with reason `airframe_link_retry_exhausted` |
**Pass criteria**: `exact (health.airframe_link == "red")` after max-retry; retries observed with backoff base 100 ms ± 20 ms.
**Test status**: DEFERRED — `<DEFERRED: airframe link command + bounded retry/backoff with peer not responding through max-retries>`.
---
### NFT-RES-R8: Wall-clock drift > 200 ms → time-source yellow
**Summary**: When wall-clock drift versus GPS or NTP source exceeds 200 ms, the time-source dependency MUST report yellow, AND `clock_source` + `last_sync_at` MUST reflect the drift.
**Traces to**: AC `Reliability & Safety — Wall-clock drift greater than 200 ms / R8`, RESTRICT `Wall-clock MUST be bound to GPS time once GPS is locked, or NTP at boot`.
**Tier**: B.
**Preconditions**:
- SUT running with `time-injector` LD_PRELOAD active.
- GPS source initially locked via `mavlink-sitl` GPS_RAW_INT messages.
**Fault injection**:
- `time-injector` advances the SUT process clock by 250 ms over a 1 s window while keeping GPS source locked.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Bind clock to GPS at boot | health.time_source == green; `clock_source == "gps"`; `last_sync_at` recent |
| 2 | Inject 250 ms drift | within 5 s health.time_source transitions to yellow; `clock_source` and `last_sync_at` updated to reflect the drift |
| 3 | Stop drift | health.time_source returns to green within the next sync cycle |
**Pass criteria**: `exact (health.time_source == "yellow")` during step 2; `exact (clock_source updated)` + `exact (last_sync_at updated)`.
**Test status**: READY (time-drift-scripts inline-authorable).
---
### NFT-RES-R9: Geofence EXCLUSION crossing → waypoint refusal + RTL
**Summary**: When a simulated waypoint crosses an EXCLUSION polygon, the SUT MUST refuse the waypoint AND trigger RTL. Symmetric behaviour for INCLUSION violations.
**Traces to**: AC `Reliability & Safety — Geofence INCLUSION and EXCLUSION violations MUST both result in waypoint refusal + RTL / R9`, RESTRICT `Geofence enforcement MUST be symmetric`.
**Tier**: B + E.
**Preconditions**:
- SUT mid-flight; geofence INCLUSION + EXCLUSION polygons loaded as part of the mission.
**Fault injection**:
- Scripted waypoint upload that crosses the EXCLUSION polygon; subsequently INCLUSION-exit test.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Upload waypoint crossing EXCLUSION polygon | SUT refuses the waypoint; structured-log WARN with `geofence_violation_exclusion`; RTL command observed on `mavlink-sitl` |
| 2 | Reset; upload waypoint exiting the INCLUSION polygon | identical behaviour — refused + RTL |
**Pass criteria**: `exact (waypoint rejected)` + `exact (RTL command observed)` for both EXCLUSION and INCLUSION cases.
**Test status**: DEFERRED — `<DEFERRED: geofence EXCLUSION polygon crossed by simulated waypoint via mavlink-sitl scripted mission>`.
---
### NFT-RES-Mp2: Map-pull timeout → cache-fallback (functional coverage in FT-N-003)
**Summary**: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects and surfaces `map_sync == "cached_fallback"` with an operator-ack gate. (Functional gate semantics are tested in `blackbox-tests.md → FT-N-003`; this scenario adds the **timing+recovery** dimension.)
**Traces to**: AC `Map Reconciliation — Cache-fallback on timeout / Mp2`.
**Tier**: B.
**Preconditions**:
- `autopilot-state` seeded with a known prior MapObjects snapshot.
- `missions-mock` configured to time out on `GET /missions/{id}/mapobjects` for a configurable duration.
**Fault injection**:
- `missions-mock` returns 504 / silent timeout for 60 s; then responds normally.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Trigger BIT | SUT issues `GET /missions/{id}/mapobjects`; observes timeout (per its configured request timeout); within 5 s falls back to cached snapshot; `map_sync == "cached_fallback"`; BIT requires explicit operator ack (see FT-N-003) |
| 2 | Mock recovers (responds normally) | next periodic resync re-attempts; once successful, `map_sync == "live"`; structured-log INFO `map_resync_recovered` |
**Pass criteria**: `exact (cached_fallback within 5 s of timeout)`; recovery within the next resync cycle.
**Test status**: READY (no external fixture beyond `mission-suite-fixture` (DEFERRED) for the cached snapshot seed; cached snapshot can be authored inline at minimal scale).
---
### NFT-RES-Mp4: Post-flight map-push 5xx → persist + bounded retry + operator warning
**Summary**: When the post-flight `POST /missions/{id}/mapobjects` returns 5xx, the pending diff MUST be persisted on durable on-device storage, an operator-visible warning MUST surface, AND bounded retry MUST execute (capped at the configured retry limit).
**Traces to**: AC `Map Reconciliation — Failure MUST persist the pending diff to durable on-device storage with bounded retry / Mp4`, RESTRICT `On-device storage MUST be bounded`.
**Tier**: B + E.
**Preconditions**:
- SUT post-landing; pending diff ready to push.
- `missions-mock` configured to return 5xx N times then 200.
**Fault injection**:
- `missions-mock` returns 503 for the first N attempts (N = configured retry-cap + 1); then returns 200.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Trigger post-flight reconciliation | SUT issues `POST /missions/{id}/mapobjects`; receives 503 |
| 2 | Observe persistence | pending diff file exists under `autopilot-state/pending_map_diff/<mission-id>.json`; size > 0 |
| 3 | Observe operator-stream | warning event `map_push_failed` surfaced |
| 4 | Observe retry loop | retries observed within the configured cap; backoff with jitter |
| 5 | After retry-cap reached without success | SUT stops retrying; pending file remains for next session pickup |
| 6 | Eventual success (mock returns 200) | next attempt succeeds; pending file removed; warning cleared |
**Pass criteria**: `exact (pending file exists)` + `exact (warning surfaced)` + `threshold_max (retries ≤ configured cap)`.
**Test status**: DEFERRED — `<DEFERRED: same fixture as Mp3 (60-minute pass diff)>`.
---
## Recovery-time invariants common to every scenario
- **No silent error swallowing.** Every fault scenario MUST observe a corresponding structured-log entry at WARN+ AND a corresponding health-endpoint transition. A fault that the SUT handles without surfacing through both channels is a TEST FAILURE per `security_approach.md → "No silent error swallowing for security-relevant failures"` (extended here to operational faults per `coderule.mdc → "Never suppress errors silently"`).
- **Bounded behaviour.** Every retry/backoff loop MUST be bounded — the scenario asserts the cap on retry count and the cap on backoff window. Open-ended retry is a test failure.
- **State integrity post-recovery.** After fault recovery (when applicable), the scenario asserts that the SUT returns to a known state — mode unchanged unless the fault legitimately altered it (e.g., RTL stays RTL until operator override).
- **Symmetry assertions.** R9 explicitly tests both INCLUSION and EXCLUSION because the AC names symmetric behaviour. Wherever an AC pairs two outcomes (`fail-fast` + `fail-closed`, `red` + `yellow`, etc.), the resilience scenario MUST cover both halves.
@@ -0,0 +1,156 @@
# Resource Limit Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Resource-limit tests assert that the SUT stays within a quantified resource ceiling for the configured duration. Short bursts do not satisfy these tests — every scenario has an explicit sustained-monitoring window.
---
### NFT-RES-LIM-Re1: Combined onboard RSS ≤ 6 GB sustained
**Summary**: Combined process RSS on the deployed compute device for everything autopilot owns onboard (excluding Tier 1) MUST stay ≤ 6 GB throughout a 5-minute steady-state window with the full onboard workload active.
**Traces to**: AC `Resources & Data — Combined RSS on the deployed compute device, for everything autopilot owns onboard (excluding Tier 1), MUST stay within ≤ 6 GB / Re1`, RESTRICT `Hardware — Compute device: Jetson Orin Nano Super, 8 GB shared LPDDR5; Tier 1 consumes ~2 GB, leaving ~6 GB for autopilot`.
**Tier**: HW (representative Jetson Orin Nano Super) — pure-x86 reports informational only and does NOT satisfy the project-level Acceptance Gate.
**Preconditions**:
- Full onboard workload active: frame ingest from `rtsp-loopback`, Tier-2 + Tier-3 (when enabled) inferring at the documented steady-state load, gimbal commands flowing, MAVLink stream consumed at 10 Hz, operator-stream connected, MapObjects store hydrated for a 30×30 km region.
- Warm-up: 60 s before measurement starts (any first-load model warm-up complete).
- Tier-1 process is RUNNING in parallel but its RSS is EXCLUDED from the measurement (the AC scope is autopilot-owned RSS, excluding Tier 1).
**Monitoring**:
- Cgroup-level RSS for every process the SUT owns (the SUT binary plus any child processes it spawns — e.g., the VLM IPC peer if it lives in autopilot's cgroup), sampled at 1 Hz.
- Cgroup-level RSS for Tier 1 sampled at the same cadence (for the Re2 cross-reference).
- Per-process RSS captured to `reports/<run-id>/rss-trace.csv` for forensic review on failure.
**Duration**: 5 minutes of measurement after warm-up.
**Pass criteria**:
- `threshold_max`: per 1 s sample, `sum(autopilot_owned_RSS) ≤ 6 GB`.
- No single 1 s sample exceeds the ceiling.
- (Reporting only — not pass/fail): peak RSS, mean RSS, P95 RSS recorded in the CSV report.
**Test status**: DEFERRED — `<DEFERRED: long-running scenario harness exercising the full onboard workload for 5 min; inline-authorable but requires that the SUT be operational end-to-end first>`.
---
### NFT-RES-LIM-Re2: Tier-1 non-degradation under autopilot workload
**Summary**: When autopilot's full onboard workload runs concurrently with Tier 1 on the same Jetson, Tier-1 per-frame latency MUST NOT degrade by more than ± 5 ms versus the Tier-1-alone baseline (recorded by NFT-PERF-L1).
**Traces to**: AC `Resources & Data — Tier 1 per-frame latency MUST NOT degrade by more than ± 5 ms when autopilot's own onboard workload is running concurrently / Re2`, RESTRICT `Tier 1 (YOLO) and any local large model with GPU memory pressure share the Jetson GPU — only one of them may execute at any wall-clock instant`.
**Tier**: HW (the only meaningful environment for this assertion — GPU contention behaviour does not reproduce on x86).
**Preconditions**:
- NFT-PERF-L1 has been run on the same HW configuration in the SAME session and a baseline `tier1_baseline_p95_ms` recorded.
- Full onboard workload active (same as Re1).
**Monitoring**:
- Tier-1 per-frame latency sampled per frame for the duration of the test.
- The same metric source as NFT-PERF-L1 — for direct delta comparison.
**Duration**: 5 minutes of measurement after warm-up (matches Re1 window so both can run in the same session).
**Pass criteria**:
- `numeric_tolerance`: `|p95(tier1_with_autopilot) - tier1_baseline_p95_ms| ≤ 5 ms`.
- (Reporting only): mean, P95, max delta over the window.
**Test status**: DEFERRED — same fixture dependency as Re1; requires SUT operational + Tier 1 colocated on HW.
---
### NFT-RES-LIM-Storage: On-device persistent store stays under 95 % for in-flight operation
**Summary**: During a steady-state mission run (no abnormal load), the on-device persistent store MUST NOT exceed 95 % full. This protects the takeoff gate (R3) from being silently violated mid-mission and protects the post-flight push (Mp4) from running out of room to persist a failed diff.
**Traces to**: AC `Reliability & Safety — On-device storage MUST be bounded` (via R3 BIT gate), RESTRICT `On-device storage MUST be bounded`.
**Tier**: B + HW.
**Preconditions**:
- SUT mid-flight; persistent store at typical post-takeoff utilisation (e.g. 30 %).
- Normal-operation event volume: telemetry persistence, ignored-item appends, pending map-diff buffer (empty in this scenario).
**Monitoring**:
- Volume utilisation sampled at 10 Hz throughout the duration.
**Duration**: 60 minutes (representative mission duration per Mp3).
**Pass criteria**:
- `threshold_max`: `volume_used / volume_total ≤ 0.95` at every sample point.
- On approach to 85 %: structured-log INFO `storage_pressure` with current utilisation.
- On approach to 90 %: structured-log WARN with current utilisation; health.storage transitions to yellow.
- On 95 %: health.storage transitions to red; the SUT begins its documented eviction policy (this scenario does NOT test the policy semantics — that belongs to its own scenario; this scenario only asserts the policy IS triggered).
**Test status**: READY (no external fixture beyond the SUT itself; the persistent-store seed file controls starting utilisation).
---
### NFT-RES-LIM-CPU: CPU headroom for the Tier-1 colocation guarantee
**Summary**: Combined CPU utilisation of every autopilot-owned process MUST leave enough Jetson CPU headroom for Tier 1 to keep its NFT-PERF-L1 budget. Concretely: per-second sustained CPU usage by autopilot-owned processes MUST stay ≤ the configured budget (default 60 % of total CPU cycles measured at the cgroup level) for the duration of the run.
**Traces to**: AC `Resources & Data — Tier 1 per-frame latency MUST NOT degrade by more than ± 5 ms / Re2` (CPU-side mechanism backing Re2), RESTRICT `Hardware — Jetson Orin Nano Super`.
**Tier**: HW (CPU contention does not reproduce on x86).
**Preconditions**:
- Same workload as Re1 + Re2.
**Monitoring**:
- Cgroup CPU usage at 1 Hz.
**Duration**: 5 minutes after warm-up.
**Pass criteria**:
- `threshold_max`: per 1 s sample, `sum(autopilot_cpu_usage) ≤ 60 %` of total CPU.
- Reporting: mean, P95, max.
**Test status**: DEFERRED — same dependency as Re1/Re2.
---
### NFT-RES-LIM-GPU: GPU mutual exclusion contract (Tier 1 vs local large model)
**Summary**: Per RESTRICT (`Tier 1 (YOLO) and any local large model with GPU memory pressure share the Jetson GPU — only one of them may execute at any wall-clock instant`), the SUT MUST NOT issue a GPU compute call (e.g. Tier-3 VLM inference) while Tier 1 is executing on the GPU. The serialisation MUST be observable: a single GPU is busy at one instant.
**Traces to**: RESTRICT `Tier 1 and any local large model … only one of them may execute at any wall-clock instant`.
**Tier**: HW.
**Preconditions**:
- Tier 1 active; SUT in a ZoomedIn hold with deep-analysis enabled (Tier-3 will fire).
**Monitoring**:
- GPU-instance occupancy via `tegrastats` / equivalent at the highest available sampling rate.
- The SUT's own internal "compute-class" telemetry exposed on the health endpoint as `gpu_owner_current` ∈ { "tier1", "tier3", "idle" }.
**Duration**: 60 s containing ≥ 5 Tier-3 hold cycles.
**Pass criteria**:
- `exact`: at every sample point, `gpu_owner_current ∈ { "tier1", "tier3", "idle" }`; never simultaneously both.
- `tegrastats` peak GPU occupancy attributable to autopilot processes never overlaps Tier 1's known activity window for the same wall-clock instant.
**Test status**: DEFERRED — depends on the SUT being operational end-to-end + Tier-3 enabled; also depends on the SUT exposing `gpu_owner_current` (which is an architectural choice not yet locked).
---
### NFT-RES-LIM-FileHandles: File-descriptor and socket bound
**Summary**: Sustained operation MUST NOT leak file descriptors or sockets. The count MUST stay within a documented headroom of the initial-post-warmup baseline for the duration of the run.
**Traces to**: RESTRICT `On-device storage MUST be bounded` (general bounded-resource principle), security principle `No silent error swallowing for security-relevant failures` (FD exhaustion would silently break the operator-stream).
**Tier**: B + HW.
**Preconditions**:
- Warm-up: 60 s.
- Workload: full onboard workload at steady state.
**Monitoring**:
- `/proc/<pid>/fd` count per autopilot process at 1 Hz.
**Duration**: 60 minutes.
**Pass criteria**:
- `threshold_max`: at every sample point, `fd_count ≤ fd_baseline_post_warmup + 50` (50 = documented churn headroom for intermittent operator reconnects).
- A monotonically rising trend (slope > 0 over the run) is a TEST FAILURE even if the absolute ceiling is not breached.
**Test status**: READY for a Tier-B run; gains its real value once HW + sustained-workload land.
---
## Common assertions for every resource-limit scenario
- **Sustained-monitoring is non-negotiable.** Each scenario specifies a duration ≥ 60 s; short bursts that pass do not satisfy the test. The CSV report records the full sample trace path under `artifacts_path`.
- **No silent eviction.** Where a ceiling is approached, the SUT MUST surface the pressure (structured-log INFO at 85 %, WARN at 90 %, transition to yellow/red on health) BEFORE reaching the ceiling. A pass with no observable pressure signal at thresholds is a TEST FAILURE.
- **HW reporting vs gating.** Pure-x86 runs report informational deltas only; they do NOT satisfy the project-level Acceptance Gate. Every CSV row records its tier so this distinction stays auditable.
- **Re1 + Re2 are paired.** Re1 establishes the autopilot RSS ceiling; Re2 establishes that respecting Re1 does not cost Tier 1 latency. They MUST be run in the same session to make the Re2 baseline meaningful.
+215
View File
@@ -0,0 +1,215 @@
# Security Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Security tests validate blackbox-observable security properties derived from `_docs/00_problem/security_approach.md` and the AC operator-command rules. Code-level vulnerability scanning is out of scope at this layer (see deploy-time security audit `Step 14` of the autodev flow).
Each scenario observes the SUT through its public surfaces only; pass criteria assert that an attack attempt produces no state change AND surfaces a structured-log entry / health signal — silent rejection is a test failure.
---
### NFT-SEC-O9: Operator-command replay protection
**Summary**: An operator command envelope replayed within (or outside) the replay-protection window MUST be rejected; system state MUST NOT change; security WARN logged with reason `replay`.
**Traces to**: AC `Operator Workflow — A replayed or unsigned operator command MUST be rejected with a logged security warning / O9`, security principle `Operator commands MUST be authenticated, signed, and replay-protected`.
**Tier**: B + E.
**Preconditions**:
- SUT in steady state; a prior valid operator-confirm envelope already accepted.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Capture the valid envelope from the prior FT-P-016 run | envelope captured (sequence_id S, timestamp T) |
| 2 | Replay the exact same envelope a second time | SUT rejects at the boundary; no `POST /missions/{id}` observed; no mode change; counter `operator_cmd_rejected_replay_total` += 1; structured-log WARN with `reason: "replay"`, `sequence_id: S`, `originating_envelope_id` recorded |
| 3 | Replay an envelope with sequence_id S but timestamp T+window+1s (outside replay window) | rejected as expired; counter `operator_cmd_rejected_expired_total` += 1; structured-log WARN reason `expired` |
**Pass criteria**: `exact (state unchanged)` AND `substring (log contains "replay")` for step 2; `exact (state unchanged)` AND `substring (log contains "expired")` for step 3.
**Test status**: DEFERRED — `<DEFERRED: operator-envelopes (replayed) fixture; services.md §8 — blocked on Q9 operator-command auth scheme>`. Until Q9 resolves, this scenario asserts only that a duplicate envelope at the byte level is rejected (placeholder behaviour); the full replay-window semantics land with Q9.
---
### NFT-SEC-O10: Operator-command signature validation
**Summary**: A malformed / unsigned operator command MUST be rejected with `reason: "invalid"`; state MUST NOT change.
**Traces to**: AC `O10`, security principle `Operator commands MUST be authenticated, signed, and replay-protected`.
**Tier**: B + E.
**Preconditions**:
- SUT in steady state.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Send a malformed envelope (signature bytes flipped) | rejected; no state change; counter `operator_cmd_rejected_signature_total` += 1; structured-log WARN reason `invalid_signature` |
| 2 | Send an UNSIGNED envelope (signature field absent / zero) | rejected; counter increments; structured-log WARN reason `unsigned` |
| 3 | Send a well-formed envelope but signed with a key NOT in the operator's authorised set | rejected; counter increments; reason `unauthorised_signer` |
| 4 | Send a valid envelope (control case) | accepted; state changes as per the command type |
**Pass criteria**: steps 13 all `exact (state unchanged)` + `substring (log contains "invalid"|"unsigned"|"unauthorised")`; step 4 succeeds normally.
**Test status**: DEFERRED — `<DEFERRED: operator-envelopes (malformed / unsigned / wrong-key); blocked on Q9>`.
---
### NFT-SEC-CraftedFrame: Crafted RTSP frame → no decoder OOM / no crash
**Summary**: A crafted H.264/265 frame (oversize SPS, malformed NAL, truncated slice) MUST NOT crash or hang the SUT and MUST NOT consume unbounded memory. Frame is dropped with a counter increment.
**Traces to**: security principle `Bounded input for any model call`, RESTRICT `On-device storage / RSS budgets`.
**Tier**: B.
**Preconditions**:
- SUT in normal sweep mode; `rtsp-loopback` switched to a corpus of crafted clips.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Stream a fuzzed clip corpus (≥ 100 crafted frames) | each crafted frame dropped at decode; counter `frame_decode_error_total` increments per drop; structured-log WARN with `reason: "decode_error"` |
| 2 | Observe SUT process | RSS does NOT exceed 1.2 × baseline; no crash; no hang; gimbal & operator-stream still responsive within their normal latency budgets |
**Pass criteria**: `exact (no crash)`; `threshold_max (RSS ≤ 1.2 × baseline)`; counter consistent with crafted-frame count.
**Test status**: READY (crafted-clip corpus authorable inline using afl++ / honggfuzz output against a vanilla H.264 decoder; corpus stored in `e2e/consumer/fixtures/fuzzed_clips/`).
---
### NFT-SEC-OversizeCrop: Bounded crop enforcement
**Summary**: An attempt to submit an oversize ROI crop (above the configured max bytes or outside the format allow-list) to any onboard model entry point MUST be rejected at the boundary; downstream models MUST NOT be invoked.
**Traces to**: security principle `Bounded input for any model call`.
**Tier**: B.
**Preconditions**:
- SUT with Tier-2 + Tier-3 enabled.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Submit a 5000 × 5000 PNG (above the configured 1024 × 1024 cap) to the Tier-2 ROI entry | rejected; Tier-2 inference NOT invoked (verified via `tier2_inference_total` counter unchanged); structured-log WARN `reason: "roi_too_large"` |
| 2 | Submit a BMP (not in the allow-list) | rejected; reason `roi_format_not_allowed` |
| 3 | Submit a well-formed 640×640 JPEG (control) | accepted; Tier-2 invoked normally |
**Pass criteria**: `exact (downstream model not invoked)` for steps 12; `exact (downstream invoked)` for step 3.
**Test status**: READY (oversize PNG + BMP generated inline).
---
### NFT-SEC-VlmSchemaViolation: VLM schema-violation fails closed
**Summary**: When the Tier-3 VLM returns a response that fails schema validation (missing required field, wrong type, truncated JSON), the SUT MUST discard the assessment AND the POI MUST NOT receive the deep-analysis upgrade.
**Traces to**: security principle `Schema validation for any non-deterministic model output … Schema violation MUST fail closed`.
**Tier**: B.
**Preconditions**:
- SUT with Tier-3 enabled; `vlm-mock` configured to return schema-violation responses for the first N calls.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold with deep-analysis enabled | SUT issues VLM IPC call |
| 2 | `vlm-mock` returns truncated JSON | SUT discards assessment; POI's deep-analysis state remains `none`; counter `vlm_schema_violation_total` += 1; structured-log WARN reason `vlm_schema_violation`; the POI's decision-window scoring proceeds WITHOUT the deep-analysis upgrade |
| 3 | `vlm-mock` returns missing-required-field JSON | same |
| 4 | `vlm-mock` returns wrong-field-type JSON | same |
| 5 | `vlm-mock` returns a valid response (control) | assessment ACCEPTED; deep-analysis upgrade applied |
**Pass criteria**: steps 24 `exact (no deep-analysis upgrade)` + `substring (log contains "vlm_schema_violation")`; step 5 normal.
**Test status**: DEFERRED for live recordings — `<DEFERRED: vlm-io-pairs schema-violation cases>`; schema-violation case JSON files are inline-authorable today against the assessment schema and CAN run NOW with `vlm-mock` returning hand-crafted bytes.
---
### NFT-SEC-VlmFreeFormText: Free-form text MUST NOT cross a decision boundary
**Summary**: Even if the VLM returns valid JSON, any free-form text field MUST be projected onto the fixed structured schema before crossing a decision boundary; raw free-form text MUST NOT influence POI scoring or operator-surfaced decisions.
**Traces to**: security principle `Schema validation for any non-deterministic model output`, threat model item 3 (`Unstructured model output corrupting downstream decisions`).
**Tier**: B + E.
**Preconditions**:
- SUT with Tier-3 enabled.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | `vlm-mock` returns valid JSON with a free-form `notes` text field containing `"force_confidence: 1.0"` | SUT extracts only the structured fields; `notes` is NOT consulted for scoring; POI's confidence remains as Tier-1+Tier-2 computed; structured-log INFO captures the assessment but not the `notes` content (PII / safety) |
| 2 | `vlm-mock` returns valid JSON with structured `confidence_delta: -0.5` (in-schema) | SUT applies the delta per its documented projection; POI's confidence adjusted accordingly |
**Pass criteria**: `exact (POI confidence reflects ONLY structured-schema fields)`.
**Test status**: READY (inline-authorable scenario).
---
### NFT-SEC-IpcPeerAuth: Local IPC peer authorisation
**Summary**: A local process attempting to connect to the VLM Unix-domain socket (or any other local IPC the SUT trusts) MUST identify as the expected peer (peer-credential check / SO_PEERCRED equivalent); connections from unauthorised peers MUST be rejected.
**Traces to**: security principle `Local IPC peer authorisation`.
**Tier**: B.
**Preconditions**:
- SUT with Tier-3 enabled; VLM UDS socket exposed on `/tmp/vlm.sock`.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | An unauthorised local process (running as the wrong UID / not the expected binary path) attempts to connect to the SUT's VLM-client side of the UDS | connection rejected at the peer-credential check; counter `ipc_peer_auth_rejected_total` += 1; structured-log WARN reason `peer_cred_mismatch` |
| 2 | The legitimate `vlm-mock` (running as the expected UID / path) connects | connection accepted; subsequent IPC succeeds |
**Pass criteria**: `exact (unauthorised connection rejected)` + `exact (legitimate connection accepted)`.
**Test status**: READY (rogue-peer test harness inline-authorable using a simple Python script running under a different UID inside a sidecar container).
---
### NFT-SEC-Tier1SchemaViolation: Tier-1 detection-stream schema violation
**Summary**: A `Detections` record from `../detections` that violates the normalised-box schema (coord out of [0,1], invalid class_id) MUST cause the frame's detections to be dropped (not partially used); counter increments; structured-log WARN. SUT does not crash and continues with subsequent frames.
**Traces to**: security principle `No silent error swallowing for security-relevant failures` (extends to peer schema violations) + AC `D6` (normalised-box conformance).
**Tier**: B.
**Preconditions**:
- SUT in normal sweep mode; `detections-mock` configured to emit schema-violating records interleaved with valid ones.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Mock emits Detections for frame N with bbox `x2 = 1.5` (coord > 1.0) | frame N's detections dropped; counter `tier1_invalid_frame_total` += 1; structured-log WARN with `field: "x2"`, `value: 1.5` |
| 2 | Mock emits Detections for frame N with `class_id = 99` (not in 0..18) | dropped; reason `class_id_out_of_range` |
| 3 | Mock emits valid Detections for frame N+1 | processed normally |
**Pass criteria**: `exact (no operator-stream emission for frames N)` + `exact (counter incremented per dropped frame)`.
**Test status**: READY (inline-authorable injection by `detections-mock`).
---
### NFT-SEC-MavlinkUnsigned: Optional MAVLink-2 signing enforcement
**Summary**: When MAVLink-2 message signing is configured ON (per Q6 once resolved), unsigned messages on the airframe link MUST be dropped with a security WARN; signed messages flow normally. When signing is OFF (current default until Q6), no signing assertion runs.
**Traces to**: security principle `Airframe MAVLink integrity` (Q6).
**Tier**: B + E.
**Preconditions**:
- SUT configured with MAVLink-2 signing ENABLED (test profile).
- `mavlink-sitl` configured to send a mix of signed and unsigned messages.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | `mavlink-sitl` sends a valid signed message | accepted; processed normally |
| 2 | `mavlink-sitl` sends an unsigned message | dropped; counter `mavlink_unsigned_dropped_total` += 1; structured-log WARN reason `mavlink_unsigned`; airframe-link health unaffected for an isolated drop |
| 3 | Sustained unsigned-only stream | airframe-link health flips red after the configured tolerance window (same threshold as R7 retry exhaustion) |
**Pass criteria**: `exact (unsigned dropped)` + `exact (signed accepted)`; sustained-unsigned escalates per the documented threshold.
**Test status**: DEFERRED — `<DEFERRED: Q6 (MAVLink-2 message signing decision)>`. When Q6 lands and signing is mandated, this scenario becomes READY.
---
### NFT-SEC-HealthExposesSecurity: Health endpoint surfaces security state
**Summary**: The `/health` endpoint MUST reflect security state — repeated operator-command signature failures, repeated peer-credential mismatches, repeated schema-violation rates all MUST be visible to ops.
**Traces to**: security principle `Health endpoint MUST reflect security state`.
**Tier**: B.
**Preconditions**:
- SUT in steady state; counters baselined.
| Step | Consumer Action | Expected Response |
|---|---|---|
| 1 | Drive sustained signature-failure rate (10 / s) for 10 s via the NFT-SEC-O10 flow | `GET /health` exposes a `security` sub-object that includes `operator_cmd_rejected_signature_rate_60s` non-zero; if rate exceeds the configured alert threshold, the security sub-object transitions to yellow |
| 2 | Drive sustained peer-credential-mismatch attempts (1 / s) for 60 s via NFT-SEC-IpcPeerAuth | `security.ipc_peer_auth_rejected_rate_60s` non-zero; transitions to yellow at threshold |
| 3 | Drive sustained Tier-1 schema-violation rate (1 / s) via NFT-SEC-Tier1SchemaViolation | `security.tier1_invalid_rate_60s` non-zero |
**Pass criteria**: `exact (health.security exposes each rate)` + `exact (transition to yellow at threshold)`.
**Test status**: READY.
---
## Out of scope at this layer
Per `security_approach.md → "Out of scope"`, the following are NOT covered by blackbox security tests because they are owned elsewhere in the suite:
- Modem-link encryption setup (radio layer below autopilot).
- Suite-wide TLS / certificate provisioning (suite-level deployment, `../_infra/`).
- OTA update signing (Watchtower; autopilot consumes signed images only). Boot-time self-check + rollback is Q10 — when it lands, it becomes a new scenario here.
- Annotation / training-data security (`../ai-training` repo).
- Operator browser UI auth (Ground Station owns it; only the modem-side handshake is jointly specified per Q9, covered by O8/O9/O10).
- Multi-operator session policy (Q11 — when it lands, becomes a new scenario here).
## Common assertions
- **No silent rejection.** Every rejected security event MUST produce both a counter increment AND a structured-log entry at WARN+. A rejection that occurs silently is a TEST FAILURE.
- **Fail-closed everywhere.** When an authentication / signature / schema check is uncertain, the SUT MUST fail closed (reject) rather than fail open. Tests assert this by sending borderline / ambiguous inputs and checking for rejection.
- **No information leak in error paths.** Error responses (where the SUT exposes any to the operator-stream or health endpoint) MUST NOT leak the rejected payload contents beyond the minimum needed for ops to triage. Tests inspect log/health output for absence of crafted-payload byte sequences.
+152
View File
@@ -0,0 +1,152 @@
# Test Data Management
Authored by `/test-spec` Phase 2 (2026-05-19). Owns the **mapping** from fixtures to tests, mock data shapes, isolation strategy, and the deferred-fixture inventory bridge.
- Per-row input-to-expected-result binding lives in `_docs/00_problem/input_data/expected_results/results_report.md` — this file references it but never duplicates it.
- Fixture manifest (SHA-pinned files + provenance) lives in `_docs/00_problem/input_data/fixtures/README.md`.
- Per-service mock catalogue (what shape each mock returns) lives in `_docs/00_problem/input_data/services.md`.
- Deferred fixture inventory + replay obligation lives in `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md`.
## Seed data sets
| Data Set | Description | Used by Tests | How Loaded | Cleanup |
|---|---|---|---|---|
| `image-set-existing` | `fixtures/images/{4d6e1830d211ad50,54f6459dbddb93d8,6dd601b7d2dc1b30,805bcf1e9f271a58,f997d0934726b555}.jpg` — 5 aerial frames | FT-P-Tier1Contract, NFT-PERF-L1, NFT-PERF-L2, FT-P-DetectExisting | mounted read-only via `fixtures-ro:/fixtures` on `rtsp-loopback` (encoded to `.mp4` clip) and on `detections-mock` (paired with `expected_detections.json` per frame) | volume detached on container teardown |
| `video-recon` | `fixtures/videos/94d42580bd1ad6ff.mp4` | NFT-PERF-T3 | mounted read-only on `rtsp-loopback`; consumer requests stream at 30 fps then throttles decode + drops frames per scenario script | as above |
| `video-movement` | `fixtures/movement/video0[1-4].mp4` (4 wide-area clips) | FT-P-MoveStarter (visual reference only), FT-P-MoveBenchmark (deferred) | mounted on `rtsp-loopback`; played at 30 fps; consumer schedules which clip per scenario | as above |
| `image-semantic-starter` | `fixtures/semantic/semantic0[1-4].png` (1 winter + 3 unmarked season) | FT-P-ConcealStarter, FT-P-FootpathStarter (visual reference only; assertion semantics deferred) | mounted on `detections-mock` and `rtsp-loopback` as a single-frame loop | as above |
| `schemas-detection` | `fixtures/schemas/expected_detections.{json,schema.json}` | FT-P-Tier1Contract, FT-P-NormalisedBoxes (D6) | mounted on `e2e-consumer:/expected:ro` | as above |
| `sql-init-suite` | `fixtures/sql/init.sql` | NOT USED by autopilot tests (suite-only artefact; recorded here for traceability) | n/a | n/a |
| `mission-suite-fixture` | `<DEFERRED: missions_fixtures/mission_30x30km.json + mapobjects_10k.json; services.md §2>` | FT-P-MissionStart, FT-P-MapPull (Mp1), FT-P-MapPush (Mp3), NFT-RES-Mp2, NFT-RES-Mp4 | mounted on `missions-mock` once acquired | as above |
| `mavlink-sitl-scripts` | scripted `ardupilot/sitl` scenarios (waypoint upload, geofence in/out, RTL on link loss, RTL on battery floor) | FT-P-WaypointInsert (O8), NFT-RES-R4, NFT-RES-R5, NFT-RES-R6, NFT-RES-R7, NFT-RES-R9 | run in `mavlink-sitl` via `--script` argument per scenario | SITL container restarted per scenario |
| `operator-session-scripts` | scripted `(t, event)` traces — nominal, drop+reconnect, lost-link 30 s, sustained lost-link | FT-P-DecisionWindow (O1O3, O4), FT-P-OperatorDecline (O5), FT-P-OperatorIgnoredSuppress (O6), FT-P-OperatorTimeout (O7), FT-P-OperatorConfirm (O8), NFT-RES-R4 | replayed by `operator-replay` per scenario | per-scenario |
| `operator-envelopes` | `<DEFERRED: operator_envelopes/{valid,replayed,malformed,unsigned,expired}.bin; services.md §8 (Q9-blocked)>` | NFT-SEC-O9, NFT-SEC-O10, FT-P-OperatorConfirm (O8 happy path uses a default placeholder envelope) | replayed by `operator-replay` | per-scenario |
| `vlm-io-pairs` | `<DEFERRED: vlm_io_pairs/{roi,prompt,response}.* + schema-violation cases; services.md §7>` | NFT-PERF-L3, FT-P-DeepAnalysisHold (S5), NFT-SEC-VlmSchemaViolation | mounted on `vlm-mock` | per-scenario |
| `gimbal-csv-pairs` | `<DEFERRED: gimbal_csv/video0[1-4].csv paired with movement videos at zoomed-in band + threshold-edge cluster; services.md §6>` | FT-P-EgoMotion (M1), FT-P-MoveDuringHold (M2), FT-P-ThresholdEdge (M3), FT-P-MoveBenchmark (M4), NFT-PERF-L6, NFT-PERF-L7 | replayed by `gimbal-mock` synchronised to RTSP frame timestamps | per-scenario |
| `tier1-replay-streams` | `<DEFERRED: tier1_replay/*.replay; services.md §1>` | FT-P-Tier1ContractIsolated (Tier B variant); Tier-E uses live `../detections` | served by `detections-mock` | per-scenario |
| `time-drift-scripts` | scripted clock offsets (50 ms ramp, 250 ms jump, NTP loss, GPS unlock) | NFT-RES-R8 | injected by `time-injector` via faketime LD_PRELOAD shim | per-scenario |
| `synthetic-poi-feeds` | inline-authorable: confidence={0.39, 0.40, 0.70, 1.00}, ordering-test feed, sustained-rate feed >5 POI/min | FT-P-DecisionWindow (O1O4), FT-P-POIOrdering (S4), NFT-PERF-T1 | authored in Rust under `e2e/consumer/fixtures/synthetic_poi/`; pumped into the SUT by injecting recorded `Detections` into `detections-mock` | n/a (in-memory) |
| `bit-scenarios` | inline-authorable: every-dep-green, tier1-unreachable, storage-95pct-full | FT-P-BitPass (R1), NFT-RES-R2, NFT-RES-R3 | manipulated by toggling mock services up/down + `autopilot-state` volume seed file | volume seed file removed |
## Data isolation strategy
- **Per scenario, fresh containers.** Each scenario starts with `docker compose down -v && docker compose up -d` (the `e2e-consumer` orchestrates this via `testcontainers-rs`). No state leaks between scenarios.
- **`autopilot-state` volume** is named per `(test_id, run_id)` so parallel scenario runs do not collide.
- **Deterministic seeds.** Every randomness source in the SUT (POI age-factor tie-breaking, retry jitter, replay-window nonce window) is configured to a per-scenario seed via env vars (`AUTOPILOT_RNG_SEED=<test_id>`). The seed is captured in the CSV report.
- **Wall-clock control.** Scenarios that depend on absolute time (NFT-RES-R8, NFT-RES-R4 grace window, FT-P-DecisionWindow timeouts) use `time-injector` (faketime LD_PRELOAD). The SUT's `time.now()` calls are intercepted; GPS-source state is set via the `mavlink-sitl` GLOBAL_POSITION_INT message stream.
- **Network determinism.** All inter-service traffic stays on the `autopilot-e2e` Docker network (no internet egress). Latency injection (for L9 modem RTT exclusion checks) uses `tc qdisc` inside the `operator-replay` container.
- **No shared mocks between scenarios.** Even when two scenarios use the same fixture, each gets its own mock container instance — this avoids stale state in `missions-mock`'s POST-buffer or `gimbal-mock`'s last-command cache.
## Input data mapping (fixtures → scenarios)
This is the **fixture-side index**; the scenario-side index is in each `*-tests.md` file's `Input data` field.
| Input data file | Source location | Description | Covers scenarios |
|---|---|---|---|
| `fixtures/images/4d6e1830d211ad50.jpg` | `_docs/00_problem/input_data/fixtures/images/` | Aerial frame, 1280 px input | FT-P-Tier1Contract (D6), NFT-PERF-L1, NFT-PERF-L2 |
| `fixtures/images/{54f6...,6dd6...,805b...,f997...}.jpg` | same dir | 4 additional aerial frames for existing-class regression | FT-P-DetectExisting (D2) |
| `fixtures/videos/94d42580bd1ad6ff.mp4` | same dir | Reconnaissance clip, 30 fps; consumer throttles to drop below 10 fps for ≥5 s | NFT-PERF-T3 |
| `fixtures/movement/video01.mp4` | same dir | Wide-area movement clip (visual reference only) | FT-P-EgoMotion (M1) [DEFERRED — needs gimbal.csv] |
| `fixtures/movement/video02.mp4` | same dir | Wide-area movement clip (visual reference only) | FT-P-MoveDuringHold (M2) [DEFERRED — needs zoomed-in gimbal.csv] |
| `fixtures/movement/video03.mp4` | same dir | Wide-area movement clip (visual reference only) | FT-P-ThresholdEdge (M3) [DEFERRED — needs threshold-edge gimbal.csv] |
| `fixtures/movement/video04.mp4` | same dir | Wide-area movement clip (visual reference only) | FT-P-MoveBenchmark (M4) [DEFERRED — needs zoom-band benchmark CSV] |
| `fixtures/semantic/semantic01.png` | same dir | Winter concealed-position reference (starter only) | FT-P-ConcealStarter (D3, D4), FT-P-FootpathStarter (D5) [DEFERRED — needs annotated multi-season set] |
| `fixtures/semantic/semantic0[2-4].png` | same dir | 3 unmarked-season concealed-position references | as above |
| `fixtures/schemas/expected_detections.json` | same dir | Reference output for D6 | FT-P-Tier1Contract (D6), FT-P-NormalisedBoxes |
| `fixtures/schemas/expected_detections.schema.json` | same dir | Schema for normalised-box output | FT-P-NormalisedBoxes, NFT-SEC-Tier1SchemaViolation |
| `fixtures/sql/init.sql` | same dir | (suite-only — recorded for traceability) | none |
## Expected results mapping (scenario → comparison row)
Every scenario in `*-tests.md` traces to a row id in `_docs/00_problem/input_data/expected_results/results_report.md`. The comparison method + tolerance is owned by that row — this table is the **scenario-side index** so a reader can navigate from a test to its assertion contract.
| Scenario ID | Input data | Expected result row | Comparison method | Tolerance | Source |
|---|---|---|---|---|---|
| FT-P-Tier1Contract | `image-set-existing` (1 frame) | `D6` | `schema_match` + `range` | each coord ∈ [0,1] | `fixtures/schemas/expected_detections.schema.json` |
| FT-P-DetectExisting | `image-set-existing` (5 frames) | `D2` | `numeric_tolerance` | ± 0.02 (P, R) | `<DEFERRED: expected_results/existing_classes_baseline.json>` |
| FT-P-DetectNew | `<DEFERRED: new-class eval set>` | `D1` | `threshold_min` | P ≥ 0.80 AND R ≥ 0.80 | `<DEFERRED: expected_results/new_classes_pr.json>` |
| FT-P-ConcealRecall | `image-semantic-starter` + `<DEFERRED: full set>` | `D3` | `threshold_min` | recall ≥ 0.60 | `<DEFERRED: expected_results/concealed_positions.json>` |
| FT-P-ConcealPrecision | same | `D4` | `threshold_min` | precision ≥ 0.20 | same |
| FT-P-FootpathRecall | `image-semantic-starter` + `<DEFERRED>` | `D5` | `threshold_min` | recall ≥ 0.70 | `<DEFERRED: expected_results/footpaths.json>` |
| NFT-PERF-L1 | `image-set-existing` (1 frame) | `L1` | `threshold_max` | ≤ 100 ms | inline |
| NFT-PERF-L2 | derived ROI from same | `L2` | `threshold_max` | ≤ 200 ms | inline |
| NFT-PERF-L3 | `vlm-io-pairs` | `L3` | `threshold_max` | ≤ 5000 ms | inline |
| NFT-PERF-L4 | `<DEFERRED: SITL or HW zoom-cmd capture>` | `L4` | `threshold_max` | ≤ 2000 ms | inline |
| NFT-PERF-L5 | `<DEFERRED: scripted scan→movement>` | `L5` | `threshold_max` | ≤ 500 ms | inline |
| NFT-PERF-L6 | `video-movement` (visual ref) + `<DEFERRED gimbal.csv>` | `L6` | `threshold_max` | ≤ 1000 ms | inline |
| NFT-PERF-L7 | `video-movement` + `<DEFERRED zoomed-in gimbal.csv>` | `L7` | `threshold_max` | ≤ 1500 ms | inline |
| NFT-PERF-L8 | `<DEFERRED: sweep→zoomed transition capture>` | `L8` | `threshold_max` | ≤ 2000 ms | inline |
| NFT-PERF-L9 | `<DEFERRED: operator-click → outbound>` | `L9` | `threshold_max` | ≤ 500 ms | inline |
| NFT-PERF-T1 | `synthetic-poi-feeds` (sustained > cap) | `T1` | `threshold_max` | ≤ 5 / min | inline |
| NFT-PERF-T2 | `<DEFERRED: MAVLink replay 60 s>` | `T2` | `range` | 1 Hz ≤ r ≤ 10 Hz | inline |
| NFT-PERF-T3 | `video-recon` (throttled) | `T3` | `exact` × 2 | suppression bool + health=yellow | inline |
| FT-P-EgoMotion (M1) | `video-movement/video01.mp4` + `<DEFERRED gimbal.csv + telemetry.csv>` | `M1` | `set_contains` | candidate set == {vehicle}; ∉ tree row | inline |
| FT-P-MoveDuringHold (M2) | `video02.mp4` + `<DEFERRED zoomed-in CSV pair>` | `M2` | `exact` | 1 candidate; preempt per priority rule | inline |
| FT-P-ThresholdEdge (M3) | `video03.mp4` + `<DEFERRED threshold-edge CSV>` | `M3` | `exact` | count == 0 | inline |
| FT-P-MoveBenchmark (M4) | `video04.mp4` + `<DEFERRED benchmark suite>` | `M4` | `threshold_max` | per-zoom-band FP rate budget | `<DEFERRED: expected_results/movement_benchmark_caps.json>` |
| FT-P-SweepToZoom (S1) | `<DEFERRED scripted mission + POI>` | `S1` | `exact` × 3 | transition + ROI + queue+=1 | inline |
| FT-P-FootpathPan (S2) | `<DEFERRED hold + footpath polyline>` | `S2` | `numeric_tolerance` | centre offset ≤ 25% per frame | inline |
| FT-P-TargetFollow (S3) | `<DEFERRED confirmed target>` | `S3` | `threshold_max` | per-frame |dx,dy| ≤ 0.125 | inline |
| FT-P-POIOrdering (S4) | `synthetic-poi-feeds` (ordering test) | `S4` | `exact (order)` | ordering matches `conf × prox × age` | inline |
| FT-P-DeepAnalysisHold (S5) | `<DEFERRED VLM-enabled hold>` | `S5` | `exact` | hold = min(5 s, vlm_complete) | inline |
| FT-P-DecisionWindow30s (O1) | `synthetic-poi-feeds` (conf=0.40) | `O1` | `exact` | window = 30 s | inline |
| FT-P-DecisionWindow120s (O2) | conf=1.00 | `O2` | `exact` | window = 120 s | inline |
| FT-P-DecisionWindow75s (O3) | conf=0.70 | `O3` | `numeric_tolerance` | window ≈ 75 s ± 0.5 s | inline |
| FT-N-BelowThreshold (O4) | conf=0.39 | `O4` | `exact` | not surfaced | inline |
| FT-P-OperatorDecline (O5) | `operator-session-scripts` (nominal + decline) | `O5` | `exact (count Δ+1)` + `schema_match` | ignored-item appended | inline |
| FT-P-IgnoredSuppress (O6) | matching MGRS + class_group | `O6` | `exact` | not surfaced | inline |
| FT-P-OperatorTimeout (O7) | no-response + > window | `O7` | `exact` × 2 | queue 1; ignored unchanged | inline |
| FT-P-OperatorConfirm (O8) | `operator-envelopes` (valid happy path) | `O8` | `exact (HTTP 200)` + `exact (mode)` | mission POST + target-follow | inline |
| NFT-SEC-O9 | `operator-envelopes` (replayed) | `O9` | `exact` + `substring` | state unchanged; log contains "replay" | inline |
| NFT-SEC-O10 | `operator-envelopes` (malformed/unsigned) | `O10` | `exact` + `substring` | state unchanged; log contains "invalid" | inline |
| FT-P-BitPass (R1) | `bit-scenarios` (every dep green) | `R1` | `exact` × 2 | takeoff permitted + health all green | inline |
| FT-N-BitDetectionDown (R2) | tier1 unreachable | `R2` | `exact` | takeoff inhibited + detection red | inline |
| FT-N-BitStorageFull (R3) | storage ≥ 95 % | `R3` | `exact` | takeoff inhibited + storage red | inline |
| NFT-RES-R4 | `operator-session-scripts` (sustained lost-link) | `R4` | `exact (RTL at 30 s ± 1 s)` | RTL command + operator-link red | inline |
| NFT-RES-R5 | `mavlink-sitl-scripts` (battery at RTL-floor) | `R5` | `exact` × 2 | RTL + health yellow | inline |
| NFT-RES-R6 | battery at hard-floor | `R6` | `exact` | land-now | inline |
| NFT-RES-R7 | `mavlink-sitl-scripts` (no-response retry exhaustion) | `R7` | `exact` | health red after max-retry | inline |
| NFT-RES-R8 | `time-drift-scripts` (250 ms drift) | `R8` | `exact` | time-source yellow + clock_source/last_sync_at updated | inline |
| NFT-RES-R9 | `mavlink-sitl-scripts` (EXCLUSION cross) | `R9` | `exact` × 2 | waypoint rejected + RTL | inline |
| NFT-RES-LIM-Re1 | `<DEFERRED long-running RSS harness>` | `Re1` | `threshold_max` | combined RSS ≤ 6 GB | inline |
| NFT-RES-LIM-Re2 | Re1 + concurrent Tier-1 traffic | `Re2` | `numeric_tolerance` | Tier-1 ms/frame Δ ± 5 ms | inline |
| FT-P-MapPull (Mp1) | `<DEFERRED 30×30 km area + ~10k mapobjects>` | `Mp1` | `threshold_max` | ≤ 30 s | inline |
| NFT-RES-Mp2 | mock unreachable | `Mp2` | `exact` × 2 | cached_fallback + BIT requires ack | inline |
| FT-P-MapPush (Mp3) | `<DEFERRED 60 min diff>` | `Mp3` | `threshold_max` | ≤ 120 s | inline |
| NFT-RES-Mp4 | POST returns 5xx | `Mp4` | `exact` × 2 + `threshold_max` | file exists + warning + retries ≤ cap | inline |
| FT-P-MapConflict (Mp5) | `<DEFERRED conflict pair>` | `Mp5` | `json_diff` | conflict resolution per Q8 | `<DEFERRED: expected_results/mapobjects_conflict_resolution.json>` |
## External dependency mocks
(Index-only; per-mock acquisition status owned by `services.md`.)
| External service | Mock/stub | How provided | Behavior |
|---|---|---|---|
| `../detections` Tier-1 RPC | `detections-mock` (gRPC bi-stream) | Docker container; serves `.replay` files | Returns recorded `Detections` byte-stream for the input frame's hash; serves a 19-class catalogue (0..18) deterministically; supports schema-violation injection for NFT-SEC tests |
| `missions` API | `missions-mock` (HTTPS FastAPI) | Docker container; TLS via self-signed test cert | Static JSON for `GET /missions/{id}`, `GET /missions/{id}/mapobjects`; records POST bodies for assertion; can be configured to return 5xx for NFT-RES-Mp4 |
| ViewPro A40 RTSP | `rtsp-loopback` (mediamtx) | Docker container | Plays back `.mp4` at scheduled fps with frame-drop injection (T3) |
| ViewPro A40 gimbal | `gimbal-mock` (Rust UDP) | Docker container | Replays `gimbal.csv` synchronised to RTSP frame timestamps; echoes received commands with bounded latency budget |
| ArduPilot | `mavlink-sitl` (official ardupilot/ardupilot-sitl image) | Docker container | Deterministic SITL run from a scripted mission file |
| Ground Station modem | `operator-replay` (Python) | Docker container | Replays `(t, event)` script per scenario; signs envelopes per Q9 once resolved |
| Local VLM | `vlm-mock` (Python over UDS) | Docker container; UDS shared via `/tmp` volume | Returns paired `VlmAssessment` JSON; can return schema-violation responses for NFT-SEC tests |
| Wall-clock / GPS / NTP | `time-injector` (Rust) | LD_PRELOAD faketime shim into the SUT container at start | Scripted offset/jump/source-loss |
## Data validation rules
| Data Type | Validation | Invalid Examples | Expected System Behaviour |
|---|---|---|---|
| Mission JSON | `mission-schema` (shared with `missions` repo) | missing required field; coord out of [-180, 180]; unknown enum value | system refuses; mission-state stays at last-known; health flips mission-config-source = yellow; structured-log at WARN with `schema_violation_field` |
| Map-object record | suite-level mapobjects schema | non-finite coordinate; class_group not in catalogue; missing MGRS | record dropped; counter `mapobjects_rejected_total` increments; structured-log at WARN |
| Tier-1 `Detections` stream | `expected_detections.schema.json` (normalised-box) | bbox coord ∉ [0, 1]; confidence ∉ [0, 1]; class_id ∉ {0..18} | frame's detections dropped (not partially used); `tier1_invalid_frame_total` increments; per AC D6 the system must surface a structured WARN |
| MAVLink message | MAVLink v2 dialect (per ArduPilot) | unknown MSG_ID; CRC mismatch; (if Q6 resolves to "signing on") missing signature | message dropped; if signing required and missing → security WARN; airframe-link health unaffected for individual drops |
| Operator command envelope | Q9 scheme (TBD) | replay (sequence_id seen recently); signature invalid; timestamp outside replay window | rejected at the boundary; no state mutation; security WARN with reason code; counters `operator_cmd_rejected_replay_total`, `..._signature_total`, `..._expired_total` |
| VLM `VlmAssessment` response | structured assessment schema | missing required field; wrong type; truncated JSON | fail-closed: assessment discarded; POI does NOT get the deep-analysis upgrade; structured WARN |
| RTSP frame | container-level decode | malformed H.264/265 NAL; oversized SPS | frame dropped; `frame_decode_error_total` increments; if rate falls below 10 fps for ≥5 s → T3 path triggers (zoom-in suppressed + health yellow) |
| Camera frame size | bounded crop policy (security_approach §Bounded input) | crop > configured max bytes; format not in allow-list | rejected at boundary; security WARN |
| Time source | wall-clock binding | GPS unlocked AND no NTP sync at boot | clock_source = `none`; health red until either source available |
## Deferred-fixture bridge (replay obligation)
Every `<DEFERRED:>` row above maps 1-to-1 to an entry in `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md → "What is needed before /autodev can resume"` table. On every `/autodev` invocation, the leftovers step must re-evaluate whether any deferred fixture has landed; once landed, the corresponding scenario(s) become unblocked and their `Test status` line in the matching `*-tests.md` file moves from `DEFERRED — input fixture not yet acquired` to `READY`.
Inline-authorable categories (10 and 11 in the leftover) — `synthetic-poi-feeds`, `time-drift-scripts`, `operator-session-scripts`, `bit-scenarios` — are NOT marked `<DEFERRED:>` in this file because they have no external dependency. They are authored by Phase 4's `e2e/consumer/fixtures/` generators when the runner scripts come online.
@@ -0,0 +1,202 @@
# Traceability Matrix
Authored by `/test-spec` Phase 2 (2026-05-19).
This matrix maps every acceptance-criterion bullet from `_docs/00_problem/acceptance_criteria.md` and every restriction bullet from `_docs/00_problem/restrictions.md` to the test scenarios that exercise them. Coverage is **scenario-level**, not fixture-level — scenarios marked `DEFERRED` in the underlying `*-tests.md` files still count as covered for the purpose of "the test is specified"; the fixture-acquisition status is tracked separately in `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md`.
## Acceptance Criteria Coverage
| AC ID | Acceptance criterion (paraphrased; canonical text in `acceptance_criteria.md`) | Test IDs | Coverage |
|---|---|---|---|
| AC-L1 | Tier 1 per-frame ≤ 100 ms at 1280 px on deployed compute | NFT-PERF-L1, FT-P-001 (functional contract) | Covered |
| AC-L2 | Tier 2 per-ROI ≤ 200 ms | NFT-PERF-L2 | Covered |
| AC-L3 | Tier 3 per-ROI ≤ 5 s (when enabled) | NFT-PERF-L3 | Covered (fixture DEFERRED) |
| AC-L4 | Camera zoom transition (medium→high) ≤ 2 s | NFT-PERF-L4 | Covered (fixture DEFERRED) |
| AC-L5 | Decision-to-movement latency ≤ 500 ms | NFT-PERF-L5 | Covered (fixture DEFERRED) |
| AC-L6 | Movement candidate enqueue ≤ 1 s (wide sweep) | NFT-PERF-L6 | Covered (fixture DEFERRED — gimbal.csv) |
| AC-L7 | Movement candidate enqueue ≤ 1.5 s (zoomed-in) | NFT-PERF-L7 | Covered (fixture DEFERRED — zoomed gimbal.csv) |
| AC-L8 | Zoom-out → zoom-in transition ≤ 2 s | NFT-PERF-L8 | Covered (fixture DEFERRED) |
| AC-L9 | Operator command → outbound action ≤ 500 ms | NFT-PERF-L9 | Covered (fixture DEFERRED for signed envelopes; placeholder usable today) |
| AC-T1 | POI rate surfaced to operator ≤ 5 / min (hard cap) | NFT-PERF-T1 | Covered |
| AC-T2 | Position telemetry rate ∈ [1, 10] Hz (target 10) | NFT-PERF-T2 | Covered (fixture DEFERRED — MAVLink replay) |
| AC-T3 | Frame-rate floor < 10 fps for ≥ 5 s → suppress zoom-in AND health yellow | NFT-PERF-T3 | Covered |
| AC-D1 | New classes per-class P ≥ 0.80 AND R ≥ 0.80 | FT-P-003 | Covered (fixture DEFERRED — annotated eval set) |
| AC-D2 | Existing-class regression Δ ≤ ± 0.02 vs baseline | FT-P-002 | Covered (baseline JSON DEFERRED; visual fixtures present) |
| AC-D3 | Concealed-position recall ≥ 0.60 (initial gate) | FT-P-004 | Covered (fixture DEFERRED — multi-season set) |
| AC-D4 | Concealed-position precision ≥ 0.20 (initial gate) | FT-P-005 | Covered (fixture DEFERRED — same as D3) |
| AC-D5 | Footpath recall ≥ 0.70 | FT-P-006 | Covered (fixture DEFERRED — polyline-annotated set) |
| AC-D6 | Tier-1 normalised-box contract conformance (class ids 0..18, coords ∈ [0,1]) | FT-P-001, NFT-SEC-Tier1SchemaViolation | Covered |
| AC-Mov-EnqueueWideSweep | Small movers during wide sweep MUST be detected and enqueued ≤ 1 s | FT-P-007 (M1 behavioural), NFT-PERF-L6 (latency dimension) | Covered |
| AC-Mov-ContinueDuringZoom | Movement detection continues during zoomed-in inspection | FT-P-008 (M2), NFT-PERF-L7 | Covered |
| AC-Mov-StableObjectsRejected | Stable objects (trees, houses, roads) NOT treated as moving solely due to camera platform motion | FT-P-007 (M1 — set_contains explicitly excludes tree row) | Covered |
| AC-Mov-FPBudgetHonoured | Configurable per-zoom-band FP budget honoured | FT-P-009 (M3), FT-P-010 (M4 — Q14) | Covered (M4 fixture DEFERRED) |
| AC-Scan-SweepCoverage | Wide-area sweep covers planned route at wide/light/medium zoom | implicitly by FT-P-011 setup + scenario-runner BIT scenarios; NOT covered as a distinct test | NOT COVERED (see Uncovered Items § §1) |
| AC-Scan-SweepToZoomTransition | Sweep → detailed inspection transition ≤ 2 s | FT-P-011, NFT-PERF-L8 | Covered |
| AC-Scan-TargetLock | Lock + pan + 2 s deep-analysis hold + per-POI timeout default 5 s | FT-P-015 (S5 — three cases) | Covered (fixture DEFERRED — vlm-mock with realistic timing) |
| AC-Scan-TargetFollowCentre | Target-follow within centre 25 % of frame | FT-P-013 | Covered (fixture DEFERRED) |
| AC-Scan-GimbalLatency | Gimbal decision-to-movement ≤ 500 ms (links to L5) | NFT-PERF-L5 | Covered |
| AC-Scan-POIOrdering | POI queue ordered by `confidence × proximity × age_factor` | FT-P-014 | Covered |
| AC-Op-DecisionWindowScale | Decision window scales 30 s @ 0.40 → 120 s @ 1.00 linearly | FT-P-017 (O1), FT-P-018 (O2), FT-P-019 (O3), FT-N-004 (O4 below-threshold) | Covered |
| AC-Op-DeclinePersistsIgnored | Operator-decline → persistent ignored-item per (MGRS, class_group) | FT-P-020 (O5) | Covered |
| AC-Op-TimeoutForget | Timeout (no response) MUST NOT create ignored-item | FT-P-022 (O7) | Covered |
| AC-Op-IgnoredSuppress | New detection matching existing ignored-item NOT surfaced | FT-P-021 (O6) | Covered |
| AC-Op-ConfirmWaypointFollow | Operator-confirm → middle waypoint POST + target-follow mode | FT-P-016 (O8) | Covered (Q9 envelope DEFERRED; happy path uses placeholder) |
| AC-Op-ReplayUnsignedRejected | Replayed or unsigned operator command REJECTED with logged security WARN; state UNCHANGED | NFT-SEC-O9, NFT-SEC-O10 | Covered (Q9 DEFERRED for full semantics) |
| AC-Rel-BITGatesTakeoff | BIT MUST pass before takeoff permitted | FT-P-023 (R1), FT-N-001 (R2), FT-N-002 (R3), FT-N-003 (Mp2 BIT gate) | Covered |
| AC-Rel-LostLinkRTL30s | Lost operator/GS link → known mission-safe outcome within configurable grace (default 30 s → RTL) | NFT-RES-R4 | Covered |
| AC-Rel-AirframeLinkRedImmediate | Airframe command link loss → health red immediately; defer to airframe failsafe | NFT-RES-R7 (extension), implicitly by airframe-link health observation in NFT-RES-R5/R6 | Covered |
| AC-Rel-BatteryFloors | Battery ≤ RTL floor → RTL; battery ≤ hard floor → land-now; operator override only | NFT-RES-R5, NFT-RES-R6 | Covered (fixture DEFERRED) |
| AC-Rel-MavlinkExhaustionRed | MAVLink command exhaustion → airframe-link health red | NFT-RES-R7 | Covered (fixture DEFERRED) |
| AC-Rel-DriftYellow | Wall-clock drift > 200 ms → health yellow | NFT-RES-R8 | Covered |
| AC-Rel-GeofenceSymmetric | Geofence INCLUSION + EXCLUSION violations → waypoint refusal + RTL | NFT-RES-R9 (both cases) | Covered (fixture DEFERRED) |
| AC-Res-RSS6GB | Combined RSS on Jetson (excluding Tier 1) ≤ 6 GB sustained | NFT-RES-LIM-Re1, NFT-RES-LIM-CPU (CPU dimension), NFT-RES-LIM-FileHandles (FD dimension) | Covered (HW DEFERRED) |
| AC-Res-Tier1NonDegradation | Tier 1 per-frame latency Δ ± 5 ms under concurrent autopilot workload | NFT-RES-LIM-Re2, NFT-RES-LIM-GPU (GPU mutual exclusion) | Covered (HW DEFERRED) |
| AC-Mp-PreFlightPull30s | Pre-flight map pull ≤ 30 s; cache-fallback only with explicit operator ack | FT-P-024 (Mp1), FT-N-003 (Mp2 cache-fallback gate), NFT-RES-Mp2 (timing+recovery) | Covered |
| AC-Mp-PostFlightPush120s | Post-flight pass diff push ≤ 120 s; failure → persist + bounded retry | FT-P-025 (Mp3), NFT-RES-Mp4 | Covered (fixture DEFERRED) |
| AC-Gate-HWBench | HW/replay benchmark suite MUST pass before product implementation | every Tier-HW row in environment.md `Hardware Execution Matrix` (filled by `hardware-assessment.md`) | Covered as a gate, executed at the Acceptance-Gates milestone |
| AC-Gate-SeasonCoverage | Per-season dataset coverage demonstrated before MVP sign-off (Q13) | NOT COVERED at blackbox test level — gated on annotation campaign and the `../ai-training` repo | NOT COVERED (see Uncovered Items § §2) |
| AC-Gate-MavlinkSITLConformance | MAVLink command surface MUST pass SITL conformance | implicitly by FT-P-016 (O8 confirms waypoint POST through SITL) + NFT-RES-R4/R5/R6/R7/R9 (all run through SITL); a dedicated conformance suite is recommended | Partially Covered (see Uncovered Items § §3) |
| AC-Q-Mov-Zoomed-FPRate | Movement detection FP rate at zoomed-in inspection (Q14) | FT-P-010 (M4) | Covered (Q14 DEFERRED) |
| AC-Q-MapObjectsConflict | MapObjects conflict resolution rule (Q8) | FT-P-026 (Mp5) | Covered (Q8 DEFERRED) |
| AC-Q-OperatorCmdAuth | Operator-command authentication conformance (Q9) | NFT-SEC-O9, NFT-SEC-O10, FT-P-016 (O8) | Covered (Q9 DEFERRED — placeholders used today) |
| AC-Q-MAVLinkSigning | Airframe MAVLink-2 message signing (Q6) | NFT-SEC-MavlinkUnsigned | Covered (Q6 DEFERRED) |
| AC-Q-SeasonGates | Per-season flight-test gates (Q13) | NOT COVERED — same as AC-Gate-SeasonCoverage | NOT COVERED |
## Restrictions Coverage
| Restriction ID | Restriction (paraphrased; canonical in `restrictions.md`) | Test IDs | Coverage |
|---|---|---|---|
| RESTRICT-HW-Jetson | Compute device Jetson Orin Nano Super; 8 GB shared LPDDR5; ~6 GB residual after Tier 1 | NFT-RES-LIM-Re1, NFT-RES-LIM-CPU, NFT-RES-LIM-Re2, all Tier-HW rows | Covered (HW DEFERRED) |
| RESTRICT-HW-A40 | Primary camera ViewPro A40; vendor protocol mandatory | FT-P-011, FT-P-012, FT-P-013, NFT-PERF-L4 (zoom traversal floor) | Covered (HW DEFERRED for L4) |
| RESTRICT-HW-Z40K | Alternative camera ViewPro Z40K — system must remain compatible | NOT COVERED at autopilot test level — verified by component-swap regression run on the Z40K HW | NOT COVERED (see Uncovered Items § §4) |
| RESTRICT-HW-ThermalLater | Thermal sensor may be added later; not assumed today | implicit (no test depends on thermal) | Covered by absence (negative assumption) |
| RESTRICT-HW-ZoomFloor | 40× optical zoom traversal 12 s wall-clock | NFT-PERF-L4 (asserts the ≤ 2 s ceiling that includes the physical floor) | Covered (HW DEFERRED) |
| RESTRICT-Op-Altitude | Flight altitude 6001000 m | implicitly by every mission-trace fixture; no dedicated test | Covered by fixture assumption |
| RESTRICT-Op-AllSeasons | All four seasons in scope; winter-first-only rejected | FT-P-002, FT-P-003, FT-P-004, FT-P-005, FT-P-006 — multi-season fixtures required | Covered (all DEFERRED on multi-season fixtures) |
| RESTRICT-Op-AllTerrains | Forest, open field, urban edges, mixed terrain | same as RESTRICT-Op-AllSeasons | Covered (DEFERRED) |
| RESTRICT-Op-IntermittentModem | Modem operator/GS link intermittent | NFT-RES-R4, FT-P-016 (O8 nominal session), NFT-SEC-O9/O10 | Covered |
| RESTRICT-SW-JetsonResidualBudget | Onboard inference path runs within 6 GB residual RAM | NFT-RES-LIM-Re1 | Covered (HW DEFERRED) |
| RESTRICT-SW-FP16 | Models use FP16 precision (INT8 rejected for MVP) | NOT COVERED at autopilot test level — pinned at the model-loading layer (Tier 1 in `../detections`; Tier 2/3 in autopilot config) | NOT COVERED (see Uncovered Items § §5) |
| RESTRICT-SW-NoCloudInference | No cloud egress for inference | NFT-SEC-CraftedFrame (process boundary), implicit by environment.md `autopilot-e2e` network having no egress | Covered |
| RESTRICT-SW-GPUMutualExclusion | Tier 1 + any local large model serialise on the Jetson GPU | NFT-RES-LIM-GPU | Covered (HW DEFERRED) |
| RESTRICT-SW-MissionSchemaShared | Autopilot consumes shared `mission-schema`; cannot fork | FT-P-016 (O8 — POST validates against schema), FT-P-024 (Mp1 — schema-validated pull) | Covered (fixtures DEFERRED) |
| RESTRICT-Arch-Tier1External | Tier 1 lives in `../detections`; autopilot consumes | FT-P-001 (D6), NFT-SEC-Tier1SchemaViolation, FT-N-001 (R2 — Tier 1 unreachable inhibits BIT) | Covered |
| RESTRICT-Arch-MissionExternal | Mission state from `missions` service; autopilot doesn't author | FT-P-024, FT-P-025, FT-P-016 | Covered (fixtures DEFERRED) |
| RESTRICT-Arch-MapInMissions | Central area map in `missions /mapobjects` | FT-P-024, FT-P-025, FT-P-026 (Mp5), NFT-RES-Mp2, NFT-RES-Mp4 | Covered (fixtures DEFERRED) |
| RESTRICT-Arch-GPSDeniedExternal | GPS coords from separate GPS-denied service; autopilot does NOT implement | NOT COVERED at autopilot test level — verified at suite-e2e tier via the live GPS-denied service | NOT COVERED at autopilot tier (covered at suite-e2e tier) |
| RESTRICT-Arch-OperatorUIExternal | Operator browser UI owned by Ground Station; autopilot pushes data | implicit by NOT testing any UI rendering; verified by operator-stream protocol assertions in FT-P-016, FT-P-017022 | Covered by absence |
| RESTRICT-Arch-AnnotationTrainingExternal | Annotation + training in `../annotations`, `../ai-training`; autopilot doesn't own | NOT TESTABLE at autopilot blackbox tier — process boundary | NOT TESTABLE (intentional scope exclusion) |
| RESTRICT-Rel-BITGate | Pre-flight BIT MUST gate takeoff | FT-P-023 (R1), FT-N-001 (R2), FT-N-002 (R3), FT-N-003 (Mp2) | Covered |
| RESTRICT-Rel-LostLinkDeterministic | Lost operator-link failsafe deterministic + bounded | NFT-RES-R4 | Covered |
| RESTRICT-Rel-AirframeLossRedImmediate | Airframe MAVLink loss → health red immediately | NFT-RES-R7 (red after retry exhaustion); a dedicated "immediate red on link loss" scenario MAY be desirable (currently rolled into R7) | Partially Covered (see Uncovered Items § §6) |
| RESTRICT-Rel-BatteryThresholds | Battery RTL + land-now triggers (override only via operator) | NFT-RES-R5, NFT-RES-R6 | Covered (fixtures DEFERRED) |
| RESTRICT-Rel-GeofenceSymmetric | Geofence INCLUSION + EXCLUSION enforcement | NFT-RES-R9 (both) | Covered (fixture DEFERRED) |
| RESTRICT-Rel-OperatorCmdAuth | Operator commands authenticated + signed + replay-protected | NFT-SEC-O9, NFT-SEC-O10, FT-P-016 happy path | Covered (Q9 DEFERRED) |
| RESTRICT-Rel-StorageBounded | On-device storage bounded; full = takeoff blocker; mid-flight eviction policy | FT-N-002 (R3 — BIT block), NFT-RES-LIM-Storage | Covered |
| RESTRICT-Rel-NoSilentErrors | No silent error swallowing | every NFT-SEC-* scenario asserts a counter + log entry; every NFT-RES-* asserts a structured-log + health transition | Covered |
| RESTRICT-Rel-ClockBound | Wall-clock bound to GPS once locked, else NTP at boot | NFT-RES-R8 | Covered |
| RESTRICT-Rel-MavlinkConformance | MAVLink command surface MUST conform to ArduPilot/PX4 SITL | every MAVLink-emitting scenario runs through `mavlink-sitl`; a dedicated conformance suite is recommended | Partially Covered (see Uncovered Items § §3) |
## Coverage Summary
| Category | Total Items | Covered | Partially Covered | Not Covered | Coverage % (counting Partially as 0.5) |
|---|---|---|---|---|---|
| Acceptance Criteria | 47 | 43 | 1 | 3 | (43 + 0.5×1) / 47 ≈ **92.6 %** |
| Restrictions | 30 | 25 | 2 | 3 | (25 + 0.5×2) / 30 ≈ **86.7 %** |
| **Total** | 77 | 68 | 3 | 6 | **(68 + 1.5) / 77 ≈ 90.3 %** |
(Coverage here is "test scenario exists for the item", not "fixture has been acquired and the test currently passes". Fixture status is tracked in `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md`.)
## Uncovered Items Analysis
| § | Item | Reason not covered | Risk | Mitigation |
|---|---|---|---|---|
| §1 | AC-Scan-SweepCoverage (wide-area sweep covers planned route) | The "covers the planned route" property is a path-coverage assertion best tested by component-level tests in the `scan_controller` component (geometry coverage) rather than at the blackbox level | Medium — incorrect sweep pattern leaks observation gaps | Componenet-test in `scan_controller` (added by `/decompose` test tasks); a Tier-E "did the camera point at every planned waypoint area for ≥ N seconds" scenario can be added if needed |
| §2 | AC-Gate-SeasonCoverage / AC-Q-SeasonGates | Per-season coverage gates depend on dataset acquisition owned by `../ai-training` and per-season flight tests (Q13) | High — model performance on un-evaluated seasons unknown | Tracked as release-gate item; D3/D4/D5/D1 scenarios DEFERRED until each season's dataset lands |
| §3 | AC-Gate-MavlinkSITLConformance / RESTRICT-Rel-MavlinkConformance | A dedicated "every command in `architecture.md §7.7` exercised against SITL" suite is recommended in addition to the implicit coverage by R-scenarios | Medium — could miss a rarely-used command | Add a `NFT-MavlinkConformance` suite during Step 9 (Decompose Tests) — explicit per-command SITL exercise |
| §4 | RESTRICT-HW-Z40K (Z40K compatibility) | Requires a second camera HW for the swap test | Medium — could miss a A40-specific assumption | Run the Tier-HW rows on Z40K as a post-MVP smoke step |
| §5 | RESTRICT-SW-FP16 (model precision) | Pinned at config + model-loading layer; not externally observable beyond perf/latency | Low — incorrect precision would manifest as either L1 latency or D2 regression failure | Add a startup log assertion: "Tier 2/3 models loaded with precision=FP16" via the SUT's structured boot log |
| §6 | RESTRICT-Rel-AirframeLossRedImmediate (immediate red on airframe link loss) | NFT-RES-R7 asserts red after retry exhaustion; the "immediate red on link loss" path (no retries) is implicit | LowMedium — depends on timing window between "link silent" and "considered lost" | Add `NFT-RES-AirframeImmediate` scenario in Step 9 (Decompose Tests) — sustained zero MAVLink traffic for N seconds → immediate health red (no retry phase) |
## Scenario index by file
| File | Scenarios | Read-back ID prefix |
|---|---|---|
| `blackbox-tests.md` | 26 positive + 4 negative | FT-P-001..FT-P-026, FT-N-001..FT-N-004 |
| `performance-tests.md` | 9 latency + 3 rate | NFT-PERF-L1..L9, NFT-PERF-T1..T3 |
| `resilience-tests.md` | 6 R-rows + 2 Mp-rows | NFT-RES-R4..R9, NFT-RES-Mp2, NFT-RES-Mp4 |
| `security-tests.md` | 10 SEC rows | NFT-SEC-O9, NFT-SEC-O10, NFT-SEC-CraftedFrame, NFT-SEC-OversizeCrop, NFT-SEC-VlmSchemaViolation, NFT-SEC-VlmFreeFormText, NFT-SEC-IpcPeerAuth, NFT-SEC-Tier1SchemaViolation, NFT-SEC-MavlinkUnsigned, NFT-SEC-HealthExposesSecurity |
| `resource-limit-tests.md` | 6 LIM rows | NFT-RES-LIM-Re1, Re2, Storage, CPU, GPU, FileHandles |
**Total scenarios authored**: 66.
## Open dependencies summary
| Dependency | Affects (scenario count) | Tracking |
|---|---|---|
| `<DEFERRED: gimbal.csv + telemetry.csv pairs>` | FT-P-007/008/009/010, NFT-PERF-L6/L7 | Leftover row "Gimbal CSV pairs" |
| `<DEFERRED: multi-season annotated datasets (concealed, footpath, new classes, existing baseline)>` | FT-P-002/003/004/005/006 | Leftover row "Concealed position image set + Footpath sequences + new-class eval set" |
| `<DEFERRED: SITL or HW capture for L4/L5/L8>` | NFT-PERF-L4/L5/L8 | Leftover row "MAVLink SITL traces" + camera frame sequences with zoom-band labelling |
| `<DEFERRED: missions API mock fixtures (Mp1/Mp3/Mp4)>` | FT-P-024/025, NFT-RES-Mp4 | Leftover row "Mock central area-map service responses" |
| `<DEFERRED: vlm-io-pairs (real recordings)>` | NFT-PERF-L3, FT-P-015 (S5), NFT-SEC-VlmSchemaViolation real-recording variant | Leftover row "Deep-analysis I/O pairs" |
| `<DEFERRED: operator-envelopes (Q9-blocked)>` | NFT-SEC-O9/O10, full semantics of FT-P-016 | Leftover row "Operator-command envelopes" + Q9 |
| `<DEFERRED: HW Jetson Orin Nano Super OR benchmarked replay>` | every Tier-HW scenario (L1, L2, L4, L5, L8, Re1, Re2, CPU, GPU) | Leftover does not enumerate HW directly — tracked via the project-level Acceptance Gate |
| `<DEFERRED: Q6 — MAVLink-2 signing decision>` | NFT-SEC-MavlinkUnsigned | architecture.md §8 Q6 |
| `<DEFERRED: Q8 — MapObjects conflict resolution rule>` | FT-P-026 (Mp5) | architecture.md §8 Q8 |
| `<DEFERRED: Q9 — operator-command auth scheme>` | NFT-SEC-O9/O10 full semantics | architecture.md §8 Q9 |
| `<DEFERRED: Q13 — per-season gates>` | AC-Gate-SeasonCoverage | architecture.md §8 Q13 |
| `<DEFERRED: Q14 — movement-detection classical vs learned-CV>` | FT-P-010 (M4) | architecture.md §8 Q14 |
When any of the above dependencies resolves, the corresponding leftover entry is replayed (per `tracker.mdc → Leftovers Mechanism`) and the affected scenarios' `Test status` lines move from `DEFERRED` to `READY` in the source files.
## Phase 3 — Test Data & Expected Results Validation Gate Outcome
Recorded by `/test-spec` Phase 3 on 2026-05-19.
### Mechanical gate
Phase 3's mechanical contract is: every scenario MUST have either (a) a provided input + provided quantifiable expected result, OR (b) a behavioural trigger + observable behaviour + quantifiable pass/fail criterion. Scenarios that fail this contract are normally REMOVED. The 75 % final-coverage check then applies.
| Shape | Total scenarios | Quantifiable comparison declared | Input/trigger fully provided today | Input/trigger DEFERRED (release-gate item) |
|---|---|---|---|---|
| Input/output | 56 | 56 | 16 | 40 |
| Behavioural | 10 | 10 | 10 | 0 |
| **Total** | **66** | **66 (100 %)** | **26 (39 %)** | **40 (61 %)** |
Every scenario carries a `Comparison` method drawn from `.cursor/skills/test-spec/templates/expected-results.md` (`exact`, `numeric_tolerance`, `threshold_min/max`, `range`, `regex`, `substring`, `set_contains`, `json_diff`, `schema_match`, `file_reference`) — none of the 66 fail the quantifiability check.
### Project-policy override (recorded 2026-05-19)
The Phase 3 75 % fixture-coverage gate is **intentionally overridden** for this project, per the decision recorded in `_docs/00_problem/input_data/expected_results/results_report.md → "Decision (project policy)"`:
> rather than block on the Phase 3 75 % gate, each deferred row is now registered with a structured `<DEFERRED:>` tag and surfaces in `data_parameters.md → "Gaps that block /test-spec downstream"`. `/test-spec` Phase 2 can author scenarios for all 56 rows; deferred rows become **release-gate items**, not development-gate items. The `acceptance_criteria.md → "Acceptance Gates (project-level)"` hardware/replay benchmark requirement is preserved as the hard release gate — that one is NOT being deferred.
Under this policy:
- **No scenarios are removed by Phase 3.** Every authored scenario remains in the spec; its `Test status` line in the source file (`blackbox-tests.md`, `performance-tests.md`, etc.) carries either `READY` or `DEFERRED — <reason>`.
- **Final coverage** is computed at the **scenario level**, not the fixture level. Per the matrix above:
- AC coverage: 92.6 % (43 + 0.5 × 1 / 47)
- RESTRICT coverage: 86.7 % (25 + 0.5 × 2 / 30)
- **Total: 90.3 %** — well above the 75 % gate.
- **Fixture acquisition** is tracked as a release-gate concern in `_docs/_process_leftovers/2026-05-19_autopilot_test_fixtures.md`; on every `/autodev` invocation the leftover-replay step re-evaluates whether any deferred fixture has landed and moves the affected scenarios from `DEFERRED` to `READY`.
- **The project-level Acceptance Gate** (`acceptance_criteria.md → "Acceptance Gates"` — HW/replay benchmark, per-season coverage, MAVLink SITL conformance) remains a hard release blocker. The override does NOT relax that gate.
### Phase 3 verdict
**PASSED** — scenario-level coverage 90.3 % ≥ 75 % gate; every scenario has a quantifiable comparison; deferred-fixture tracking handled via leftovers replay; no scenarios removed.
## Phase 4 — Test Runner Script Generation: SKIPPED in this invocation
Per `phases/04-runner-scripts.md → "Skip condition"`:
> If this skill was invoked from the `/plan` skill (planning context, no code exists yet), skip Phase 4 entirely. Script creation should instead be planned as a task during decompose — the decomposer creates a task for creating these scripts. Phase 4 only runs when invoked from the existing-code flow (where source code already exists) or standalone.
This invocation is greenfield Step 5 (Test Spec) and no source code exists yet — the `_docs/02_document/components/*/description.md` files describe 13 Rust components that the Implement step (Step 7) will create. Producing runner scripts here would write `scripts/run-tests.sh` and `scripts/run-performance-tests.sh` against a binary that does not yet exist.
**Handoff to Step 6 (Decompose)**: the decomposer MUST create at least two task specs covering the test runner scripts:
1. A task to create `scripts/run-tests.sh` (Tier B/E orchestration; calls `docker compose -f e2e/docker-compose.autopilot-e2e.yml up` and runs `cargo test --release --test scenarios` in `e2e-consumer`).
2. A task to create `scripts/run-performance-tests.sh` (Tier HW orchestration; per `environment.md → Hardware Execution Matrix`).
Both tasks should be tagged as part of the test-infrastructure decomposition (`Step 1t` of decompose tests-only mode) so they land before any Tier-B test scenarios are implemented.