Files
autopilot/_docs/02_document/tests/performance-tests.md
T
Oleksandr Bezdieniezhnykh bc40ea7300 [AZ-626] Decompose complete: 47 tasks + docs + module layout
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 11:02:01 +03:00

271 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Performance Tests
Authored by `/test-spec` Phase 2 (2026-05-19). Performance tests measure latency / rate / sustained-load characteristics. Functional behaviour that those characteristics enable lives in `blackbox-tests.md`. Resource ceilings live in `resource-limit-tests.md`.
Every scenario records steady-state metrics — cold-start measurements are explicitly excluded by a warm-up precondition. Pass criteria use the methods in `_docs/00_problem/input_data/expected_results/results_report.md` (referenced by row id).
---
## Latency
### NFT-PERF-L1: Tier-1 per-frame end-to-end latency ≤ 100 ms
**Summary**: Per-frame end-to-end latency through the Tier-1 contract (frame in → normalised-box record out) ≤ 100 ms at 1280 px input.
**Traces to**: AC `Latency — Primitive (Tier 1) object detection / L1`.
**Tier**: HW (representative Jetson Orin Nano Super) OR benchmarked replay (the only way to satisfy the project-level Acceptance Gate).
**Metric**: per-frame wall-clock from RTSP frame-receive timestamp to normalised-box emission timestamp.
**Preconditions**:
- Warm-up: 100 frames played before measurement starts (TensorRT engine warm, autopilot's frame pipeline in steady state).
- Single 1280 px frame replayed via `rtsp-loopback`; the live Tier-1 service is colocated on the same Jetson.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Play `fixtures/images/4d6e1830d211ad50.jpg` as a 60 s loop at 30 fps | record per-frame (frame_receive_ts, normalised_box_emit_ts); compute Δms |
| 2 | Aggregate over the measurement window | report p50, p95, p99, max |
**Pass criteria**: `p95 ≤ 100 ms` AND `max ≤ 150 ms` (max gives a soft headroom; AC enforces the p95 line).
**Duration**: 60 s after warm-up.
**Test status**: READY (fixture present); Tier requires HW for the release gate.
---
### NFT-PERF-L2: Tier-2 per-ROI semantic confirmation ≤ 200 ms
**Summary**: Per-ROI latency through Tier-2 semantic confirmation ≤ 200 ms.
**Traces to**: AC `Latency — Semantic confirmation (Tier 2) / L2`.
**Tier**: HW + Tier-B (inline ROI crop generation).
**Metric**: per-ROI wall-clock from ROI submitted to Tier-2 to Tier-2 emits semantic confirmation.
**Preconditions**:
- Warm-up: 50 ROIs processed before measurement.
- Test runner derives a ~640×640 ROI inline from `fixtures/images/4d6e1830d211ad50.jpg` and injects it directly into the SUT's Tier-2 entry (via a test-only ROI submission API exposed in test builds).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Submit 1000 ROIs at 5 Hz | per-ROI Δms |
| 2 | Aggregate | p50, p95, p99 |
**Pass criteria**: `p95 ≤ 200 ms`.
**Duration**: 200 s.
**Test status**: READY.
---
### NFT-PERF-L3: Tier-3 deep-analysis ≤ 5 s per ROI
**Summary**: Per-ROI deep-analysis (Tier-3 / VLM, when enabled) ≤ 5 s.
**Traces to**: AC `Latency — Deep semantic confirmation (Tier 3 / VLM, when enabled) / L3`.
**Tier**: HW + Tier-B (vlm-mock).
**Metric**: per-ROI wall-clock from SUT issuing a Tier-3 IPC call to VLM response received and schema-validated.
**Preconditions**:
- Warm-up: 5 Tier-3 calls.
- `vlm-mock` configured to respond from `vlm-io-pairs` fixture; Tier-3 enabled via SUT config.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Trigger 100 Tier-3 calls via injected ROIs | per-call Δms |
| 2 | Aggregate | p50, p95, p99 |
**Pass criteria**: `p95 ≤ 5000 ms`.
**Duration**: as needed for 100 calls.
**Test status**: DEFERRED — `<DEFERRED: vlm-io-pairs (real I/O) and the pinned local VLM model>`.
---
### NFT-PERF-L4: Camera zoom transition (medium → high) ≤ 2 s
**Summary**: Wall-clock from issuing the medium→high zoom command to the physical zoom transition completing ≤ 2 s, including the 12 s physical floor (restriction).
**Traces to**: AC `Latency — Camera zoom transition / L4`, RESTRICT `Hardware — 40× optical zoom traversal takes 12 s wall-clock`.
**Tier**: HW (physical A40 OR benchmarked replay) — pure-emulator runs not acceptable per `expected_results/results_report.md → Notes on this spec`.
**Metric**: wall-clock from outbound zoom command (observed on gimbal UDP) to gimbal-mock zoom telemetry reporting target_zoom_band.
**Preconditions**:
- SUT in `ZoomedIn` mode after a sweep-to-zoom transition; gimbal at medium zoom.
- HW Jetson OR `gimbal-mock` replaying recorded A40 zoom telemetry with realistic traversal time.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Trigger 30 medium→high zoom transitions via scripted POI sequence | per-transition Δms |
| 2 | Aggregate | p50, p95, max |
**Pass criteria**: `p95 ≤ 2000 ms`.
**Test status**: DEFERRED — `<DEFERRED: SITL or hardware-in-loop ViewPro A40 zoom command capture>`.
---
### NFT-PERF-L5: Decision-to-movement latency ≤ 500 ms
**Summary**: From the internal scan-control decision (POI detected mid-sweep) to the camera physically beginning to move ≤ 500 ms.
**Traces to**: AC `Latency — Decision-to-movement latency / L5`.
**Tier**: HW + Tier-B.
**Metric**: wall-clock from Tier-1 detection received at the scan-controller to first gimbal command observed on `gimbal-mock`.
**Preconditions**:
- Warm-up: 10 scripted POI events.
- Scripted scan-decision events followed by camera physical motion observed on the gimbal UDP channel.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject 100 POI detections at random sweep positions | per-event Δms (detection-receive-ts → gimbal-command-out-ts) |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 500 ms`.
**Test status**: DEFERRED — `<DEFERRED: scripted scan decision events with paired gimbal telemetry capture>`.
---
### NFT-PERF-L6: Movement candidate enqueue ≤ 1 s (wide sweep)
**Summary**: From the movement event in the visual stream to candidate enqueued for zoomed inspection ≤ 1 s during the wide-area sweep.
**Traces to**: AC `Latency — Movement candidate enqueue / L6`.
**Tier**: B + E.
**Metric**: wall-clock from ground-truth movement-event timestamp (annotated in the fixture) to candidate appearing on operator-stream.
**Preconditions**:
- Warm-up: 30 s of sweep playback.
- Synchronised RTSP + gimbal.csv + telemetry.csv (DEFERRED CSV pair).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay `fixtures/movement/video01.mp4` + paired CSVs | record per-event Δms |
| 2 | Aggregate over ~20 movement events | p95 |
**Pass criteria**: `p95 ≤ 1000 ms`.
**Test status**: DEFERRED — `<DEFERRED: paired gimbal.csv + telemetry.csv for video01.mp4 with annotated movement-event timestamps>`.
---
### NFT-PERF-L7: Movement candidate enqueue ≤ 1.5 s (zoomed-in)
**Summary**: Same as L6 but during a zoomed-in hold; budget relaxed to 1.5 s to accommodate gimbal slew.
**Traces to**: AC `Latency — Movement candidate enqueue … during the zoomed-in inspection / L7`.
**Tier**: B + E.
**Metric**: same as L6 but starting from a ZoomedIn hold.
**Preconditions**:
- SUT in ZoomedIn hold; small mover appears mid-hold.
- DEFERRED zoomed-in CSV pair.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Drive SUT into ZoomedIn hold; replay zoomed-in scene with small mover | per-event Δms |
| 2 | Aggregate over ~10 movement events | p95 |
**Pass criteria**: `p95 ≤ 1500 ms`.
**Test status**: DEFERRED — `<DEFERRED: paired gimbal.csv + telemetry.csv at zoomed-in band>`.
---
### NFT-PERF-L8: Zoom-out → zoom-in transition ≤ 2 s
**Summary**: From POI detected during sweep to ROI fully zoomed and held ≤ 2 s wall-clock.
**Traces to**: AC `Latency — Zoom-out → zoom-in transition / L8`.
**Tier**: HW + Tier-B.
**Metric**: wall-clock from Tier-1 detection injected → first frame at full zoom on the ROI (observed via gimbal-mock zoom telemetry and the operator-stream ROI overlay).
**Preconditions**:
- Warm-up.
- Scripted sweep + injected POI.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject 30 mid-sweep POIs | per-transition Δms |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 2000 ms`.
**Test status**: DEFERRED — `<DEFERRED: sweep → zoomed-inspection transition capture with annotated transition-complete timestamps>`.
---
### NFT-PERF-L9: Operator command → action ≤ 500 ms
**Summary**: From operator click event (entering the SUT on the operator-stream return path) to the corresponding outbound command observed on its destination channel ≤ 500 ms; modem RTT explicitly excluded by measuring inside the SUT-side of the modem.
**Traces to**: AC `Latency — Operator command → action / L9`.
**Tier**: B + E.
**Metric**: wall-clock from operator-stream message arrival at SUT → first outbound command observed on the affected channel (MAVLink waypoint POST, gimbal command, mode-change emission).
**Preconditions**:
- Operator-session-scripts include click events at deterministic offsets.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay scripted operator-click sequence (50 clicks across confirm / decline / target-follow / abort) | per-click Δms |
| 2 | Aggregate | p95 |
**Pass criteria**: `p95 ≤ 500 ms`.
**Test status**: DEFERRED — `<DEFERRED: operator-envelopes once Q9 resolves>` for signed commands; happy-path placeholder usable today for an early measurement (mark interim baseline only).
---
## Throughput / Rate
### NFT-PERF-T1: POI rate to operator capped at ≤ 5 / min
**Summary**: Even when Tier-1 produces detections faster than the cap, the rate of POIs SURFACED to the operator MUST stay ≤ 5 / min (hard cap, frozen 2026-05-06).
**Traces to**: AC `Throughput / Rate — POI rate surfaced to the operator / T1`.
**Tier**: B.
**Metric**: count of POIs emitted on operator-stream per rolling 60 s window.
**Preconditions**:
- Synthetic POI feed sustained at 20 POIs / min via `synthetic-poi-feeds`.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Inject sustained 20 POI/min feed for 10 minutes | per-minute count of surfaced POIs |
| 2 | Compute max over any rolling 60 s window | rolling-max |
**Pass criteria**: `rolling-max ≤ 5` POIs/min for every 60 s window.
**Duration**: 10 min.
**Test status**: READY (synthetic feeds inline-authorable).
---
### NFT-PERF-T2: Position telemetry rate ∈ [1 Hz, 10 Hz]
**Summary**: The position telemetry the SUT consumes from the airframe link MUST sustain ≥1 Hz, target 10 Hz, over a 60 s window.
**Traces to**: AC `Throughput / Rate — Position telemetry rate / T2`.
**Tier**: B (with MAVLink replay) + E (live SITL).
**Metric**: count of `GLOBAL_POSITION_INT` messages consumed by the SUT per second.
**Preconditions**:
- MAVLink stream replayed at the configured target rate (10 Hz).
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Replay 60 s of GLOBAL_POSITION_INT at 10 Hz | per-second consumed count |
| 2 | Aggregate | min, mean |
**Pass criteria**: `min ≥ 1 Hz` AND `mean ≥ 9.5 Hz` (target 10 Hz with ≤ 5 % tolerance).
**Test status**: DEFERRED — `<DEFERRED: MAVLink replay fixture over a 60 s window>`.
---
### NFT-PERF-T3: Frame-rate floor → suppress zoom-in + health yellow
**Summary**: When the sustained camera frame rate drops below 10 fps for ≥5 s, zoom-in transitions MUST be suppressed AND overall health MUST surface yellow.
**Traces to**: AC `Throughput / Rate — Sustained camera frame-rate floor / T3`.
**Tier**: B.
**Metric**: pair: (boolean — was a zoom-in suppressed during the low-FPS window?), (boolean — did health surface yellow?).
**Preconditions**:
- SUT in normal sweep mode.
- `rtsp-loopback` plays `fixtures/videos/94d42580bd1ad6ff.mp4` with throttled decode injecting frame drops to keep FPS < 10 for ≥ 5 s.
| Step | Consumer Action | Measurement |
|---|---|---|
| 1 | Start playback at normal 30 fps | health remains green; zoom-in proceeds normally on detection |
| 2 | Throttle decode + drop frames to push FPS below 10 for ≥ 5 s | record: (a) whether a zoom-in-required event during this window was suppressed; (b) whether `GET /health` returns `overall == "yellow"` |
**Pass criteria**: both observations TRUE.
**Duration**: 30 s (5 s low-FPS window + buffer).
**Test status**: READY (fixture present; throttling implemented by consumer).
---
## Sustained-load (handoff to resource-limit-tests)
The two sustained-resource AC rows (Re1, Re2) live as resource-limit tests rather than performance tests because the pass criterion is "stays within ceiling for the duration", not "is fast enough":
- Re1 — combined RSS ≤ 6 GB onboard for everything autopilot owns — see `resource-limit-tests.md → NFT-RES-LIM-Re1`.
- Re2 — Tier-1 per-frame latency Δ ≤ 5 ms when autopilot's workload runs concurrently — see `resource-limit-tests.md → NFT-RES-LIM-Re2`. Re2 is the Tier-1 non-degradation contract; the absolute Tier-1 latency target is L1.
---
## Common preconditions for every performance scenario
- **Warm-up**: every scenario MUST include an explicit warm-up phase whose duration is recorded in the CSV report. This separates cold-start cost from steady-state behaviour.
- **Steady-state window**: pass criteria apply only to the steady-state window (after warm-up), not to the warm-up itself.
- **Hardware honesty**: scenarios that name Tier HW MUST run on representative Jetson Orin Nano Super OR on a benchmarked replay. Pure-x86-emulator runs report results but do NOT contribute to the project-level Acceptance Gate.
- **Concurrent workload disclosure**: every scenario records whether other autopilot subsystems were running concurrently (Tier-1 inference, VLM, MAVLink, etc.). Re2 is the only scenario that REQUIRES concurrent workload; the others MUST report it for context.
- **Seed + determinism**: where the test inputs randomness (e.g., synthetic-POI ordering tie-breakers), the seed is captured in the CSV report.