[AZ-424] [AZ-425] [AZ-426] Implement negatives set (FT-N-01/03/04)

Adds three pure-logic evaluators + scenarios + unit tests covering the
project's failure-mode robustness ladder (AC-3.1, AC-3.4, AC-3.5,
AC-NEW-8):

* outlier_tolerance_evaluator (AZ-424 / FT-N-01): per-event 50 m drift
  bound + 3-frame covariance-monotonic window over the AZ-408 outlier
  injector's medium-density manifest.
* outage_request_evaluator (AZ-425 / FT-N-03): detects 3+ consecutive
  missing-frame windows; validates OPERATOR_RELOC_REQUEST STATUSTEXT
  arrives at 2 s ±500 ms, dead_reckoned label during outage, and no
  FC EKF divergence.
* blackout_spoof_evaluator (AZ-426 / FT-N-04): eight-AC ladder across
  the 5 s / 15 s / 35 s sub-windows — switch latency, spoof rejection,
  monotonic covariance, honest horiz_accuracy, STATUSTEXT 1-2 Hz,
  35 s escalation thresholds, and recovery gate.

Each scenario is skip-gated on the AZ-441 / AZ-407 / AZ-416 replay /
SITL / mavproxy helpers; unit tests (14 + 18 + 29 = 61) cover the
AC logic today. Full e2e unit-test suite: 527 passed (+67).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-17 08:26:16 +03:00
parent a644debdb7
commit 2d6d44af5d
16 changed files with 3343 additions and 1 deletions
+141
View File
@@ -0,0 +1,141 @@
# Batch 73 Report — Test Implementation (cycle 1, batch 7 of test phase)
**Batch**: 73
**Date**: 2026-05-17
**Context**: Test implementation (greenfield Step 10 — Implement Tests)
**Tasks**: AZ-424 (3pt), AZ-425 (3pt), AZ-426 (5pt) — 11 cp / 3 tasks
**Cycle**: 1
**Verdict**: COMPLETE — PASS (self-reviewed; see `reviews/batch_73_review.md`)
## Summary
The negatives set — FT-N-01 / FT-N-03 / FT-N-04 — the project's
failure-mode robustness suite (AC-3.1, AC-3.4, AC-3.5, AC-NEW-8).
Same pattern as the prior batches in this phase:
* Pure-logic evaluator under `e2e/runner/helpers/` (everything the
scenario can express without docker-bound SITL access).
* Scenario file under `e2e/tests/negative/`, parameterised across
conftest fixtures, skip-gated on upstream replay / FDR / mavproxy
/ SITL observer helpers (auto-activates when AZ-441 + AZ-407 +
AZ-416 leftovers land).
* Helper-driven unit test file under `e2e/_unit_tests/helpers/`.
### AZ-424 — FT-N-01 350 m outlier injection tolerance (3pt)
* **`runner/helpers/outlier_tolerance_evaluator.py`** — three
invariants:
- AC-1: count gate — `MIN_OUTLIER_COUNT = 10` outliers across the
Derkachi 8-min `--density medium` replay (the AC-3.1 envelope).
- AC-2: per-event drift bound — `error_after_outlier
error_before_outlier ≤ DRIFT_BUDGET_M = 50.0`. `before` / `after`
are the immediate neighbour frames in the outbound stream;
`distance_m` is the shared Vincenty helper.
- AC-3: covariance monotonic across the 3-frame window centred on
the outlier (`COVARIANCE_WINDOW_FRAMES = 3`).
- Plus `load_outlier_manifest` (reads the AZ-408 injector's
`manifest.csv`) and `write_csv_evidence`.
* **`tests/negative/test_ft_n_01_outlier_tolerance.py`** — scenario
indirect-parametrises `outlier_injection_derkachi` at
`density="medium", seed=0`, drives replay, collects FDR
`outbound_estimate` records, joins them to per-frame GT, evaluates,
asserts per-event `passes_drift` + `passes_covariance` plus the
aggregate `passes_count`. Records NFR metrics
`ft_n_01.total_outliers`, `ft_n_01.failed_event_count`, per-event
`drift_m` + `cov_non_decreasing`.
* **14 unit tests** in `test_outlier_tolerance_evaluator.py`.
### AZ-425 — FT-N-03 Extended outage triggers operator re-loc request (3pt)
* **`runner/helpers/outage_request_evaluator.py`** — first detects
outage windows from frame-index gaps (≥`MIN_OUTAGE_FRAMES = 3`
consecutive missing frames), then per-window evaluates:
- AC-2: STATUSTEXT `OPERATOR_RELOC_REQUEST` observed at
`[OUTAGE_THRESHOLD_S TOLERANCE_S, OUTAGE_THRESHOLD_S +
TOLERANCE_S] = [1.5, 2.5] s` after outage onset.
- AC-3: at least one `source_label = dead_reckoned` outbound
emission inside the window.
- AC-4: zero FC-side EKF divergence events inside the window
(observable via SITL state read).
- Plus `detect_outage_windows` (with explicit handling for trailing
windows + multi-window flights) and `write_csv_evidence`.
* **`tests/negative/test_ft_n_03_outage_reloc.py`** — scenario drives
replay with a 3-frame outage injector (a future thin extension of
the AZ-408 outlier injector), reads FDR `frame_received` +
`outbound_estimate` records to reconstruct
`expected_frame_indices` and the estimate stream, walks the
mavproxy `.tlog` for STATUSTEXT, and pulls EKF divergence events
via `sitl_observer.read_ekf_divergence_events()`. Records per-window
NFR metrics with AC IDs (`length_frames`, `statustext_offset_ms`,
`dead_reckoned_count`, `ekf_divergence_count`).
* **18 unit tests** in `test_outage_request_evaluator.py`.
### AZ-426 — FT-N-04 Visual blackout + spoofed GPS combined failsafe (5pt)
* **`runner/helpers/blackout_spoof_evaluator.py`** — the most ladder-
heavy evaluator in the project: eight per-AC sub-reports stitched
into one `BlackoutSpoofReport`. Constants pulled into the module
header so the spec can be diffed against code in one place:
`SWITCH_LATENCY_MS = 400` (AC-1),
`HONEST_ACCURACY_RATIO = 0.95` (AC-4),
`STATUSTEXT_RATE_MIN_HZ = 1.0` / `STATUSTEXT_RATE_MAX_HZ = 2.0` (AC-5),
`ESCALATION_COV_2D_M = 100.0` (AC-6),
`ESCALATION_COV_FAILSAFE_M = 500.0`, `ESCALATION_DURATION_FAILSAFE_S = 30.0`,
`ESCALATION_LATENCY_MS = 500` (AC-7),
`RECOVERY_STABLE_S = 10.0` (AC-8).
Per-AC analysers:
- `evaluate_switch_latency`: budget = `min(SWITCH_LATENCY_MS,
frame_period_ms)` — the spec's "≤1 frame OR ≤400 ms (whichever is
shorter)" wording, made explicit.
- `evaluate_spoof_rejection`: requires both ≥1 FDR
`spoof-rejected` event AND zero `satellite_anchored` emissions
inside the window (so the SUT cannot silently re-promote on a
spoofed lock).
- `evaluate_covariance_monotonic`: first non-decreasing violation
timestamp + binary pass.
- `evaluate_honest_accuracy`: per-sample `horiz_accuracy ≥ 0.95 ×
cov_semi_major_m`. Boundary test pins the spec budget.
- `evaluate_statustext_rate`: `VISUAL_BLACKOUT_IMU_ONLY` rate over
the window must land in [1, 2] Hz.
- `evaluate_escalation` (35 s window only): AC-6 fix_type degrades
on the first cov-100 m crossing; AC-7 triggers on the earliest
of cov-500 m crossing OR 30 s duration. Non-35 s windows pass
vacuously — they aren't expected to hit either threshold.
- `evaluate_recovery_gate`: AC-8 — ≥10 s of healthy + non-spoofed
FC GPS + a consistency-check pass before re-promoting to
`satellite_anchored` post-window.
* **`tests/negative/test_ft_n_04_blackout_spoof.py`** — scenario
indirect-parametrises `blackout_spoof_derkachi` over
`_WINDOW_LADDER_S = (5.0, 15.0, 35.0)` with ids `["5s", "15s",
"35s"]`. Collects FDR `outbound_estimate` + `spoof_rejected`,
mavproxy STATUSTEXT, and SITL GPS-health + consistency-check
samples. Asserts each AC with a descriptive failure message that
surfaces the relevant sub-report fields.
* **29 unit tests** in `test_blackout_spoof_evaluator.py`.
## Layout invariant
`e2e/_unit_tests/test_directory_layout.py` now lists the three new
evaluators and the three new scenario files.
## Test Results
* New unit tests: 14 + 18 + 29 = **61**.
* Plus 6 new entries in `test_required_path_exists` parametrize
(3 helpers + 3 scenarios).
* Full `e2e/_unit_tests` suite: **527 passed in 130 s** (previous
cumulative: 460 → +67 net).
* Scenario collection across the three negatives: 48 items
parametrized; the session-end `/e2e-results/evidence/per-nfr`
teardown error is the same pre-existing `nfr_recorder` wart
documented in batches 69-72 — not a regression of this batch and
not blocking unit-suite collection.
## State
* Specs moved: `_docs/02_tasks/todo/AZ-{424,425,426}_*.md` →
`_docs/02_tasks/done/`.
* `_docs/_autodev_state.md` advanced to
`last_completed_batch: 73`.
* Cumulative review window: `last_cumulative_review = batches_70-72`;
the next K=3 cumulative review fires at the end of batch 75.