Files
Oleksandr Bezdieniezhnykh 2d6d44af5d [AZ-424] [AZ-425] [AZ-426] Implement negatives set (FT-N-01/03/04)
Adds three pure-logic evaluators + scenarios + unit tests covering the
project's failure-mode robustness ladder (AC-3.1, AC-3.4, AC-3.5,
AC-NEW-8):

* outlier_tolerance_evaluator (AZ-424 / FT-N-01): per-event 50 m drift
  bound + 3-frame covariance-monotonic window over the AZ-408 outlier
  injector's medium-density manifest.
* outage_request_evaluator (AZ-425 / FT-N-03): detects 3+ consecutive
  missing-frame windows; validates OPERATOR_RELOC_REQUEST STATUSTEXT
  arrives at 2 s ±500 ms, dead_reckoned label during outage, and no
  FC EKF divergence.
* blackout_spoof_evaluator (AZ-426 / FT-N-04): eight-AC ladder across
  the 5 s / 15 s / 35 s sub-windows — switch latency, spoof rejection,
  monotonic covariance, honest horiz_accuracy, STATUSTEXT 1-2 Hz,
  35 s escalation thresholds, and recovery gate.

Each scenario is skip-gated on the AZ-441 / AZ-407 / AZ-416 replay /
SITL / mavproxy helpers; unit tests (14 + 18 + 29 = 61) cover the
AC logic today. Full e2e unit-test suite: 527 passed (+67).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-17 08:26:16 +03:00

6.8 KiB
Raw Permalink Blame History

Batch 73 Report — Test Implementation (cycle 1, batch 7 of test phase)

Batch: 73 Date: 2026-05-17 Context: Test implementation (greenfield Step 10 — Implement Tests) Tasks: AZ-424 (3pt), AZ-425 (3pt), AZ-426 (5pt) — 11 cp / 3 tasks Cycle: 1 Verdict: COMPLETE — PASS (self-reviewed; see reviews/batch_73_review.md)

Summary

The negatives set — FT-N-01 / FT-N-03 / FT-N-04 — the project's failure-mode robustness suite (AC-3.1, AC-3.4, AC-3.5, AC-NEW-8). Same pattern as the prior batches in this phase:

  • Pure-logic evaluator under e2e/runner/helpers/ (everything the scenario can express without docker-bound SITL access).
  • Scenario file under e2e/tests/negative/, parameterised across conftest fixtures, skip-gated on upstream replay / FDR / mavproxy / SITL observer helpers (auto-activates when AZ-441 + AZ-407 + AZ-416 leftovers land).
  • Helper-driven unit test file under e2e/_unit_tests/helpers/.

AZ-424 — FT-N-01 350 m outlier injection tolerance (3pt)

  • runner/helpers/outlier_tolerance_evaluator.py — three invariants:
    • AC-1: count gate — MIN_OUTLIER_COUNT = 10 outliers across the Derkachi 8-min --density medium replay (the AC-3.1 envelope).
    • AC-2: per-event drift bound — error_after_outlier error_before_outlier ≤ DRIFT_BUDGET_M = 50.0. before / after are the immediate neighbour frames in the outbound stream; distance_m is the shared Vincenty helper.
    • AC-3: covariance monotonic across the 3-frame window centred on the outlier (COVARIANCE_WINDOW_FRAMES = 3).
    • Plus load_outlier_manifest (reads the AZ-408 injector's manifest.csv) and write_csv_evidence.
  • tests/negative/test_ft_n_01_outlier_tolerance.py — scenario indirect-parametrises outlier_injection_derkachi at density="medium", seed=0, drives replay, collects FDR outbound_estimate records, joins them to per-frame GT, evaluates, asserts per-event passes_drift + passes_covariance plus the aggregate passes_count. Records NFR metrics ft_n_01.total_outliers, ft_n_01.failed_event_count, per-event drift_m + cov_non_decreasing.
  • 14 unit tests in test_outlier_tolerance_evaluator.py.

AZ-425 — FT-N-03 Extended outage triggers operator re-loc request (3pt)

  • runner/helpers/outage_request_evaluator.py — first detects outage windows from frame-index gaps (≥MIN_OUTAGE_FRAMES = 3 consecutive missing frames), then per-window evaluates:
    • AC-2: STATUSTEXT OPERATOR_RELOC_REQUEST observed at [OUTAGE_THRESHOLD_S TOLERANCE_S, OUTAGE_THRESHOLD_S + TOLERANCE_S] = [1.5, 2.5] s after outage onset.
    • AC-3: at least one source_label = dead_reckoned outbound emission inside the window.
    • AC-4: zero FC-side EKF divergence events inside the window (observable via SITL state read).
    • Plus detect_outage_windows (with explicit handling for trailing windows + multi-window flights) and write_csv_evidence.
  • tests/negative/test_ft_n_03_outage_reloc.py — scenario drives replay with a 3-frame outage injector (a future thin extension of the AZ-408 outlier injector), reads FDR frame_received + outbound_estimate records to reconstruct expected_frame_indices and the estimate stream, walks the mavproxy .tlog for STATUSTEXT, and pulls EKF divergence events via sitl_observer.read_ekf_divergence_events(). Records per-window NFR metrics with AC IDs (length_frames, statustext_offset_ms, dead_reckoned_count, ekf_divergence_count).
  • 18 unit tests in test_outage_request_evaluator.py.

AZ-426 — FT-N-04 Visual blackout + spoofed GPS combined failsafe (5pt)

  • runner/helpers/blackout_spoof_evaluator.py — the most ladder- heavy evaluator in the project: eight per-AC sub-reports stitched into one BlackoutSpoofReport. Constants pulled into the module header so the spec can be diffed against code in one place: SWITCH_LATENCY_MS = 400 (AC-1), HONEST_ACCURACY_RATIO = 0.95 (AC-4), STATUSTEXT_RATE_MIN_HZ = 1.0 / STATUSTEXT_RATE_MAX_HZ = 2.0 (AC-5), ESCALATION_COV_2D_M = 100.0 (AC-6), ESCALATION_COV_FAILSAFE_M = 500.0, ESCALATION_DURATION_FAILSAFE_S = 30.0, ESCALATION_LATENCY_MS = 500 (AC-7), RECOVERY_STABLE_S = 10.0 (AC-8). Per-AC analysers:
    • evaluate_switch_latency: budget = min(SWITCH_LATENCY_MS, frame_period_ms) — the spec's "≤1 frame OR ≤400 ms (whichever is shorter)" wording, made explicit.
    • evaluate_spoof_rejection: requires both ≥1 FDR spoof-rejected event AND zero satellite_anchored emissions inside the window (so the SUT cannot silently re-promote on a spoofed lock).
    • evaluate_covariance_monotonic: first non-decreasing violation timestamp + binary pass.
    • evaluate_honest_accuracy: per-sample horiz_accuracy ≥ 0.95 × cov_semi_major_m. Boundary test pins the spec budget.
    • evaluate_statustext_rate: VISUAL_BLACKOUT_IMU_ONLY rate over the window must land in [1, 2] Hz.
    • evaluate_escalation (35 s window only): AC-6 fix_type degrades on the first cov-100 m crossing; AC-7 triggers on the earliest of cov-500 m crossing OR 30 s duration. Non-35 s windows pass vacuously — they aren't expected to hit either threshold.
    • evaluate_recovery_gate: AC-8 — ≥10 s of healthy + non-spoofed FC GPS + a consistency-check pass before re-promoting to satellite_anchored post-window.
  • tests/negative/test_ft_n_04_blackout_spoof.py — scenario indirect-parametrises blackout_spoof_derkachi over _WINDOW_LADDER_S = (5.0, 15.0, 35.0) with ids ["5s", "15s", "35s"]. Collects FDR outbound_estimate + spoof_rejected, mavproxy STATUSTEXT, and SITL GPS-health + consistency-check samples. Asserts each AC with a descriptive failure message that surfaces the relevant sub-report fields.
  • 29 unit tests in test_blackout_spoof_evaluator.py.

Layout invariant

e2e/_unit_tests/test_directory_layout.py now lists the three new evaluators and the three new scenario files.

Test Results

  • New unit tests: 14 + 18 + 29 = 61.
  • Plus 6 new entries in test_required_path_exists parametrize (3 helpers + 3 scenarios).
  • Full e2e/_unit_tests suite: 527 passed in 130 s (previous cumulative: 460 → +67 net).
  • Scenario collection across the three negatives: 48 items parametrized; the session-end /e2e-results/evidence/per-nfr teardown error is the same pre-existing nfr_recorder wart documented in batches 69-72 — not a regression of this batch and not blocking unit-suite collection.

State

  • Specs moved: _docs/02_tasks/todo/AZ-{424,425,426}_*.md_docs/02_tasks/done/.
  • _docs/_autodev_state.md advanced to last_completed_batch: 73.
  • Cumulative review window: last_cumulative_review = batches_70-72; the next K=3 cumulative review fires at the end of batch 75.