[AZ-234] [AZ-235] [AZ-236] [AZ-237] Add replay tests

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 17:21:13 +00:00 · 2026-05-05 06:24:10 +03:00
parent c30fd4f67d
commit 5acd14b792
12 changed files with 616 additions and 3 deletions
@@ -1,88 +0,0 @@
-# Replay Geolocation And Confidence Tests
-
-**Task**: AZ-234_replay_geolocation_confidence_tests
-**Name**: Replay Geolocation And Confidence Tests
-**Description**: Implement blackbox tests for still-image geolocation, confidence/source-label output, and replay latency smoke.
-**Complexity**: 3 points
-**Dependencies**: AZ-233_test_infrastructure
-**Component**: Blackbox Tests
-**Tracker**: AZ-234
-**Epic**: AZ-218
-
-## Problem
-
-The project needs deterministic blackbox evidence that the 60-image replay path emits WGS84 frame-center estimates with required confidence fields and latency metrics.
-
-## Outcome
-
- Still-image replay reports per-frame coordinate error and aggregate threshold results.
- Every emitted estimate includes covariance, source label, and anchor-age fields.
- Replay smoke latency and dropped-frame metrics are captured in the shared report format.
-
-## Scope
-
-### Included
-
- FT-P-01 Still-Image Frame Center Geolocation.
- FT-P-02 Position Confidence Output Contract.
- NFT-PERF-01 Per-Frame Latency On Project Still Images.
- CSV and Markdown evidence output for these scenarios.
-
-### Excluded
-
- Synchronized VIO video/IMU replay.
- Satellite-anchor VPR/local matching.
- Jetson-only release-gate profiling.
-
-## Acceptance Criteria
-
-**AC-1: Still-image coordinates are validated**
-Given the 60-image project fixture and expected frame-center coordinates
-When the replay test runs
-Then per-frame WGS84 error is reported and aggregate 50 m / 20 m thresholds are evaluated.
-
-**AC-2: Confidence output contract is validated**
-Given emitted position estimates from the replay
-When the test inspects public output fields
-Then each estimate includes WGS84 coordinates, 95% covariance semi-major axis, source label, and anchor age.
-
-**AC-3: Replay latency is measured**
-Given the still-image replay runs at the configured smoke rate
-When processing completes
-Then capture-to-output latency and dropped-frame rate are recorded with pass/fail or blocked status.
-
-## Non-Functional Requirements
-
-**Performance**
- Replay smoke evidence includes p50/p95/p99 latency and dropped-frame rate.
-
-**Reliability**
- Missing or invalid expected-coordinate fixtures fail fixture validation before scenario execution.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|--------------|------------------|
-| AC-1 | Expected-coordinate loader validation | Invalid coordinates are rejected before replay |
-| AC-2 | Report field validation | Missing confidence/source fields fail the scenario |
-| AC-3 | Latency metric aggregation | p50/p95/p99 and dropped-frame metrics are emitted |
-
-## Blackbox Tests
-
-| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
-|--------|-------------------------|--------------|-------------------|----------------|
-| AC-1 | `project_60_still_images`, `expected_frame_centers` | FT-P-01 | >=80% within 50 m and >=50% within 20 m or explicit failure | Reliability |
-| AC-2 | Same replay output | FT-P-02 | 100% of emitted estimates include required confidence fields | Reliability |
-| AC-3 | Replay smoke run | NFT-PERF-01 | Latency and drop-rate metrics are recorded | Performance |
-
-## Constraints
-
- Tests must use public replay input and output artifacts only.
- Input fixtures must be mounted read-only.
- Blocked prerequisites must be reported as `blocked`, not `passed`.
-
-## Risks & Mitigation
-
-**Risk 1: Calibration limits are mistaken for product failure**
- *Risk*: Fixture limits can make absolute accuracy inconclusive.
- *Mitigation*: Report the fixture source and threshold basis with each failure.
@@ -1,89 +0,0 @@
-# VIO Replay Performance Tests
-
-**Task**: AZ-235_vio_replay_performance_tests
-**Name**: VIO Replay Performance Tests
-**Description**: Implement synchronized video/IMU replay tests for VIO output, covariance evidence, and replay performance metrics.
-**Complexity**: 5 points
-**Dependencies**: AZ-233_test_infrastructure, AZ-240_native_vio_backend_integration
-**Component**: Blackbox Tests
-**Tracker**: AZ-235
-**Epic**: AZ-218
-
-## Problem
-
-The runtime needs blackbox evidence that synchronized navigation video and flight-controller telemetry can drive VIO/wrapper output with honest confidence and measurable performance.
-
-This test task must run after AZ-240 so it validates the real native VIO path rather than the deterministic scaffold.
-
-## Outcome
-
- Derkachi video/telemetry fixture alignment is validated before replay.
- Synchronized replay produces frame-by-frame output or a clear blocked/failure reason.
- Latency, completion rate, memory, trajectory comparison, and calibration-gated checks are reported.
-
-## Scope
-
-### Included
-
- FT-P-03 BASALT VIO Replay With Synchronized Video/Telemetry.
- NFT-PERF-02 BASALT + Wrapper Replay Latency.
- Public/representative dataset prerequisite reporting.
-
-### Excluded
-
- Satellite-anchor local verification.
- SITL spoofing/failsafe scenarios.
- Thermal/endurance release gates.
-
-## Acceptance Criteria
-
-**AC-1: Replay fixture alignment is validated**
-Given the Derkachi MP4 and telemetry CSV
-When fixture validation runs
-Then duration, frame-to-telemetry ratio, and timestamp monotonicity are verified before replay.
-
-**AC-2: Synchronized replay emits estimates**
-Given a valid synchronized video/IMU replay fixture
-When replay executes
-Then estimates are emitted frame-by-frame with source labels, covariance, and segment evidence.
-
-**AC-3: VIO performance evidence is reported**
-Given replay completed or blocked
-When reporting finishes
-Then latency, completion rate, memory, and calibration/public-dataset prerequisite status are written.
-
-## Non-Functional Requirements
-
-**Performance**
- Reports include per-frame latency and memory metrics where the environment can measure them.
-
-**Reliability**
- Calibration-gated absolute accuracy checks must be marked explicitly instead of silently passing.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|--------------|------------------|
-| AC-1 | Video/telemetry validator | Invalid duration or timestamp alignment blocks replay |
-| AC-2 | Replay result parser | Missing per-frame confidence fields fail the scenario |
-| AC-3 | Calibration gate reporting | Missing calibration/public data is reported as blocked |
-
-## Blackbox Tests
-
-| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
-|--------|-------------------------|--------------|-------------------|----------------|
-| AC-1 | `derkachi_video_telemetry` | FT-P-03 fixture validation | Fixture accepted only when alignment rules pass | Reliability |
-| AC-2 | Valid synchronized replay | FT-P-03 output | Continuous estimates for normal overlapping segments or explicit degradation | Reliability |
-| AC-3 | Replay performance run | NFT-PERF-02 | Latency, completion rate, and memory evidence are recorded | Performance |
-
-## Constraints
-
- Tests must not import BASALT/OpenVINS/Kimera internals directly.
- Public/representative datasets are optional prerequisites and may produce blocked results.
- Raw input video and telemetry fixtures remain read-only.
-
-## Risks & Mitigation
-
-**Risk 1: Hardware or dataset prerequisites are unavailable**
- *Risk*: The scenario cannot produce final accuracy evidence locally.
- *Mitigation*: Emit blocked results with exact missing prerequisite and continue other scenario groups.
@@ -1,102 +0,0 @@
-# Satellite Anchor Cache Tests
-
-**Task**: AZ-236_satellite_anchor_cache_tests
-**Name**: Satellite Anchor Cache Tests
-**Description**: Implement blackbox, security, and performance tests for satellite-anchor retrieval, local verification, cache integrity, and no in-flight external access.
-**Complexity**: 5 points
-**Dependencies**: AZ-233_test_infrastructure, AZ-241_real_satellite_vpr_descriptor_retrieval, AZ-242_real_anchor_feature_matching_ransac
-**Component**: Blackbox Tests
-**Tracker**: AZ-236
-**Epic**: AZ-218
-
-## Problem
-
-Satellite anchors and cache fixtures are safety-critical: invalid, stale, poisoned, or externally fetched data must not become trusted localization output.
-
-This test task must run after AZ-241 and AZ-242 so it validates real local VPR retrieval and real anchor feature matching rather than scaffold evidence gates.
-
-## Outcome
-
- Accepted anchors include retrieval, matching, geometry, freshness, and provenance evidence.
- Invalid/stale/poisoned cache fixtures cannot produce trusted anchors or trusted generated tiles.
- No in-flight Satellite Service or provider access occurs when cache data is missing.
-
-## Scope
-
-### Included
-
- FT-P-04 Satellite Service And Anchor Verification.
- FT-N-01 Repetitive Or Low-Texture Imagery.
- FT-N-03 Invalid Or Stale Satellite Cache.
- NFT-PERF-03 Relocalization Trigger Path Latency.
- NFT-RES-04 Tile Cache Freshness Degradation.
- NFT-SEC-01 Signed Cache Manifest Enforcement.
- NFT-SEC-02 Cache Poisoning Write Gate.
- NFT-SEC-04 No In-Flight Satellite Provider Access.
- NFT-RES-LIM-03 Satellite Cache Storage Budget.
-
-### Excluded
-
- VIO synchronized replay.
- MAVLink spoofing/failsafe behavior.
- Jetson thermal endurance.
-
-## Acceptance Criteria
-
-**AC-1: Verified anchors include evidence**
-Given a valid local cache/index fixture and relocalization trigger
-When retrieval and verification run
-Then accepted anchors include candidate IDs, scores, MRE, inliers, covariance, and tile provenance.
-
-**AC-2: Unsafe candidates are rejected**
-Given low-texture, stale, unsigned, hash-mismatched, or low-resolution fixtures
-When anchor/cache tests run
-Then no invalid candidate emits a trusted `satellite_anchored` estimate or trusted generated tile.
-
-**AC-3: No in-flight external access occurs**
-Given flight-mode replay with missing cache data
-When relocalization is requested
-Then the system reports degraded/no-candidate behavior without satellite-provider or Suite service network calls.
-
-**AC-4: Cache and trigger-path metrics are reported**
-Given cache and relocalization scenarios complete
-When reporting finishes
-Then latency, MRE, trust level, freshness, and storage-budget evidence are written.
-
-## Non-Functional Requirements
-
-**Security**
- Invalid cache data must not be trusted or promoted.
-
-**Performance**
- Trigger-path latency and bounded top-K behavior are measured.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|--------------|------------------|
-| AC-1 | Anchor evidence parser | Required evidence fields are present |
-| AC-2 | Invalid cache fixture generator | Stale/unsigned/hash-mismatched fixtures are produced deterministically |
-| AC-3 | Network-block assertion | Unexpected external calls fail the scenario |
-| AC-4 | Cache metrics report | Latency, freshness, and storage metrics are present |
-
-## Blackbox Tests
-
-| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
-|--------|-------------------------|--------------|-------------------|----------------|
-| AC-1 | Public/cache fixture | FT-P-04 | Accepted anchors meet MRE/evidence requirements | Performance |
-| AC-2 | Ambiguous and invalid cache fixtures | FT-N-01, FT-N-03, NFT-SEC-01, NFT-SEC-02 | 0 unsafe trusted outputs | Security |
-| AC-3 | Network-blocked flight-mode replay | NFT-SEC-04 | Missing cache causes degraded behavior, not fetch | Security |
-| AC-4 | Relocalization/cache runs | NFT-PERF-03, NFT-RES-04, NFT-RES-LIM-03 | Metrics and storage evidence are recorded | Performance |
-
-## Constraints
-
- Tests must use local preloaded cache/index fixtures only.
- External network access during flight-mode scenarios is a failure.
- VPAir and UZH FPV licensing must be respected before use as commercial acceptance evidence.
-
-## Risks & Mitigation
-
-**Risk 1: Dataset licensing blocks final anchor evidence**
- *Risk*: Public dataset terms prevent commercial acceptance use.
- *Mitigation*: Mark dataset-specific checks blocked and keep generated cache fixtures for deterministic security coverage.
@@ -1,94 +0,0 @@
-# MAVLink Blackout Spoofing Tests
-
-**Task**: AZ-237_mavlink_blackout_spoofing_tests
-**Name**: MAVLink Blackout Spoofing Tests
-**Description**: Implement SITL/replay tests for visual blackout, spoofed GPS, MAVLink source validation, degraded covariance, no-fix thresholds, and QGC status.
-**Complexity**: 5 points
-**Dependencies**: AZ-233_test_infrastructure
-**Component**: Blackbox Tests
-**Tracker**: AZ-237
-**Epic**: AZ-218
-
-## Problem
-
-The system must prove that spoofed GPS and unauthorized MAVLink messages cannot override estimator state during visual blackout or degraded operation.
-
-## Outcome
-
- Blackout and spoofing traces drive visible degraded-mode transitions.
- Covariance, `GPS_INPUT`, QGC status, and FDR evidence match the safety thresholds.
- Unauthorized MAVLink sources are rejected and recorded.
-
-## Scope
-
-### Included
-
- FT-N-02 GPS Spoofing During Total Visual Blackout.
- NFT-RES-01 Total Visual Blackout With GPS Spoofing.
- NFT-SEC-03 MAVLink Source And Spoofing Rejection.
-
-### Excluded
-
- Still-image geolocation accuracy.
- Satellite-anchor cache poisoning.
- Cold-start and restart trials.
-
-## Acceptance Criteria
-
-**AC-1: Blackout transitions to dead reckoning**
-Given a replay/SITL trace with total camera blackout and spoofed GPS
-When the scenario runs
-Then the system enters `dead_reckoned` mode within the required frame or timing threshold.
-
-**AC-2: Degraded output thresholds are enforced**
-Given blackout continues beyond configured thresholds
-When estimates are emitted
-Then covariance grows monotonically and `GPS_INPUT` fields degrade to no-fix/failsafe values at the specified limits.
-
-**AC-3: Spoofed or unauthorized MAVLink inputs are rejected**
-Given spoofed real-GPS measurements or unauthorized MAVLink source IDs
-When messages arrive during normal or blackout operation
-Then no confident position estimate is produced from those inputs.
-
-**AC-4: Operator and FDR evidence is visible**
-Given degraded-mode transitions occur
-When reporting completes
-Then QGC status and FDR evidence show promotion, demotion, blackout, and failsafe events at expected rates.
-
-## Non-Functional Requirements
-
-**Safety**
- Spoofed GPS must not be promoted during blackout without the documented recovery gates.
-
-**Reliability**
- Missing SITL prerequisites are reported as blocked with exact setup evidence.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|--------------|------------------|
-| AC-1 | Scenario trigger builder | Blackout and spoofing events are generated deterministically |
-| AC-2 | Threshold assertion logic | Fix type, covariance, and `horiz_accuracy` thresholds are checked |
-| AC-3 | MAVLink source filter assertion | Unauthorized source messages fail the scenario |
-| AC-4 | Status/FDR parser | Expected status events and rates are validated |
-
-## Blackbox Tests
-
-| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
-|--------|-------------------------|--------------|-------------------|----------------|
-| AC-1 | SITL or replay spoofing trace | FT-N-02, NFT-RES-01 | Dead-reckoned transition within timing threshold | Safety |
-| AC-2 | Continued blackout | FT-N-02, NFT-RES-01 | Monotonic covariance and no-fix/failsafe fields | Safety |
-| AC-3 | Unauthorized/spoofed MAVLink messages | NFT-SEC-03 | No confident estimate from bad source | Safety |
-| AC-4 | QGC/FDR outputs | FT-N-02, NFT-SEC-03 | Status and evidence are visible and rate-limited | Reliability |
-
-## Constraints
-
- ArduPilot Plane SITL is the authoritative autopilot target.
- v1 asserts `GPS_INPUT` output and intentional absence of ODOMETRY.
- Tests must not depend on Mission Planner or PX4 behavior.
-
-## Risks & Mitigation
-
-**Risk 1: SITL setup varies by environment**
- *Risk*: Local runs may not have SITL installed or configured.
- *Mitigation*: Report blocked prerequisites clearly and keep replay-level assertions runnable where possible.