[AZ-234] [AZ-235] [AZ-236] [AZ-237] Add replay tests

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-05 06:24:10 +03:00
parent c30fd4f67d
commit 5acd14b792
12 changed files with 616 additions and 3 deletions
@@ -1,88 +0,0 @@
# Replay Geolocation And Confidence Tests
**Task**: AZ-234_replay_geolocation_confidence_tests
**Name**: Replay Geolocation And Confidence Tests
**Description**: Implement blackbox tests for still-image geolocation, confidence/source-label output, and replay latency smoke.
**Complexity**: 3 points
**Dependencies**: AZ-233_test_infrastructure
**Component**: Blackbox Tests
**Tracker**: AZ-234
**Epic**: AZ-218
## Problem
The project needs deterministic blackbox evidence that the 60-image replay path emits WGS84 frame-center estimates with required confidence fields and latency metrics.
## Outcome
- Still-image replay reports per-frame coordinate error and aggregate threshold results.
- Every emitted estimate includes covariance, source label, and anchor-age fields.
- Replay smoke latency and dropped-frame metrics are captured in the shared report format.
## Scope
### Included
- FT-P-01 Still-Image Frame Center Geolocation.
- FT-P-02 Position Confidence Output Contract.
- NFT-PERF-01 Per-Frame Latency On Project Still Images.
- CSV and Markdown evidence output for these scenarios.
### Excluded
- Synchronized VIO video/IMU replay.
- Satellite-anchor VPR/local matching.
- Jetson-only release-gate profiling.
## Acceptance Criteria
**AC-1: Still-image coordinates are validated**
Given the 60-image project fixture and expected frame-center coordinates
When the replay test runs
Then per-frame WGS84 error is reported and aggregate 50 m / 20 m thresholds are evaluated.
**AC-2: Confidence output contract is validated**
Given emitted position estimates from the replay
When the test inspects public output fields
Then each estimate includes WGS84 coordinates, 95% covariance semi-major axis, source label, and anchor age.
**AC-3: Replay latency is measured**
Given the still-image replay runs at the configured smoke rate
When processing completes
Then capture-to-output latency and dropped-frame rate are recorded with pass/fail or blocked status.
## Non-Functional Requirements
**Performance**
- Replay smoke evidence includes p50/p95/p99 latency and dropped-frame rate.
**Reliability**
- Missing or invalid expected-coordinate fixtures fail fixture validation before scenario execution.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|--------------|------------------|
| AC-1 | Expected-coordinate loader validation | Invalid coordinates are rejected before replay |
| AC-2 | Report field validation | Missing confidence/source fields fail the scenario |
| AC-3 | Latency metric aggregation | p50/p95/p99 and dropped-frame metrics are emitted |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|-------------------------|--------------|-------------------|----------------|
| AC-1 | `project_60_still_images`, `expected_frame_centers` | FT-P-01 | >=80% within 50 m and >=50% within 20 m or explicit failure | Reliability |
| AC-2 | Same replay output | FT-P-02 | 100% of emitted estimates include required confidence fields | Reliability |
| AC-3 | Replay smoke run | NFT-PERF-01 | Latency and drop-rate metrics are recorded | Performance |
## Constraints
- Tests must use public replay input and output artifacts only.
- Input fixtures must be mounted read-only.
- Blocked prerequisites must be reported as `blocked`, not `passed`.
## Risks & Mitigation
**Risk 1: Calibration limits are mistaken for product failure**
- *Risk*: Fixture limits can make absolute accuracy inconclusive.
- *Mitigation*: Report the fixture source and threshold basis with each failure.
@@ -1,89 +0,0 @@
# VIO Replay Performance Tests
**Task**: AZ-235_vio_replay_performance_tests
**Name**: VIO Replay Performance Tests
**Description**: Implement synchronized video/IMU replay tests for VIO output, covariance evidence, and replay performance metrics.
**Complexity**: 5 points
**Dependencies**: AZ-233_test_infrastructure, AZ-240_native_vio_backend_integration
**Component**: Blackbox Tests
**Tracker**: AZ-235
**Epic**: AZ-218
## Problem
The runtime needs blackbox evidence that synchronized navigation video and flight-controller telemetry can drive VIO/wrapper output with honest confidence and measurable performance.
This test task must run after AZ-240 so it validates the real native VIO path rather than the deterministic scaffold.
## Outcome
- Derkachi video/telemetry fixture alignment is validated before replay.
- Synchronized replay produces frame-by-frame output or a clear blocked/failure reason.
- Latency, completion rate, memory, trajectory comparison, and calibration-gated checks are reported.
## Scope
### Included
- FT-P-03 BASALT VIO Replay With Synchronized Video/Telemetry.
- NFT-PERF-02 BASALT + Wrapper Replay Latency.
- Public/representative dataset prerequisite reporting.
### Excluded
- Satellite-anchor local verification.
- SITL spoofing/failsafe scenarios.
- Thermal/endurance release gates.
## Acceptance Criteria
**AC-1: Replay fixture alignment is validated**
Given the Derkachi MP4 and telemetry CSV
When fixture validation runs
Then duration, frame-to-telemetry ratio, and timestamp monotonicity are verified before replay.
**AC-2: Synchronized replay emits estimates**
Given a valid synchronized video/IMU replay fixture
When replay executes
Then estimates are emitted frame-by-frame with source labels, covariance, and segment evidence.
**AC-3: VIO performance evidence is reported**
Given replay completed or blocked
When reporting finishes
Then latency, completion rate, memory, and calibration/public-dataset prerequisite status are written.
## Non-Functional Requirements
**Performance**
- Reports include per-frame latency and memory metrics where the environment can measure them.
**Reliability**
- Calibration-gated absolute accuracy checks must be marked explicitly instead of silently passing.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|--------------|------------------|
| AC-1 | Video/telemetry validator | Invalid duration or timestamp alignment blocks replay |
| AC-2 | Replay result parser | Missing per-frame confidence fields fail the scenario |
| AC-3 | Calibration gate reporting | Missing calibration/public data is reported as blocked |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|-------------------------|--------------|-------------------|----------------|
| AC-1 | `derkachi_video_telemetry` | FT-P-03 fixture validation | Fixture accepted only when alignment rules pass | Reliability |
| AC-2 | Valid synchronized replay | FT-P-03 output | Continuous estimates for normal overlapping segments or explicit degradation | Reliability |
| AC-3 | Replay performance run | NFT-PERF-02 | Latency, completion rate, and memory evidence are recorded | Performance |
## Constraints
- Tests must not import BASALT/OpenVINS/Kimera internals directly.
- Public/representative datasets are optional prerequisites and may produce blocked results.
- Raw input video and telemetry fixtures remain read-only.
## Risks & Mitigation
**Risk 1: Hardware or dataset prerequisites are unavailable**
- *Risk*: The scenario cannot produce final accuracy evidence locally.
- *Mitigation*: Emit blocked results with exact missing prerequisite and continue other scenario groups.
@@ -1,102 +0,0 @@
# Satellite Anchor Cache Tests
**Task**: AZ-236_satellite_anchor_cache_tests
**Name**: Satellite Anchor Cache Tests
**Description**: Implement blackbox, security, and performance tests for satellite-anchor retrieval, local verification, cache integrity, and no in-flight external access.
**Complexity**: 5 points
**Dependencies**: AZ-233_test_infrastructure, AZ-241_real_satellite_vpr_descriptor_retrieval, AZ-242_real_anchor_feature_matching_ransac
**Component**: Blackbox Tests
**Tracker**: AZ-236
**Epic**: AZ-218
## Problem
Satellite anchors and cache fixtures are safety-critical: invalid, stale, poisoned, or externally fetched data must not become trusted localization output.
This test task must run after AZ-241 and AZ-242 so it validates real local VPR retrieval and real anchor feature matching rather than scaffold evidence gates.
## Outcome
- Accepted anchors include retrieval, matching, geometry, freshness, and provenance evidence.
- Invalid/stale/poisoned cache fixtures cannot produce trusted anchors or trusted generated tiles.
- No in-flight Satellite Service or provider access occurs when cache data is missing.
## Scope
### Included
- FT-P-04 Satellite Service And Anchor Verification.
- FT-N-01 Repetitive Or Low-Texture Imagery.
- FT-N-03 Invalid Or Stale Satellite Cache.
- NFT-PERF-03 Relocalization Trigger Path Latency.
- NFT-RES-04 Tile Cache Freshness Degradation.
- NFT-SEC-01 Signed Cache Manifest Enforcement.
- NFT-SEC-02 Cache Poisoning Write Gate.
- NFT-SEC-04 No In-Flight Satellite Provider Access.
- NFT-RES-LIM-03 Satellite Cache Storage Budget.
### Excluded
- VIO synchronized replay.
- MAVLink spoofing/failsafe behavior.
- Jetson thermal endurance.
## Acceptance Criteria
**AC-1: Verified anchors include evidence**
Given a valid local cache/index fixture and relocalization trigger
When retrieval and verification run
Then accepted anchors include candidate IDs, scores, MRE, inliers, covariance, and tile provenance.
**AC-2: Unsafe candidates are rejected**
Given low-texture, stale, unsigned, hash-mismatched, or low-resolution fixtures
When anchor/cache tests run
Then no invalid candidate emits a trusted `satellite_anchored` estimate or trusted generated tile.
**AC-3: No in-flight external access occurs**
Given flight-mode replay with missing cache data
When relocalization is requested
Then the system reports degraded/no-candidate behavior without satellite-provider or Suite service network calls.
**AC-4: Cache and trigger-path metrics are reported**
Given cache and relocalization scenarios complete
When reporting finishes
Then latency, MRE, trust level, freshness, and storage-budget evidence are written.
## Non-Functional Requirements
**Security**
- Invalid cache data must not be trusted or promoted.
**Performance**
- Trigger-path latency and bounded top-K behavior are measured.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|--------------|------------------|
| AC-1 | Anchor evidence parser | Required evidence fields are present |
| AC-2 | Invalid cache fixture generator | Stale/unsigned/hash-mismatched fixtures are produced deterministically |
| AC-3 | Network-block assertion | Unexpected external calls fail the scenario |
| AC-4 | Cache metrics report | Latency, freshness, and storage metrics are present |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|-------------------------|--------------|-------------------|----------------|
| AC-1 | Public/cache fixture | FT-P-04 | Accepted anchors meet MRE/evidence requirements | Performance |
| AC-2 | Ambiguous and invalid cache fixtures | FT-N-01, FT-N-03, NFT-SEC-01, NFT-SEC-02 | 0 unsafe trusted outputs | Security |
| AC-3 | Network-blocked flight-mode replay | NFT-SEC-04 | Missing cache causes degraded behavior, not fetch | Security |
| AC-4 | Relocalization/cache runs | NFT-PERF-03, NFT-RES-04, NFT-RES-LIM-03 | Metrics and storage evidence are recorded | Performance |
## Constraints
- Tests must use local preloaded cache/index fixtures only.
- External network access during flight-mode scenarios is a failure.
- VPAir and UZH FPV licensing must be respected before use as commercial acceptance evidence.
## Risks & Mitigation
**Risk 1: Dataset licensing blocks final anchor evidence**
- *Risk*: Public dataset terms prevent commercial acceptance use.
- *Mitigation*: Mark dataset-specific checks blocked and keep generated cache fixtures for deterministic security coverage.
@@ -1,94 +0,0 @@
# MAVLink Blackout Spoofing Tests
**Task**: AZ-237_mavlink_blackout_spoofing_tests
**Name**: MAVLink Blackout Spoofing Tests
**Description**: Implement SITL/replay tests for visual blackout, spoofed GPS, MAVLink source validation, degraded covariance, no-fix thresholds, and QGC status.
**Complexity**: 5 points
**Dependencies**: AZ-233_test_infrastructure
**Component**: Blackbox Tests
**Tracker**: AZ-237
**Epic**: AZ-218
## Problem
The system must prove that spoofed GPS and unauthorized MAVLink messages cannot override estimator state during visual blackout or degraded operation.
## Outcome
- Blackout and spoofing traces drive visible degraded-mode transitions.
- Covariance, `GPS_INPUT`, QGC status, and FDR evidence match the safety thresholds.
- Unauthorized MAVLink sources are rejected and recorded.
## Scope
### Included
- FT-N-02 GPS Spoofing During Total Visual Blackout.
- NFT-RES-01 Total Visual Blackout With GPS Spoofing.
- NFT-SEC-03 MAVLink Source And Spoofing Rejection.
### Excluded
- Still-image geolocation accuracy.
- Satellite-anchor cache poisoning.
- Cold-start and restart trials.
## Acceptance Criteria
**AC-1: Blackout transitions to dead reckoning**
Given a replay/SITL trace with total camera blackout and spoofed GPS
When the scenario runs
Then the system enters `dead_reckoned` mode within the required frame or timing threshold.
**AC-2: Degraded output thresholds are enforced**
Given blackout continues beyond configured thresholds
When estimates are emitted
Then covariance grows monotonically and `GPS_INPUT` fields degrade to no-fix/failsafe values at the specified limits.
**AC-3: Spoofed or unauthorized MAVLink inputs are rejected**
Given spoofed real-GPS measurements or unauthorized MAVLink source IDs
When messages arrive during normal or blackout operation
Then no confident position estimate is produced from those inputs.
**AC-4: Operator and FDR evidence is visible**
Given degraded-mode transitions occur
When reporting completes
Then QGC status and FDR evidence show promotion, demotion, blackout, and failsafe events at expected rates.
## Non-Functional Requirements
**Safety**
- Spoofed GPS must not be promoted during blackout without the documented recovery gates.
**Reliability**
- Missing SITL prerequisites are reported as blocked with exact setup evidence.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|--------------|------------------|
| AC-1 | Scenario trigger builder | Blackout and spoofing events are generated deterministically |
| AC-2 | Threshold assertion logic | Fix type, covariance, and `horiz_accuracy` thresholds are checked |
| AC-3 | MAVLink source filter assertion | Unauthorized source messages fail the scenario |
| AC-4 | Status/FDR parser | Expected status events and rates are validated |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|-------------------------|--------------|-------------------|----------------|
| AC-1 | SITL or replay spoofing trace | FT-N-02, NFT-RES-01 | Dead-reckoned transition within timing threshold | Safety |
| AC-2 | Continued blackout | FT-N-02, NFT-RES-01 | Monotonic covariance and no-fix/failsafe fields | Safety |
| AC-3 | Unauthorized/spoofed MAVLink messages | NFT-SEC-03 | No confident estimate from bad source | Safety |
| AC-4 | QGC/FDR outputs | FT-N-02, NFT-SEC-03 | Status and evidence are visible and rate-limited | Reliability |
## Constraints
- ArduPilot Plane SITL is the authoritative autopilot target.
- v1 asserts `GPS_INPUT` output and intentional absence of ODOMETRY.
- Tests must not depend on Mission Planner or PX4 behavior.
## Risks & Mitigation
**Risk 1: SITL setup varies by environment**
- *Risk*: Local runs may not have SITL installed or configured.
- *Mitigation*: Report blocked prerequisites clearly and keep replay-level assertions runnable where possible.