Enhanced the SKILL.md file to enforce conciseness rules for the state file, specifying acceptable content and file size limits. Updated the autodev state to reflect the transition to the planning phase, including changes to the current step and sub-step details. Revised acceptance criteria to clarify validation requirements and external dependencies, ensuring alignment with the latest research findings. Added a new overlay for Mode B revisions to track changes and decisions made during the assessment process.
22 KiB
Test Data Management
Seed Data Sets
| Data Set | Description | Used by Tests | How Loaded | Cleanup |
|---|---|---|---|---|
still-image-set-60 |
60 nadir aerial images AD000001-60.jpg from _docs/00_problem/input_data/ with WGS84 frame-center GT in coordinates.csv and per-image accuracy table in expected_results/position_accuracy.csv. Captured at 400 m AGL with ADTi 20MP 20L V1 (per data_parameters.md). Slow cadence (~1 per 2-3 s), so suitable for satellite-anchor frame-center tests, NOT frame-to-frame VIO. |
FT-P-01, FT-P-03, FT-P-05, FT-P-06, FT-P-15, FT-P-19, NFT-RES-03 (Monte Carlo), NFT-PERF-04 | Bind-mounted from _docs/00_problem/input_data/ to /test-data in e2e-runner (read-only) |
None — read-only fixture |
still-image-sat-refs-2 |
Two paired Google Maps reference images AD000001_gmaps.png, AD000002_gmaps.png. Insufficient for full satellite-anchor coverage of the 60-image set; supplements the tile-cache fixture for AC-2.1b cross-validation only. |
FT-P-05 (subset), FT-P-19 | Same as above | Same |
derkachi-fixture |
Cropped nadir flight footage flight_derkachi/flight_derkachi.mp4 (H.264, 880×720, 30 fps, ~490.07 s = 14,700 frames) plus synchronized FC telemetry flight_derkachi/data_imu.csv (4,900 rows @ 10 Hz, columns timestamp(ms), Time, SCALED_IMU2.*, GLOBAL_POSITION_INT.*). Three video frames per telemetry row. The GLOBAL_POSITION_INT columns are the trajectory ground truth. |
FT-P-02, FT-P-04, FT-P-07, FT-P-10, FT-N-01 (synth on top), FT-N-02, FT-N-03 (synth), FT-N-04 (synth), NFT-PERF-01, NFT-PERF-02, NFT-RES-01, NFT-RES-02, NFT-RES-03 (Monte Carlo), NFT-RES-04, NFT-LIM-02 (8 h synth load loop) | Same bind mount as above | Same |
tile-cache-fixture |
Pre-built FAISS HNSW index + tile filesystem covering: (a) the 60 still-image footprints at 0.3-0.5 m/px, (b) the Derkachi route bbox at the same resolution. Built once per CI run by tests/fixtures/tile-cache-builder/ from the _gmaps.png references and from a curated public-data subset (when D-PROJ-3 is resolved — until then, stub-tile content for footprints not paired with _gmaps.png). Tile manifest schema per restrictions.md § Satellite Imagery. |
FT-P-01, FT-P-05, FT-P-15, FT-P-16, FT-P-17, FT-P-19, FT-N-05, FT-N-06, NFT-LIM-03, NFT-PERF-01, NFT-PERF-04, NFT-SEC-01 (poisoning test), NFT-SEC-02 (egress) | Built into named Docker volume tile-cache-fixture; mounted read-only into SUT at /var/azaion/tile-cache |
Volume removed at teardown |
synth-age-tile-set |
Two clones of the tile-cache-fixture with manifest capture_date field synthetically aged: synth-age-7mo (>6 mo, exceeds AC-8.2 active-conflict threshold) and synth-age-13mo (>12 mo, exceeds rear threshold). Tile pixels unchanged; only manifest dates differ. |
FT-N-05, FT-N-06 | Built from tile-cache-fixture by date-mutating script in tests/fixtures/age-injector/ |
Volume removed at teardown |
outlier-injection-derkachi |
Synthetic adversarial overlay on derkachi-fixture: every Nth frame replaced by a random crop from a far-away tile (>350 m offset, per AC-3.1) to inject a visual outlier. Three injection densities: light (1 in 100), medium (1 in 10), heavy (1 in 3). Generated at runtime by tests/fixtures/injectors/outlier.py. |
FT-N-01 | Generated at scenario start, written to tmpfs in e2e-runner, mounted into SUT as a derived frame source |
Auto-cleared at teardown (tmpfs) |
blackout-spoof-derkachi |
Synthetic overlay on derkachi-fixture: pure-black frames inserted in 5 s / 15 s / 35 s windows AND simultaneous spoofed-GPS injection on the FC inbound stream. Spoof pattern: realistic-looking GPS jumps the trajectory 200-500 m in north_east_random_direction. Three windows produce three sub-scenarios per AC-NEW-8. Generated at runtime. |
FT-N-04, NFT-RES-04 | Same | Same |
multi-segment-derkachi |
Synthetic overlay: 3+ blackout segments distributed across the Derkachi flight to exercise satellite-reference re-localization (AC-3.3) without spoofing. Generated at runtime. | FT-P-08 | Same | Same |
cold-boot-fixture |
The state needed to validate AC-NEW-1: a frozen FC pose (GLOBAL_POSITION_INT snapshot at flight-resume time) + the tile-cache-fixture + a blank FDR. Test cold-boots the SUT and measures TTFF. |
NFT-PERF-03 (AC-NEW-1) | The frozen FC pose is a JSON fixture in tests/fixtures/cold-boot/; SUT is restarted (docker compose restart gps-denied-onboard) and TTFF is measured from container-ready event to first valid GPS_INPUT / MSP2_SENSOR_GPS arrival at SITL |
Container restart only |
mavlink-passkey |
A test-only MAVLink 2.0 signing passkey (32-byte hex). Used for D-C8-9 ArduPilot-track signing channel. NEVER reused outside test environment; checked-in as tests/fixtures/secrets/mavlink-test-passkey.txt with explicit comment "TEST ONLY". |
FT-P-09 (AP track), NFT-SEC-03 | Loaded via Docker secret into SUT environment | None — fixture file |
cve-jpeg-fixture |
Crafted JPEG that triggers CVE-2025-53644 (uninitialized stack pointer → heap buffer write) in OpenCV 4.10/4.11. The pinned ≥4.12.0 must process it without crash and either decode safely or reject. | NFT-SEC-04 | Local-data-only fixture file at tests/fixtures/security/cve-2025-53644.jpg (sourced from public PoC, license-checked) |
None — fixture file |
Data Isolation Strategy
Each pytest test case runs against a fresh gps-denied-onboard container (docker compose restart between tests, OR --forked pytest mode that brings a clean compose stack per case for hermetic-critical tests). The tile-cache-fixture and input-data mounts are read-only so cross-contamination between tests is impossible at the SUT-input layer. The fdr-output volume is reset between tests (docker volume rm + recreate) so each test sees a blank FDR.
For Tier-2 (Jetson hardware), the same isolation discipline applies but at the systemd-service level: systemctl restart gps-denied-onboard.service between tests, /var/azaion/fdr is wiped between tests.
Synthetic-injection fixtures (outlier-injection-derkachi, blackout-spoof-derkachi, multi-segment-derkachi, synth-age-tile-set) are generated into per-test tmpfs and never written back to a persistent volume.
Input Data Mapping
| Input Data File | Source Location | Description | Covers Scenarios |
|---|---|---|---|
AD000001.jpg ... AD000060.jpg |
_docs/00_problem/input_data/ |
60 nadir still images, ADTi 20MP @ 400 m AGL | FT-P-01, FT-P-03, FT-P-05, FT-P-06, FT-P-15, FT-P-19, NFT-PERF-04, NFT-RES-03 |
coordinates.csv |
_docs/00_problem/input_data/ |
60-row WGS84 frame-center GT (image, lat, lon) | Same as above |
AD000001_gmaps.png, AD000002_gmaps.png |
_docs/00_problem/input_data/ |
Google Maps satellite reference for images 1-2 | FT-P-05, FT-P-19 |
data_parameters.md |
_docs/00_problem/input_data/ |
AGL height (400 m) + camera model | All — global metadata |
flight_derkachi/flight_derkachi.mp4 |
_docs/00_problem/input_data/flight_derkachi/ |
H.264 nadir video, 880×720 @ 30 fps, ~490 s | FT-P-02, FT-P-04, FT-P-07, FT-P-10, FT-N-01..04, NFT-PERF-01..04, NFT-RES-01..04, NFT-LIM-02 |
flight_derkachi/data_imu.csv |
_docs/00_problem/input_data/flight_derkachi/ |
4,900 rows @ 10 Hz of SCALED_IMU2 + GLOBAL_POSITION_INT |
Same as above |
flight_derkachi/README.md |
_docs/00_problem/input_data/flight_derkachi/ |
Fixture metadata | Documentation only |
expected_results/results_report.md |
_docs/00_problem/input_data/expected_results/ |
Pass/fail rules + still-image and Derkachi mappings | All FT-P / FT-N scenarios that load this fixture |
expected_results/position_accuracy.csv |
_docs/00_problem/input_data/expected_results/ |
Per-image accuracy threshold flags | FT-P-01, NFT-RES-03 |
Expected Results Mapping
This table closes the gap between each test scenario and the quantifiable expected result it asserts on. Comparison methods follow .cursor/skills/test-spec/templates/expected-results.md. The Expected Result Source column points at the canonical source of truth for the assertion.
Position accuracy
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| FT-P-01 | still-image-set-60 + tile-cache-fixture |
pass_count(error≤50m) ≥ 48 (≥80% of 60) AND pass_count(error≤20m) ≥ 30 (≥50% of 60) |
threshold_min on aggregate counts; per-image error via numeric_tolerance against Vincenty geodesic distance to GT in coordinates.csv |
±50 m / ±20 m | expected_results/results_report.md § Pass/Fail Rules + expected_results/position_accuracy.csv |
| FT-P-02 | derkachi-fixture |
At each anchor frame, ‖propagated_centre − next_anchor_centre‖ < 100 m (visual-only) AND < 50 m (IMU-fused). Drift binned by last_satellite_anchor_age_ms. |
threshold_max per anchor pair, then aggregate rule ≥95% of anchor pairs satisfy |
< 100 m / < 50 m | AC-1.3 + Derkachi GLOBAL_POSITION_INT GT |
| FT-P-03 | still-image-set-60 (any 1 image) |
Estimate output schema fields present: lat:float, lon:float, cov_semi_major_m:float, source_label ∈ {satellite_anchored, visual_propagated, dead_reckoned}, last_satellite_anchor_age_ms:int |
schema_match (presence + type) AND set_contains (label) |
N/A | AC-1.4 + AC-4.3 |
| FT-P-19 | tile-cache-fixture + still-image-sat-refs-2 |
Scale-ratio: any UAV-frame footprint at 400 m AGL retrievable from cache (FAISS top-K=10 includes a tile with center within 100 m of true position). Scene-change subset (PARTIAL — flag-marked, see traceability matrix). | set_contains (top-K result includes correct tile) |
top-K hit | AC-8.6 |
Image processing quality
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| FT-P-04 | derkachi-fixture |
Frame-to-frame registration succeeds for ≥95% of "normal" segments (defined per AC-2.1a: nadir ±10° bank/pitch from data_imu.csv SCALED_IMU2 quaternion-derived attitude estimate, ≥40% inferred prior-frame overlap). Sharp-turn frames excluded from this denominator. |
threshold_min on success ratio |
≥95% | AC-2.1a |
| FT-P-05 | still-image-set-60 (with _gmaps.png subset for ground-truth match) |
Satellite-anchor registration succeeds AND satisfies AC-1.1/1.2 accuracy AND MRE < 2.5 px | threshold_max MRE |
< 2.5 px | AC-2.1b + AC-2.2 |
| FT-P-06 | derkachi-fixture (frame-to-frame) AND still-image-set-60 (sat-anchor) |
Mean Reprojection Error: < 1.0 px frame-to-frame, < 2.5 px satellite-anchored cross-domain |
threshold_max per shape |
< 1.0 / < 2.5 px | AC-2.2 |
Resilience
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| FT-N-01 | outlier-injection-derkachi |
Up to 350 m offset in a single frame is rejected as outlier; estimate continues from prior valid state with grown covariance; airframe tilt up to ±20° handled | Per-injected-outlier: error_after_outlier ≤ error_before_outlier + 50 m AND covariance_growth_monotonic |
±50 m drift budget | AC-3.1 |
| FT-N-02 | derkachi-fixture (sharp-turn segment, identified via SCALED_IMU2 gyro_z spikes) |
Sharp-turn frames may fail frame-to-frame registration; recovery via satellite-reference re-localization within next 3 frames | Boolean recovery within 3 frames | N/A | AC-3.2 |
| FT-P-08 | multi-segment-derkachi |
≥3 disconnected segments handled; satellite-reference re-localization succeeds at each gap; trajectory remains continuous (no >100 m jump) | threshold_max discontinuity |
< 100 m | AC-3.3 |
| FT-N-03 | derkachi-fixture + synthetic 3-frame outage injector |
After ≥3 consecutive frames AND ≥2 s without estimate: STATUSTEXT containing OPERATOR_RELOC_REQUEST emitted to GCS via mavproxy-listener; estimates labeled dead_reckoned continue |
regex on STATUSTEXT + set_contains on labels |
regex | AC-3.4 |
| FT-N-04 | blackout-spoof-derkachi (5 s / 15 s / 35 s windows) |
Within ≤1 frame OR ≤400 ms: label switches to dead_reckoned; spoofed GPS rejected; covariance grows monotonically; horiz_accuracy not under-reported; VISUAL_BLACKOUT_IMU_ONLY STATUSTEXT at 1-2 Hz |
threshold_max switch latency + regex STATUSTEXT + monotonic check |
≤400 ms | AC-3.5 |
FC contract & startup
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| FT-P-09-AP | derkachi-fixture + mavlink-passkey + ardupilot-plane-sitl |
GPS_INPUT messages reach AP SITL; AP EKF accepts them as EK3_SRC1_POSXY=3 (GPS); MAVLink 2.0 signing handshake completes (D-C8-9); messages without valid signature are rejected |
exact (AP source-set state via param read) + boolean (signing handshake success) + exact (rejection of unsigned in NFT-SEC-03) |
N/A | AC-4.3 + D-C8-9 |
| FT-P-09-iNav | derkachi-fixture + inav-sitl |
MSP2_SENSOR_GPS (ID 0x1F03) messages reach iNav SITL via TCP 5760; iNav GPS provider state shows provider=MSP and fix is acquired |
exact on iNav GPS provider state via MSP read |
N/A | AC-4.3 + Source #4 |
| FT-P-10 | derkachi-fixture |
Per Mode B Fact #107: GTSAM iSAM2 smoothed past-keyframe pose estimates differ from raw single-shot estimates AND smoothed estimates are closer to GLOBAL_POSITION_INT GT than raw (IT-11). NOT validated as FC-side retroactive correction (out of scope per Mode B revision). |
numeric_tolerance improvement check |
smoothed_error < raw_error | AC-4.5 (revised) + Mode B Fact #107 |
| FT-P-11 | cold-boot-fixture + ardupilot-plane-sitl |
On boot, SUT initializes from FC EKF's last valid GPS + IMU-extrapolated position | numeric_tolerance initial-pose-vs-FC-pose |
±50 m | AC-5.1 |
| NFT-RES-01 | derkachi-fixture + 4 s outage injector |
After >3 s without estimate, FC falls back to IMU-only dead reckoning; SUT emits a NO_ESTIMATE_TIMEOUT failure log |
boolean on FC EKF source-set transition + regex on log |
N/A | AC-5.2 |
| NFT-RES-02 | derkachi-fixture + container restart mid-replay |
After companion reboot, SUT re-initializes from FC's current IMU-extrapolated position; first emitted GPS_INPUT / MSP2_SENSOR_GPS is within ±100 m of FC's IMU-extrapolated pose at boot-complete time |
numeric_tolerance pose at first emit |
±100 m | AC-5.3 |
Performance
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| NFT-PERF-01 (Tier-2 only) | derkachi-fixture resampled to 3 Hz on Jetson Orin Nano Super |
End-to-end latency (camera capture → GPS to FC) | threshold_max p95 |
≤ 400 ms | AC-4.1 + D-CROSS-LATENCY-1 |
| NFT-PERF-02 (Tier-1+2) | derkachi-fixture |
Estimates emitted frame-by-frame (no batching > 1 frame); inter-emit interval p95 ≤ inter-frame interval × 1.05 | threshold_max p95 inter-emit |
≤ 350 ms (at 3 Hz target) | AC-4.4 |
| NFT-PERF-03 (Tier-2 only) | cold-boot-fixture |
Cold-start TTFF: from container-ready to first valid GPS_INPUT / MSP2_SENSOR_GPS |
threshold_max p95 over 50 cold boots |
< 30 s | AC-NEW-1 |
| NFT-PERF-04 | still-image-set-60 + spoofed FC GPS injection in ardupilot-plane-sitl |
Spoofing-promotion latency: from FC GPS-denial / spoof signal to SUT estimate becoming AP primary position source | threshold_max p95 over 50 trials per FC |
< 3 s | AC-NEW-2 |
Resource limits
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| NFT-LIM-01 (Tier-2) | derkachi-fixture 8 h replay loop |
Memory < 8 GB shared on Jetson Orin Nano Super throughout |
threshold_max peak RSS over duration |
≤ 8 GB | AC-4.2 |
| NFT-LIM-02 (Tier-1) | 8 h Derkachi replay loop | FDR ≤ 64 GB; no payload class silently dropped without a logged rollover |
threshold_max total FDR size + regex on rollover-event presence |
≤ 64 GB | AC-NEW-3 |
| NFT-LIM-03 | tile-cache-fixture plus exercised manifests/overviews/indices |
Cache budget ≤ 10 GB for the ~400 km² operational area unless solution defines a separate descriptor budget |
threshold_max total cache size |
≤ 10 GB | RESTRICT-SAT-2 + AC-8.3 |
| NFT-LIM-04 (Tier-2) | derkachi-fixture 8 h |
CPU/GPU/temp/throttle telemetry recorded; no thermal throttling at 25 W TDP at the upper temp envelope (deferred to chamber for AC-NEW-5) | threshold_max throttle event count = 0 (workstation thermal-day) |
0 events | RESTRICT-HW-1 + AC-NEW-5 (Tier-2 partial) |
Security
| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
|---|---|---|---|---|---|
| NFT-SEC-01 | Synthetic over-confidence injection: deflate covariance ×1.5-3 in 3 trial flights, observe AC-NEW-7 cache-poisoning behavior at the mock-suite-sat-service ingest |
Per flight: P(geo-misalign > 30 m) < 1%, P(> 100 m) < 0.1% of written tiles. PARTIAL — multi-flight Monte Carlo (≥100 flights per AC text) is reduced-confidence with current single Derkachi fixture; trace flag in matrix. |
threshold_max on probability |
< 1% / < 0.1% | AC-NEW-7 |
| NFT-SEC-02 | Network egress probe from SUT container | All non-e2e-net egress attempts blocked by Docker internal: true; per-attempt logged as security event in SUT log |
exact (egress count = 0) + regex (security-event log emission) |
N/A | RESTRICT-SAT-1 + AC-8.1 |
| NFT-SEC-03 | ardupilot-plane-sitl + un-signed MAVLink GPS_INPUT injection |
AP SITL rejects unsigned messages on the signed channel; SUT-emitted (signed) messages pass; SBOM check confirms passkey configuration | exact (AP rejection of unsigned) + boolean (SBOM passkey present) |
N/A | D-C8-9 + Mode B Fact #109 + AC-NEW-2 |
| NFT-SEC-04 | cve-jpeg-fixture fed to SUT image pipeline (C1 + C4 paths) |
OpenCV ≥4.12.0 either decodes safely or rejects the file; no crash, no buffer overflow detected by AddressSanitizer | boolean on no-crash + ASan clean |
N/A | D-CROSS-CVE-1 + Mode B Fact #112 |
External Dependency Mocks
| External Service | Mock/Stub | How Provided | Behavior |
|---|---|---|---|
| Azaion Suite Satellite Service (ingest API for AC-NEW-7 voting layer) | mock-suite-sat-service Docker service |
Local FastAPI stub returning canned tile-publish-acknowledgement responses with deterministic IDs; logs every received tile + per-tile quality metadata to a file the e2e-runner reads back | Returns 202 Accepted on every well-formed publish; returns 400 on malformed; never simulates real voting (the project's role is to publish, the Service's role is to vote per Mode B Fact #105 / D-PROJ-2) |
| ArduPilot Plane FC | ardupilot-plane-sitl Docker service |
Open-source SITL build of ArduPilot Plane stable; configured with GPS_TYPE=14 per Source #2 to accept MAVLink GPS_INPUT |
Real ArduPilot EKF behavior; we observe but do not patch |
| iNav FC | inav-sitl Docker service |
Open-source iNav SITL; GPS provider configured to MSP per docs/SITL/SITL.md |
Real iNav GPS subsystem behavior; we observe but do not patch |
| QGroundControl GCS | mavproxy-listener Docker service |
Passive MAVLink listener that forwards SUT → GCS stream into a .tlog file the e2e-runner parses |
Captures all STATUSTEXT, NAMED_VALUE_FLOAT, downsampled position frames for assertions |
| AI camera (AC-7.x) | NOT MOCKED — out of scope per Phase 1 gate | N/A | NOT COVERED in current matrix — see traceability matrix |
Data Validation Rules
| Data Type | Validation | Invalid Examples | Expected System Behavior |
|---|---|---|---|
| Nav-camera frame | Resolution within ADTi spec (~5472×3648 production, downscaled equivalents allowed in Tier-1 Docker) | 0×0 frame, corrupt JPEG (CVE fixture), wrong color depth | Reject frame, log invalid-input event, do NOT advance estimator state |
| FC IMU sample | SCALED_IMU2 fields present; timestamp monotonic; non-zero accelerometer norm |
Missing field, backwards timestamp, NaN | Reject sample, log invalid-input event, propagate estimator from prior valid state |
| Satellite tile manifest | Required fields per restrictions.md: CRS, tile matrix, dimension, lat-adjusted m/px, capture date, source, compression. m/px ≥ 0.5. capture_date within AC-8.2 freshness window. |
Missing capture_date, m/px = 1.0 (below floor), capture_date older than freshness threshold | Reject tile load OR downgrade to non-satellite_anchored source label per AC-NEW-6 |
| Spoofed FC GPS | (FC-side input the SUT detects) | GPS jump >200 m between consecutive 5 Hz frames; FC GPS-health flag toggled to spoofed | SUT switches estimator label to dead_reckoned, stops promoting FC GPS, continues per AC-NEW-8 |
| MAVLink GPS_INPUT outbound | Honest covariance — horiz_accuracy ≥ estimator's 95% covariance semi-major axis |
Under-reported covariance | This is a defect (AC-NEW-4) — fail NFT-PERF-04 if observed |
| MAVLink message signature | MAVLink 2.0 signed on AP wired channel per D-C8-9 | Unsigned message on signed channel | AP-side rejection (NFT-SEC-03 expected behavior) |