[AZ-233] Update Docker Compose and enhance test documentation

- Modified the Docker Compose configuration to include an input root for replay tests and added an environment variable for enabling SITL. - Enhanced documentation for various testing processes, including the addition of a Runtime Completeness Decomposition Gate and clarifications on internal module testing requirements. - Updated the implementation completeness report to reflect the current state and added new test cases for performance and resilience scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 15:11:12 +00:00 · 2026-05-06 05:03:48 +03:00
parent 2485763d09
commit cab7b5d020
20 changed files with 265 additions and 41 deletions
@@ -14,7 +14,7 @@ Build a Jetson-hosted onboard localization pipeline for fixed-wing GPS-denied fl
 - Tile Manager: manage COGs, manifests, freshness/provenance, orthorectified generated tiles, and local tile metadata.
 - MAVLink/GCS integration: consume FC telemetry and emit `GPS_INPUT`/QGC status.
 - FDR/observability: record replayable mission evidence under storage caps.
- Validation harness: run still-image, public dataset, SITL, Jetson, and representative replay tests.
+- Validation harness: run local pytest plus Docker replay smoke for still-image, cache, SITL/QGC stub, security, Jetson-prerequisite, public dataset, and representative replay tests.

 ### Principles / Non-Negotiables

@@ -228,11 +228,15 @@ Read top-to-bottom; an upper layer may import from a lower layer but never the r

 Violations of this table are Architecture findings in code-review Phase 7 and are High severity.

-## Out-of-Product E2E Test Suite
+## Out-of-Product Blackbox / E2E Test Suite

-The e2e replay/SITL/Jetson validation suite is not a product component and must not receive Step 6 product implementation tasks. It owns test-support artifacts under `tests/blackbox/**`, `tests/e2e/**`, `e2e/replay/**`, and `e2e/reports/**`, and it exercises the runtime only through public file, MAVLink, cache, status, and FDR interfaces.
+The blackbox/e2e replay/SITL/Jetson validation suite is not a product component and must not receive Step 6 product implementation tasks. It owns test-support artifacts under `tests/blackbox/**`, `e2e/replay/**`, `e2e/fixtures/**`, `e2e/mocks/**`, `docker-compose.test.yml`, and `deployment/docker/Dockerfile.replay`, and it exercises the runtime only through public file, MAVLink, cache, status, and FDR interfaces.

- **Technologies**: Python, pytest-style runner, Docker/compose, pymavlink/log parser, ArduPilot Plane SITL, QGC observer/log parser, CSV/Markdown reports
+- **Technologies**: Python, pytest-style runner, Docker/compose, deterministic fixture stubs, ArduPilot Plane SITL/QGC observer placeholders, CSV/Markdown reports
+- **Entry points**:
+  - Local: `python3 -m pytest`
+  - Replay: `python -m e2e.replay.run_replay --output-dir <dir> --input-root <fixture-root>`
+  - Compose: `docker compose -f docker-compose.test.yml run --build --rm replay-consumer`

 ## Self-Verification

@@ -0,0 +1,17 @@
+# Ripple Log Cycle 1
+
+## Scope
+
+Task-mode documentation refresh for Cycle 1 test implementation tasks `AZ-233` through `AZ-239`, plus Step 11 replay-gate fixes.
+
+## Ripple Analysis
+
+- No product component module docs were refreshed because the changed implementation surface is the out-of-product blackbox/e2e replay harness under `tests/blackbox/**`, `e2e/replay/**`, `docker-compose.test.yml`, and `deployment/docker/Dockerfile.replay`.
+- `_docs/02_document/module-layout.md` was refreshed because the out-of-product test-suite path list now includes actual implemented paths and entry points.
+- `_docs/02_document/architecture.md` was refreshed because the validation harness responsibility now includes the implemented Docker replay smoke gate.
+- `_docs/02_document/tests/environment.md` was refreshed because the replay harness entry points, output paths, and local-vs-Jetson gate behavior changed.
+- `_docs/02_document/tests/*` and `_docs/02_document/tests/traceability-matrix.md` were refreshed during Step 12 to capture implementation-learned replay-smoke scenario IDs.
+
+## Import-Graph Result
+
+No reverse-import product ripple was found or required. The replay harness imports product runtime modules only from tests; product runtime modules do not import the replay harness.
@@ -44,9 +44,12 @@

 ## Consumer Application

-**Tech stack**: Python replay harness with pytest-style assertions and MAVLink log parsing.
+**Tech stack**: Python replay harness with pytest-style assertions, Docker/compose orchestration, deterministic cache/SITL/QGC stubs, and CSV/Markdown report generation.

-**Entry point**: `run-blackbox-replay` command to be created during implementation; this planning artifact defines required behavior, not code.
+**Entry points**:
+- Local functional suite: `python3 -m pytest`
+- Replay harness: `python -m e2e.replay.run_replay --output-dir <dir> --input-root <fixture-root>`
+- Docker replay gate: `docker compose -f docker-compose.test.yml run --build --rm replay-consumer`

 ### Communication With System Under Test

@@ -81,7 +84,7 @@

 **Columns**: Test ID, Test Name, Input Dataset, Execution Time (ms), Result, Error Distance (m), Source Label, Covariance 95% Semi-Major (m), `GPS_INPUT.fix_type`, Error Message.

-**Output path**: `./test-results/blackbox-report.csv` and `./test-results/fdr-validation-summary.md`.
+**Output path**: `data/test-results/<run-id>/blackbox-report.csv` and `data/test-results/<run-id>/fdr-validation-summary.md` on the host; `/app/data/test-results/<run-id>/...` inside the replay container.

 ## Test Execution

@@ -107,6 +110,8 @@ Use Docker or local host replay for deterministic, reproducible tests that do no

 Docker/replay mode is suitable for PR checks and nightly validation, but it does not prove Jetson latency, memory, thermal, or camera-driver behavior.

+Current Docker replay smoke evidence is expected to pass `FT-P-01`, `NFT-PERF-INFRA`, `NFT-RES-INFRA`, and `NFT-SEC-INFRA`. `NFT-RES-LIM-INFRA` remains blocked on local non-Jetson runners with an explicit target-hardware prerequisite.
+
 ### Local Hardware Mode

 Use local Jetson hardware for release gates:
@@ -94,3 +94,24 @@
 **Pass criteria**: 95th percentile <30 s over 50 runs.

 **Duration**: 50 cold-start trials.
+
+---
+
+### NFT-PERF-INFRA: Replay Evidence Smoke
+
+**Summary**: Validate that the Docker replay harness records timing evidence for the runnable local replay subset.
+
+**Traces to**: AZ-234 AC-3, AZ-233 AC-3, AZ-233 AC-4
+
+**Metric**: Scenario execution time and report generation status.
+
+**Preconditions**:
+- Docker replay environment is available.
+- Project input fixtures are mounted read-only into the replay consumer.
+
+| Step | Consumer Action | Measurement |
+|------|-----------------|-------------|
+| 1 | Run the replay consumer in Docker mode | Confirm the performance smoke scenario executes |
+| 2 | Inspect the generated CSV and FDR summary | Confirm execution time and artifact paths are recorded |
+
+**Pass criteria**: `NFT-PERF-INFRA` reports `pass` and writes run-scoped CSV/Markdown evidence; Jetson-only performance evidence remains in release-gate resource tests.
@@ -83,3 +83,25 @@
 | 2 | Inspect emitted estimate | No stale tile produces `satellite_anchored` label past hard rejection threshold |

 **Pass criteria**: Freshness decay and hard rejection match AC-NEW-6.
+
+---
+
+### NFT-RES-INFRA: Replay/SITL Prerequisite Smoke
+
+**Summary**: Validate that the Docker replay environment can execute the resilience scenario group with deterministic SITL/QGC stubs.
+
+**Traces to**: AZ-237 AC-1, AZ-237 AC-4, AZ-233 AC-1, AZ-233 AC-3
+
+**Preconditions**:
+- `ardupilot-plane-sitl` and `qgc-observer` services are started by `docker-compose.test.yml`.
+- `GPSD_ENABLE_SITL=1` is set only for the Docker replay stub environment.
+
+**Fault injection**:
+- Run the blackout/restart control smoke scenario through the replay consumer.
+
+| Step | Action | Expected Behavior |
+|------|--------|-------------------|
+| 1 | Start Docker replay services | SITL and QGC observer stubs are reachable to the replay consumer |
+| 2 | Execute the resilience smoke scenario | The report records a `pass` result instead of a missing-SITL prerequisite block |
+
+**Pass criteria**: `NFT-RES-INFRA` reports `pass` in Docker replay mode; live SITL release-candidate scenarios remain covered by `NFT-RES-01` and `FT-N-02`.
@@ -83,3 +83,18 @@
 **Duration**: 50 cold-start trials.

 **Pass criteria**: First valid `GPS_INPUT` <30 s p95; peak memory <8 GB; no first-run engine build occurs at runtime.
+
+---
+
+### NFT-RES-LIM-INFRA: Jetson Hardware Prerequisite Smoke
+
+**Summary**: Validate that local replay reports Jetson-only resource gates as blocked unless target hardware is explicitly enabled.
+
+**Traces to**: AZ-239 AC-1, AZ-239 AC-2, AZ-239 AC-4, AZ-233 Reliability NFR
+
+**Monitoring**:
+- Replay report status, blocked reason, and run-scoped artifact path.
+
+**Duration**: One Docker replay smoke run.
+
+**Pass criteria**: On non-Jetson local runners, the scenario reports `blocked` with `Jetson prerequisite blocked: set GPSD_ENABLE_JETSON=1 on target hardware`; on Jetson release-gate runners, it must collect the metrics required by `NFT-RES-LIM-01`, `NFT-RES-LIM-02`, and `NFT-RES-LIM-05`.
@@ -60,3 +60,18 @@
 | 2 | Run replay requiring missing tile | System reports degraded/relocalization-needed status, not an external fetch |

 **Pass criteria**: 0 outbound satellite-provider or Suite Service calls during runtime; missing cache data produces controlled degraded behavior.
+
+---
+
+### NFT-SEC-INFRA: Invalid Cache No-Fetch Smoke
+
+**Summary**: Validate that the replay harness treats untrusted cache fixtures as a successful security rejection, not as a trusted anchor.
+
+**Traces to**: AZ-236 AC-2, AZ-236 AC-3, AZ-233 Security NFR
+
+| Step | Consumer Action | Expected Response |
+|------|-----------------|-------------------|
+| 1 | Run replay with `cache_variant=stale` | Satellite cache stub marks the manifest untrusted and records no network fetch |
+| 2 | Inspect replay evidence | Scenario reports `pass`, `source_label=untrusted_cache_rejected`, and `GPS_INPUT.fix_type=0` |
+
+**Pass criteria**: The invalid cache smoke scenario passes only when the untrusted fixture is rejected and no external satellite-provider or Suite service network fetch is attempted.
@@ -59,6 +59,37 @@
 | R-GCS-01 | QGroundControl supported GCS | FT-N-02, NFT-SEC-03 | Covered |
 | R-SAFETY-01 | False-position, cold-start, spoofing, and failsafe constraints | FT-N-01, FT-N-02, NFT-PERF-04, NFT-RES-01 | Covered |

+## Cycle 1 Implementation-Learned Test Coverage
+
+| Task AC ID | Task Acceptance Criterion Summary | Test IDs | Coverage |
+|------------|-----------------------------------|----------|----------|
+| AZ-233 AC-1 | Docker/replay environment starts or reports clear blocked prerequisites | NFT-RES-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-233 AC-2 | External dependency stubs are deterministic and record interactions | NFT-SEC-INFRA, NFT-RES-INFRA | Covered |
+| AZ-233 AC-3 | Runner executes blackbox, performance, resilience, security, and resource-limit groups | FT-P-01, NFT-PERF-INFRA, NFT-RES-INFRA, NFT-SEC-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-233 AC-4 | CSV and Markdown evidence reports are generated with required fields | FT-P-01, NFT-PERF-INFRA, NFT-RES-INFRA, NFT-SEC-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-234 AC-1 | Still-image WGS84 error is reported against expected coordinates | FT-P-01 | Covered |
+| AZ-234 AC-2 | Confidence output contract fields are validated | FT-P-02 | Covered |
+| AZ-234 AC-3 | Replay latency and dropped-frame metrics are recorded | NFT-PERF-INFRA, NFT-PERF-01 | Covered |
+| AZ-235 AC-1 | Derkachi fixture alignment is validated before replay | FT-P-03 | Covered |
+| AZ-235 AC-2 | Synchronized replay emits frame-by-frame estimates or explicit degradation | FT-P-03 | Covered |
+| AZ-235 AC-3 | VIO latency, completion, memory, and calibration status are reported | NFT-PERF-02 | Covered |
+| AZ-236 AC-1 | Verified anchors include retrieval, matching, geometry, freshness, and provenance evidence | FT-P-04 | Covered |
+| AZ-236 AC-2 | Unsafe cache or low-texture candidates are rejected | FT-N-01, FT-N-03, NFT-SEC-INFRA | Covered |
+| AZ-236 AC-3 | Flight-mode missing-cache behavior does not fetch external satellite data | NFT-SEC-04, NFT-SEC-INFRA | Covered |
+| AZ-236 AC-4 | Cache and trigger-path metrics are reported | NFT-PERF-03, NFT-RES-04, NFT-RES-LIM-03 | Covered |
+| AZ-237 AC-1 | Blackout transitions to dead reckoning within threshold | FT-N-02, NFT-RES-01 | Covered |
+| AZ-237 AC-2 | Degraded covariance and no-fix/failsafe thresholds are enforced | FT-N-02, NFT-RES-01 | Covered |
+| AZ-237 AC-3 | Spoofed or unauthorized MAVLink inputs are rejected | NFT-SEC-03 | Covered |
+| AZ-237 AC-4 | QGC and FDR degraded-mode evidence is visible | FT-N-02, NFT-SEC-03, NFT-RES-INFRA | Covered |
+| AZ-238 AC-1 | Disconnected segments trigger relocalization or degraded status | NFT-RES-02 | Covered |
+| AZ-238 AC-2 | Companion restart first-output and FDR evidence are recorded | NFT-RES-03 | Covered |
+| AZ-238 AC-3 | Cold-start trials report first-fix timing or blocked prerequisite | NFT-PERF-04, NFT-RES-LIM-05 | Covered |
+| AZ-238 AC-4 | Cold-start resource spikes are captured where measurable | NFT-RES-LIM-05 | Covered |
+| AZ-239 AC-1 | Jetson memory budget is measured on target hardware | NFT-RES-LIM-01, NFT-RES-LIM-INFRA | Covered |
+| AZ-239 AC-2 | Thermal/power endurance is validated or blocked with reason | NFT-RES-LIM-02, NFT-RES-LIM-INFRA | Covered |
+| AZ-239 AC-3 | FDR rollover behavior is validated | NFT-RES-LIM-04 | Covered |
+| AZ-239 AC-4 | Resource/endurance evidence artifacts are complete | NFT-RES-LIM-01, NFT-RES-LIM-02, NFT-RES-LIM-04, NFT-RES-LIM-INFRA | Covered |
+
 ## Coverage Summary

 | Category | Total Items | Covered | Not Covered | Coverage % |
@@ -79,3 +110,4 @@
 - Derkachi project data supports synchronized video/IMU/GPS trajectory replay for FT-P-03 and NFT-PERF-02.
 - Derkachi project data is calibration-limited: raw camera intrinsics, lens distortion, and camera-to-body transform are still required before final absolute accuracy thresholds can be treated as production acceptance.
 - Phase 3 must validate camera calibration inputs and public/calibrated dataset acquisition before FT-P-03, FT-P-04, and NFT-PERF-02 can be used for final signoff.
+- Cycle 1 Docker replay smoke evidence currently passes blackbox, performance, resilience, and security infrastructure scenarios; Jetson resource evidence remains a target-hardware release gate and is reported as blocked on local runners.
@@ -2,24 +2,24 @@

 **Cycle**: 1
 **Date**: 2026-05-05
-**Outcome**: Product implementation complete
+**Outcome**: FAIL — product implementation incomplete

 ## Summary

-All product implementation tasks for cycle 1 are implemented or have explicit runtime prerequisite boundaries. The remediation tasks close the previously identified gaps in native VIO selection, local descriptor/index VPR retrieval, and computed anchor matching/geometry verification.
+Product implementation was previously marked complete, but Step 11 exposed a false-positive gate: tests passed against scaffold/fake contract behavior while the actual A-Z runtime path, especially real VIO execution, is not implemented. Product implementation must return to Step 7 and create remediation tasks before downstream test gates can be trusted.

 ## Product Task Classifications

 | Task | Classification | Evidence |
 |------|----------------|----------|
-| AZ-219 through AZ-232 | PASS | Prior batch reports 01-09 and cumulative review 01-09 |
-| AZ-240 | PASS | `src/vio_adapter/interfaces.py`, `src/vio_adapter/native/__init__.py`, `tests/unit/test_vio_adapter.py` |
+| AZ-219 through AZ-232 | NEEDS RECHECK | Prior batch reports 01-09 and cumulative review 01-09 were not audited under the stricter runtime completeness gate |
+| AZ-240 | FAIL | `src/vio_adapter/interfaces.py` exposes `NativeVioBackend`, but default runtime behavior is `ReplayVioBackend`; `src/vio_adapter/native/__init__.py` only re-exports protocol wrappers and does not execute a real BASALT/native VIO engine |
 | AZ-241 | PASS | `src/satellite_service/interfaces.py`, `src/satellite_service/types.py`, `src/satellite_service/native/__init__.py`, `tests/unit/test_satellite_service_vpr.py` |
 | AZ-242 | PASS | `src/anchor_verification/interfaces.py`, `src/anchor_verification/types.py`, `src/anchor_verification/native/__init__.py`, `tests/unit/test_anchor_verification.py` |

 ## Remediation Evidence

- VIO now exposes `NativeVioBackend` behind the `VioBackend` protocol, fills latency metrics, maps initialization/runtime failures into explicit health/error envelopes, and keeps WGS84 authority out of the adapter.
+- VIO currently exposes `NativeVioBackend` behind the `VioBackend` protocol, but the production/native engine is not actually integrated. This is a scaffold, not product-complete VIO.
 - Satellite retrieval now loads local descriptor/index packages from cache files, builds a CPU FAISS-compatible descriptor index, requires query descriptors for retrieval, and degrades safely for missing or invalid index data.
 - Anchor verification now computes matcher evidence from frame/tile keypoints through `KeypointRansacMatcher`, reports runtime/quality metrics, and routes computed evidence through the existing freshness, provenance, inlier, MRE, and homography gates.

@@ -39,4 +39,4 @@ Checked changed component source for unresolved implementation markers:

 ## Required Follow-Up

-No product remediation tasks remain. Autodev may advance to Step 8, Code Testability Revision.
+Autodev must return to Step 7, rerun the Product Implementation Completeness Gate under the stricter rules, create remediation tasks sized at 5 points or less, and implement the missing runtime behavior before Step 8 or Step 11 may pass.
@@ -2,13 +2,13 @@

 ## Current Step
 flow: greenfield
-step: 11
-name: Run Tests
-status: not_started
+step: 7
+name: Implement
+status: in_progress
 tracker: jira
 sub_step:
-  phase: 0
-  name: awaiting-invocation
-  detail: ""
+  phase: 15
+  name: product-completeness-gate
+  detail: "Reset after failed reality check: VIO/A-Z runtime path is scaffolded; tests passed stubs/contracts instead of actual system behavior"
 retry_count: 0
 cycle: 1