[AZ-233] Update Docker Compose and enhance test documentation

- Modified the Docker Compose configuration to include an input root for replay tests and added an environment variable for enabling SITL. - Enhanced documentation for various testing processes, including the addition of a Runtime Completeness Decomposition Gate and clarifications on internal module testing requirements. - Updated the implementation completeness report to reflect the current state and added new test cases for performance and resilience scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 07:01:14 +00:00 · 2026-05-06 05:03:48 +03:00
parent 2485763d09
commit cab7b5d020
20 changed files with 265 additions and 41 deletions
@@ -225,7 +225,7 @@ State-driven: reached by auto-chain from Step 10.

 Action: Read and execute `.cursor/skills/test-run/SKILL.md`

-Verifies the implemented unit, integration, blackbox, and e2e tests pass before proceeding to spec and documentation sync.
+Verifies the implemented unit, integration, blackbox, and e2e tests pass before proceeding to spec and documentation sync. This is a hard product gate, not a harness-smoke gate: e2e/blackbox tests must exercise the actual implemented system through public runtime boundaries and compare actual outputs against `_docs/00_problem/input_data/expected_results/results_report.md` or referenced machine-readable expected-result files. Stubs are allowed only for external systems outside the product boundary; missing internal product implementation must fail or block the gate and send the flow back to Implement.

 ---

@@ -43,6 +43,21 @@ For each component (or the single provided component):
    Consumers read the contract file, not the producer's task spec. This prevents interface drift when the producer's implementation detail leaks into consumers.
 11. **Immediately after writing each task file**: create a work item ticket, link it to the component's epic, write the work item ticket ID and Epic ID back into the task header, then rename the file from `todo/[##]_[short_name].md` to `todo/[TRACKER-ID]_[short_name].md`.

+## Runtime Completeness Decomposition Gate
+
+Before Step 2 is considered complete, scan `architecture.md`, `system-flows.md`, component descriptions, and the solution for named internal runtime capabilities and dependencies. Examples include BASALT/OpenVINS/Kimera, FAISS, DINOv2, ONNX/TensorRT, ALIKED/DISK, LightGlue, RANSAC, PostGIS, MAVLink emission, FDR rollover, and any "A-Z" user-visible pipeline.
+
+For every named internal capability:
+
+1. Ensure at least one implementation task explicitly owns the production integration or production algorithm.
+2. Do not treat "define protocol", "create adapter boundary", "add deterministic fallback", "create scaffold", or "prepare native bridge" as implementation of the capability unless the architecture explicitly says the real capability is out of scope.
+3. If a capability needs external hardware/data to verify, still create the production implementation task. Verification may be hardware-gated later; implementation must not be omitted.
+4. Add a `## Runtime Completeness` section to any affected task with:
+   - named capability/dependency,
+   - production code that must exist,
+   - allowed external stubs, if any,
+   - unacceptable substitutes such as fake/deterministic/internal stubs.
+
 ## Self-verification (per component)

 - [ ] Every task is atomic (single concern)
@@ -53,6 +68,7 @@ For each component (or the single provided component):
 - [ ] Every task has a work item ticket linked to the correct epic
 - [ ] Every shared-models / shared-API task has a contract file at `_docs/02_document/contracts/<component>/<name>.md` and a `## Contract` section linking to it
 - [ ] Every cross-cutting concern appears exactly once as a shared task, not N per-component copies
+- [ ] Every named internal runtime capability has a production implementation task, not only an interface/scaffold/fallback task

 ## Save action

@@ -13,12 +13,17 @@
 1. Read all test specs from `DOCUMENT_DIR/tests/` (`blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, `resource-limit-tests.md`)
 2. Group related test scenarios into atomic tasks (e.g., one task per test category or per component under test)
 3. Each task should reference the specific test scenarios it implements and the environment/test-data specs
-4. Dependencies:
+4. Add a **System Under Test Boundary** section to every e2e/blackbox test task:
+   - The test must drive the product through public runtime boundaries and compare actual outputs to `_docs/00_problem/input_data/expected_results/results_report.md` and any referenced machine-readable expected-result files.
+   - Stubs are allowed only for external systems outside the product boundary: flight controller/SITL, QGC observer, satellite-provider/Suite service, physical Jetson hardware, physical camera, licensed public datasets, and network services.
+   - Stubs, fakes, deterministic fallbacks, monkeypatches, or direct imports are not allowed for internal product modules that the scenario is meant to validate, such as VIO, safety/anchor wrapper, satellite retrieval, anchor verification, tile manager, MAVLink output adapter, or FDR.
+   - If an internal module is not implemented, the test must fail/block as missing product implementation; it must not pass by replacing that module with a test stub.
+5. Dependencies:
   - In tests-only mode: blackbox test tasks depend on the test infrastructure bootstrap task (Step 1t)
-5. Write each task spec using `templates/task.md`
-6. Estimate complexity per task (1, 2, 3, 5 points); no task should exceed 5 points — split if it does
-7. Note task dependencies (referencing tracker IDs of already-created dependency tasks)
-8. **Immediately after writing each task file**: create a work item ticket under the "Blackbox Tests" epic, write the work item ticket ID and Epic ID back into the task header, then rename the file from `todo/[##]_[short_name].md` to `todo/[TRACKER-ID]_[short_name].md`.
+6. Write each task spec using `templates/task.md`
+7. Estimate complexity per task (1, 2, 3, 5 points); no task should exceed 5 points — split if it does
+8. Note task dependencies (referencing tracker IDs of already-created dependency tasks)
+9. **Immediately after writing each task file**: create a work item ticket under the "Blackbox Tests" epic, write the work item ticket ID and Epic ID back into the task header, then rename the file from `todo/[##]_[short_name].md` to `todo/[TRACKER-ID]_[short_name].md`.

 ## Self-verification

@@ -27,6 +32,7 @@
 - [ ] No task exceeds 5 complexity points
 - [ ] Dependencies correctly reference the test infrastructure task
 - [ ] Every task has a work item ticket linked to the "Blackbox Tests" epic
+- [ ] Every e2e/blackbox task forbids internal product stubs/fakes and requires comparison against expected-results artifacts

 ## Save action

@@ -10,6 +10,8 @@
 2. Check no gaps:
   - In implementation mode: every product interface in `architecture.md` has implementation task coverage
   - In tests-only mode: every test scenario in `traceability-matrix.md` is covered by a task
+   - In implementation mode: every named internal runtime capability/dependency from architecture, solution, system flows, and component descriptions has a production implementation task, not only an interface/scaffold/fallback task
+   - In tests-only mode: every e2e/blackbox task has a System Under Test Boundary section that forbids stubbing internal product modules and requires comparison to expected-results artifacts
 3. Check no overlaps: tasks don't duplicate work
 4. Check no circular dependencies in the task graph
 5. Produce `_dependencies_table.md` using `templates/dependencies-table.md`
@@ -19,6 +21,7 @@
 ### Implementation mode

 - [ ] Every product interface in `architecture.md` is covered by at least one implementation task
+- [ ] Every named internal runtime capability has a production implementation task
 - [ ] No circular dependencies in the task graph
 - [ ] Cross-component dependencies are explicitly noted in affected task specs
 - [ ] `_dependencies_table.md` contains every task with correct dependencies
@@ -26,6 +29,7 @@
 ### Tests-only mode

 - [ ] Every test scenario from `traceability-matrix.md` "Covered" entries has a corresponding task
+- [ ] Every e2e/blackbox task validates actual product behavior and allows stubs only for external systems
 - [ ] No circular dependencies in the task graph
 - [ ] Test task dependencies reference the test infrastructure bootstrap
 - [ ] `_dependencies_table.md` contains every task with correct dependencies
@@ -25,7 +25,8 @@ For each task the main agent receives a task spec, analyzes the codebase, implem
 - **Dependency-aware ordering**: tasks run only when all their dependencies are satisfied
 - **Batching for review, not parallelism**: tasks are grouped into batches so `/code-review` and commits operate on a coherent unit of work — all tasks inside a batch are still implemented one after the other
 - **Integrated review**: `/code-review` skill runs automatically after each batch
- **Completeness before testing**: product implementation is not done until code is checked against task outcomes, included scope, architecture/component promises, and unresolved scaffold/native placeholders — not just task AC tests
+- **Completeness before testing**: product implementation is not done until code is checked against task outcomes, included scope, architecture/component promises, named runtime dependencies, and unresolved scaffold/native placeholders — not just task AC tests
+- **Runtime dependency reality**: production code cannot satisfy a task by exposing only a protocol, fake runner, deterministic fallback, or "native bridge" placeholder when the task/architecture promises a concrete internal capability such as BASALT VIO, FAISS retrieval, LightGlue matching, or a full A-Z localization pipeline. Stubs are allowed only for external systems and tests.
 - **Auto-start**: batches start immediately — no user confirmation before a batch
 - **Gate on failure**: user confirmation is required only when code review returns FAIL
 - **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session.
@@ -66,6 +67,7 @@ TASKS_DIR/
 ## Prerequisite Checks (BLOCKING)

 1. `TASKS_DIR/todo/` exists and contains at least one task file for the selected context — **STOP if missing**
+   - Exception for Product implementation re-entry: if no selected product tasks remain in `todo/`, but the active autodev state is Step 7 or the latest product completeness report is missing/invalid/contains `FAIL`, skip directly to Step 15 (Product Implementation Completeness Gate). This gate may create remediation tasks and return to Step 1. Do not write a final implementation report from this state.
 2. `_dependencies_table.md` exists — **STOP if missing**
 3. At least one task is not yet completed — **STOP if all done**
 4. **Working tree is clean** — run `git status --porcelain`; the output must be empty.
@@ -129,7 +131,7 @@ For each task in the batch, transition its ticket status to **In Progress** via
 For each task in the batch **in topological order, one at a time**:
 1. Read the task spec file.
 2. Respect the file-ownership envelope computed in Step 4 (OWNED / READ-ONLY / FORBIDDEN).
-3. Implement the feature and write/update tests for every acceptance criterion in the spec. If a test cannot run in the current environment (e.g., TensorRT requires GPU), the test must still be written and skip with a clear reason.
+3. Implement the feature and write/update tests for every acceptance criterion in the spec. Tests for internal product behavior must exercise the production implementation path. If a test cannot run in the current environment (e.g., TensorRT requires GPU), the test must still exist and skip/block with a clear prerequisite reason, but that skip does not make missing production code complete.
 4. Run the relevant tests locally before moving on to the next task in the batch. If tests fail, fix in-place — do not defer.
 5. Capture a short per-task status line (files changed, tests pass/fail, any blockers) for the batch report.

@@ -255,9 +257,13 @@ For each completed product task:
 1. Read these sections from the task spec: `Description`, `Outcome`, `Scope / Included`, `Acceptance Criteria`, `Non-Functional Requirements`, `Constraints`, and explicit named technologies or integrations.
 2. Compare those promises against actual source code, not only tests or report prose.
 3. Search the task's owned component files for unresolved implementation markers: `placeholder`, `stub`, `reserved`, `TODO`, `NotImplemented`, `pass`, `deterministic`, `fake`, `mock`, `scaffold`, `native bridge`, and empty native/readme-only integration directories. Ignore test fixtures/mocks only when they are under test-owned paths and not used as production behavior.
-4. Verify that each named runtime dependency in the task promise is either integrated behind the approved boundary or explicitly documented as a blocked prerequisite in the task/report. Examples: if a task promises FAISS, DINOv2, BASALT, LightGlue, OpenCV, RANSAC, a database, cloud service, or hardware SDK, the production code must contain that integration boundary; a deterministic fallback alone is not complete.
-5. Verify tests exercise the real implementation path where local prerequisites exist. Environment-gated tests may skip only with an explicit prerequisite reason; they do not make missing production code complete.
-6. Classify each task:
+4. Verify that each named runtime dependency in the task promise is integrated as production behavior, not merely represented by an interface. Examples: if a task promises FAISS, DINOv2, BASALT, LightGlue, OpenCV, RANSAC, a database, cloud service, or hardware SDK, the production code must either call that dependency or contain an adapter that loads and executes the real dependency package. A deterministic fallback, fake runner, empty `native/` package, or "bridge to be supplied later" is **FAIL** unless the task itself explicitly scoped the dependency out before implementation started.
+5. Distinguish internal implementation from external prerequisites:
+   - Internal product capabilities (VIO, anchor verification, cache retrieval, safety wrapper, FDR, MAVLink emission) must be implemented in production code before the task can pass.
+   - External systems/hardware/data (Jetson device, physical camera, ArduPilot process, QGC, third-party service credentials, unavailable licensed dataset) may be `BLOCKED` only when production code exists and the missing prerequisite is outside the product boundary.
+6. Verify tests exercise the real implementation path where local prerequisites exist. Environment-gated tests may skip only with an explicit prerequisite reason; they do not make missing production code complete.
+7. For any architecture promise that describes an end-to-end user outcome, verify there is an executable production pipeline connecting the relevant components. Isolated component contracts and test-only harness orchestration are not enough.
+8. Classify each task:
   - **PASS**: task promises are implemented or explicitly out of scope in the task itself.
   - **BLOCKED**: production code exists but cannot be fully verified due to external hardware/data/license/runtime prerequisites; the blocker is explicit and tests report blocked/skipped with reason.
   - **FAIL**: promised production behavior is missing, only scaffolded, or only represented in tests/reports.
@@ -32,6 +32,17 @@ After selecting a mode, read its corresponding workflow below; do not mix them.

 ## Functional Mode

+### 0. System-Under-Test Reality Gate
+
+Before accepting any functional, blackbox, or e2e result as a pass, verify what the tests actually exercised.
+
+1. If `_docs/00_problem/input_data/expected_results/results_report.md` exists, at least one e2e/blackbox run must compare actual product outputs against that mapping or the machine-readable files it references.
+2. Stubs are allowed only for external systems outside the product boundary: flight controller/SITL, QGC observer, satellite-provider/Suite service, physical Jetson hardware, physical camera, unavailable licensed datasets, and network services.
+3. Stubs, fakes, deterministic fallbacks, monkeypatches, or direct replacement of internal product modules are not allowed for the behavior under test. Internal examples include VIO, safety/anchor wrapper, satellite retrieval, anchor verification, tile manager, MAVLink output adapter, FDR, and the A-Z localization pipeline.
+4. If tests pass only because an internal module is fake/scaffolded, classify the run as **failed** with category `missing product implementation`.
+5. If a scenario is blocked because external hardware/data is absent, verify the production code path exists before accepting the block as legitimate. Missing internal production code is not an environment block.
+6. If the test runner writes CSV/Markdown reports, inspect them. A zero exit code is not enough; blocked/internal-stubbed scenarios still require classification.
+
 ### 1. Detect Test Runner

 Check in order — first match wins:
@@ -94,7 +105,7 @@ Categorize skips as: **explicit skip (dead code)**, **runtime skip (unreachable)

 ### 5. Handle Outcome

-**All tests pass, zero skipped** → return success to the autodev for auto-chain.
+**All tests pass, zero skipped, and the System-Under-Test Reality Gate passes** → return success to the autodev for auto-chain.

 **Any test fails or errors** → this is a **blocking gate**. Never silently ignore failures. **Always investigate the root cause before deciding on an action.** Read the failing test code, read the error output, check service logs if applicable, and determine whether the bug is in the test or in the production code.

@@ -14,7 +14,7 @@ Build a Jetson-hosted onboard localization pipeline for fixed-wing GPS-denied fl
 - Tile Manager: manage COGs, manifests, freshness/provenance, orthorectified generated tiles, and local tile metadata.
 - MAVLink/GCS integration: consume FC telemetry and emit `GPS_INPUT`/QGC status.
 - FDR/observability: record replayable mission evidence under storage caps.
- Validation harness: run still-image, public dataset, SITL, Jetson, and representative replay tests.
+- Validation harness: run local pytest plus Docker replay smoke for still-image, cache, SITL/QGC stub, security, Jetson-prerequisite, public dataset, and representative replay tests.

 ### Principles / Non-Negotiables

@@ -228,11 +228,15 @@ Read top-to-bottom; an upper layer may import from a lower layer but never the r

 Violations of this table are Architecture findings in code-review Phase 7 and are High severity.

-## Out-of-Product E2E Test Suite
+## Out-of-Product Blackbox / E2E Test Suite

-The e2e replay/SITL/Jetson validation suite is not a product component and must not receive Step 6 product implementation tasks. It owns test-support artifacts under `tests/blackbox/**`, `tests/e2e/**`, `e2e/replay/**`, and `e2e/reports/**`, and it exercises the runtime only through public file, MAVLink, cache, status, and FDR interfaces.
+The blackbox/e2e replay/SITL/Jetson validation suite is not a product component and must not receive Step 6 product implementation tasks. It owns test-support artifacts under `tests/blackbox/**`, `e2e/replay/**`, `e2e/fixtures/**`, `e2e/mocks/**`, `docker-compose.test.yml`, and `deployment/docker/Dockerfile.replay`, and it exercises the runtime only through public file, MAVLink, cache, status, and FDR interfaces.

- **Technologies**: Python, pytest-style runner, Docker/compose, pymavlink/log parser, ArduPilot Plane SITL, QGC observer/log parser, CSV/Markdown reports
+- **Technologies**: Python, pytest-style runner, Docker/compose, deterministic fixture stubs, ArduPilot Plane SITL/QGC observer placeholders, CSV/Markdown reports
+- **Entry points**:
+  - Local: `python3 -m pytest`
+  - Replay: `python -m e2e.replay.run_replay --output-dir <dir> --input-root <fixture-root>`
+  - Compose: `docker compose -f docker-compose.test.yml run --build --rm replay-consumer`

 ## Self-Verification

@@ -0,0 +1,17 @@
+# Ripple Log Cycle 1
+
+## Scope
+
+Task-mode documentation refresh for Cycle 1 test implementation tasks `AZ-233` through `AZ-239`, plus Step 11 replay-gate fixes.
+
+## Ripple Analysis
+
+- No product component module docs were refreshed because the changed implementation surface is the out-of-product blackbox/e2e replay harness under `tests/blackbox/**`, `e2e/replay/**`, `docker-compose.test.yml`, and `deployment/docker/Dockerfile.replay`.
+- `_docs/02_document/module-layout.md` was refreshed because the out-of-product test-suite path list now includes actual implemented paths and entry points.
+- `_docs/02_document/architecture.md` was refreshed because the validation harness responsibility now includes the implemented Docker replay smoke gate.
+- `_docs/02_document/tests/environment.md` was refreshed because the replay harness entry points, output paths, and local-vs-Jetson gate behavior changed.
+- `_docs/02_document/tests/*` and `_docs/02_document/tests/traceability-matrix.md` were refreshed during Step 12 to capture implementation-learned replay-smoke scenario IDs.
+
+## Import-Graph Result
+
+No reverse-import product ripple was found or required. The replay harness imports product runtime modules only from tests; product runtime modules do not import the replay harness.
@@ -44,9 +44,12 @@

 ## Consumer Application

-**Tech stack**: Python replay harness with pytest-style assertions and MAVLink log parsing.
+**Tech stack**: Python replay harness with pytest-style assertions, Docker/compose orchestration, deterministic cache/SITL/QGC stubs, and CSV/Markdown report generation.

-**Entry point**: `run-blackbox-replay` command to be created during implementation; this planning artifact defines required behavior, not code.
+**Entry points**:
+- Local functional suite: `python3 -m pytest`
+- Replay harness: `python -m e2e.replay.run_replay --output-dir <dir> --input-root <fixture-root>`
+- Docker replay gate: `docker compose -f docker-compose.test.yml run --build --rm replay-consumer`

 ### Communication With System Under Test

@@ -81,7 +84,7 @@

 **Columns**: Test ID, Test Name, Input Dataset, Execution Time (ms), Result, Error Distance (m), Source Label, Covariance 95% Semi-Major (m), `GPS_INPUT.fix_type`, Error Message.

-**Output path**: `./test-results/blackbox-report.csv` and `./test-results/fdr-validation-summary.md`.
+**Output path**: `data/test-results/<run-id>/blackbox-report.csv` and `data/test-results/<run-id>/fdr-validation-summary.md` on the host; `/app/data/test-results/<run-id>/...` inside the replay container.

 ## Test Execution

@@ -107,6 +110,8 @@ Use Docker or local host replay for deterministic, reproducible tests that do no

 Docker/replay mode is suitable for PR checks and nightly validation, but it does not prove Jetson latency, memory, thermal, or camera-driver behavior.

+Current Docker replay smoke evidence is expected to pass `FT-P-01`, `NFT-PERF-INFRA`, `NFT-RES-INFRA`, and `NFT-SEC-INFRA`. `NFT-RES-LIM-INFRA` remains blocked on local non-Jetson runners with an explicit target-hardware prerequisite.
+
 ### Local Hardware Mode

 Use local Jetson hardware for release gates:
@@ -94,3 +94,24 @@
 **Pass criteria**: 95th percentile <30 s over 50 runs.

 **Duration**: 50 cold-start trials.
+
+---
+
+### NFT-PERF-INFRA: Replay Evidence Smoke
+
+**Summary**: Validate that the Docker replay harness records timing evidence for the runnable local replay subset.
+
+**Traces to**: AZ-234 AC-3, AZ-233 AC-3, AZ-233 AC-4
+
+**Metric**: Scenario execution time and report generation status.
+
+**Preconditions**:
+- Docker replay environment is available.
+- Project input fixtures are mounted read-only into the replay consumer.
+
+| Step | Consumer Action | Measurement |
+|------|-----------------|-------------|
+| 1 | Run the replay consumer in Docker mode | Confirm the performance smoke scenario executes |
+| 2 | Inspect the generated CSV and FDR summary | Confirm execution time and artifact paths are recorded |
+
+**Pass criteria**: `NFT-PERF-INFRA` reports `pass` and writes run-scoped CSV/Markdown evidence; Jetson-only performance evidence remains in release-gate resource tests.
@@ -83,3 +83,25 @@
 | 2 | Inspect emitted estimate | No stale tile produces `satellite_anchored` label past hard rejection threshold |

 **Pass criteria**: Freshness decay and hard rejection match AC-NEW-6.
+
+---
+
+### NFT-RES-INFRA: Replay/SITL Prerequisite Smoke
+
+**Summary**: Validate that the Docker replay environment can execute the resilience scenario group with deterministic SITL/QGC stubs.
+
+**Traces to**: AZ-237 AC-1, AZ-237 AC-4, AZ-233 AC-1, AZ-233 AC-3
+
+**Preconditions**:
+- `ardupilot-plane-sitl` and `qgc-observer` services are started by `docker-compose.test.yml`.
+- `GPSD_ENABLE_SITL=1` is set only for the Docker replay stub environment.
+
+**Fault injection**:
+- Run the blackout/restart control smoke scenario through the replay consumer.
+
+| Step | Action | Expected Behavior |
+|------|--------|-------------------|
+| 1 | Start Docker replay services | SITL and QGC observer stubs are reachable to the replay consumer |
+| 2 | Execute the resilience smoke scenario | The report records a `pass` result instead of a missing-SITL prerequisite block |
+
+**Pass criteria**: `NFT-RES-INFRA` reports `pass` in Docker replay mode; live SITL release-candidate scenarios remain covered by `NFT-RES-01` and `FT-N-02`.
@@ -83,3 +83,18 @@
 **Duration**: 50 cold-start trials.

 **Pass criteria**: First valid `GPS_INPUT` <30 s p95; peak memory <8 GB; no first-run engine build occurs at runtime.
+
+---
+
+### NFT-RES-LIM-INFRA: Jetson Hardware Prerequisite Smoke
+
+**Summary**: Validate that local replay reports Jetson-only resource gates as blocked unless target hardware is explicitly enabled.
+
+**Traces to**: AZ-239 AC-1, AZ-239 AC-2, AZ-239 AC-4, AZ-233 Reliability NFR
+
+**Monitoring**:
+- Replay report status, blocked reason, and run-scoped artifact path.
+
+**Duration**: One Docker replay smoke run.
+
+**Pass criteria**: On non-Jetson local runners, the scenario reports `blocked` with `Jetson prerequisite blocked: set GPSD_ENABLE_JETSON=1 on target hardware`; on Jetson release-gate runners, it must collect the metrics required by `NFT-RES-LIM-01`, `NFT-RES-LIM-02`, and `NFT-RES-LIM-05`.
@@ -60,3 +60,18 @@
 | 2 | Run replay requiring missing tile | System reports degraded/relocalization-needed status, not an external fetch |

 **Pass criteria**: 0 outbound satellite-provider or Suite Service calls during runtime; missing cache data produces controlled degraded behavior.
+
+---
+
+### NFT-SEC-INFRA: Invalid Cache No-Fetch Smoke
+
+**Summary**: Validate that the replay harness treats untrusted cache fixtures as a successful security rejection, not as a trusted anchor.
+
+**Traces to**: AZ-236 AC-2, AZ-236 AC-3, AZ-233 Security NFR
+
+| Step | Consumer Action | Expected Response |
+|------|-----------------|-------------------|
+| 1 | Run replay with `cache_variant=stale` | Satellite cache stub marks the manifest untrusted and records no network fetch |
+| 2 | Inspect replay evidence | Scenario reports `pass`, `source_label=untrusted_cache_rejected`, and `GPS_INPUT.fix_type=0` |
+
+**Pass criteria**: The invalid cache smoke scenario passes only when the untrusted fixture is rejected and no external satellite-provider or Suite service network fetch is attempted.
@@ -59,6 +59,37 @@
 | R-GCS-01 | QGroundControl supported GCS | FT-N-02, NFT-SEC-03 | Covered |
 | R-SAFETY-01 | False-position, cold-start, spoofing, and failsafe constraints | FT-N-01, FT-N-02, NFT-PERF-04, NFT-RES-01 | Covered |

+## Cycle 1 Implementation-Learned Test Coverage
+
+| Task AC ID | Task Acceptance Criterion Summary | Test IDs | Coverage |
+|------------|-----------------------------------|----------|----------|
+| AZ-233 AC-1 | Docker/replay environment starts or reports clear blocked prerequisites | NFT-RES-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-233 AC-2 | External dependency stubs are deterministic and record interactions | NFT-SEC-INFRA, NFT-RES-INFRA | Covered |
+| AZ-233 AC-3 | Runner executes blackbox, performance, resilience, security, and resource-limit groups | FT-P-01, NFT-PERF-INFRA, NFT-RES-INFRA, NFT-SEC-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-233 AC-4 | CSV and Markdown evidence reports are generated with required fields | FT-P-01, NFT-PERF-INFRA, NFT-RES-INFRA, NFT-SEC-INFRA, NFT-RES-LIM-INFRA | Covered |
+| AZ-234 AC-1 | Still-image WGS84 error is reported against expected coordinates | FT-P-01 | Covered |
+| AZ-234 AC-2 | Confidence output contract fields are validated | FT-P-02 | Covered |
+| AZ-234 AC-3 | Replay latency and dropped-frame metrics are recorded | NFT-PERF-INFRA, NFT-PERF-01 | Covered |
+| AZ-235 AC-1 | Derkachi fixture alignment is validated before replay | FT-P-03 | Covered |
+| AZ-235 AC-2 | Synchronized replay emits frame-by-frame estimates or explicit degradation | FT-P-03 | Covered |
+| AZ-235 AC-3 | VIO latency, completion, memory, and calibration status are reported | NFT-PERF-02 | Covered |
+| AZ-236 AC-1 | Verified anchors include retrieval, matching, geometry, freshness, and provenance evidence | FT-P-04 | Covered |
+| AZ-236 AC-2 | Unsafe cache or low-texture candidates are rejected | FT-N-01, FT-N-03, NFT-SEC-INFRA | Covered |
+| AZ-236 AC-3 | Flight-mode missing-cache behavior does not fetch external satellite data | NFT-SEC-04, NFT-SEC-INFRA | Covered |
+| AZ-236 AC-4 | Cache and trigger-path metrics are reported | NFT-PERF-03, NFT-RES-04, NFT-RES-LIM-03 | Covered |
+| AZ-237 AC-1 | Blackout transitions to dead reckoning within threshold | FT-N-02, NFT-RES-01 | Covered |
+| AZ-237 AC-2 | Degraded covariance and no-fix/failsafe thresholds are enforced | FT-N-02, NFT-RES-01 | Covered |
+| AZ-237 AC-3 | Spoofed or unauthorized MAVLink inputs are rejected | NFT-SEC-03 | Covered |
+| AZ-237 AC-4 | QGC and FDR degraded-mode evidence is visible | FT-N-02, NFT-SEC-03, NFT-RES-INFRA | Covered |
+| AZ-238 AC-1 | Disconnected segments trigger relocalization or degraded status | NFT-RES-02 | Covered |
+| AZ-238 AC-2 | Companion restart first-output and FDR evidence are recorded | NFT-RES-03 | Covered |
+| AZ-238 AC-3 | Cold-start trials report first-fix timing or blocked prerequisite | NFT-PERF-04, NFT-RES-LIM-05 | Covered |
+| AZ-238 AC-4 | Cold-start resource spikes are captured where measurable | NFT-RES-LIM-05 | Covered |
+| AZ-239 AC-1 | Jetson memory budget is measured on target hardware | NFT-RES-LIM-01, NFT-RES-LIM-INFRA | Covered |
+| AZ-239 AC-2 | Thermal/power endurance is validated or blocked with reason | NFT-RES-LIM-02, NFT-RES-LIM-INFRA | Covered |
+| AZ-239 AC-3 | FDR rollover behavior is validated | NFT-RES-LIM-04 | Covered |
+| AZ-239 AC-4 | Resource/endurance evidence artifacts are complete | NFT-RES-LIM-01, NFT-RES-LIM-02, NFT-RES-LIM-04, NFT-RES-LIM-INFRA | Covered |
+
 ## Coverage Summary

 | Category | Total Items | Covered | Not Covered | Coverage % |
@@ -79,3 +110,4 @@
 - Derkachi project data supports synchronized video/IMU/GPS trajectory replay for FT-P-03 and NFT-PERF-02.
 - Derkachi project data is calibration-limited: raw camera intrinsics, lens distortion, and camera-to-body transform are still required before final absolute accuracy thresholds can be treated as production acceptance.
 - Phase 3 must validate camera calibration inputs and public/calibrated dataset acquisition before FT-P-03, FT-P-04, and NFT-PERF-02 can be used for final signoff.
+- Cycle 1 Docker replay smoke evidence currently passes blackbox, performance, resilience, and security infrastructure scenarios; Jetson resource evidence remains a target-hardware release gate and is reported as blocked on local runners.
@@ -2,24 +2,24 @@

 **Cycle**: 1
 **Date**: 2026-05-05
-**Outcome**: Product implementation complete
+**Outcome**: FAIL — product implementation incomplete

 ## Summary

-All product implementation tasks for cycle 1 are implemented or have explicit runtime prerequisite boundaries. The remediation tasks close the previously identified gaps in native VIO selection, local descriptor/index VPR retrieval, and computed anchor matching/geometry verification.
+Product implementation was previously marked complete, but Step 11 exposed a false-positive gate: tests passed against scaffold/fake contract behavior while the actual A-Z runtime path, especially real VIO execution, is not implemented. Product implementation must return to Step 7 and create remediation tasks before downstream test gates can be trusted.

 ## Product Task Classifications

 | Task | Classification | Evidence |
 |------|----------------|----------|
-| AZ-219 through AZ-232 | PASS | Prior batch reports 01-09 and cumulative review 01-09 |
-| AZ-240 | PASS | `src/vio_adapter/interfaces.py`, `src/vio_adapter/native/__init__.py`, `tests/unit/test_vio_adapter.py` |
+| AZ-219 through AZ-232 | NEEDS RECHECK | Prior batch reports 01-09 and cumulative review 01-09 were not audited under the stricter runtime completeness gate |
+| AZ-240 | FAIL | `src/vio_adapter/interfaces.py` exposes `NativeVioBackend`, but default runtime behavior is `ReplayVioBackend`; `src/vio_adapter/native/__init__.py` only re-exports protocol wrappers and does not execute a real BASALT/native VIO engine |
 | AZ-241 | PASS | `src/satellite_service/interfaces.py`, `src/satellite_service/types.py`, `src/satellite_service/native/__init__.py`, `tests/unit/test_satellite_service_vpr.py` |
 | AZ-242 | PASS | `src/anchor_verification/interfaces.py`, `src/anchor_verification/types.py`, `src/anchor_verification/native/__init__.py`, `tests/unit/test_anchor_verification.py` |

 ## Remediation Evidence

- VIO now exposes `NativeVioBackend` behind the `VioBackend` protocol, fills latency metrics, maps initialization/runtime failures into explicit health/error envelopes, and keeps WGS84 authority out of the adapter.
+- VIO currently exposes `NativeVioBackend` behind the `VioBackend` protocol, but the production/native engine is not actually integrated. This is a scaffold, not product-complete VIO.
 - Satellite retrieval now loads local descriptor/index packages from cache files, builds a CPU FAISS-compatible descriptor index, requires query descriptors for retrieval, and degrades safely for missing or invalid index data.
 - Anchor verification now computes matcher evidence from frame/tile keypoints through `KeypointRansacMatcher`, reports runtime/quality metrics, and routes computed evidence through the existing freshness, provenance, inlier, MRE, and homography gates.

@@ -39,4 +39,4 @@ Checked changed component source for unresolved implementation markers:

 ## Required Follow-Up

-No product remediation tasks remain. Autodev may advance to Step 8, Code Testability Revision.
+Autodev must return to Step 7, rerun the Product Implementation Completeness Gate under the stricter rules, create remediation tasks sized at 5 points or less, and implement the missing runtime behavior before Step 8 or Step 11 may pass.
@@ -2,13 +2,13 @@

 ## Current Step
 flow: greenfield
-step: 11
-name: Run Tests
-status: not_started
+step: 7
+name: Implement
+status: in_progress
 tracker: jira
 sub_step:
-  phase: 0
-  name: awaiting-invocation
-  detail: ""
+  phase: 15
+  name: product-completeness-gate
+  detail: "Reset after failed reality check: VIO/A-Z runtime path is scaffolded; tests passed stubs/contracts instead of actual system behavior"
 retry_count: 0
 cycle: 1
@@ -54,9 +54,20 @@ services:
    build:
      context: .
      dockerfile: deployment/docker/Dockerfile.replay
-    command: ["python", "e2e/replay/run_replay.py", "--output-dir", "/app/data/test-results"]
+    command:
+      [
+        "python",
+        "-m",
+        "e2e.replay.run_replay",
+        "--output-dir",
+        "/app/data/test-results",
+        "--input-root",
+        "/data/input",
+      ]
    env_file:
      - config/ci/runtime.env
+    environment:
+      GPSD_ENABLE_SITL: "1"
    depends_on:
      gps-denied-service:
        condition: service_completed_successfully
@@ -221,10 +221,11 @@ class BlackboxReplayRunner:
    def __init__(
        self,
        output_root: Path = Path("data/test-results"),
+        input_root: Path = Path("_docs/00_problem/input_data"),
        scenarios: Sequence[ScenarioConfig] | None = None,
    ) -> None:
        self.output_root = output_root
-        self.scenarios = tuple(scenarios or default_scenarios())
+        self.scenarios = tuple(scenarios or default_scenarios(input_root))
        self.environment = TestEnvironment(output_root)
        self.satellite_cache = SatelliteCacheStub()
        self.ardupilot_sitl = ArdupilotSitlStub()
@@ -277,11 +278,25 @@ class BlackboxReplayRunner:
            interactions.extend(self.satellite_cache.interactions[cache_interaction_count:])
            interactions.extend(self.ardupilot_sitl.interactions[sitl_interaction_count:])
            interactions.extend(self.qgc_observer.interactions[observer_interaction_count:])
-            result = ScenarioResult.PASS if cache_response["trusted"] else ScenarioResult.BLOCKED
+            cache_rejection_expected = scenario.controls.get("expect_cache_rejection") == "true"
+            cache_rejected_safely = (
+                cache_rejection_expected
+                and cache_response["trusted"] is False
+                and cache_response["network_fetch_attempted"] is False
+            )
+            result = (
+                ScenarioResult.PASS
+                if cache_response["trusted"] or cache_rejected_safely
+                else ScenarioResult.BLOCKED
+            )
            error_message = "" if result == ScenarioResult.PASS else "cache fixture is not trusted"
-            source_label = "satellite_anchored" if result == ScenarioResult.PASS else "degraded"
-            covariance = 12.5 if result == ScenarioResult.PASS else None
-            gps_fix_type = int(str(sitl_response["fix_type"])) if result == ScenarioResult.PASS else 0
+            source_label = (
+                "satellite_anchored"
+                if cache_response["trusted"]
+                else "untrusted_cache_rejected"
+            )
+            covariance = 12.5 if cache_response["trusted"] else None
+            gps_fix_type = int(str(sitl_response["fix_type"])) if cache_response["trusted"] else 0

        scenario_dir = run_dir / scenario.scenario_id
        scenario_dir.mkdir(parents=True, exist_ok=True)
@@ -363,8 +378,7 @@ class BlackboxReplayRunner:
        return markdown_path


-def default_scenarios() -> tuple[ScenarioConfig, ...]:
-    input_root = Path("_docs/00_problem/input_data")
+def default_scenarios(input_root: Path = Path("_docs/00_problem/input_data")) -> tuple[ScenarioConfig, ...]:
    return (
        ScenarioConfig(
            scenario_id="FT-P-01",
@@ -395,7 +409,7 @@ def default_scenarios() -> tuple[ScenarioConfig, ...]:
            name="Invalid cache no-fetch smoke",
            group=ScenarioGroup.SECURITY,
            input_dataset="cache_integrity_fixtures",
-            controls={"cache_variant": "stale"},
+            controls={"cache_variant": "stale", "expect_cache_rejection": "true"},
        ),
        ScenarioConfig(
            scenario_id="NFT-RES-LIM-INFRA",
@@ -601,9 +615,15 @@ def main(argv: Sequence[str] | None = None) -> int:
        default=Path("data/test-results"),
        help="Directory for run-scoped CSV and Markdown reports.",
    )
+    parser.add_argument(
+        "--input-root",
+        type=Path,
+        default=Path(os.environ.get("GPSD_REPLAY_INPUT_ROOT", "_docs/00_problem/input_data")),
+        help="Directory containing replay input fixtures.",
+    )
    args = parser.parse_args(argv)

-    result = BlackboxReplayRunner(output_root=args.output_dir).run()
+    result = BlackboxReplayRunner(output_root=args.output_dir, input_root=args.input_root).run()
    print(f"blackbox replay completed: {result.csv_path}")
    print(f"fdr validation summary: {result.markdown_path}")
    return 0
@@ -57,6 +57,9 @@ def test_runner_executes_all_required_groups_and_writes_reports(tmp_path: Path)
    assert rows
    assert rows[0].keys() == set(REPORT_COLUMNS)
    assert {row["Result"] for row in rows} <= {"pass", "blocked"}
+    security_row = next(row for row in rows if row["Test ID"] == "NFT-SEC-INFRA")
+    assert security_row["Result"] == "pass"
+    assert security_row["Source Label"] == "untrusted_cache_rejected"

    markdown = result.markdown_path.read_text(encoding="utf-8")
    assert "FDR Validation Summary" in markdown
@@ -64,6 +67,22 @@ def test_runner_executes_all_required_groups_and_writes_reports(tmp_path: Path)
    assert "Jetson prerequisite blocked" in markdown


+def test_runner_uses_configurable_input_root_for_replay_fixtures(tmp_path: Path) -> None:
+    # Arrange
+    input_root = tmp_path / "input"
+    (input_root / "expected_results").mkdir(parents=True)
+    (input_root / "coordinates.csv").write_text("image,lat,lon\nAD000001.jpg,48.0,37.0\n")
+    (input_root / "expected_results" / "results_report.md").write_text("# Expected results\n")
+
+    # Act
+    result = BlackboxReplayRunner(output_root=tmp_path / "output", input_root=input_root).run()
+
+    # Assert
+    reports_by_id = {report.scenario_id: report for report in result.reports}
+    assert reports_by_id["FT-P-01"].result == ScenarioResult.PASS
+    assert reports_by_id["NFT-PERF-INFRA"].result == ScenarioResult.PASS
+
+
 def test_runner_keeps_generated_artifacts_run_scoped(tmp_path: Path) -> None:
    # Act
    result = BlackboxReplayRunner(output_root=tmp_path).run()