[autodev] Step 11 outcome — local Tier-1 green, reality gate deferred

Local Tier-1 pytest suite: 3343 pass / 88 skip / 0 fail across 12 chunks. Docker harness SUT Reality Gate UNMET — both Tier-1 docker harnesses (scripts/run-tests.sh and e2e/docker/run-tier1.sh) have pre-existing drift that prevents them from running end-to-end. Findings: H-1..H-3 (fixed in 6ce3158): dockerfile rename, fdr-output tmpfs cap, e2e-results bind dir + gitignore. H-4..H-6 (deferred): three SITL/MAVLink Docker Hub images don't exist (ardupilot/mavproxy, ardupilot/ardupilot-sitl, inavflight/inav-sitl). environment.md spec was written against aspirational image names. H-7..H-8 (deferred): tests/e2e/Dockerfile entrypoint points at empty scenarios dir + doesn't install the SUT package. H-9 (deferred): tile-cache-fixture seeder missing (relates to AZ-595). Plus a regression caught and fixed mid-run: pytest-csv autoload conflicts with our custom --csv flag (commit eb6dc17). Also surfaced a false-positive batch-89 test-result report; proposed preventive meta-rule pending user approval. Step 11 marked status=blocked pending harness rehabilitation tickets (payloads recorded in _docs/_process_leftovers/). Full outcome report: _docs/03_implementation/run_tests_step11_report.md. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 08:41:12 +00:00 · 2026-05-17 20:30:19 +03:00
parent 6ce31587d4
commit c4e4063650
3 changed files with 301 additions and 3 deletions
@@ -0,0 +1,180 @@
+# Step 11 — Run Tests (Cycle 1)
+
+## TL;DR
+
+- **Local Tier-1 pytest suite: 3343 passed / 88 skipped / 0 failed** (after the
+  `--csv` collision fix landed in `eb6dc17`).
+- **Docker Tier-1 SUT Reality Gate: NOT MET.** Both docker harnesses (top-level
+  `scripts/run-tests.sh` and the fuller `e2e/docker/run-tier1.sh`) have
+  pre-existing drift that prevents them from running end-to-end. None of the
+  drift was caused by Step 10 work — the harnesses had simply never been
+  bench-tested.
+- **Recommendation**: open a single epic to rehabilitate the e2e docker harness
+  (or two, splitting bootstrap vs. full blackbox); resume Step 11 reality-gate
+  verification once at least the bootstrap harness can run `tests/e2e/replay/`
+  with `RUN_REPLAY_E2E=1`.
+
+## Local Tier-1 results
+
+Run on 2026-05-17 against `dev` HEAD `c64e492`, then `eb6dc17` after the
+csv_reporter fix. Split into 12 logical chunks per the human directive to avoid
+a monolithic 3.4k-test run:
+
+| # | Chunk | Pass | Skip | Fail |
+|---|-------|-----:|-----:|-----:|
+| C1 | `tests/contract` + 6× cross-cutting | 87 | 0 | 0 |
+| C2 | `c2_5_rerank + c4_pose + c13_fdr + c11_tile_manager + c3_5_adhop` | 262 | 0 | 0 |
+| C3 | `c3_matcher + c10_provisioning` | 170 | 3 | 0 |
+| C4 | `c1_vio` | 148 | 6 | 0 |
+| C5 | `c12_operator_orchestrator` | 151 | 2 | 0 |
+| C6 | `c7_inference` | 139 | 17 | 0 |
+| C7 | `c6_tile_cache` | 126 | 57 | 0 |
+| C8 | `c8_fc_adapter` | 212 | 1 | 0 |
+| C9 | `c5_state` | 216 | 0 | 0 |
+| C10 | `c2_vpr` | 230 | 0 | 0 |
+| C11 | `tests/unit/test_*.py` root-level | 373 | 2 | 0 |
+| C12 | `e2e/_unit_tests` (after fix) | 1229 | 0 | 0 |
+| | **TOTAL** | **3343** | **88** | **0** |
+
+### Skip classification (88 total)
+
+| Reason | Count | Verdict |
+|--------|------:|---------|
+| Tier-2-only (`GPS_DENIED_TIER=2`) — Jetson Orin Nano Super hardware | 14 | legitimate |
+| CUDA / NVIDIA GPU not present on macOS dev host | 8 | legitimate |
+| TensorRT python binding not installed (Tier-2 Jetson only) | 6 | legitimate |
+| Requires Docker compose services (postgres / mock-sat) | 57 | borderline — covered by docker harness when it runs |
+| Console scripts not on PATH (`pip install -e .` would fix) | 3 | borderline — env-conditional |
+| `actionlint` not on PATH (CI lint job installs separately) | 1 | borderline — env-conditional |
+| Other (empty parametrize, doc-gated replay e2e) | 2 | legitimate |
+
+The 57 "Requires Docker compose services" skips are the largest illegitimate-per-skill cluster. They become covered the moment any docker harness runs end-to-end against postgres + mock-sat. Until then, they remain.
+
+## C12 regression that surfaced during this run
+
+`e2e/_unit_tests/reporting/test_csv_reporter.py::test_csv_plugin_emits_required_columns` and two sibling tests in `test_nfr_recorder.py` failed with:
+
+```
+argparse.ArgumentError: argument --csv: conflicting option string: --csv
+```
+
+inside subprocess-spawned pytest invocations. Root cause: `pytest-csv 3.0.0` was listed in `e2e/runner/requirements.txt` and auto-loaded via entry-point, conflicting with our custom `--csv` flag from `e2e/runner/reporting/csv_reporter.py`. The conftest comment claimed our plugin "overrides" pytest-csv, but pytest's option registry does not allow overrides — it raises on conflict. `pytest-csv` is also incompatible with pytest 9.x (uses removed `hookwrapper` marker). Our code never `import pytest_csv` — the dep was dead weight.
+
+**Fix** landed in commit `eb6dc17 [autodev] fix csv_reporter --csv collision with pytest-csv`:
+
+- Removed `pytest-csv` from `e2e/runner/requirements.txt`.
+- Updated docstrings/comments in `csv_reporter.py` and `conftest.py`.
+- Uninstalled `pytest-csv` from the local environment.
+
+After the fix the full C12 suite reports `1229 passed in 145.93s`.
+
+### Secondary finding — false-positive batch report
+
+`_docs/03_implementation/batch_89_cycle1_report.md` claimed "Full e2e unit-test suite: **1229 passed in 134 s**". That number was reported without an actual verifying invocation. The 3 reporting subprocess tests have been broken since `pytest-csv` was installed locally, but the batch report didn't catch it.
+
+Proposed preventive rule (pending user approval, per `meta-rule.mdc` Self-Improvement): *"Before writing 'Test Results: X passed' in a batch report, the same shell invocation that produced X must appear in the assistant transcript with the exit code visible."* Will be added to `meta-rule.mdc` if approved.
+
+## Docker harness — findings
+
+| ID | Severity | Description | Status |
+|----|----------|-------------|--------|
+| H-1 | medium | `e2e/docker/docker-compose.test.yml` referenced `docker/Dockerfile`; actual file is `docker/companion-tier1.Dockerfile` | **fixed** in `6ce3158` |
+| H-2 | medium | `fdr-output` volume declared `tmpfs size=64g`; host Docker has 3.8 GiB | **fixed** in `6ce3158` (plain named volume; SUT enforces cap internally per NFT-LIM-02) |
+| H-3 | low | `e2e-results/` bind dir missing at repo root | **fixed** in `6ce3158` (mkdir + .gitignore) |
+| H-4 | blocker | `ardupilot/mavproxy:latest` image MISSING from Docker Hub | **deferred** — see "Harness rehabilitation" below |
+| H-5 | blocker | `ardupilot/ardupilot-sitl:plane-stable` image MISSING from Docker Hub | **deferred** |
+| H-6 | blocker | `inavflight/inav-sitl:9.0.0` image MISSING from Docker Hub | **deferred** |
+| H-7 | blocker | Top-level `tests/e2e/Dockerfile` entrypoint is `pytest /opt/tests/e2e/scenarios` (empty dir); real tests are in `/opt/tests/e2e/replay/` | **deferred** |
+| H-8 | blocker | Top-level `tests/e2e/Dockerfile` uses a plain `python:3.10-slim` and never installs the SUT package — so the `gps-denied-replay` console script and the project's import surface aren't available in the container | **deferred** |
+| H-9 | medium | `tile-cache-fixture` volume in `e2e/docker/docker-compose.test.yml` is unseeded; no builder service. AZ-595 was meant to deliver the seeder | **deferred** |
+
+### Why H-4..H-6 are blockers, not minor drift
+
+`_docs/02_document/tests/environment.md` § Docker Environment specifies three Docker Hub images for SITL/MAVLink:
+
+- `ardupilot/ardupilot-sitl:plane-stable`
+- `inavflight/inav-sitl:9.0.0`
+- `ardupilot/mavproxy:latest`
+
+None of the named org accounts publish to Docker Hub. Verified by `docker manifest inspect` — all three return MISSING. The spec was written against aspirational/imagined image names and never verified.
+
+Real alternatives:
+
+| Spec image | Real option |
+|------------|-------------|
+| `ardupilot/ardupilot-sitl:plane-stable` | Community images: `radarku/ardupilot-sitl`, `khancyr/ardupilot-docker-sitl` (older); or build from `ardupilot/ardupilot` source (~30-60 min build). |
+| `inavflight/inav-sitl:9.0.0` | No good published image. Build from iNav source (multi-hour). |
+| `ardupilot/mavproxy:latest` | Doesn't exist. Wrap `pip install MAVProxy` in a `python:3.10-slim` Dockerfile (~10 lines). |
+
+### Why H-7 and H-8 matter
+
+`scripts/run-tests.sh` is the test-run skill's "first match" runner — but its Dockerfile points at an empty scenarios dir and never installs the SUT package. Even if H-7 is fixed by repointing to `/opt/tests/e2e/`, the heavy tests in `tests/e2e/replay/test_derkachi_1min.py` require the `gps-denied-replay` console script which only exists when the SUT package is `pip install`-ed into the runner image. So H-7 + H-8 are coupled.
+
+## SUT Reality Gate verdict
+
+Per `test-run/SKILL.md` § Functional Mode → 0. System-Under-Test Reality Gate:
+
+> 1. If `_docs/00_problem/input_data/expected_results/results_report.md` exists, at least one e2e/blackbox run must compare actual product outputs against that mapping or the machine-readable files it references.
+
+`results_report.md` exists and contains:
+
+- **Still-image frame centers** (60 images → expected WGS84 lat/lon, ±50 m primary, ±20 m stretch).
+- **Derkachi video/IMU fixture** (validation rules for telemetry CSV, video stream, alignment, trajectory comparison).
+
+The 41 blackbox scenarios in `e2e/tests/` (`functional/FT-*` for the still-image set, `performance/NFT-PERF-*` for replay latency, `resilience/NFT-RES-*` for failure modes) exist and would exercise this mapping. None of them ran in this cycle because:
+
+1. They require the `e2e/docker/run-tier1.sh` harness, blocked by H-4..H-6.
+2. The fallback bootstrap harness (`scripts/run-tests.sh` → `tests/e2e/replay/`) is blocked by H-7 + H-8.
+
+**Verdict: Reality Gate UNMET.** Local pytest verifies internal modules but does NOT compare actual product outputs against `results_report.md`. The skill defines this as a blocking gate.
+
+## Harness rehabilitation — proposed work
+
+Three independent tracks; tracks 1 and 2 are required for the Reality Gate, track 3 is a nice-to-have for FC-side acceptance tests.
+
+### Track 1 — Bootstrap harness (fastest path to a real Reality Gate)
+
+Fix H-7 + H-8 so `scripts/run-tests.sh` actually runs `tests/e2e/replay/test_derkachi_1min.py` with `RUN_REPLAY_E2E=1`. Steps:
+
+1. Change `tests/e2e/Dockerfile` entrypoint from `pytest /opt/tests/e2e/scenarios` to `pytest /opt/tests/e2e/`.
+2. Install the SUT package in the runner image so `gps-denied-replay` is on PATH. Either `pip install -e .` from the host source (requires bind-mount or COPY) or build a wheel and `pip install` it.
+3. Inject `RUN_REPLAY_E2E=1` into the e2e-runner service environment in `docker-compose.test.yml`.
+4. Mount `_docs/00_problem/input_data/` into the runner so the Derkachi fixture is reachable.
+5. Verify the resulting docker run produces `tests/e2e/replay/output/replay.jsonl` and that AC-1 + AC-2 assertions actually fire.
+
+Estimated effort: ~4-6 hours including verification on dev workstation + CI.
+
+Coverage: still-image reality gate still won't run from this harness (it's in `e2e/tests/functional/FT-*` which Track 2 owns). But Derkachi tlog comparison runs, which is itself a reality-gate signal for the replay pipeline.
+
+### Track 2 — Full blackbox harness (the real story)
+
+The fuller e2e harness in `e2e/docker/run-tier1.sh` runs all 41 batch-60-89 scenarios. To get it running:
+
+1. Decide on the SITL strategy:
+   - **Option a**: switch to community images (`radarku/ardupilot-sitl`, source-built iNav).
+   - **Option b**: strip SITL services entirely from the compose; mark SITL-dependent scenarios as `skip(reason="sitl-image-unavailable, ticket=...")`. ~70-80% of scenarios still run (FT-* still-image accuracy, NFT-PERF latency, NFT-LIM resource budgets, most NFT-RES resilience scenarios). FT-P-09-AP / NFT-SEC-03 / AC-4.3 / AC-NEW-2 stay skipped.
+2. Replace `ardupilot/mavproxy:latest` with a local `e2e/fixtures/mavproxy/Dockerfile` that wraps `pip install MAVProxy`.
+3. Build a tile-cache seeder service that consumes the 60 nadir reference images + Derkachi bbox and emits the FAISS index + tile manifest into the `tile-cache-fixture` volume. This is the long-pending AZ-595.
+
+Estimated effort: ~3-5 days if going with Option b + AZ-595; ~1-3 weeks if going Option a with full SITL coverage.
+
+### Track 3 — Tier-2 Jetson hardware loop (AZ-444)
+
+Out of scope for this report; documented in `environment.md` § Execution instructions — Tier-2 (Jetson hardware loop). Requires Jetson Orin Nano Super hardware + JetPack 6.2 + a self-hosted runner.
+
+## Recommendations
+
+1. **Treat Step 11 as PARTIALLY MET for cycle 1**: local Tier-1 green, Reality Gate deferred to a follow-on cycle. Document this honestly in the autodev state instead of marking Step 11 complete.
+2. **Open Jira tickets** (or replay via leftovers if MCP is not invoked immediately):
+   - `[Bug] csv_reporter --csv collides with pytest-csv autoload` (2 pts) — fix already in commit `eb6dc17`, ticket just records the regression.
+   - `[Epic] E2E Tier-1 harness rehabilitation` with sub-tasks: H-4..H-9 each as a child story (2-5 pts each).
+   - `[Story] Tile-cache fixture builder` — AZ-595 already exists; link H-9 to it.
+3. **Add the preventive meta-rule** about transcript-verified test claims, if approved.
+4. **Resume Step 11 after Track 1 completes** — at minimum get one real Reality Gate signal from `tests/e2e/replay/`. Track 2 can run in parallel as its own work stream and feed back into Step 11 cycle 2.
+
+## Artifacts
+
+- Commit `eb6dc17` — csv_reporter / pytest-csv fix
+- Commit `6ce3158` — e2e/docker harness drift fixes (H-1, H-2, H-3)
+- This report: `_docs/03_implementation/run_tests_step11_report.md`
+- Leftover for pytest-csv ticket: `_docs/_process_leftovers/2026-05-17_csv_reporter_pytest_csv_conflict.md`
@@ -4,10 +4,10 @@
 flow: greenfield
 step: 11
 name: Run Tests
-status: not_started
+status: blocked
 sub_step:
-  phase: 0
-  name: awaiting-invocation
+  phase: 4
+  name: sut-reality-gate-deferred
  detail: ""
 retry_count: 0
 cycle: 1
@@ -20,3 +20,4 @@ last_step_outcomes:
  step_8: "Code is testable — no changes needed (testability_assessment.md committed; no list-of-changes, no source edits)"
  step_9: "41 blackbox test tasks (AZ-406..AZ-446) under epic AZ-262 in _docs/02_tasks/todo/ pre-existing; AZ-406 test-infra bootstrap pre-existing. Folder fallback satisfied. No Step-9 work executed in cycle 1."
  step_10: "41 of 41 blackbox-test tasks done (AZ-406..AZ-446). Final report at _docs/03_implementation/implementation_report_tests.md. Full-suite gate handed off to test-run skill per implement Step 16."
+  step_11: "Local Tier-1 pytest: 3343 pass / 88 skip / 0 fail (after csv_reporter fix in eb6dc17). SUT Reality Gate UNMET — both docker harnesses blocked by pre-existing drift (3 SITL images missing + 5 other bugs). Full report: _docs/03_implementation/run_tests_step11_report.md. Tickets deferred to leftovers."
@@ -0,0 +1,117 @@
+# Leftover — E2E Tier-1 harness rehabilitation tickets deferred
+
+- **Timestamp**: 2026-05-17T17:30:00Z
+- **What was blocked**: Jira ticket creation for the harness drift surfaced during Step 11
+- **Reason for blockage**: same session as the csv_reporter fix; user skipped the
+  structured Q&A so I did not pause for tracker writes. Full findings live in
+  `_docs/03_implementation/run_tests_step11_report.md`; this leftover records the
+  tickets that need filing.
+
+## Pending tracker writes — replay on next /autodev
+
+### Epic
+```yaml
+type: Epic
+summary: "E2E Tier-1 harness rehabilitation"
+description: |
+  Surfaced during /autodev Step 11 (Run Tests) cycle 1 on 2026-05-17. Both
+  Tier-1 docker harnesses (top-level scripts/run-tests.sh and the fuller
+  e2e/docker/run-tier1.sh) had pre-existing drift preventing them from
+  running end-to-end. Local pytest suite is green (3343/88/0); SUT Reality
+  Gate is unmet until at least the bootstrap harness can run
+  tests/e2e/replay/ with RUN_REPLAY_E2E=1. Full report:
+  _docs/03_implementation/run_tests_step11_report.md
+linked_to: AZ-595, AZ-444  # related but distinct: tile-cache fixtures, Tier-2 hw loop
+```
+
+### Story: H-7 — Bootstrap runner entrypoint
+```yaml
+type: Story
+summary: "[Bug] tests/e2e/Dockerfile entrypoint points at empty scenarios dir"
+description: |
+  Current entrypoint: `pytest -q /opt/tests/e2e/scenarios` (empty in repo).
+  Real tests are in `tests/e2e/replay/` (test_derkachi_1min.py, etc.).
+  Fix: change entrypoint to /opt/tests/e2e/ (let pytest discover both
+  scenarios and replay).
+story_points: 1
+```
+
+### Story: H-8 — Install SUT in runner image
+```yaml
+type: Story
+summary: "[Bug] tests/e2e e2e-runner image doesn't install gps-denied-onboard"
+description: |
+  Image is python:3.10-slim with only pytest+requests+pyyaml. The replay
+  tests need `gps-denied-replay` console script on PATH. Either:
+   - COPY pyproject.toml + src/ and pip install -e ".[dev]", or
+   - Build a wheel in a separate stage and pip install it.
+  Verify the resulting image: `which gps-denied-replay`.
+story_points: 3
+```
+
+### Story: H-4..H-6 — SITL/MAVLink images choice
+```yaml
+type: Story
+summary: "[Decision] Choose SITL strategy for e2e/docker harness"
+description: |
+  environment.md specifies ardupilot/ardupilot-sitl:plane-stable,
+  inavflight/inav-sitl:9.0.0, ardupilot/mavproxy:latest. All MISSING from
+  Docker Hub. Options:
+   a) Switch to community images (radarku/ardupilot-sitl etc.)
+   b) Build SITLs from source in a separate stage
+   c) Strip SITL services and mark SITL-bound scenarios skip(reason="sitl-unavailable")
+  Track 1 doesn't depend on this; Track 2 does.
+story_points: 5
+```
+
+### Story: MAVProxy local image
+```yaml
+type: Story
+summary: "[Story] Replace ardupilot/mavproxy:latest with local pip-MAVProxy Dockerfile"
+description: |
+  Image doesn't exist on Docker Hub. Wrap `pip install MAVProxy` in a
+  python:3.10-slim Dockerfile in e2e/fixtures/mavproxy/. Update compose
+  to use the local build.
+story_points: 1
+```
+
+### Story: H-9 — Tile-cache fixture builder
+```yaml
+type: Story
+summary: "Link H-9 to AZ-595 / tile-cache fixture seeder"
+description: |
+  e2e/docker/docker-compose.test.yml declares tile-cache-fixture as an
+  empty named volume. Track 2 cannot run without seeded tiles. AZ-595
+  exists and owns this; verify scope alignment, add a link.
+story_points: 2
+```
+
+### Story: H-1..H-3 — fixes already committed
+```yaml
+type: Story
+summary: "[Bug] e2e/docker harness drift (already fixed in commit 6ce3158)"
+description: |
+  Fixed in this session: dockerfile rename, fdr-output tmpfs cap, e2e-results
+  dir + gitignore. Ticket is just for tracking — already in dev branch.
+story_points: 1
+status_after_create: "Done"
+```
+
+### Bug: csv_reporter --csv collision (already committed)
+```yaml
+type: Bug
+summary: "[Bug] csv_reporter --csv flag collides with pytest-csv autoload"
+description: |
+  See _docs/_process_leftovers/2026-05-17_csv_reporter_pytest_csv_conflict.md
+  Fix already in commit eb6dc17.
+linked_to: AZ-446
+story_points: 2
+status_after_create: "Done"
+```
+
+## Replay obligation
+
+Next /autodev should:
+1. Open Jira, create the Epic + Stories above (link Epic to AZ-595 and AZ-444).
+2. Update the Epic with the actual issue keys once created.
+3. Delete this leftover entry on success.