Local Tier-1 pytest suite: 3343 pass / 88 skip / 0 fail across 12 chunks. Docker harness SUT Reality Gate UNMET — both Tier-1 docker harnesses (scripts/run-tests.sh and e2e/docker/run-tier1.sh) have pre-existing drift that prevents them from running end-to-end. Findings: H-1..H-3 (fixed in6ce3158): dockerfile rename, fdr-output tmpfs cap, e2e-results bind dir + gitignore. H-4..H-6 (deferred): three SITL/MAVLink Docker Hub images don't exist (ardupilot/mavproxy, ardupilot/ardupilot-sitl, inavflight/inav-sitl). environment.md spec was written against aspirational image names. H-7..H-8 (deferred): tests/e2e/Dockerfile entrypoint points at empty scenarios dir + doesn't install the SUT package. H-9 (deferred): tile-cache-fixture seeder missing (relates to AZ-595). Plus a regression caught and fixed mid-run: pytest-csv autoload conflicts with our custom --csv flag (commiteb6dc17). Also surfaced a false-positive batch-89 test-result report; proposed preventive meta-rule pending user approval. Step 11 marked status=blocked pending harness rehabilitation tickets (payloads recorded in _docs/_process_leftovers/). Full outcome report: _docs/03_implementation/run_tests_step11_report.md. Co-authored-by: Cursor <cursoragent@cursor.com>
12 KiB
Step 11 — Run Tests (Cycle 1)
TL;DR
- Local Tier-1 pytest suite: 3343 passed / 88 skipped / 0 failed (after the
--csvcollision fix landed ineb6dc17). - Docker Tier-1 SUT Reality Gate: NOT MET. Both docker harnesses (top-level
scripts/run-tests.shand the fullere2e/docker/run-tier1.sh) have pre-existing drift that prevents them from running end-to-end. None of the drift was caused by Step 10 work — the harnesses had simply never been bench-tested. - Recommendation: open a single epic to rehabilitate the e2e docker harness
(or two, splitting bootstrap vs. full blackbox); resume Step 11 reality-gate
verification once at least the bootstrap harness can run
tests/e2e/replay/withRUN_REPLAY_E2E=1.
Local Tier-1 results
Run on 2026-05-17 against dev HEAD c64e492, then eb6dc17 after the
csv_reporter fix. Split into 12 logical chunks per the human directive to avoid
a monolithic 3.4k-test run:
| # | Chunk | Pass | Skip | Fail |
|---|---|---|---|---|
| C1 | tests/contract + 6× cross-cutting |
87 | 0 | 0 |
| C2 | c2_5_rerank + c4_pose + c13_fdr + c11_tile_manager + c3_5_adhop |
262 | 0 | 0 |
| C3 | c3_matcher + c10_provisioning |
170 | 3 | 0 |
| C4 | c1_vio |
148 | 6 | 0 |
| C5 | c12_operator_orchestrator |
151 | 2 | 0 |
| C6 | c7_inference |
139 | 17 | 0 |
| C7 | c6_tile_cache |
126 | 57 | 0 |
| C8 | c8_fc_adapter |
212 | 1 | 0 |
| C9 | c5_state |
216 | 0 | 0 |
| C10 | c2_vpr |
230 | 0 | 0 |
| C11 | tests/unit/test_*.py root-level |
373 | 2 | 0 |
| C12 | e2e/_unit_tests (after fix) |
1229 | 0 | 0 |
| TOTAL | 3343 | 88 | 0 |
Skip classification (88 total)
| Reason | Count | Verdict |
|---|---|---|
Tier-2-only (GPS_DENIED_TIER=2) — Jetson Orin Nano Super hardware |
14 | legitimate |
| CUDA / NVIDIA GPU not present on macOS dev host | 8 | legitimate |
| TensorRT python binding not installed (Tier-2 Jetson only) | 6 | legitimate |
| Requires Docker compose services (postgres / mock-sat) | 57 | borderline — covered by docker harness when it runs |
Console scripts not on PATH (pip install -e . would fix) |
3 | borderline — env-conditional |
actionlint not on PATH (CI lint job installs separately) |
1 | borderline — env-conditional |
| Other (empty parametrize, doc-gated replay e2e) | 2 | legitimate |
The 57 "Requires Docker compose services" skips are the largest illegitimate-per-skill cluster. They become covered the moment any docker harness runs end-to-end against postgres + mock-sat. Until then, they remain.
C12 regression that surfaced during this run
e2e/_unit_tests/reporting/test_csv_reporter.py::test_csv_plugin_emits_required_columns and two sibling tests in test_nfr_recorder.py failed with:
argparse.ArgumentError: argument --csv: conflicting option string: --csv
inside subprocess-spawned pytest invocations. Root cause: pytest-csv 3.0.0 was listed in e2e/runner/requirements.txt and auto-loaded via entry-point, conflicting with our custom --csv flag from e2e/runner/reporting/csv_reporter.py. The conftest comment claimed our plugin "overrides" pytest-csv, but pytest's option registry does not allow overrides — it raises on conflict. pytest-csv is also incompatible with pytest 9.x (uses removed hookwrapper marker). Our code never import pytest_csv — the dep was dead weight.
Fix landed in commit eb6dc17 [autodev] fix csv_reporter --csv collision with pytest-csv:
- Removed
pytest-csvfrome2e/runner/requirements.txt. - Updated docstrings/comments in
csv_reporter.pyandconftest.py. - Uninstalled
pytest-csvfrom the local environment.
After the fix the full C12 suite reports 1229 passed in 145.93s.
Secondary finding — false-positive batch report
_docs/03_implementation/batch_89_cycle1_report.md claimed "Full e2e unit-test suite: 1229 passed in 134 s". That number was reported without an actual verifying invocation. The 3 reporting subprocess tests have been broken since pytest-csv was installed locally, but the batch report didn't catch it.
Proposed preventive rule (pending user approval, per meta-rule.mdc Self-Improvement): "Before writing 'Test Results: X passed' in a batch report, the same shell invocation that produced X must appear in the assistant transcript with the exit code visible." Will be added to meta-rule.mdc if approved.
Docker harness — findings
| ID | Severity | Description | Status |
|---|---|---|---|
| H-1 | medium | e2e/docker/docker-compose.test.yml referenced docker/Dockerfile; actual file is docker/companion-tier1.Dockerfile |
fixed in 6ce3158 |
| H-2 | medium | fdr-output volume declared tmpfs size=64g; host Docker has 3.8 GiB |
fixed in 6ce3158 (plain named volume; SUT enforces cap internally per NFT-LIM-02) |
| H-3 | low | e2e-results/ bind dir missing at repo root |
fixed in 6ce3158 (mkdir + .gitignore) |
| H-4 | blocker | ardupilot/mavproxy:latest image MISSING from Docker Hub |
deferred — see "Harness rehabilitation" below |
| H-5 | blocker | ardupilot/ardupilot-sitl:plane-stable image MISSING from Docker Hub |
deferred |
| H-6 | blocker | inavflight/inav-sitl:9.0.0 image MISSING from Docker Hub |
deferred |
| H-7 | blocker | Top-level tests/e2e/Dockerfile entrypoint is pytest /opt/tests/e2e/scenarios (empty dir); real tests are in /opt/tests/e2e/replay/ |
deferred |
| H-8 | blocker | Top-level tests/e2e/Dockerfile uses a plain python:3.10-slim and never installs the SUT package — so the gps-denied-replay console script and the project's import surface aren't available in the container |
deferred |
| H-9 | medium | tile-cache-fixture volume in e2e/docker/docker-compose.test.yml is unseeded; no builder service. AZ-595 was meant to deliver the seeder |
deferred |
Why H-4..H-6 are blockers, not minor drift
_docs/02_document/tests/environment.md § Docker Environment specifies three Docker Hub images for SITL/MAVLink:
ardupilot/ardupilot-sitl:plane-stableinavflight/inav-sitl:9.0.0ardupilot/mavproxy:latest
None of the named org accounts publish to Docker Hub. Verified by docker manifest inspect — all three return MISSING. The spec was written against aspirational/imagined image names and never verified.
Real alternatives:
| Spec image | Real option |
|---|---|
ardupilot/ardupilot-sitl:plane-stable |
Community images: radarku/ardupilot-sitl, khancyr/ardupilot-docker-sitl (older); or build from ardupilot/ardupilot source (~30-60 min build). |
inavflight/inav-sitl:9.0.0 |
No good published image. Build from iNav source (multi-hour). |
ardupilot/mavproxy:latest |
Doesn't exist. Wrap pip install MAVProxy in a python:3.10-slim Dockerfile (~10 lines). |
Why H-7 and H-8 matter
scripts/run-tests.sh is the test-run skill's "first match" runner — but its Dockerfile points at an empty scenarios dir and never installs the SUT package. Even if H-7 is fixed by repointing to /opt/tests/e2e/, the heavy tests in tests/e2e/replay/test_derkachi_1min.py require the gps-denied-replay console script which only exists when the SUT package is pip install-ed into the runner image. So H-7 + H-8 are coupled.
SUT Reality Gate verdict
Per test-run/SKILL.md § Functional Mode → 0. System-Under-Test Reality Gate:
- If
_docs/00_problem/input_data/expected_results/results_report.mdexists, at least one e2e/blackbox run must compare actual product outputs against that mapping or the machine-readable files it references.
results_report.md exists and contains:
- Still-image frame centers (60 images → expected WGS84 lat/lon, ±50 m primary, ±20 m stretch).
- Derkachi video/IMU fixture (validation rules for telemetry CSV, video stream, alignment, trajectory comparison).
The 41 blackbox scenarios in e2e/tests/ (functional/FT-* for the still-image set, performance/NFT-PERF-* for replay latency, resilience/NFT-RES-* for failure modes) exist and would exercise this mapping. None of them ran in this cycle because:
- They require the
e2e/docker/run-tier1.shharness, blocked by H-4..H-6. - The fallback bootstrap harness (
scripts/run-tests.sh→tests/e2e/replay/) is blocked by H-7 + H-8.
Verdict: Reality Gate UNMET. Local pytest verifies internal modules but does NOT compare actual product outputs against results_report.md. The skill defines this as a blocking gate.
Harness rehabilitation — proposed work
Three independent tracks; tracks 1 and 2 are required for the Reality Gate, track 3 is a nice-to-have for FC-side acceptance tests.
Track 1 — Bootstrap harness (fastest path to a real Reality Gate)
Fix H-7 + H-8 so scripts/run-tests.sh actually runs tests/e2e/replay/test_derkachi_1min.py with RUN_REPLAY_E2E=1. Steps:
- Change
tests/e2e/Dockerfileentrypoint frompytest /opt/tests/e2e/scenariostopytest /opt/tests/e2e/. - Install the SUT package in the runner image so
gps-denied-replayis on PATH. Eitherpip install -e .from the host source (requires bind-mount or COPY) or build a wheel andpip installit. - Inject
RUN_REPLAY_E2E=1into the e2e-runner service environment indocker-compose.test.yml. - Mount
_docs/00_problem/input_data/into the runner so the Derkachi fixture is reachable. - Verify the resulting docker run produces
tests/e2e/replay/output/replay.jsonland that AC-1 + AC-2 assertions actually fire.
Estimated effort: ~4-6 hours including verification on dev workstation + CI.
Coverage: still-image reality gate still won't run from this harness (it's in e2e/tests/functional/FT-* which Track 2 owns). But Derkachi tlog comparison runs, which is itself a reality-gate signal for the replay pipeline.
Track 2 — Full blackbox harness (the real story)
The fuller e2e harness in e2e/docker/run-tier1.sh runs all 41 batch-60-89 scenarios. To get it running:
- Decide on the SITL strategy:
- Option a: switch to community images (
radarku/ardupilot-sitl, source-built iNav). - Option b: strip SITL services entirely from the compose; mark SITL-dependent scenarios as
skip(reason="sitl-image-unavailable, ticket=..."). ~70-80% of scenarios still run (FT-* still-image accuracy, NFT-PERF latency, NFT-LIM resource budgets, most NFT-RES resilience scenarios). FT-P-09-AP / NFT-SEC-03 / AC-4.3 / AC-NEW-2 stay skipped.
- Option a: switch to community images (
- Replace
ardupilot/mavproxy:latestwith a locale2e/fixtures/mavproxy/Dockerfilethat wrapspip install MAVProxy. - Build a tile-cache seeder service that consumes the 60 nadir reference images + Derkachi bbox and emits the FAISS index + tile manifest into the
tile-cache-fixturevolume. This is the long-pending AZ-595.
Estimated effort: ~3-5 days if going with Option b + AZ-595; ~1-3 weeks if going Option a with full SITL coverage.
Track 3 — Tier-2 Jetson hardware loop (AZ-444)
Out of scope for this report; documented in environment.md § Execution instructions — Tier-2 (Jetson hardware loop). Requires Jetson Orin Nano Super hardware + JetPack 6.2 + a self-hosted runner.
Recommendations
- Treat Step 11 as PARTIALLY MET for cycle 1: local Tier-1 green, Reality Gate deferred to a follow-on cycle. Document this honestly in the autodev state instead of marking Step 11 complete.
- Open Jira tickets (or replay via leftovers if MCP is not invoked immediately):
[Bug] csv_reporter --csv collides with pytest-csv autoload(2 pts) — fix already in commiteb6dc17, ticket just records the regression.[Epic] E2E Tier-1 harness rehabilitationwith sub-tasks: H-4..H-9 each as a child story (2-5 pts each).[Story] Tile-cache fixture builder— AZ-595 already exists; link H-9 to it.
- Add the preventive meta-rule about transcript-verified test claims, if approved.
- Resume Step 11 after Track 1 completes — at minimum get one real Reality Gate signal from
tests/e2e/replay/. Track 2 can run in parallel as its own work stream and feed back into Step 11 cycle 2.
Artifacts
- Commit
eb6dc17— csv_reporter / pytest-csv fix - Commit
6ce3158— e2e/docker harness drift fixes (H-1, H-2, H-3) - This report:
_docs/03_implementation/run_tests_step11_report.md - Leftover for pytest-csv ticket:
_docs/_process_leftovers/2026-05-17_csv_reporter_pytest_csv_conflict.md