mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 20:41:13 +00:00

Files

T

Oleksandr Bezdieniezhnykh c2934b8686 [AZ-603] [AZ-604] e2e-runner: install SUT, fix entrypoint (Track 1)

Multi-stage Ubuntu 22.04 e2e-runner image installs gps-denied-onboard
(editable) into /opt/venv so the AZ-404 replay tests can subprocess
gps-denied-replay against the Derkachi fixture. Image layout mirrors
the host repo (/opt/pyproject.toml + /opt/src + /opt/tests bind mount)
so Path(__file__).parents[3] resolves to /opt and AC-4's AST scan
finds the components dir.

Entrypoint now runs `pytest /opt/tests/e2e/` instead of the empty
`scenarios/` dir. The bootstrap harness collects 24 tests vs. 0 before.

Compose: e2e-runner env mirrors the companion service (FullSystemConfig
requirements) plus RUN_REPLAY_E2E=1, BUILD_REPLAY_SINK_JSONL=ON;
bind-mounts the Derkachi fixture dir; adds writable fdr-data /
tile-data volumes the SUT requires.

Reality Gate signal is now real: 17 pass / 5 fail / 1 skip / 1 xfail.
The 5 heavy-AC failures share root cause AZ-614 (tlog synth time-base
mismatch, surfaced by the now-functional harness).

Also archives the replayed leftover entries (csv_reporter -> AZ-601,
harness rehab -> AZ-602 epic + 11 child stories).

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-18 01:28:36 +03:00

22 KiB

Raw Blame History

Step 11 — Run Tests (Cycle 1)

TL;DR

Local Tier-1 pytest suite: 3343 passed / 88 skipped / 0 failed (after the --csv collision fix landed in eb6dc17).
Docker Tier-1 SUT Reality Gate: NOT MET. Both docker harnesses (top-level scripts/run-tests.sh and the fuller e2e/docker/run-tier1.sh) have pre-existing drift that prevents them from running end-to-end. None of the drift was caused by Step 10 work — the harnesses had simply never been bench-tested.
Recommendation: open a single epic to rehabilitate the e2e docker harness (or two, splitting bootstrap vs. full blackbox); resume Step 11 reality-gate verification once at least the bootstrap harness can run tests/e2e/replay/ with RUN_REPLAY_E2E=1.

Local Tier-1 results

Run on 2026-05-17 against dev HEAD c64e492, then eb6dc17 after the csv_reporter fix. Split into 12 logical chunks per the human directive to avoid a monolithic 3.4k-test run:

#	Chunk	Pass	Skip
C1	`tests/contract` + 6× cross-cutting	87	0
C2	`c2_5_rerank + c4_pose + c13_fdr + c11_tile_manager + c3_5_adhop`	262	0
C3	`c3_matcher + c10_provisioning`	170	3
C4	`c1_vio`	148	6
C5	`c12_operator_orchestrator`	151	2
C6	`c7_inference`	139	17
C7	`c6_tile_cache`	126	57
C8	`c8_fc_adapter`	212	1
C9	`c5_state`	216	0
C10	`c2_vpr`	230	0
C11	`tests/unit/test_*.py` root-level	373	2
C12	`e2e/_unit_tests` (after fix)	1229	0
	TOTAL	3343	88

Skip classification (88 total)

Reason	Count	Verdict
Tier-2-only (`GPS_DENIED_TIER=2`) — Jetson Orin Nano Super hardware	14	legitimate
CUDA / NVIDIA GPU not present on macOS dev host	8	legitimate
TensorRT python binding not installed (Tier-2 Jetson only)	6	legitimate
Requires Docker compose services (postgres / mock-sat)	57	borderline — covered by docker harness when it runs
Console scripts not on PATH (`pip install -e .` would fix)	3	borderline — env-conditional
`actionlint` not on PATH (CI lint job installs separately)	1	borderline — env-conditional
Other (empty parametrize, doc-gated replay e2e)	2	legitimate

The 57 "Requires Docker compose services" skips are the largest illegitimate-per-skill cluster. They become covered the moment any docker harness runs end-to-end against postgres + mock-sat. Until then, they remain.

C12 regression that surfaced during this run

e2e/_unit_tests/reporting/test_csv_reporter.py::test_csv_plugin_emits_required_columns and two sibling tests in test_nfr_recorder.py failed with:

argparse.ArgumentError: argument --csv: conflicting option string: --csv

inside subprocess-spawned pytest invocations. Root cause: pytest-csv 3.0.0 was listed in e2e/runner/requirements.txt and auto-loaded via entry-point, conflicting with our custom --csv flag from e2e/runner/reporting/csv_reporter.py. The conftest comment claimed our plugin "overrides" pytest-csv, but pytest's option registry does not allow overrides — it raises on conflict. pytest-csv is also incompatible with pytest 9.x (uses removed hookwrapper marker). Our code never import pytest_csv — the dep was dead weight.

Fix landed in commit eb6dc17 [autodev] fix csv_reporter --csv collision with pytest-csv:

Removed pytest-csv from e2e/runner/requirements.txt.
Updated docstrings/comments in csv_reporter.py and conftest.py.
Uninstalled pytest-csv from the local environment.

After the fix the full C12 suite reports 1229 passed in 145.93s.

Secondary finding — false-positive batch report

_docs/03_implementation/batch_89_cycle1_report.md claimed "Full e2e unit-test suite: 1229 passed in 134 s". That number was reported without an actual verifying invocation. The 3 reporting subprocess tests have been broken since pytest-csv was installed locally, but the batch report didn't catch it.

Proposed preventive rule (pending user approval, per meta-rule.mdc Self-Improvement): "Before writing 'Test Results: X passed' in a batch report, the same shell invocation that produced X must appear in the assistant transcript with the exit code visible." Will be added to meta-rule.mdc if approved.

Docker harness — findings

ID	Severity	Description	Status
H-1	medium	`e2e/docker/docker-compose.test.yml` referenced `docker/Dockerfile`; actual file is `docker/companion-tier1.Dockerfile`	fixed in `6ce3158`
H-2	medium	`fdr-output` volume declared `tmpfs size=64g`; host Docker has 3.8 GiB	fixed in `6ce3158` (plain named volume; SUT enforces cap internally per NFT-LIM-02)
H-3	low	`e2e-results/` bind dir missing at repo root	fixed in `6ce3158` (mkdir + .gitignore)
H-4	blocker	`ardupilot/mavproxy:latest` image MISSING from Docker Hub	deferred — see "Harness rehabilitation" below
H-5	blocker	`ardupilot/ardupilot-sitl:plane-stable` image MISSING from Docker Hub	deferred
H-6	blocker	`inavflight/inav-sitl:9.0.0` image MISSING from Docker Hub	deferred
H-7	blocker	Top-level `tests/e2e/Dockerfile` entrypoint is `pytest /opt/tests/e2e/scenarios` (empty dir); real tests are in `/opt/tests/e2e/replay/`	deferred
H-8	blocker	Top-level `tests/e2e/Dockerfile` uses a plain `python:3.10-slim` and never installs the SUT package — so the `gps-denied-replay` console script and the project's import surface aren't available in the container	deferred
H-9	medium	`tile-cache-fixture` volume in `e2e/docker/docker-compose.test.yml` is unseeded; no builder service. AZ-595 was meant to deliver the seeder	deferred

Why H-4..H-6 are blockers, not minor drift

_docs/02_document/tests/environment.md § Docker Environment specifies three Docker Hub images for SITL/MAVLink:

ardupilot/ardupilot-sitl:plane-stable
inavflight/inav-sitl:9.0.0
ardupilot/mavproxy:latest

None of the named org accounts publish to Docker Hub. Verified by docker manifest inspect — all three return MISSING. The spec was written against aspirational/imagined image names and never verified.

Real alternatives:

Spec image	Real option
`ardupilot/ardupilot-sitl:plane-stable`	Community images: `radarku/ardupilot-sitl`, `khancyr/ardupilot-docker-sitl` (older); or build from `ardupilot/ardupilot` source (~30-60 min build).
`inavflight/inav-sitl:9.0.0`	No good published image. Build from iNav source (multi-hour).
`ardupilot/mavproxy:latest`	Doesn't exist. Wrap `pip install MAVProxy` in a `python:3.10-slim` Dockerfile (~10 lines).

Why H-7 and H-8 matter

scripts/run-tests.sh is the test-run skill's "first match" runner — but its Dockerfile points at an empty scenarios dir and never installs the SUT package. Even if H-7 is fixed by repointing to /opt/tests/e2e/, the heavy tests in tests/e2e/replay/test_derkachi_1min.py require the gps-denied-replay console script which only exists when the SUT package is pip install-ed into the runner image. So H-7 + H-8 are coupled.

SUT Reality Gate verdict

Per test-run/SKILL.md § Functional Mode → 0. System-Under-Test Reality Gate:

If _docs/00_problem/input_data/expected_results/results_report.md exists, at least one e2e/blackbox run must compare actual product outputs against that mapping or the machine-readable files it references.

results_report.md exists and contains:

Still-image frame centers (60 images → expected WGS84 lat/lon, ±50 m primary, ±20 m stretch).
Derkachi video/IMU fixture (validation rules for telemetry CSV, video stream, alignment, trajectory comparison).

The 41 blackbox scenarios in e2e/tests/ (functional/FT-* for the still-image set, performance/NFT-PERF-* for replay latency, resilience/NFT-RES-* for failure modes) exist and would exercise this mapping. None of them ran in this cycle because:

They require the e2e/docker/run-tier1.sh harness, blocked by H-4..H-6.
The fallback bootstrap harness (scripts/run-tests.sh → tests/e2e/replay/) is blocked by H-7 + H-8.

Verdict: Reality Gate UNMET. Local pytest verifies internal modules but does NOT compare actual product outputs against results_report.md. The skill defines this as a blocking gate.

Harness rehabilitation — proposed work

Three independent tracks; tracks 1 and 2 are required for the Reality Gate, track 3 is a nice-to-have for FC-side acceptance tests.

Track 1 — Bootstrap harness (fastest path to a real Reality Gate)

Fix H-7 + H-8 so scripts/run-tests.sh actually runs tests/e2e/replay/test_derkachi_1min.py with RUN_REPLAY_E2E=1. Steps:

Change tests/e2e/Dockerfile entrypoint from pytest /opt/tests/e2e/scenarios to pytest /opt/tests/e2e/.
Install the SUT package in the runner image so gps-denied-replay is on PATH. Either pip install -e . from the host source (requires bind-mount or COPY) or build a wheel and pip install it.
Inject RUN_REPLAY_E2E=1 into the e2e-runner service environment in docker-compose.test.yml.
Mount _docs/00_problem/input_data/ into the runner so the Derkachi fixture is reachable.
Verify the resulting docker run produces tests/e2e/replay/output/replay.jsonl and that AC-1 + AC-2 assertions actually fire.

Estimated effort: ~4-6 hours including verification on dev workstation + CI.

Coverage: still-image reality gate still won't run from this harness (it's in e2e/tests/functional/FT-* which Track 2 owns). But Derkachi tlog comparison runs, which is itself a reality-gate signal for the replay pipeline.

Track 2 — Full blackbox harness (the real story)

The fuller e2e harness in e2e/docker/run-tier1.sh runs all 41 batch-60-89 scenarios. To get it running:

Decide on the SITL strategy:
- Option a: switch to community images (radarku/ardupilot-sitl, source-built iNav).
- Option b: strip SITL services entirely from the compose; mark SITL-dependent scenarios as skip(reason="sitl-image-unavailable, ticket=..."). ~70-80% of scenarios still run (FT-* still-image accuracy, NFT-PERF latency, NFT-LIM resource budgets, most NFT-RES resilience scenarios). FT-P-09-AP / NFT-SEC-03 / AC-4.3 / AC-NEW-2 stay skipped.
Replace ardupilot/mavproxy:latest with a local e2e/fixtures/mavproxy/Dockerfile that wraps pip install MAVProxy.
Build a tile-cache seeder service that consumes the 60 nadir reference images + Derkachi bbox and emits the FAISS index + tile manifest into the tile-cache-fixture volume. This is the long-pending AZ-595.

Estimated effort: ~3-5 days if going with Option b + AZ-595; ~1-3 weeks if going Option a with full SITL coverage.

Track 3 — Tier-2 Jetson hardware loop (AZ-444)

Out of scope for this report; documented in environment.md § Execution instructions — Tier-2 (Jetson hardware loop). Requires Jetson Orin Nano Super hardware + JetPack 6.2 + a self-hosted runner.

Recommendations

Treat Step 11 as PARTIALLY MET for cycle 1: local Tier-1 green, Reality Gate deferred to a follow-on cycle. Document this honestly in the autodev state instead of marking Step 11 complete.
Open Jira tickets (or replay via leftovers if MCP is not invoked immediately):
- [Bug] csv_reporter --csv collides with pytest-csv autoload (2 pts) — fix already in commit eb6dc17, ticket just records the regression.
- [Epic] E2E Tier-1 harness rehabilitation with sub-tasks: H-4..H-9 each as a child story (2-5 pts each).
- [Story] Tile-cache fixture builder — AZ-595 already exists; link H-9 to it.
Add the preventive meta-rule about transcript-verified test claims, if approved.
Resume Step 11 after Track 1 completes — at minimum get one real Reality Gate signal from tests/e2e/replay/. Track 2 can run in parallel as its own work stream and feed back into Step 11 cycle 2.

Path 3 attempt — Full SITL with community images (2026-05-17, post-blocker)

Per user direction, attempted the "Full path" rehab: switch ArduPilot SITL to sparlane/ardupilot-sitl:Plane-latest (verified pullable), build iNav SITL from source, write MAVProxy Dockerfile, then run FT-P-01 / FT-P-02 against the real fixture builders.

Key reframe discovered during attempt: e2e/runner/helpers/sitl_observer.py is pure offline fixture replay, not a live SITL client (see file docstring + _FdrReplayObserver class). Setting E2E_SITL_REPLAY_DIR=... switches the observer to read pre-built JSON fixtures (observer_<fc_kind>_<host>.json). No live SITL container needed for the existing blackbox FT-P-* and NFT-* tests. The compose-file SITL services in environment.md are aspirational future state.

So the realistic Full Path is:

Install SUT locally (pip install -e .) — DONE.
Run e2e.fixtures.sitl_replay_builder.build_p01_fixtures to produce e2e/fixtures/sitl_replay/p01/ — BLOCKED (see below).
Run pytest on e2e/tests/positive/test_ft_p_01_still_image_accuracy.py with E2E_SITL_REPLAY_DIR=e2e/fixtures/sitl_replay/p01 — BLOCKED on step 2.

Trying step 2 surfaced 4 new integration drifts, on top of H-1..H-9 from the prior section:

ID	Severity	Description	Status
H-10	blocker	Fixture builder calls `gps-denied-replay --fdr-out PATH`. The CLI's actual arg name is `--output`.	not fixed
H-11	blocker	Fixture builder doesn't pass the CLI's required `--camera-calibration`, `--config`, `--mavlink-signing-key` args. Need to add fields to `FixtureBuilderConfig` and update `build_p01_fixtures.py` / `build_p02_fixtures.py`.	not fixed
H-12	medium	`tests/fixtures/calibration/adti26.json` declared `body_to_camera_se3` as `{rotation_xyzw, translation_xyz_m}` dict; loader at `runtime_root/_replay_branch.py:308` strictly expects a 4×4 matrix via `np.asarray(..., dtype=np.float64)`. The dict form was never parseable.	fixed — converted to 4×4 identity (`tests/fixtures/calibration/adti26.json`). Equivalent rotation/translation, no behavior change.
H-13	blocker	Auto-sync AC-8 validation hard-fails on still-image + stationary fixtures even when `--time-offset-ms 0` is supplied. Validator computes a "frame-window match %" (default 95% threshold) that requires real video motion + IMU takeoff signal. The FT-P-01 fixture (60 stills + stationary IMU) has neither by design. No `--skip-auto-sync` or `--accept-low-confidence-offset` escape hatch exists.	not fixed
H-14	env-conditional	CLI requires env vars including `BUILD_REPLAY_SINK_JSONL=ON` to use `NoopMavlinkTransport`. This is documented in code comments but not in `.env.example`.	needs doc update

Total live harness drift count: 14 distinct items (3 fixed, 11 deferred). Each H-10..H-13 individually takes 30-60 min to fix with the right design decisions; together they exceed the safe single-session budget given the surface-area uncertainty.

Pattern: The fixture builders (AZ-598/599/600), the CLI signature (AZ-401/402), the calibration JSON schema, and the replay protocol auto-sync (AZ-405) were each implemented well in isolation but never integrated end-to-end. This is exactly what the SUT Reality Gate is designed to surface.

Path 3 verdict

Cannot reach the SUT Reality Gate in this session. Even after fixing H-12, the next gate (H-13: auto-sync hard-fail on stationary fixtures) requires a design decision: either expand the auto-sync escape hatch in the SUT, or change the fixture builder to inject a single-frame motion event, or relax AC-8 validation thresholds for stationary scenarios. Each is a non-trivial design call that warrants a Jira ticket and review, not a unilateral mid-session fix.

Updated recommendation

The Track 2 ("Full blackbox harness") track from the previous section needs to expand to include H-10..H-14 as additional sub-stories. Realistic effort: +1-2 days on top of the prior estimate. Path 3 is achievable but requires 3-5 days of focused harness rehab, not a single session.

Artifacts

Commit eb6dc17 — csv_reporter / pytest-csv fix
Commit 6ce3158 — e2e/docker harness drift fixes (H-1, H-2, H-3)
Commit 5c1c35d — H-12 4×4 SE3 calibration fix + replay_config_minimal.yaml
This report: _docs/03_implementation/run_tests_step11_report.md
(Replayed and removed) Leftover for pytest-csv ticket → AZ-601
(Replayed and removed) Leftover for harness epic → AZ-602 + 11 child stories

Cycle-2 Update: Track 1 Bootstrap Harness Outcome (2026-05-17 22:00 UTC)

Status: Track 1 done — Reality Gate signal is now REAL

The harness rehabilitation Epic landed as AZ-602 with 11 child stories. The user picked Track 1 (AZ-603 + AZ-604) for the shortest path to a genuine SUT Reality Gate signal. Both stories shipped together in a single PR.

What changed

tests/e2e/Dockerfile rewritten as a three-stage Ubuntu 22.04 build:
- stage 1: system deps (build-essential, libpq-dev, libspatialindex-dev, python3.10-venv, python3-pip)
- stage 2: SUT editable install (pip install -e ".[dev]" into /opt/venv)
- stage 3: slim runtime with python3, python3.10, libpq5, libspatialindex-c6, libgl1, libglib2.0-0 (OpenCV's runtime libs)
Image layout: /opt/pyproject.toml + /opt/src/... + /opt/tests/... (bind-mounted) — mirrors the host repo so Path(__file__).resolve().parents[3] resolves to /opt and AC-4's AST scan finds src/gps_denied_onboard/components/ correctly.
Entrypoint: pytest -q /opt/tests/e2e/ (not the empty scenarios/ dir).
docker-compose.test.yml e2e-runner service gets the full env set (GPS_DENIED_FC_PROFILE, CAMERA_CALIBRATION_PATH, LOG_LEVEL, LOG_SINK, INFERENCE_BACKEND, FDR_PATH, TILE_CACHE_PATH, MAVLINK_SIGNING_KEY, RUN_REPLAY_E2E=1, BUILD_REPLAY_SINK_JSONL=ON) plus mounts for _docs/00_problem/input_data and writable fdr-data / tile-data named volumes.

Reality Gate run

Standalone docker run of the e2e-runner (no companion / mock-sat / db needed for AZ-404):

docker run --rm \
  -v "$PWD/tests:/opt/tests:ro" \
  -v "$PWD/_docs/00_problem/input_data:/opt/_docs/00_problem/input_data:ro" \
  -e RUN_REPLAY_E2E=1  -e BUILD_REPLAY_SINK_JSONL=ON \
  ... (full env set) ... \
  --entrypoint pytest gps-denied-onboard/e2e-runner:dev \
  -v --tb=short /opt/tests/e2e/

Result:

Outcome	Count	Tests
PASSED	17	AC-4 AST scan, AC-4b encoder byte-equality, AC-7 skip-gate, all AC-9 helpers (`test_helpers.py`)
FAILED	5	AC-1, AC-2, AC-5, AC-6 pace-realtime, AC-6 pace-asap
SKIPPED	1	AC-8 operator workflow (D-PROJ-2 mock-suite-sat-service not implemented)
XFAIL	1	AC-3 (calibration intrinsics unknown — documented)
Total collected	24	(vs. 0 before Track 1 — empty `scenarios/` dir)

Before vs. after

Metric	Before Track 1	After Track 1
Tests collected by `scripts/run-tests.sh`	0 (entrypoint points at empty `scenarios/`)	24 (full `tests/e2e/`)
Tests that actually exercise the SUT	0	5 heavy ACs invoke `gps-denied-replay` subprocess
Exit code semantics	Vacuous 0 (no tests collected ≠ no SUT bugs)	Reflects real test outcomes
`gps-denied-replay` on PATH inside e2e-runner image	no (image was python:3.10-slim + pytest only)	yes (multi-stage SUT install)
Source-tree layout inside image matches repo	no (no src present)	yes (`/opt/src/...`, AC-4 passes)
Real SUT wall-clock per heavy AC	n/a	~21 s for the auto-sync probe (see below)

Real bug discovered

The 5 failing heavy ACs share a single root cause: tlog synth time-base mismatch.

tests/e2e/replay/_tlog_synth.py:62:

_TLOG_BASE_TIMESTAMP_US: Final[int] = 1_700_000_000_000_000  # 2023-11-14
# "The absolute value is irrelevant for replay-mode determinism;
#  only the delta-between-rows matters."  ← STALE COMMENT

The auto-sync detector in replay_input.tlog_video_adapter DOES use absolute timestamps to compute the video↔tlog offset. With the tlog anchored at Nov 2023 absolute and the synthetic video at relative t=0, auto-sync reports offset_ms=1699999995666 (~54 years) and hard-fails AC-8 (95% frame-window match threshold).

Surface signal from the SUT (the kind of log the Reality Gate was meant to surface):

ERROR replay_input.tlog_video_adapter
  kind=replay.auto_sync.ac8_validation_failed
  msg=auto-sync hard-fail: frame-window match below 95.0% with offset_ms=1699999995666
  tlog_takeoff_ns=1700000000000000000  video_motion_onset_ns=4333333333
  imu_sample_count=3000  video_frame_count=301

This is the same family as H-13 / AZ-611 (stationary FT-P-01) but on the moving Derkachi fixture with a different root cause (synth time-base, not stationary kinematics). Filed as AZ-614.

Jira state at end of cycle 2

Issue	Title	Status
AZ-602	E2E Tier-1 harness rehabilitation (Epic)	TO DO
AZ-601	csv_reporter `--csv` collision (fixed `eb6dc17`)	IN TESTING
AZ-603	H-7 Dockerfile entrypoint (Track 1)	DONE (this cycle)
AZ-604	H-8 install SUT in runner image (Track 1)	DONE (this cycle)
AZ-605	H-4..H-6 SITL strategy decision	TO DO
AZ-606	MAVProxy local Dockerfile	TO DO
AZ-607	H-9 tile-cache seeder (linked to AZ-595)	TO DO
AZ-608	H-10 fixture builder `--fdr-out` → `--output`	TO DO
AZ-609	H-11 fixture builder missing CLI args	TO DO
AZ-610	H-12 calibration JSON 4×4 (fixed `5c1c35d`)	DONE
AZ-611	H-13 auto-sync hard-fail on stationary	TO DO (Track 2, decision)
AZ-612	H-14 `.env.example` BUILD_REPLAY_SINK_JSONL	TO DO
AZ-613	H-1..H-3 harness drift (fixed `6ce3158`)	DONE
AZ-614	Derkachi tlog synth time-base mismatch	TO DO (Track 2, unblocks AC-1..AC-6)

Reality Gate verdict

Cycle-2 verdict for Step 11: Reality Gate signal is now REAL — the SUT runs end-to-end for ~21 s on the Derkachi fixture and surfaces a real auto-sync bug. Pre-Track 1, the gate was a vacuous "exit 0 with 0 tests collected" that hid every SUT issue. Track 1 was the minimum investment to make the gate honest; future cycles (Track 2 + AZ-614) will turn the failing ACs green.

22 KiB Raw Blame History Unescape Escape