Cycle-3 addendum captures the layered Jetson rerun progression: synth time-base fix (AZ-614) drops offset_ms from 1.7e12 to -4334; AZ-611 skip-auto-sync then crosses the AC-9 validator; AZ-602 build-flag completeness opens VideoFileFrameSource and TlogReplayFcAdapter; composition root logs 'replay.compose_root.ready: auto_sync_used=false', then crashes inside runtime_root.airborne_bootstrap because production main() never builds c13_fdr / c6_* / c7_inference / c3_lightglue_runtime / c3_feature_extractor / c2_82_ransac_filter into pre_constructed. The bootstrap gap is filed as AZ-618 (Story under AZ-602). It affects both live and replay binaries -- every prior Reality-Gate run died at auto-sync before the composition graph was walked, so the gap was hidden. The 38 compose_root unit tests pass only via the replay_components_factory stub kwarg, which bypasses the bootstrap entirely. Autodev sub_step advances to phase 8 'az614-az611-landed-bootstrap-gap-discovered' pending the user's decision on whether to start AZ-618 immediately or close out Step 11 with the current Reality-Gate signal. Co-authored-by: Cursor <cursoragent@cursor.com>
31 KiB
Step 11 — Run Tests (Cycle 1)
TL;DR
- Local Tier-1 pytest suite: 3343 passed / 88 skipped / 0 failed (after the
--csvcollision fix landed ineb6dc17). - Docker Tier-1 SUT Reality Gate: NOT MET. Both docker harnesses (top-level
scripts/run-tests.shand the fullere2e/docker/run-tier1.sh) have pre-existing drift that prevents them from running end-to-end. None of the drift was caused by Step 10 work — the harnesses had simply never been bench-tested. - Recommendation: open a single epic to rehabilitate the e2e docker harness
(or two, splitting bootstrap vs. full blackbox); resume Step 11 reality-gate
verification once at least the bootstrap harness can run
tests/e2e/replay/withRUN_REPLAY_E2E=1.
Local Tier-1 results
Run on 2026-05-17 against dev HEAD c64e492, then eb6dc17 after the
csv_reporter fix. Split into 12 logical chunks per the human directive to avoid
a monolithic 3.4k-test run:
| # | Chunk | Pass | Skip | Fail |
|---|---|---|---|---|
| C1 | tests/contract + 6× cross-cutting |
87 | 0 | 0 |
| C2 | c2_5_rerank + c4_pose + c13_fdr + c11_tile_manager + c3_5_adhop |
262 | 0 | 0 |
| C3 | c3_matcher + c10_provisioning |
170 | 3 | 0 |
| C4 | c1_vio |
148 | 6 | 0 |
| C5 | c12_operator_orchestrator |
151 | 2 | 0 |
| C6 | c7_inference |
139 | 17 | 0 |
| C7 | c6_tile_cache |
126 | 57 | 0 |
| C8 | c8_fc_adapter |
212 | 1 | 0 |
| C9 | c5_state |
216 | 0 | 0 |
| C10 | c2_vpr |
230 | 0 | 0 |
| C11 | tests/unit/test_*.py root-level |
373 | 2 | 0 |
| C12 | e2e/_unit_tests (after fix) |
1229 | 0 | 0 |
| TOTAL | 3343 | 88 | 0 |
Skip classification (88 total)
| Reason | Count | Verdict |
|---|---|---|
Tier-2-only (GPS_DENIED_TIER=2) — Jetson Orin Nano Super hardware |
14 | legitimate |
| CUDA / NVIDIA GPU not present on macOS dev host | 8 | legitimate |
| TensorRT python binding not installed (Tier-2 Jetson only) | 6 | legitimate |
| Requires Docker compose services (postgres / mock-sat) | 57 | borderline — covered by docker harness when it runs |
Console scripts not on PATH (pip install -e . would fix) |
3 | borderline — env-conditional |
actionlint not on PATH (CI lint job installs separately) |
1 | borderline — env-conditional |
| Other (empty parametrize, doc-gated replay e2e) | 2 | legitimate |
The 57 "Requires Docker compose services" skips are the largest illegitimate-per-skill cluster. They become covered the moment any docker harness runs end-to-end against postgres + mock-sat. Until then, they remain.
C12 regression that surfaced during this run
e2e/_unit_tests/reporting/test_csv_reporter.py::test_csv_plugin_emits_required_columns and two sibling tests in test_nfr_recorder.py failed with:
argparse.ArgumentError: argument --csv: conflicting option string: --csv
inside subprocess-spawned pytest invocations. Root cause: pytest-csv 3.0.0 was listed in e2e/runner/requirements.txt and auto-loaded via entry-point, conflicting with our custom --csv flag from e2e/runner/reporting/csv_reporter.py. The conftest comment claimed our plugin "overrides" pytest-csv, but pytest's option registry does not allow overrides — it raises on conflict. pytest-csv is also incompatible with pytest 9.x (uses removed hookwrapper marker). Our code never import pytest_csv — the dep was dead weight.
Fix landed in commit eb6dc17 [autodev] fix csv_reporter --csv collision with pytest-csv:
- Removed
pytest-csvfrome2e/runner/requirements.txt. - Updated docstrings/comments in
csv_reporter.pyandconftest.py. - Uninstalled
pytest-csvfrom the local environment.
After the fix the full C12 suite reports 1229 passed in 145.93s.
Secondary finding — false-positive batch report
_docs/03_implementation/batch_89_cycle1_report.md claimed "Full e2e unit-test suite: 1229 passed in 134 s". That number was reported without an actual verifying invocation. The 3 reporting subprocess tests have been broken since pytest-csv was installed locally, but the batch report didn't catch it.
Proposed preventive rule (pending user approval, per meta-rule.mdc Self-Improvement): "Before writing 'Test Results: X passed' in a batch report, the same shell invocation that produced X must appear in the assistant transcript with the exit code visible." Will be added to meta-rule.mdc if approved.
Docker harness — findings
| ID | Severity | Description | Status |
|---|---|---|---|
| H-1 | medium | e2e/docker/docker-compose.test.yml referenced docker/Dockerfile; actual file is docker/companion-tier1.Dockerfile |
fixed in 6ce3158 |
| H-2 | medium | fdr-output volume declared tmpfs size=64g; host Docker has 3.8 GiB |
fixed in 6ce3158 (plain named volume; SUT enforces cap internally per NFT-LIM-02) |
| H-3 | low | e2e-results/ bind dir missing at repo root |
fixed in 6ce3158 (mkdir + .gitignore) |
| H-4 | blocker | ardupilot/mavproxy:latest image MISSING from Docker Hub |
deferred — see "Harness rehabilitation" below |
| H-5 | blocker | ardupilot/ardupilot-sitl:plane-stable image MISSING from Docker Hub |
deferred |
| H-6 | blocker | inavflight/inav-sitl:9.0.0 image MISSING from Docker Hub |
deferred |
| H-7 | blocker | Top-level tests/e2e/Dockerfile entrypoint is pytest /opt/tests/e2e/scenarios (empty dir); real tests are in /opt/tests/e2e/replay/ |
deferred |
| H-8 | blocker | Top-level tests/e2e/Dockerfile uses a plain python:3.10-slim and never installs the SUT package — so the gps-denied-replay console script and the project's import surface aren't available in the container |
deferred |
| H-9 | medium | tile-cache-fixture volume in e2e/docker/docker-compose.test.yml is unseeded; no builder service. AZ-595 was meant to deliver the seeder |
deferred |
Why H-4..H-6 are blockers, not minor drift
_docs/02_document/tests/environment.md § Docker Environment specifies three Docker Hub images for SITL/MAVLink:
ardupilot/ardupilot-sitl:plane-stableinavflight/inav-sitl:9.0.0ardupilot/mavproxy:latest
None of the named org accounts publish to Docker Hub. Verified by docker manifest inspect — all three return MISSING. The spec was written against aspirational/imagined image names and never verified.
Real alternatives:
| Spec image | Real option |
|---|---|
ardupilot/ardupilot-sitl:plane-stable |
Community images: radarku/ardupilot-sitl, khancyr/ardupilot-docker-sitl (older); or build from ardupilot/ardupilot source (~30-60 min build). |
inavflight/inav-sitl:9.0.0 |
No good published image. Build from iNav source (multi-hour). |
ardupilot/mavproxy:latest |
Doesn't exist. Wrap pip install MAVProxy in a python:3.10-slim Dockerfile (~10 lines). |
Why H-7 and H-8 matter
scripts/run-tests.sh is the test-run skill's "first match" runner — but its Dockerfile points at an empty scenarios dir and never installs the SUT package. Even if H-7 is fixed by repointing to /opt/tests/e2e/, the heavy tests in tests/e2e/replay/test_derkachi_1min.py require the gps-denied-replay console script which only exists when the SUT package is pip install-ed into the runner image. So H-7 + H-8 are coupled.
SUT Reality Gate verdict
Per test-run/SKILL.md § Functional Mode → 0. System-Under-Test Reality Gate:
- If
_docs/00_problem/input_data/expected_results/results_report.mdexists, at least one e2e/blackbox run must compare actual product outputs against that mapping or the machine-readable files it references.
results_report.md exists and contains:
- Still-image frame centers (60 images → expected WGS84 lat/lon, ±50 m primary, ±20 m stretch).
- Derkachi video/IMU fixture (validation rules for telemetry CSV, video stream, alignment, trajectory comparison).
The 41 blackbox scenarios in e2e/tests/ (functional/FT-* for the still-image set, performance/NFT-PERF-* for replay latency, resilience/NFT-RES-* for failure modes) exist and would exercise this mapping. None of them ran in this cycle because:
- They require the
e2e/docker/run-tier1.shharness, blocked by H-4..H-6. - The fallback bootstrap harness (
scripts/run-tests.sh→tests/e2e/replay/) is blocked by H-7 + H-8.
Verdict: Reality Gate UNMET. Local pytest verifies internal modules but does NOT compare actual product outputs against results_report.md. The skill defines this as a blocking gate.
Harness rehabilitation — proposed work
Three independent tracks; tracks 1 and 2 are required for the Reality Gate, track 3 is a nice-to-have for FC-side acceptance tests.
Track 1 — Bootstrap harness (fastest path to a real Reality Gate)
Fix H-7 + H-8 so scripts/run-tests.sh actually runs tests/e2e/replay/test_derkachi_1min.py with RUN_REPLAY_E2E=1. Steps:
- Change
tests/e2e/Dockerfileentrypoint frompytest /opt/tests/e2e/scenariostopytest /opt/tests/e2e/. - Install the SUT package in the runner image so
gps-denied-replayis on PATH. Eitherpip install -e .from the host source (requires bind-mount or COPY) or build a wheel andpip installit. - Inject
RUN_REPLAY_E2E=1into the e2e-runner service environment indocker-compose.test.yml. - Mount
_docs/00_problem/input_data/into the runner so the Derkachi fixture is reachable. - Verify the resulting docker run produces
tests/e2e/replay/output/replay.jsonland that AC-1 + AC-2 assertions actually fire.
Estimated effort: ~4-6 hours including verification on dev workstation + CI.
Coverage: still-image reality gate still won't run from this harness (it's in e2e/tests/functional/FT-* which Track 2 owns). But Derkachi tlog comparison runs, which is itself a reality-gate signal for the replay pipeline.
Track 2 — Full blackbox harness (the real story)
The fuller e2e harness in e2e/docker/run-tier1.sh runs all 41 batch-60-89 scenarios. To get it running:
- Decide on the SITL strategy:
- Option a: switch to community images (
radarku/ardupilot-sitl, source-built iNav). - Option b: strip SITL services entirely from the compose; mark SITL-dependent scenarios as
skip(reason="sitl-image-unavailable, ticket=..."). ~70-80% of scenarios still run (FT-* still-image accuracy, NFT-PERF latency, NFT-LIM resource budgets, most NFT-RES resilience scenarios). FT-P-09-AP / NFT-SEC-03 / AC-4.3 / AC-NEW-2 stay skipped.
- Option a: switch to community images (
- Replace
ardupilot/mavproxy:latestwith a locale2e/fixtures/mavproxy/Dockerfilethat wrapspip install MAVProxy. - Build a tile-cache seeder service that consumes the 60 nadir reference images + Derkachi bbox and emits the FAISS index + tile manifest into the
tile-cache-fixturevolume. This is the long-pending AZ-595.
Estimated effort: ~3-5 days if going with Option b + AZ-595; ~1-3 weeks if going Option a with full SITL coverage.
Track 3 — Tier-2 Jetson hardware loop (AZ-444)
Out of scope for this report; documented in environment.md § Execution instructions — Tier-2 (Jetson hardware loop). Requires Jetson Orin Nano Super hardware + JetPack 6.2 + a self-hosted runner.
Recommendations
- Treat Step 11 as PARTIALLY MET for cycle 1: local Tier-1 green, Reality Gate deferred to a follow-on cycle. Document this honestly in the autodev state instead of marking Step 11 complete.
- Open Jira tickets (or replay via leftovers if MCP is not invoked immediately):
[Bug] csv_reporter --csv collides with pytest-csv autoload(2 pts) — fix already in commiteb6dc17, ticket just records the regression.[Epic] E2E Tier-1 harness rehabilitationwith sub-tasks: H-4..H-9 each as a child story (2-5 pts each).[Story] Tile-cache fixture builder— AZ-595 already exists; link H-9 to it.
- Add the preventive meta-rule about transcript-verified test claims, if approved.
- Resume Step 11 after Track 1 completes — at minimum get one real Reality Gate signal from
tests/e2e/replay/. Track 2 can run in parallel as its own work stream and feed back into Step 11 cycle 2.
Path 3 attempt — Full SITL with community images (2026-05-17, post-blocker)
Per user direction, attempted the "Full path" rehab: switch ArduPilot SITL to sparlane/ardupilot-sitl:Plane-latest (verified pullable), build iNav SITL from source, write MAVProxy Dockerfile, then run FT-P-01 / FT-P-02 against the real fixture builders.
Key reframe discovered during attempt: e2e/runner/helpers/sitl_observer.py is pure offline fixture replay, not a live SITL client (see file docstring + _FdrReplayObserver class). Setting E2E_SITL_REPLAY_DIR=... switches the observer to read pre-built JSON fixtures (observer_<fc_kind>_<host>.json). No live SITL container needed for the existing blackbox FT-P-* and NFT-* tests. The compose-file SITL services in environment.md are aspirational future state.
So the realistic Full Path is:
- Install SUT locally (
pip install -e .) — DONE. - Run
e2e.fixtures.sitl_replay_builder.build_p01_fixturesto producee2e/fixtures/sitl_replay/p01/— BLOCKED (see below). - Run pytest on
e2e/tests/positive/test_ft_p_01_still_image_accuracy.pywithE2E_SITL_REPLAY_DIR=e2e/fixtures/sitl_replay/p01— BLOCKED on step 2.
Trying step 2 surfaced 4 new integration drifts, on top of H-1..H-9 from the prior section:
| ID | Severity | Description | Status |
|---|---|---|---|
| H-10 | blocker | Fixture builder calls gps-denied-replay --fdr-out PATH. The CLI's actual arg name is --output. |
not fixed |
| H-11 | blocker | Fixture builder doesn't pass the CLI's required --camera-calibration, --config, --mavlink-signing-key args. Need to add fields to FixtureBuilderConfig and update build_p01_fixtures.py / build_p02_fixtures.py. |
not fixed |
| H-12 | medium | tests/fixtures/calibration/adti26.json declared body_to_camera_se3 as {rotation_xyzw, translation_xyz_m} dict; loader at runtime_root/_replay_branch.py:308 strictly expects a 4×4 matrix via np.asarray(..., dtype=np.float64). The dict form was never parseable. |
fixed — converted to 4×4 identity (tests/fixtures/calibration/adti26.json). Equivalent rotation/translation, no behavior change. |
| H-13 | blocker | Auto-sync AC-8 validation hard-fails on still-image + stationary fixtures even when --time-offset-ms 0 is supplied. Validator computes a "frame-window match %" (default 95% threshold) that requires real video motion + IMU takeoff signal. The FT-P-01 fixture (60 stills + stationary IMU) has neither by design. No --skip-auto-sync or --accept-low-confidence-offset escape hatch exists. |
not fixed |
| H-14 | env-conditional | CLI requires env vars including BUILD_REPLAY_SINK_JSONL=ON to use NoopMavlinkTransport. This is documented in code comments but not in .env.example. |
needs doc update |
Total live harness drift count: 14 distinct items (3 fixed, 11 deferred). Each H-10..H-13 individually takes 30-60 min to fix with the right design decisions; together they exceed the safe single-session budget given the surface-area uncertainty.
Pattern: The fixture builders (AZ-598/599/600), the CLI signature (AZ-401/402), the calibration JSON schema, and the replay protocol auto-sync (AZ-405) were each implemented well in isolation but never integrated end-to-end. This is exactly what the SUT Reality Gate is designed to surface.
Path 3 verdict
Cannot reach the SUT Reality Gate in this session. Even after fixing H-12, the next gate (H-13: auto-sync hard-fail on stationary fixtures) requires a design decision: either expand the auto-sync escape hatch in the SUT, or change the fixture builder to inject a single-frame motion event, or relax AC-8 validation thresholds for stationary scenarios. Each is a non-trivial design call that warrants a Jira ticket and review, not a unilateral mid-session fix.
Updated recommendation
The Track 2 ("Full blackbox harness") track from the previous section needs to expand to include H-10..H-14 as additional sub-stories. Realistic effort: +1-2 days on top of the prior estimate. Path 3 is achievable but requires 3-5 days of focused harness rehab, not a single session.
Artifacts
- Commit
eb6dc17— csv_reporter / pytest-csv fix - Commit
6ce3158— e2e/docker harness drift fixes (H-1, H-2, H-3) - Commit
5c1c35d— H-12 4×4 SE3 calibration fix + replay_config_minimal.yaml - This report:
_docs/03_implementation/run_tests_step11_report.md - (Replayed and removed) Leftover for pytest-csv ticket → AZ-601
- (Replayed and removed) Leftover for harness epic → AZ-602 + 11 child stories
Cycle-2 Update: Track 1 Bootstrap Harness Outcome (2026-05-17 22:00 UTC)
Status: Track 1 done — Reality Gate signal is now REAL
The harness rehabilitation Epic landed as AZ-602 with 11 child stories. The user picked Track 1 (AZ-603 + AZ-604) for the shortest path to a genuine SUT Reality Gate signal. Both stories shipped together in a single PR.
What changed
tests/e2e/Dockerfilerewritten as a three-stage Ubuntu 22.04 build:- stage 1: system deps (
build-essential,libpq-dev,libspatialindex-dev,python3.10-venv,python3-pip) - stage 2: SUT editable install (
pip install -e ".[dev]"into/opt/venv) - stage 3: slim runtime with
python3,python3.10,libpq5,libspatialindex-c6,libgl1,libglib2.0-0(OpenCV's runtime libs)
- stage 1: system deps (
- Image layout:
/opt/pyproject.toml+/opt/src/...+/opt/tests/...(bind-mounted) — mirrors the host repo soPath(__file__).resolve().parents[3]resolves to/optand AC-4's AST scan findssrc/gps_denied_onboard/components/correctly. - Entrypoint:
pytest -q /opt/tests/e2e/(not the emptyscenarios/dir). docker-compose.test.ymle2e-runnerservice gets the full env set (GPS_DENIED_FC_PROFILE,CAMERA_CALIBRATION_PATH,LOG_LEVEL,LOG_SINK,INFERENCE_BACKEND,FDR_PATH,TILE_CACHE_PATH,MAVLINK_SIGNING_KEY,RUN_REPLAY_E2E=1,BUILD_REPLAY_SINK_JSONL=ON) plus mounts for_docs/00_problem/input_dataand writablefdr-data/tile-datanamed volumes.
Reality Gate run
Standalone docker run of the e2e-runner (no companion / mock-sat / db needed for AZ-404):
docker run --rm \
-v "$PWD/tests:/opt/tests:ro" \
-v "$PWD/_docs/00_problem/input_data:/opt/_docs/00_problem/input_data:ro" \
-e RUN_REPLAY_E2E=1 -e BUILD_REPLAY_SINK_JSONL=ON \
... (full env set) ... \
--entrypoint pytest gps-denied-onboard/e2e-runner:dev \
-v --tb=short /opt/tests/e2e/
Result:
| Outcome | Count | Tests |
|---|---|---|
| PASSED | 17 | AC-4 AST scan, AC-4b encoder byte-equality, AC-7 skip-gate, all AC-9 helpers (test_helpers.py) |
| FAILED | 5 | AC-1, AC-2, AC-5, AC-6 pace-realtime, AC-6 pace-asap |
| SKIPPED | 1 | AC-8 operator workflow (D-PROJ-2 mock-suite-sat-service not implemented) |
| XFAIL | 1 | AC-3 (calibration intrinsics unknown — documented) |
| Total collected | 24 | (vs. 0 before Track 1 — empty scenarios/ dir) |
Before vs. after
| Metric | Before Track 1 | After Track 1 |
|---|---|---|
Tests collected by scripts/run-tests.sh |
0 (entrypoint points at empty scenarios/) |
24 (full tests/e2e/) |
| Tests that actually exercise the SUT | 0 | 5 heavy ACs invoke gps-denied-replay subprocess |
| Exit code semantics | Vacuous 0 (no tests collected ≠ no SUT bugs) | Reflects real test outcomes |
gps-denied-replay on PATH inside e2e-runner image |
no (image was python:3.10-slim + pytest only) | yes (multi-stage SUT install) |
| Source-tree layout inside image matches repo | no (no src present) | yes (/opt/src/..., AC-4 passes) |
| Real SUT wall-clock per heavy AC | n/a | ~21 s for the auto-sync probe (see below) |
Real bug discovered
The 5 failing heavy ACs share a single root cause: tlog synth time-base mismatch.
tests/e2e/replay/_tlog_synth.py:62:
_TLOG_BASE_TIMESTAMP_US: Final[int] = 1_700_000_000_000_000 # 2023-11-14
# "The absolute value is irrelevant for replay-mode determinism;
# only the delta-between-rows matters." ← STALE COMMENT
The auto-sync detector in replay_input.tlog_video_adapter DOES use absolute timestamps to compute the video↔tlog offset. With the tlog anchored at Nov 2023 absolute and the synthetic video at relative t=0, auto-sync reports offset_ms=1699999995666 (~54 years) and hard-fails AC-8 (95% frame-window match threshold).
Surface signal from the SUT (the kind of log the Reality Gate was meant to surface):
ERROR replay_input.tlog_video_adapter
kind=replay.auto_sync.ac8_validation_failed
msg=auto-sync hard-fail: frame-window match below 95.0% with offset_ms=1699999995666
tlog_takeoff_ns=1700000000000000000 video_motion_onset_ns=4333333333
imu_sample_count=3000 video_frame_count=301
This is the same family as H-13 / AZ-611 (stationary FT-P-01) but on the moving Derkachi fixture with a different root cause (synth time-base, not stationary kinematics). Filed as AZ-614.
Jira state at end of cycle 2
| Issue | Title | Status |
|---|---|---|
| AZ-602 | E2E Tier-1 harness rehabilitation (Epic) | TO DO |
| AZ-601 | csv_reporter --csv collision (fixed eb6dc17) |
IN TESTING |
| AZ-603 | H-7 Dockerfile entrypoint (Track 1) | DONE (this cycle) |
| AZ-604 | H-8 install SUT in runner image (Track 1) | DONE (this cycle) |
| AZ-605 | H-4..H-6 SITL strategy decision | TO DO |
| AZ-606 | MAVProxy local Dockerfile | TO DO |
| AZ-607 | H-9 tile-cache seeder (linked to AZ-595) | TO DO |
| AZ-608 | H-10 fixture builder --fdr-out → --output |
TO DO |
| AZ-609 | H-11 fixture builder missing CLI args | TO DO |
| AZ-610 | H-12 calibration JSON 4×4 (fixed 5c1c35d) |
DONE |
| AZ-611 | H-13 auto-sync hard-fail on stationary | TO DO (Track 2, decision) |
| AZ-612 | H-14 .env.example BUILD_REPLAY_SINK_JSONL |
TO DO |
| AZ-613 | H-1..H-3 harness drift (fixed 6ce3158) |
DONE |
| AZ-614 | Derkachi tlog synth time-base mismatch | TO DO (Track 2, unblocks AC-1..AC-6) |
Reality Gate verdict
Cycle-2 verdict for Step 11: Reality Gate signal is now REAL — the SUT runs end-to-end for ~21 s on the Derkachi fixture and surfaces a real auto-sync bug. Pre-Track 1, the gate was a vacuous "exit 0 with 0 tests collected" that hid every SUT issue. Track 1 was the minimum investment to make the gate honest; future cycles (Track 2 + AZ-614) will turn the failing ACs green.
Cycle-2 addendum: Jetson harness brought online (AZ-615)
The Colima harness above is "Tier-1" — ARM Linux without GPU. The SUT's
pytorch_fp16_runtime (and tensorrt_runtime) hard-code .cuda() calls,
so anything past auto-sync can ONLY be exercised against a real GPU. The
operator's Jetson Orin Nano (JetPack 6.2.2+b24, L4T R36.5.0,
nvidia-container-toolkit ≥ 1.16) was wired in as the Tier-2 harness.
Net-new artifacts (committed under AZ-615):
tests/e2e/Dockerfile.jetson—FROM dustynv/l4t-pytorch:r36.4.0with Tegra-tuned torch / torchvision pre-baked. Wipes the image's stale/etc/pip.conf(jetson.webredirect.org is maintainer-LAN only), upgrades pip 24→26 so thegtsam<5.0,>=4.2constraint resolves to the only PyPI wheel for aarch64 (4.3a0, same as Colima), installs the SUT editable via system-pip +--break-system-packages.docker-compose.test.jetson.yml— mirror ofdocker-compose.test.ymlwithruntime: nvidia,deploy.resources.reservations.devices, andGPS_DENIED_TIER: "2"so the auto-skip hook intests/conftest.pyruns the heavy ACs instead of skipping them.scripts/run-tests-jetson.sh— rsync → ssh build → ssh up wrapper. Operator-side SSH aliasjetson-e2edocumented in_docs/03_implementation/jetson_harness_setup.md.@pytest.mark.tier2applied to AC-1, AC-2, AC-3, AC-5, AC-6 intests/e2e/replay/test_derkachi_1min.pyso the same test file is the source of truth for both harnesses (Colima auto-skips tier2 via the existingpytest_collection_modifyitemshook).
Jetson smoke run (first end-to-end, 2026-05-18)
| Outcome | Count | Tests |
|---|---|---|
| PASSED | 17 | AC-4 AST scan, AC-7 skip-gate, 14× AC-9 helpers |
| FAILED | 5 | AC-1, AC-2, AC-5, AC-6 pace-realtime, AC-6 pace-asap |
| SKIPPED | 1 | AC-8 (unchanged: D-PROJ-2 mock-sat stub) |
| XFAIL | 1 | AC-3 (unchanged: calibration intrinsics unknown) |
| Wall clock | 10m09s | (vs ~5m on Colima) |
Same 5 failures as Colima, same root cause (replay.auto_sync.ac8_validation_failed,
offset_ms=1699999995666). AZ-614 reproduces on Jetson because the synth
tlog time-base bug is architecture-independent — heavy ACs die at
auto-sync, BEFORE any frame reaches the GPU. So this run validated the
infrastructure (image builds, GPU exposed, SUT runs, pytest collects 24)
but did NOT yet exercise ALIKED / DISK LightGlue on the actual GPU. The
2× wall delta vs Colima is the cost of CUDA + torch + TensorRT
initialization in the per-test SUT subprocess.
Implication for Track 2: fixing AZ-614 is the gating prerequisite for ANY Reality-Gate-grade signal from the heavy ACs. Until then, Jetson and Colima are indistinguishable — same green light ACs, same failed heavy ACs. Once AZ-614 lands, the two harnesses divide cleanly: Colima keeps exercising the light path (AC-4 / AC-7 / AC-9 plus auto-sync), Jetson covers the heavy path (AC-1 / AC-2 / AC-5 / AC-6 plus the GPU inference stages they entail).
Lessons learned (committed to setup doc)
nvcr.io/nvidia/l4t-baseis deprecated in JetPack 6;l4t-pytorchhas no R36 tags;l4t-jetpack:r36.4.0exists but ships no PyTorch.dustynv/l4t-pytorch:r36.4.0(Docker Hub) is the only off-the-shelf Jetson base image with Tegra-tuned PyTorch wheels for R36.nvidia-container-runtimemountsnvidia-smi+ CUDA libs from the host into any container at runtime, so the GPU-exposure smoke test doesn't need a 5 GBl4t-jetpackpull —ubuntu:22.04 nvidia-smi(80 MB) suffices.- The dustynv image bakes a private pip mirror into
/etc/pip.conf; builds in any other network must wipe it AND pin--index-urlto upstream PyPI. - git LFS-tracked fixtures (the 269 MB Derkachi mp4) must be pre-smudged on the Mac BEFORE the rsync step; otherwise the Jetson receives the 134 B pointer and tests fail at fixture-load.
Cycle-3 addendum: AZ-614 + AZ-611 landed; next blocker is airborne bootstrap (AZ-618)
Date: 2026-05-18. Track-2 Reality-Gate work continued on Jetson with two SUT-side fixes layered on top of the AZ-615 harness.
Commits landed this cycle
| Commit | Ticket | What |
|---|---|---|
e114bfd |
AZ-614 | _TLOG_BASE_TIMESTAMP_US = 0 so the synth tlog shares the video's t=0 anchor |
bd41956 |
AZ-611 | --skip-auto-sync CLI flag + ReplayConfig.skip_auto_sync_validation field + 5 unit tests |
324bbd6 |
AZ-602 | docker-compose.test.yml + docker-compose.test.jetson.yml: set all three replay BUILD_* flags |
b7012d2 |
AZ-615 | scripts/run-tests-jetson.sh: resolve ~ against remote $HOME before the heredoc cd |
Jetson rerun progression
Each rerun isolated one fix to keep the diagnostic signal clean.
| Rerun | Run scope | tlog_takeoff_ns |
resolved offset_ms |
Failure layer |
|---|---|---|---|---|
Pre-fix (915484.txt) |
AZ-614 unverified | 1.7e18 (53 yr anchor) |
1.7e12 (~53 yr) |
auto-sync AC-9 |
Rerun 1 (527631.txt) |
AZ-614 only | 0 |
-4334 (~4.3 s) |
auto-sync AC-9 (false-positive motion) |
Rerun 2 (224191.txt) |
AZ-614 + AZ-611 | 0 |
0 (manual, validation skipped) |
runtime_root build flag (BUILD_VIDEO_FILE_FRAME_SOURCE) |
Rerun 3 (110515.txt) |
+ AZ-602 build flags | 0 |
0 |
runtime_root airborne_bootstrap ← NEW |
Reality-Gate verdict (cycle 3)
The Jetson run now successfully:
- Reads the synth tlog (
message_counts: SCALED_IMU2/ATTITUDE/GPS_RAW_INT/HEARTBEAT) - Opens
VideoFileFrameSourceagainst the 269 MB Derkachi mp4 - Opens
TlogReplayFcAdapterandJsonlReplaySink - Logs
replay.compose_root.ready: pace=asap resolved_offset_ms=0 auto_sync_used=false
…then immediately crashes inside runtime_root.airborne_bootstrap:
runtime_root: airborne_bootstrap: component 'c4_pose' requires
pre_constructed['c282_ransac_filter'] to be populated before
compose_root() runs; available keys in constructed:
['clock', 'fc_adapter', 'frame_source', 'mavlink_transport', 'replay_sink'].
Production main() must build infrastructure (c13_fdr, c6_*, c7_inference, etc.)
into pre_constructed and pass it to compose_root(config, pre_constructed=...).
This affects both live and replay binaries. Every prior "green" Reality-Gate
run died at auto-sync (AZ-614 root cause) BEFORE the composition graph was
walked, so the gap stayed hidden through Track 1 + AZ-615. AZ-591 registered the
strategy wrappers; runtime_root.main() still does not construct the
infrastructure dependencies those wrappers consume. The 38 unit tests for
compose_root pass only because they inject a stub factory via the
replay_components_factory kwarg, bypassing the bootstrap entirely.
Filed as AZ-618 (Story under AZ-602, 5 pts capped per local rules, with a 6-subtask split recommended during refinement: c13_fdr+clock, c6_*, c7_inference, c3_lightglue+feature_extractor, c2_82_ransac_filter, integration wiring).
Tier-2 e2e count breakdown (Cycle 3)
Same 5 failures, three layers deeper into the SUT than Cycle 2.
| Outcome | Count | Tests |
|---|---|---|
| PASSED | 17 | AC-4 AST scan, AC-7 skip-gate, 14× AC-9 helpers |
| FAILED | 5 | AC-1, AC-2, AC-5, AC-6 pace-realtime, AC-6 pace-asap |
| SKIPPED | 1 | AC-8 (unchanged: D-PROJ-2 mock-sat stub) |
| XFAIL | 1 | AC-3 (unchanged: calibration intrinsics unknown) |
| Wall clock | 20s | (vs ~10 min Cycle 2: now fails fast at composition root instead of timing out at auto-sync) |
Jira state at end of cycle 3
| Issue | Title | Status |
|---|---|---|
| AZ-602 | E2E Tier-1 harness rehabilitation (Epic) | TO DO (Track-2 still in progress) |
| AZ-611 | Auto-sync escape hatch (--skip-auto-sync) |
DONE this cycle |
| AZ-614 | Derkachi tlog synth time-base mismatch | DONE this cycle |
| AZ-615 | Jetson Tier-2 harness | DONE (+ tilde-fix this cycle) |
| AZ-618 | Airborne main() must build pre_constructed infrastructure for compose_root | NEW — TO DO (next Reality-Gate blocker) |
Discovered followup (no commits, just a note)
scripts/run-tests-jetson.sh still does docker compose up against the
docker-compose.test.jetson.yml runner stack, which tries to pull
operator-orchestrator:dev etc. from Docker Hub (they only exist as local
build tags). Rerun 3 worked around this by skipping compose entirely and
invoking docker run --rm --runtime=nvidia --gpus all gps-denied-onboard/e2e-runner:jetson …
directly. Compose isn't needed until the test reaches into the DB / mock-sat /
companion services — which currently never happens because the run dies at
airborne_bootstrap. Recommend revisiting the script after AZ-618 lands so the
compose dependency graph is meaningful.