[AZ-615] Step-11 report + state: Jetson harness first end-to-end run

Records the first Jetson Tier-2 run results in the step-11 report:
17 pass / 5 fail / 1 skip / 1 xfail (24 total, 10m09s) — identical to
Colima because all 5 failures hit AZ-614 (tlog time-base mismatch)
BEFORE reaching the GPU. So the infrastructure is proven (image
builds, GPU exposed inside container, SUT subprocess runs to the
auto-sync stage) but the heavy ACs haven't yet exercised ALIKED /
DISK LightGlue. Fixing AZ-614 is the gating prerequisite to actually
drive the GPU stages.

Also captures lessons learned that are now in the setup doc:
  * Only dustynv/l4t-pytorch:r36.4.0 is a usable Jetson PyTorch base
    on Docker Hub for R36 / JetPack 6 (l4t-base deprecated, official
    l4t-pytorch has no R36 tags).
  * The dustynv image bakes a maintainer-LAN-only pip mirror into
    /etc/pip.conf — must be wiped + --index-url pinned to pypi.org.
  * pip 24.2 (image default) rejects gtsam-4.3a0 pre-release; pip 26.x
    accepts the same wheel for `gtsam<5.0,>=4.2` because there are no
    stable aarch64 builds. Upgrade pip in the build, don't relax pin.
  * nvidia-container-runtime mounts nvidia-smi from host, so the GPU
    smoke test needs only ubuntu:22.04 (80 MB), not l4t-jetpack (5 GB).

Autodev state advances to phase 7 / jetson-harness-online.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-18 08:14:26 +03:00
parent 58a1678417
commit 8e563efd4c
2 changed files with 75 additions and 3 deletions
@@ -314,3 +314,75 @@ This is the same family as H-13 / `AZ-611` (stationary FT-P-01) but on the movin
### Reality Gate verdict
**Cycle-2 verdict for Step 11**: Reality Gate signal is now REAL — the SUT runs end-to-end for ~21 s on the Derkachi fixture and surfaces a real auto-sync bug. Pre-Track 1, the gate was a vacuous "exit 0 with 0 tests collected" that hid every SUT issue. Track 1 was the minimum investment to make the gate honest; future cycles (Track 2 + AZ-614) will turn the failing ACs green.
## Cycle-2 addendum: Jetson harness brought online (AZ-615)
The Colima harness above is "Tier-1" — ARM Linux without GPU. The SUT's
`pytorch_fp16_runtime` (and `tensorrt_runtime`) hard-code `.cuda()` calls,
so anything past auto-sync can ONLY be exercised against a real GPU. The
operator's Jetson Orin Nano (JetPack 6.2.2+b24, L4T R36.5.0,
nvidia-container-toolkit ≥ 1.16) was wired in as the Tier-2 harness.
Net-new artifacts (committed under AZ-615):
* `tests/e2e/Dockerfile.jetson``FROM dustynv/l4t-pytorch:r36.4.0` with
Tegra-tuned torch / torchvision pre-baked. Wipes the image's stale
`/etc/pip.conf` (jetson.webredirect.org is maintainer-LAN only),
upgrades pip 24→26 so the `gtsam<5.0,>=4.2` constraint resolves to
the only PyPI wheel for aarch64 (`4.3a0`, same as Colima), installs
the SUT editable via system-pip + `--break-system-packages`.
* `docker-compose.test.jetson.yml` — mirror of `docker-compose.test.yml`
with `runtime: nvidia`, `deploy.resources.reservations.devices`, and
`GPS_DENIED_TIER: "2"` so the auto-skip hook in `tests/conftest.py`
runs the heavy ACs instead of skipping them.
* `scripts/run-tests-jetson.sh` — rsync → ssh build → ssh up wrapper.
Operator-side SSH alias `jetson-e2e` documented in
`_docs/03_implementation/jetson_harness_setup.md`.
* `@pytest.mark.tier2` applied to AC-1, AC-2, AC-3, AC-5, AC-6 in
`tests/e2e/replay/test_derkachi_1min.py` so the same test file is the
source of truth for both harnesses (Colima auto-skips tier2 via the
existing `pytest_collection_modifyitems` hook).
### Jetson smoke run (first end-to-end, 2026-05-18)
| Outcome | Count | Tests |
|---------|-------|-------|
| PASSED | 17 | AC-4 AST scan, AC-7 skip-gate, 14× AC-9 helpers |
| FAILED | 5 | AC-1, AC-2, AC-5, AC-6 pace-realtime, AC-6 pace-asap |
| SKIPPED | 1 | AC-8 (unchanged: D-PROJ-2 mock-sat stub) |
| XFAIL | 1 | AC-3 (unchanged: calibration intrinsics unknown) |
| **Wall clock** | **10m09s** | (vs ~5m on Colima) |
**Same 5 failures as Colima, same root cause** (`replay.auto_sync.ac8_validation_failed`,
offset_ms=1699999995666). AZ-614 reproduces on Jetson because the synth
tlog time-base bug is architecture-independent — heavy ACs die at
auto-sync, BEFORE any frame reaches the GPU. So this run validated the
infrastructure (image builds, GPU exposed, SUT runs, pytest collects 24)
but did NOT yet exercise ALIKED / DISK LightGlue on the actual GPU. The
2× wall delta vs Colima is the cost of CUDA + torch + TensorRT
initialization in the per-test SUT subprocess.
**Implication for Track 2**: fixing AZ-614 is the gating prerequisite for
ANY Reality-Gate-grade signal from the heavy ACs. Until then, Jetson and
Colima are indistinguishable — same green light ACs, same failed heavy
ACs. Once AZ-614 lands, the two harnesses divide cleanly: Colima keeps
exercising the light path (AC-4 / AC-7 / AC-9 plus auto-sync), Jetson
covers the heavy path (AC-1 / AC-2 / AC-5 / AC-6 plus the GPU inference
stages they entail).
### Lessons learned (committed to setup doc)
* `nvcr.io/nvidia/l4t-base` is deprecated in JetPack 6; `l4t-pytorch`
has no R36 tags; `l4t-jetpack:r36.4.0` exists but ships no PyTorch.
`dustynv/l4t-pytorch:r36.4.0` (Docker Hub) is the only off-the-shelf
Jetson base image with Tegra-tuned PyTorch wheels for R36.
* `nvidia-container-runtime` mounts `nvidia-smi` + CUDA libs from the
host into any container at runtime, so the GPU-exposure smoke test
doesn't need a 5 GB `l4t-jetpack` pull — `ubuntu:22.04 nvidia-smi`
(80 MB) suffices.
* The dustynv image bakes a private pip mirror into `/etc/pip.conf`;
builds in any other network must wipe it AND pin `--index-url` to
upstream PyPI.
* git LFS-tracked fixtures (the 269 MB Derkachi mp4) must be
pre-smudged on the Mac BEFORE the rsync step; otherwise the Jetson
receives the 134 B pointer and tests fail at fixture-load.
+3 -3
View File
@@ -6,9 +6,9 @@ step: 11
name: Run Tests
status: passed_with_followups
sub_step:
phase: 6
name: track-1-complete
detail: "Track 1 done (AZ-603 + AZ-604 Done). Reality Gate signal now REAL: 17 pass / 5 fail / 1 skip / 1 xfail across 24 tests. AC-1..AC-6 share root cause AZ-614 (tlog synth time-base mismatch). Tracks 2/3 queued for cycle 2."
phase: 7
name: jetson-harness-online
detail: "Track 1 done + AZ-615 Jetson Tier-2 harness wired. First Jetson run: identical to Colima (17 pass / 5 fail / 1 skip / 1 xfail, 10m09s). Same 5 failures hit AZ-614 (tlog synth time-base, arch-independent) BEFORE reaching the GPU. Image builds, GPU exposed, SUT runs — infrastructure proven. Next: fix AZ-614 to actually exercise the GPU. AZ-616 (real ../satellite-provider) + AZ-617 (tier2 marks done) queued."
retry_count: 0
cycle: 1
tracker: jira