mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 08:41:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh 6586208f83 [AZ-615] Fix Jetson harness base image (l4t-base/l4t-pytorch tags don't exist)

Operator-reported: `nvcr.io/nvidia/l4t-base:r36.4.0` fails to pull.
Investigation against the live registries confirmed:

  * `nvcr.io/nvidia/l4t-base` — deprecated in JetPack 6, no r36 tags
    (forum thread "L4T Base docker image for Jetpack 6.2 (r36.4.3)",
    GitHub dusty-nv/jetson-containers#883).
  * `nvcr.io/nvidia/l4t-pytorch` — no r36 tags at all. Newest is
    r35.2.1-pth2.0-py3 (too old for our torch>=2.2 floor).
  * `nvcr.io/nvidia/l4t-jetpack:r36.4.0` — exists but ships no PyTorch.
  * `dustynv/l4t-pytorch:r36.4.0` (Docker Hub) — exists, ~6.3 GB ARM64,
    PyTorch + torchvision + opencv pre-baked, maintained by dusty-nv
    (NVIDIA's Jetson containers maintainer).

Switched Dockerfile.jetson base to `dustynv/l4t-pytorch:r36.4.0`.
Forward-compatible with the host's R36.5 BSP (NVIDIA containers
tolerate one minor BSP ahead on the host side).

Setup doc fixes:
  * smoke-test command now uses `l4t-jetpack:r36.4.0` (the official
    replacement for the deprecated `l4t-base`)
  * keygen step explicitly states it produces BOTH halves (private +
    .pub) in one go
  * ssh-copy-id + ssh config show how to specify a custom port
  * troubleshooting table gets a new row for the `l4t-base not found`
    case so the next dev hits the answer in 30 seconds

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-18 02:02:26 +03:00

7.7 KiB

Raw Blame History

Jetson e2e Harness — Operator Setup

AZ-615 / AZ-602 cycle-2. Documents the one-time operator-side setup that makes scripts/run-tests-jetson.sh work against a Jetson Orin Nano reachable from the developer Mac over SSH.

Why a separate Jetson harness exists

The Colima/Tier-1 smoke harness (docker-compose.test.yml + tests/e2e/Dockerfile) verifies wiring, env config, fixture loading, auto-sync, and JSONL schema — everything UP TO the GPU boundary. But all three C7 inference strategies (pytorch_fp16_runtime.py, tensorrt_runtime.py, onnx_trt_ep_runtime.py) are CUDA-only by design (model.half().cuda() on pytorch_fp16_runtime.py:189, no CPU fallback). The full Reality Gate — including C3 matcher + C7 inference — therefore needs a CUDA-capable host.

The Jetson harness runs the same test tree (tests/e2e/) on the Jetson with GPS_DENIED_TIER=2, which turns OFF the auto-skip for @pytest.mark.tier2 tests (see tests/conftest.py:31-44).

Hardware contract

Operator-confirmed environment (2026-05-17):

Jetson Orin Nano dev kit
JetPack 6.2.2+b24
L4T R36.5.0 (Jan 2026)
nvidia-container-toolkit 1.16.2
≥ 30 GB free on /var/lib/docker (l4t-pytorch base image ~7 GB + build cache + fixture volumes)
Swap enabled (Orin Nano has 8 GB RAM; PyTorch + TensorRT loads spike)

One-time setup

1. SSH key + alias (on the Mac)

# Generate a dedicated keypair (separate from your daily-dev key).
# This command produces BOTH halves in one go:
#   ~/.ssh/id_ed25519_jetson_e2e       — private (keep secret, never share)
#   ~/.ssh/id_ed25519_jetson_e2e.pub   — public (push to Jetson below)
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
    -C "jetson-e2e $(date +%Y-%m-%d)"

# Push the public half to the Jetson (asks for the Jetson password once).
# Add `-p <port>` if the Jetson's sshd listens on a non-default port:
ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# or with a custom port:
# ssh-copy-id -p <port> -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>

# Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
# not over the LAN you're about to trust):
#   ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub
# Then compare against what the Mac sees on first connect. Accept only
# if they match.

# Wire up ~/.ssh/config (gitignored, never committed). Add `Port <port>`
# if the Jetson's sshd listens on a non-default port.
cat >> ~/.ssh/config <<'EOF'
Host jetson-e2e
    HostName <jetson-ip>
    User <jetson-user>
    Port 22
    IdentityFile ~/.ssh/id_ed25519_jetson_e2e
    IdentitiesOnly yes
    AddKeysToAgent yes
    UseKeychain yes
    StrictHostKeyChecking yes
    ServerAliveInterval 30
    ServerAliveCountMax 4
EOF

# Cache the passphrase into macOS Keychain (one-time)
ssh-add --apple-use-keychain ~/.ssh/id_ed25519_jetson_e2e

2. Restrict the key's scope on the Jetson (recommended)

Edit ~/.ssh/authorized_keys on the Jetson and prefix the line that the ssh-copy-id step appended:

from="<mac-lan-ip>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA…  jetson-e2e

Optionally lock to "only run the e2e driver" by adding command="docker compose -f /home/jetson/gps-denied-onboard/docker-compose.test.jetson.yml up --abort-on-container-exit" — the key can't get a general shell, only invoke that one command.

3. Harden sshd (optional, recommended for an exposed test rig)

On the Jetson, create /etc/ssh/sshd_config.d/10-e2e.conf:

PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes

Then sudo systemctl reload ssh.

4. Verify the Jetson Docker + GPU pipeline

nvcr.io/nvidia/l4t-base was deprecated in JetPack 6 — use l4t-jetpack (the official replacement) for the smoke test:

ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
    nvcr.io/nvidia/l4t-jetpack:r36.4.0 nvidia-smi'

Expected output: an nvidia-smi-style table listing the Orin GPU. If this fails with "runtime not found" or "no GPU devices", install nvidia-container-toolkit and sudo systemctl restart docker. If it fails with pull access denied, run docker login nvcr.io once (NGC API key from developer.nvidia.com — most public images don't require auth, but the registry sometimes prompts).

If nvidia-smi works on the host directly (it does — driver 540.5.0, CUDA 12.6, Orin detected) but the container can't see the GPU, the problem is always nvidia-container-toolkit, not the driver.

5. Confirm disk + swap

ssh jetson-e2e 'df -h /var/lib/docker && swapon --show && free -h'

Need ≥ 30 GB free on /var/lib/docker. Swap should be at least 4 GB (JetPack default is 4 GB zram).

Running the harness

From the developer Mac, repo root:

bash scripts/run-tests-jetson.sh

What happens:

rsync source → jetson-e2e:~/gps-denied-onboard/ (excludes .git, __pycache__, build artefacts; LFS pointers transfer as text).
ssh jetson-e2e docker compose -f docker-compose.test.jetson.yml build e2e-runner
ssh jetson-e2e docker compose ... up --abort-on-container-exit --exit-code-from e2e-runner
stdout / stderr stream to the Mac terminal; exit code propagates.

Override the alias or remote dir if your setup differs:

JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
    bash scripts/run-tests-jetson.sh

Smoke vs. Reality Gate split — at a glance

Test category	Marker	Colima (Tier-1)	Jetson (Tier-2)
AC-4a AST scan	(none)	runs	runs
AC-4b byte-equality	(none)	runs	runs
AC-7 skip-gate self-check	(none)	runs	runs
AC-9 helper unit tests	(none)	runs	runs
AC-1 / AC-2 / AC-3 / AC-5 / AC-6 (heavy)	`tier2`	SKIPPED	runs
AC-8 operator workflow	`skip` (AZ-616 blocks)	skipped	skipped

GPS_DENIED_TIER env var controls the auto-skip:

GPS_DENIED_TIER=1 (Colima default) → tier2 / gpu / docker marked tests auto-skipped via tests/conftest.py:31-44.
GPS_DENIED_TIER=2 (Jetson default) → all markers active; everything runs (subject to other skip gates like RUN_REPLAY_E2E).

Troubleshooting

Symptom	Likely cause	Fix
`cannot reach 'ssh jetson-e2e' non-interactively`	Agent isn't unlocked or key not in `authorized_keys`	`ssh-add -l` on Mac; check `~/.ssh/authorized_keys` on Jetson
`docker: Error response from daemon: could not select device driver "nvidia"`	nvidia-container-toolkit missing or daemon not restarted after install	`sudo apt install nvidia-container-toolkit && sudo systemctl restart docker`
`torch.cuda.is_available() == False` inside the container	`runtime: nvidia` block missing, or building on x86 host	Verify `docker-compose.test.jetson.yml` has `runtime: nvidia`; rebuild on the Jetson
`replay.auto_sync.ac8_validation_failed`	AZ-614 (tlog time-base mismatch) — not a harness bug	Fix AZ-614 in `tests/e2e/replay/_tlog_synth.py`
`not found` / `tag not found` on `nvcr.io/nvidia/l4t-base:r36.*`	`l4t-base` was deprecated in JetPack 6	use `l4t-jetpack:r36.4.0` for smoke tests; the harness itself uses `dustynv/l4t-pytorch:r36.4.0`
`pull access denied for nvcr.io/nvidia/...`	NGC requires login for some tags	`docker login nvcr.io` (use NGC API key from developer.nvidia.com)

AZ-615 — this harness (Jetson runner story)
AZ-616 — replace mock-sat with real ../satellite-provider service
AZ-617 — mark heavy ACs with tier2 (already applied; this story documents and verifies the auto-skip)
AZ-614 — tlog time-base mismatch (currently blocks the heavy ACs from reaching the GPU stage)
AZ-602 — parent Epic: E2E Tier-1 harness rehabilitation

7.7 KiB Raw Blame History