Two doc lessons learned from on-Jetson verification:
1. The `cat >> ~/.ssh/config <<'EOF'` heredoc needs a leading blank
line. Without it, the appended block fused onto the previous
file line and produced "unsupported option yesHost" at parse
time. Added an explicit blank line + comment.
2. The smoke test for nvidia-container-runtime doesn't need a 5 GB
l4t-jetpack pull — nvidia-container-runtime mounts nvidia-smi
from the host into any container, so `ubuntu:22.04 nvidia-smi`
(80 MB) is sufficient. Switched the doc.
Operator verified end-to-end:
* `ssh jetson-e2e true` works from both terminal and Cursor Shell
* `jetson` user already in `docker` group (no sudo needed)
* `docker run --runtime=nvidia ubuntu:22.04 nvidia-smi` returns
Orin GPU info inside the container
Co-authored-by: Cursor <cursoragent@cursor.com>
7.8 KiB
Jetson e2e Harness — Operator Setup
AZ-615 / AZ-602 cycle-2. Documents the one-time operator-side setup
that makes scripts/run-tests-jetson.sh work against a Jetson Orin Nano
reachable from the developer Mac over SSH.
Why a separate Jetson harness exists
The Colima/Tier-1 smoke harness (docker-compose.test.yml +
tests/e2e/Dockerfile) verifies wiring, env config, fixture loading,
auto-sync, and JSONL schema — everything UP TO the GPU boundary. But
all three C7 inference strategies
(pytorch_fp16_runtime.py, tensorrt_runtime.py,
onnx_trt_ep_runtime.py) are CUDA-only by design (model.half().cuda()
on pytorch_fp16_runtime.py:189, no CPU fallback). The full Reality
Gate — including C3 matcher + C7 inference — therefore needs a
CUDA-capable host.
The Jetson harness runs the same test tree (tests/e2e/) on the Jetson
with GPS_DENIED_TIER=2, which turns OFF the auto-skip for
@pytest.mark.tier2 tests (see tests/conftest.py:31-44).
Hardware contract
Operator-confirmed environment (2026-05-17):
- Jetson Orin Nano dev kit
- JetPack 6.2.2+b24
- L4T R36.5.0 (Jan 2026)
- nvidia-container-toolkit 1.16.2
- ≥ 30 GB free on
/var/lib/docker(l4t-pytorch base image ~7 GB + build cache + fixture volumes) - Swap enabled (Orin Nano has 8 GB RAM; PyTorch + TensorRT loads spike)
One-time setup
1. SSH key + alias (on the Mac)
# Generate a dedicated keypair (separate from your daily-dev key).
# This command produces BOTH halves in one go:
# ~/.ssh/id_ed25519_jetson_e2e — private (keep secret, never share)
# ~/.ssh/id_ed25519_jetson_e2e.pub — public (push to Jetson below)
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
-C "jetson-e2e $(date +%Y-%m-%d)"
# Push the public half to the Jetson (asks for the Jetson password once).
# Add `-p <port>` if the Jetson's sshd listens on a non-default port:
ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# or with a custom port:
# ssh-copy-id -p <port> -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
# not over the LAN you're about to trust):
# ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub
# Then compare against what the Mac sees on first connect. Accept only
# if they match.
# Wire up ~/.ssh/config (gitignored, never committed). Add `Port <port>`
# if the Jetson's sshd listens on a non-default port.
#
# IMPORTANT: the leading blank line inside the heredoc is intentional.
# Without it, the appended block can fuse onto the previous file line
# (`IdentitiesOnly yesHost jetson-e2e` was a real failure mode).
cat >> ~/.ssh/config <<'EOF'
Host jetson-e2e
HostName <jetson-ip>
User <jetson-user>
Port 22
IdentityFile ~/.ssh/id_ed25519_jetson_e2e
IdentitiesOnly yes
AddKeysToAgent yes
UseKeychain yes
StrictHostKeyChecking accept-new
ServerAliveInterval 30
ServerAliveCountMax 4
EOF
# Cache the passphrase into macOS Keychain (one-time)
ssh-add --apple-use-keychain ~/.ssh/id_ed25519_jetson_e2e
2. Restrict the key's scope on the Jetson (recommended)
Edit ~/.ssh/authorized_keys on the Jetson and prefix the line that the
ssh-copy-id step appended:
from="<mac-lan-ip>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA… jetson-e2e
Optionally lock to "only run the e2e driver" by adding
command="docker compose -f /home/jetson/gps-denied-onboard/docker-compose.test.jetson.yml up --abort-on-container-exit" —
the key can't get a general shell, only invoke that one command.
3. Harden sshd (optional, recommended for an exposed test rig)
On the Jetson, create /etc/ssh/sshd_config.d/10-e2e.conf:
PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes
Then sudo systemctl reload ssh.
4. Verify the Jetson Docker + GPU pipeline
nvidia-container-runtime mounts nvidia-smi + CUDA libs from the
host into the container at runtime, so a tiny base image works for the
smoke test (no need to pull the 5 GB l4t-jetpack image just to check
GPU exposure):
ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
ubuntu:22.04 nvidia-smi'
Expected output: an nvidia-smi-style table listing the Orin GPU. If
this fails with "could not select device driver "nvidia"" or "no GPU
devices", reinstall nvidia-container-toolkit and
sudo systemctl restart docker.
If nvidia-smi works on the host directly but not inside a container,
the problem is always nvidia-container-toolkit, not the driver.
5. Confirm disk + swap
ssh jetson-e2e 'df -h /var/lib/docker && swapon --show && free -h'
Need ≥ 30 GB free on /var/lib/docker. Swap should be at least 4 GB
(JetPack default is 4 GB zram).
Running the harness
From the developer Mac, repo root:
bash scripts/run-tests-jetson.sh
What happens:
rsyncsource →jetson-e2e:~/gps-denied-onboard/(excludes.git,__pycache__, build artefacts; LFS pointers transfer as text).ssh jetson-e2e docker compose -f docker-compose.test.jetson.yml build e2e-runnerssh jetson-e2e docker compose ... up --abort-on-container-exit --exit-code-from e2e-runner- stdout / stderr stream to the Mac terminal; exit code propagates.
Override the alias or remote dir if your setup differs:
JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
bash scripts/run-tests-jetson.sh
Smoke vs. Reality Gate split — at a glance
| Test category | Marker | Colima (Tier-1) | Jetson (Tier-2) |
|---|---|---|---|
| AC-4a AST scan | (none) | runs | runs |
| AC-4b byte-equality | (none) | runs | runs |
| AC-7 skip-gate self-check | (none) | runs | runs |
| AC-9 helper unit tests | (none) | runs | runs |
| AC-1 / AC-2 / AC-3 / AC-5 / AC-6 (heavy) | tier2 |
SKIPPED | runs |
| AC-8 operator workflow | skip (AZ-616 blocks) |
skipped | skipped |
GPS_DENIED_TIER env var controls the auto-skip:
GPS_DENIED_TIER=1(Colima default) →tier2/gpu/dockermarked tests auto-skipped viatests/conftest.py:31-44.GPS_DENIED_TIER=2(Jetson default) → all markers active; everything runs (subject to other skip gates likeRUN_REPLAY_E2E).
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
cannot reach 'ssh jetson-e2e' non-interactively |
Agent isn't unlocked or key not in authorized_keys |
ssh-add -l on Mac; check ~/.ssh/authorized_keys on Jetson |
docker: Error response from daemon: could not select device driver "nvidia" |
nvidia-container-toolkit missing or daemon not restarted after install | sudo apt install nvidia-container-toolkit && sudo systemctl restart docker |
torch.cuda.is_available() == False inside the container |
runtime: nvidia block missing, or building on x86 host |
Verify docker-compose.test.jetson.yml has runtime: nvidia; rebuild on the Jetson |
replay.auto_sync.ac8_validation_failed |
AZ-614 (tlog time-base mismatch) — not a harness bug | Fix AZ-614 in tests/e2e/replay/_tlog_synth.py |
not found / tag not found on nvcr.io/nvidia/l4t-base:r36.* |
l4t-base was deprecated in JetPack 6 |
use l4t-jetpack:r36.4.0 for smoke tests; the harness itself uses dustynv/l4t-pytorch:r36.4.0 |
pull access denied for nvcr.io/nvidia/... |
NGC requires login for some tags | docker login nvcr.io (use NGC API key from developer.nvidia.com) |
Related Jira
- AZ-615 — this harness (Jetson runner story)
- AZ-616 — replace
mock-satwith real../satellite-providerservice - AZ-617 — mark heavy ACs with
tier2(already applied; this story documents and verifies the auto-skip) - AZ-614 — tlog time-base mismatch (currently blocks the heavy ACs from reaching the GPU stage)
- AZ-602 — parent Epic: E2E Tier-1 harness rehabilitation