diff --git a/_docs/03_implementation/jetson_harness_setup.md b/_docs/03_implementation/jetson_harness_setup.md index 50e19e6..d31d416 100644 --- a/_docs/03_implementation/jetson_harness_setup.md +++ b/_docs/03_implementation/jetson_harness_setup.md @@ -37,12 +37,18 @@ Operator-confirmed environment (2026-05-17): ### 1. SSH key + alias (on the Mac) ```bash -# Generate a dedicated keypair (separate from your daily-dev key) +# Generate a dedicated keypair (separate from your daily-dev key). +# This command produces BOTH halves in one go: +# ~/.ssh/id_ed25519_jetson_e2e — private (keep secret, never share) +# ~/.ssh/id_ed25519_jetson_e2e.pub — public (push to Jetson below) ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \ -C "jetson-e2e $(date +%Y-%m-%d)" -# Push the public half to the Jetson (asks for the Jetson password once) +# Push the public half to the Jetson (asks for the Jetson password once). +# Add `-p ` if the Jetson's sshd listens on a non-default port: ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub @ +# or with a custom port: +# ssh-copy-id -p -i ~/.ssh/id_ed25519_jetson_e2e.pub @ # Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial, # not over the LAN you're about to trust): @@ -50,11 +56,13 @@ ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub @ # Then compare against what the Mac sees on first connect. Accept only # if they match. -# Wire up ~/.ssh/config (gitignored, never committed) +# Wire up ~/.ssh/config (gitignored, never committed). Add `Port ` +# if the Jetson's sshd listens on a non-default port. cat >> ~/.ssh/config <<'EOF' Host jetson-e2e HostName User + Port 22 IdentityFile ~/.ssh/id_ed25519_jetson_e2e IdentitiesOnly yes AddKeysToAgent yes @@ -95,14 +103,24 @@ Then `sudo systemctl reload ssh`. ### 4. Verify the Jetson Docker + GPU pipeline +`nvcr.io/nvidia/l4t-base` was deprecated in JetPack 6 — use +`l4t-jetpack` (the official replacement) for the smoke test: + ```bash ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \ - nvcr.io/nvidia/l4t-base:r36.4.0 nvidia-smi' + nvcr.io/nvidia/l4t-jetpack:r36.4.0 nvidia-smi' ``` -Expected output: a `nvidia-smi`-style table listing the Orin GPU. If +Expected output: an `nvidia-smi`-style table listing the Orin GPU. If this fails with "runtime not found" or "no GPU devices", install -`nvidia-container-toolkit` and `sudo systemctl restart docker`. +`nvidia-container-toolkit` and `sudo systemctl restart docker`. If it +fails with `pull access denied`, run `docker login nvcr.io` once (NGC +API key from developer.nvidia.com — most public images don't require +auth, but the registry sometimes prompts). + +If `nvidia-smi` works on the host directly (it does — driver 540.5.0, +CUDA 12.6, Orin detected) but the container can't see the GPU, the +problem is always nvidia-container-toolkit, not the driver. ### 5. Confirm disk + swap @@ -162,7 +180,8 @@ JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \ | `docker: Error response from daemon: could not select device driver "nvidia"` | nvidia-container-toolkit missing or daemon not restarted after install | `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker` | | `torch.cuda.is_available() == False` inside the container | `runtime: nvidia` block missing, or building on x86 host | Verify `docker-compose.test.jetson.yml` has `runtime: nvidia`; rebuild on the Jetson | | `replay.auto_sync.ac8_validation_failed` | AZ-614 (tlog time-base mismatch) — not a harness bug | Fix AZ-614 in `tests/e2e/replay/_tlog_synth.py` | -| `pull access denied for nvcr.io/nvidia/l4t-pytorch` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) | +| `not found` / `tag not found` on `nvcr.io/nvidia/l4t-base:r36.*` | `l4t-base` was deprecated in JetPack 6 | use `l4t-jetpack:r36.4.0` for smoke tests; the harness itself uses `dustynv/l4t-pytorch:r36.4.0` | +| `pull access denied for nvcr.io/nvidia/...` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) | ## Related Jira diff --git a/tests/e2e/Dockerfile.jetson b/tests/e2e/Dockerfile.jetson index 74d4963..0ff0761 100644 --- a/tests/e2e/Dockerfile.jetson +++ b/tests/e2e/Dockerfile.jetson @@ -24,17 +24,29 @@ # the nvidia-container-runtime later mounts at run time. # --------------------------------------------------------------------------- -# Base — l4t-pytorch ships JetPack runtime + PyTorch wheel ready for `.cuda()` +# Base — dustynv/l4t-pytorch ships JetPack runtime + PyTorch wheel for `.cuda()` # -# Tag selection: NGC publishes l4t-pytorch on a slight lag from L4T BSP -# releases. With BSP R36.5 on the device, the closest stable NGC tag at -# author time is `r36.4.0-pth2.3-py3`. NVIDIA containers are -# forward-compatible across one minor BSP (the container's userspace -# can be slightly older than the host's L4T kernel). If a `r36.5.0-*` -# tag is published, prefer it. +# Tag selection rationale (verified 2026-05-17 against the live registries): # -# Image lookup at run time: `docker manifest inspect nvcr.io/nvidia/l4t-pytorch:r36.4.0-pth2.3-py3` -FROM nvcr.io/nvidia/l4t-pytorch:r36.4.0-pth2.3-py3 AS runtime +# - `nvcr.io/nvidia/l4t-base` was deprecated in JetPack 6 (forums: +# "L4T Base docker image for Jetpack 6.2 (r36.4.3)" / Issue #883 in +# dusty-nv/jetson-containers). The image no longer publishes r36 tags. +# - `nvcr.io/nvidia/l4t-pytorch` has NO r36 tags published. The newest +# official l4t-pytorch tag is r35.2.1-pth2.0-py3 — too old for our +# torch >= 2.2 floor in pyproject.toml `[inference]`. +# - `nvcr.io/nvidia/l4t-jetpack:r36.4.0` exists (CUDA + cuDNN + TensorRT +# bundled) but ships NO PyTorch — we'd have to install the Jetson +# PyTorch wheel from developer.download.nvidia.com manually. +# - `dustynv/l4t-pytorch:r36.4.0` (Docker Hub) is the de-facto Jetson +# PyTorch image: maintained by dusty-nv (NVIDIA's Jetson containers +# maintainer), bakes torch / torchvision / opencv / ONNX runtime for +# JetPack 6, ARM64, ~6.3 GB. Forward-compatible with the host's +# slightly newer R36.5 BSP (NVIDIA containers tolerate one minor BSP +# ahead on the host side). +# +# Verify availability before build: +# docker pull dustynv/l4t-pytorch:r36.4.0 +FROM dustynv/l4t-pytorch:r36.4.0 AS runtime ARG DEBIAN_FRONTEND=noninteractive # System deps mirror tests/e2e/Dockerfile + the Jetson runtime stack: