[AZ-615] Fix Jetson harness base image (l4t-base/l4t-pytorch tags don't exist)

Operator-reported: `nvcr.io/nvidia/l4t-base:r36.4.0` fails to pull. Investigation against the live registries confirmed: * `nvcr.io/nvidia/l4t-base` — deprecated in JetPack 6, no r36 tags (forum thread "L4T Base docker image for Jetpack 6.2 (r36.4.3)", GitHub dusty-nv/jetson-containers#883). * `nvcr.io/nvidia/l4t-pytorch` — no r36 tags at all. Newest is r35.2.1-pth2.0-py3 (too old for our torch>=2.2 floor). * `nvcr.io/nvidia/l4t-jetpack:r36.4.0` — exists but ships no PyTorch. * `dustynv/l4t-pytorch:r36.4.0` (Docker Hub) — exists, ~6.3 GB ARM64, PyTorch + torchvision + opencv pre-baked, maintained by dusty-nv (NVIDIA's Jetson containers maintainer). Switched Dockerfile.jetson base to `dustynv/l4t-pytorch:r36.4.0`. Forward-compatible with the host's R36.5 BSP (NVIDIA containers tolerate one minor BSP ahead on the host side). Setup doc fixes: * smoke-test command now uses `l4t-jetpack:r36.4.0` (the official replacement for the deprecated `l4t-base`) * keygen step explicitly states it produces BOTH halves (private + .pub) in one go * ssh-copy-id + ssh config show how to specify a custom port * troubleshooting table gets a new row for the `l4t-base not found` case so the next dev hits the answer in 30 seconds Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 17:21:13 +00:00 · 2026-05-18 02:02:26 +03:00
parent 9c13ab3bd0
commit 6586208f83
2 changed files with 47 additions and 16 deletions
@@ -37,12 +37,18 @@ Operator-confirmed environment (2026-05-17):
 ### 1. SSH key + alias (on the Mac)

 ```bash
-# Generate a dedicated keypair (separate from your daily-dev key)
+# Generate a dedicated keypair (separate from your daily-dev key).
+# This command produces BOTH halves in one go:
+#   ~/.ssh/id_ed25519_jetson_e2e       — private (keep secret, never share)
+#   ~/.ssh/id_ed25519_jetson_e2e.pub   — public (push to Jetson below)
 ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
    -C "jetson-e2e $(date +%Y-%m-%d)"

-# Push the public half to the Jetson (asks for the Jetson password once)
+# Push the public half to the Jetson (asks for the Jetson password once).
+# Add `-p <port>` if the Jetson's sshd listens on a non-default port:
 ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
+# or with a custom port:
+# ssh-copy-id -p <port> -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>

 # Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
 # not over the LAN you're about to trust):
@@ -50,11 +56,13 @@ ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
 # Then compare against what the Mac sees on first connect. Accept only
 # if they match.

-# Wire up ~/.ssh/config (gitignored, never committed)
+# Wire up ~/.ssh/config (gitignored, never committed). Add `Port <port>`
+# if the Jetson's sshd listens on a non-default port.
 cat >> ~/.ssh/config <<'EOF'
 Host jetson-e2e
    HostName <jetson-ip>
    User <jetson-user>
+    Port 22
    IdentityFile ~/.ssh/id_ed25519_jetson_e2e
    IdentitiesOnly yes
    AddKeysToAgent yes
@@ -95,14 +103,24 @@ Then `sudo systemctl reload ssh`.

 ### 4. Verify the Jetson Docker + GPU pipeline

+`nvcr.io/nvidia/l4t-base` was deprecated in JetPack 6 — use
+`l4t-jetpack` (the official replacement) for the smoke test:
+
 ```bash
 ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
-    nvcr.io/nvidia/l4t-base:r36.4.0 nvidia-smi'
+    nvcr.io/nvidia/l4t-jetpack:r36.4.0 nvidia-smi'
 ```

-Expected output: a `nvidia-smi`-style table listing the Orin GPU. If
+Expected output: an `nvidia-smi`-style table listing the Orin GPU. If
 this fails with "runtime not found" or "no GPU devices", install
-`nvidia-container-toolkit` and `sudo systemctl restart docker`.
+`nvidia-container-toolkit` and `sudo systemctl restart docker`. If it
+fails with `pull access denied`, run `docker login nvcr.io` once (NGC
+API key from developer.nvidia.com — most public images don't require
+auth, but the registry sometimes prompts).
+
+If `nvidia-smi` works on the host directly (it does — driver 540.5.0,
+CUDA 12.6, Orin detected) but the container can't see the GPU, the
+problem is always nvidia-container-toolkit, not the driver.

 ### 5. Confirm disk + swap

@@ -162,7 +180,8 @@ JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
 | `docker: Error response from daemon: could not select device driver "nvidia"` | nvidia-container-toolkit missing or daemon not restarted after install | `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker` |
 | `torch.cuda.is_available() == False` inside the container | `runtime: nvidia` block missing, or building on x86 host | Verify `docker-compose.test.jetson.yml` has `runtime: nvidia`; rebuild on the Jetson |
 | `replay.auto_sync.ac8_validation_failed` | AZ-614 (tlog time-base mismatch) — not a harness bug | Fix AZ-614 in `tests/e2e/replay/_tlog_synth.py` |
-| `pull access denied for nvcr.io/nvidia/l4t-pytorch` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) |
+| `not found` / `tag not found` on `nvcr.io/nvidia/l4t-base:r36.*` | `l4t-base` was deprecated in JetPack 6 | use `l4t-jetpack:r36.4.0` for smoke tests; the harness itself uses `dustynv/l4t-pytorch:r36.4.0` |
+| `pull access denied for nvcr.io/nvidia/...` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) |

 ## Related Jira