Files
gps-denied-onboard/_docs/03_implementation/jetson_harness_setup.md
T
Oleksandr Bezdieniezhnykh a7b3e60716
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update Jetson test environment and satellite-provider integration
- Added `.env.test` to `.gitignore` to exclude test environment variables.
- Enhanced `docker-compose.test.jetson.yml` to include the real satellite-provider .NET service and its PostgreSQL database, replacing the mock service.
- Updated test execution policy to mandate all tests run exclusively on Jetson hardware, deprecating the previous two-tier model.
- Revised documentation in `_docs/LESSONS.md`, `_docs/02_document/tests/environment.md`, and `_docs/04_deploy/ci_cd_pipeline.md` to reflect the new testing strategy and environment setup.
- Improved `run-tests-jetson.sh` script to ensure proper environment variable handling and satellite-provider integration.

This commit aligns the testing framework with production environments, enhancing reliability and coverage.
2026-05-20 13:22:51 +03:00

10 KiB

Jetson e2e Harness — Operator Setup

AZ-615 / AZ-602 cycle-2. Documents the one-time operator-side setup that makes scripts/run-tests-jetson.sh work against a Jetson Orin Nano reachable from the developer Mac over SSH.

Why a separate Jetson harness exists

The Colima/Tier-1 smoke harness (docker-compose.test.yml + tests/e2e/Dockerfile) verifies wiring, env config, fixture loading, auto-sync, and JSONL schema — everything UP TO the GPU boundary. But all three C7 inference strategies (pytorch_fp16_runtime.py, tensorrt_runtime.py, onnx_trt_ep_runtime.py) are CUDA-only by design (model.half().cuda() on pytorch_fp16_runtime.py:189, no CPU fallback). The full Reality Gate — including C3 matcher + C7 inference — therefore needs a CUDA-capable host.

The Jetson harness runs the same test tree (tests/e2e/) on the Jetson with GPS_DENIED_TIER=2, which turns OFF the auto-skip for @pytest.mark.tier2 tests (see tests/conftest.py:31-44).

Hardware contract

Operator-confirmed environment (2026-05-17):

  • Jetson Orin Nano dev kit
  • JetPack 6.2.2+b24
  • L4T R36.5.0 (Jan 2026)
  • nvidia-container-toolkit 1.16.2
  • ≥ 30 GB free on /var/lib/docker (l4t-pytorch base image ~7 GB + build cache + fixture volumes)
  • Swap enabled (Orin Nano has 8 GB RAM; PyTorch + TensorRT loads spike)

One-time setup

1. SSH key + alias (on the Mac)

# Generate a dedicated keypair (separate from your daily-dev key).
# This command produces BOTH halves in one go:
#   ~/.ssh/id_ed25519_jetson_e2e       — private (keep secret, never share)
#   ~/.ssh/id_ed25519_jetson_e2e.pub   — public (push to Jetson below)
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
    -C "jetson-e2e $(date +%Y-%m-%d)"

# Push the public half to the Jetson (asks for the Jetson password once).
# Add `-p <port>` if the Jetson's sshd listens on a non-default port:
ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# or with a custom port:
# ssh-copy-id -p <port> -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>

# Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
# not over the LAN you're about to trust):
#   ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub
# Then compare against what the Mac sees on first connect. Accept only
# if they match.

# Wire up ~/.ssh/config (gitignored, never committed). Add `Port <port>`
# if the Jetson's sshd listens on a non-default port.
#
# IMPORTANT: the leading blank line inside the heredoc is intentional.
# Without it, the appended block can fuse onto the previous file line
# (`IdentitiesOnly yesHost jetson-e2e` was a real failure mode).
cat >> ~/.ssh/config <<'EOF'

Host jetson-e2e
    HostName <jetson-ip>
    User <jetson-user>
    Port 22
    IdentityFile ~/.ssh/id_ed25519_jetson_e2e
    IdentitiesOnly yes
    AddKeysToAgent yes
    UseKeychain yes
    StrictHostKeyChecking accept-new
    ServerAliveInterval 30
    ServerAliveCountMax 4
EOF

# Cache the passphrase into macOS Keychain (one-time)
ssh-add --apple-use-keychain ~/.ssh/id_ed25519_jetson_e2e

Edit ~/.ssh/authorized_keys on the Jetson and prefix the line that the ssh-copy-id step appended:

from="<mac-lan-ip>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA…  jetson-e2e

Optionally lock to "only run the e2e driver" by adding command="docker compose -f /home/jetson/gps-denied-onboard/docker-compose.test.jetson.yml up --abort-on-container-exit" — the key can't get a general shell, only invoke that one command.

On the Jetson, create /etc/ssh/sshd_config.d/10-e2e.conf:

PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes

Then sudo systemctl reload ssh.

4. Verify the Jetson Docker + GPU pipeline

nvidia-container-runtime mounts nvidia-smi + CUDA libs from the host into the container at runtime, so a tiny base image works for the smoke test (no need to pull the 5 GB l4t-jetpack image just to check GPU exposure):

ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
    ubuntu:22.04 nvidia-smi'

Expected output: an nvidia-smi-style table listing the Orin GPU. If this fails with "could not select device driver "nvidia"" or "no GPU devices", reinstall nvidia-container-toolkit and sudo systemctl restart docker.

If nvidia-smi works on the host directly but not inside a container, the problem is always nvidia-container-toolkit, not the driver.

5. Confirm disk + swap

ssh jetson-e2e 'df -h /var/lib/docker && swapon --show && free -h'

Need ≥ 30 GB free on /var/lib/docker. Swap should be at least 4 GB (JetPack default is 4 GB zram).

Running the harness

Pre-flight (one-time, then on JWT secret rotation)

AZ-688 added the real ../satellite-provider .NET service to the Jetson compose graph. Two extra setup steps before the first run:

# 1. Sibling repo must be checked out alongside gps-denied-onboard/.
#    The harness rsyncs both repos to the Jetson; the relative `../satellite-provider`
#    path in docker-compose.test.jetson.yml resolves identically on Mac and Jetson.
ls ../satellite-provider/SatelliteProvider.sln    # sanity check

# 2. Copy the env template and fill in the dev JWT secret. .env.test is
#    gitignored; the script refuses to start if it's missing or if any
#    of JWT_SECRET / JWT_ISSUER / JWT_AUDIENCE are unset.
cp .env.test.example .env.test
# Generate a fresh dev secret (≥32 bytes for HMAC-SHA256):
openssl rand -hex 32
# Paste into JWT_SECRET=… in .env.test. The same secret is later used by
# AZ-690 (dev JWT minting helper) to sign tokens that this same provider
# validates. Issuer/audience defaults are pre-filled.

The dev TLS cert (../satellite-provider/certs/{api.pfx,api.crt,api.key}) is regenerated on demand by scripts/ensure-dev-cert.sh, which run-tests-jetson.sh calls automatically. The cert is self-signed, gitignored in both repos, and pinned to SAN api/satellite-provider/ localhost/127.0.0.1 — see the script for the openssl recipe.

Run

From the developer Mac, repo root:

bash scripts/run-tests-jetson.sh

What happens:

  1. Load .env.test (fail-fast if missing / JWT vars unset / JWT_SECRET < 32 bytes).
  2. scripts/ensure-dev-cert.sh on the Mac — idempotent dev TLS cert generation into ../satellite-provider/certs/.
  3. rsync source → jetson-e2e:~/gps-denied-onboard/ (excludes .git, __pycache__, build artefacts; LFS pointers transfer as text).
  4. rsync ../satellite-provider/jetson-e2e:~/satellite-provider/ (sibling of gps-denied-onboard/ so the compose path resolves).
  5. ssh jetson-e2e docker compose ... build e2e-runner satellite-provider (env vars exported through the heredoc so the upstream compose's ${JWT_SECRET} interpolation resolves on the Jetson side).
  6. ssh jetson-e2e docker compose ... up --abort-on-container-exit --exit-code-from e2e-runner.
  7. stdout / stderr stream to the Mac terminal; exit code propagates.

Override the alias or remote dir if your setup differs:

JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
    bash scripts/run-tests-jetson.sh

JETSON_REMOTE_DIR MUST be a path whose parent directory is writable — the harness places satellite-provider/ next to it. With the default ~/gps-denied-onboard, the satellite-provider lands at ~/satellite-provider/ on the Jetson.

Smoke vs. Reality Gate split — at a glance

Test category Marker Colima (Tier-1) Jetson (Tier-2)
AC-4a AST scan (none) runs runs
AC-4b byte-equality (none) runs runs
AC-7 skip-gate self-check (none) runs runs
AC-9 helper unit tests (none) runs runs
AC-1 / AC-2 / AC-3 / AC-5 / AC-6 (heavy) tier2 SKIPPED runs
AC-8 operator workflow skip (AZ-616 blocks) skipped skipped

GPS_DENIED_TIER env var controls the auto-skip:

  • GPS_DENIED_TIER=1 (Colima default) → tier2 / gpu / docker marked tests auto-skipped via tests/conftest.py:31-44.
  • GPS_DENIED_TIER=2 (Jetson default) → all markers active; everything runs (subject to other skip gates like RUN_REPLAY_E2E).

Troubleshooting

Symptom Likely cause Fix
cannot reach 'ssh jetson-e2e' non-interactively Agent isn't unlocked or key not in authorized_keys ssh-add -l on Mac; check ~/.ssh/authorized_keys on Jetson
docker: Error response from daemon: could not select device driver "nvidia" nvidia-container-toolkit missing or daemon not restarted after install sudo apt install nvidia-container-toolkit && sudo systemctl restart docker
torch.cuda.is_available() == False inside the container runtime: nvidia block missing, or building on x86 host Verify docker-compose.test.jetson.yml has runtime: nvidia; rebuild on the Jetson
replay.auto_sync.ac8_validation_failed AZ-614 (tlog time-base mismatch) — not a harness bug Fix AZ-614 in tests/e2e/replay/_tlog_synth.py
not found / tag not found on nvcr.io/nvidia/l4t-base:r36.* l4t-base was deprecated in JetPack 6 use l4t-jetpack:r36.4.0 for smoke tests; the harness itself uses dustynv/l4t-pytorch:r36.4.0
pull access denied for nvcr.io/nvidia/... NGC requires login for some tags docker login nvcr.io (use NGC API key from developer.nvidia.com)
  • AZ-615 — this harness (Jetson runner story)
  • AZ-616 — umbrella: replace mock-sat with real ../satellite-provider service
    • AZ-688 — Compose-include real satellite-provider + Postgres (this doc)
    • AZ-689 — Seed Derkachi-bbox fixture tile set for hermetic e2e
    • AZ-690 — Long-lived dev JWT minting helper
    • AZ-691 — Python SatelliteProviderClient
    • AZ-692 — Wire client into composition root; retire mock-sat
    • AZ-693 — Docs: client contract + test env + containerization
    • AZ-694 — AC-8 unskip + diagnose (sibling Story, not a subtask)
  • AZ-617 — mark heavy ACs with tier2 (already applied; this story documents and verifies the auto-skip)
  • AZ-614 — tlog time-base mismatch (currently blocks the heavy ACs from reaching the GPU stage)
  • AZ-602 — parent Epic: E2E Tier-1 harness rehabilitation