mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 19:51:12 +00:00
9c13ab3bd0
C7 inference (PytorchFp16Runtime / TensorRTRuntime / OnnxTrtEpRuntime)
is CUDA-only by design — `model.half().cuda()` is hard-wired with no
CPU fallback. The Colima/Tier-1 smoke harness can never exercise C3
matcher or C7 inference. Once AZ-614 fixes the tlog time-base mismatch
and the pipeline reaches those stages, Colima runs would hard-fail at
`.cuda()` instead of cleanly skipping.
This commit lays down the Jetson companion harness and wires the
existing `tier2` auto-skip:
* tests/e2e/Dockerfile.jetson — l4t-pytorch:r36.4.0-pth2.3-py3 base,
same /opt layout as the Colima image so AC-4 AST scan + bind mounts
work identically. Built ON the Jetson via run-tests-jetson.sh.
* docker-compose.test.jetson.yml — mirrors docker-compose.test.yml
but with `runtime: nvidia`, GPU device exposure, and
GPS_DENIED_TIER=2 (turns OFF the tier2 auto-skip).
* scripts/run-tests-jetson.sh — rsync → ssh build → ssh up,
exit-code-from e2e-runner so the local exit code reflects the
remote test verdict. No credentials in the repo; uses
`ssh jetson-e2e` alias resolved via ~/.ssh/config.
* _docs/03_implementation/jetson_harness_setup.md — one-time SSH
key + alias + sshd hardening + GPU verification steps. Documents
the smoke vs. Reality Gate split + the GPS_DENIED_TIER switch.
AZ-617 (mark heavy ACs with tier2): adds @pytest.mark.tier2 to AC-1,
AC-2, AC-3, AC-5, AC-6 in tests/e2e/replay/test_derkachi_1min.py.
Reuses the existing tier2 marker + auto-skip in tests/conftest.py
(scope revision documented as a comment on AZ-617). AC-4a/4b/AC-7/AC-9
stay unmarked — they don't touch CUDA.
Defers to follow-up Jira:
* AZ-614 — Derkachi tlog synth time-base mismatch (unblocks tier2 ACs
actually reaching the GPU stage on the Jetson)
* AZ-616 — replace mock-sat with real ../satellite-provider service
Not run yet: the harness needs operator-side SSH setup to come online
before scripts/run-tests-jetson.sh can be executed end-to-end. Setup
steps documented in jetson_harness_setup.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
176 lines
6.6 KiB
Markdown
176 lines
6.6 KiB
Markdown
# Jetson e2e Harness — Operator Setup
|
|
|
|
AZ-615 / AZ-602 cycle-2. Documents the one-time operator-side setup
|
|
that makes `scripts/run-tests-jetson.sh` work against a Jetson Orin Nano
|
|
reachable from the developer Mac over SSH.
|
|
|
|
## Why a separate Jetson harness exists
|
|
|
|
The Colima/Tier-1 smoke harness (`docker-compose.test.yml` +
|
|
`tests/e2e/Dockerfile`) verifies wiring, env config, fixture loading,
|
|
auto-sync, and JSONL schema — everything UP TO the GPU boundary. But
|
|
all three C7 inference strategies
|
|
(`pytorch_fp16_runtime.py`, `tensorrt_runtime.py`,
|
|
`onnx_trt_ep_runtime.py`) are CUDA-only by design (`model.half().cuda()`
|
|
on `pytorch_fp16_runtime.py:189`, no CPU fallback). The full Reality
|
|
Gate — including C3 matcher + C7 inference — therefore needs a
|
|
CUDA-capable host.
|
|
|
|
The Jetson harness runs the same test tree (`tests/e2e/`) on the Jetson
|
|
with `GPS_DENIED_TIER=2`, which turns OFF the auto-skip for
|
|
`@pytest.mark.tier2` tests (see `tests/conftest.py:31-44`).
|
|
|
|
## Hardware contract
|
|
|
|
Operator-confirmed environment (2026-05-17):
|
|
|
|
* Jetson Orin Nano dev kit
|
|
* JetPack 6.2.2+b24
|
|
* L4T R36.5.0 (Jan 2026)
|
|
* nvidia-container-toolkit 1.16.2
|
|
* ≥ 30 GB free on `/var/lib/docker` (l4t-pytorch base image ~7 GB +
|
|
build cache + fixture volumes)
|
|
* Swap enabled (Orin Nano has 8 GB RAM; PyTorch + TensorRT loads spike)
|
|
|
|
## One-time setup
|
|
|
|
### 1. SSH key + alias (on the Mac)
|
|
|
|
```bash
|
|
# Generate a dedicated keypair (separate from your daily-dev key)
|
|
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
|
|
-C "jetson-e2e $(date +%Y-%m-%d)"
|
|
|
|
# Push the public half to the Jetson (asks for the Jetson password once)
|
|
ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
|
|
|
|
# Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
|
|
# not over the LAN you're about to trust):
|
|
# ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub
|
|
# Then compare against what the Mac sees on first connect. Accept only
|
|
# if they match.
|
|
|
|
# Wire up ~/.ssh/config (gitignored, never committed)
|
|
cat >> ~/.ssh/config <<'EOF'
|
|
Host jetson-e2e
|
|
HostName <jetson-ip>
|
|
User <jetson-user>
|
|
IdentityFile ~/.ssh/id_ed25519_jetson_e2e
|
|
IdentitiesOnly yes
|
|
AddKeysToAgent yes
|
|
UseKeychain yes
|
|
StrictHostKeyChecking yes
|
|
ServerAliveInterval 30
|
|
ServerAliveCountMax 4
|
|
EOF
|
|
|
|
# Cache the passphrase into macOS Keychain (one-time)
|
|
ssh-add --apple-use-keychain ~/.ssh/id_ed25519_jetson_e2e
|
|
```
|
|
|
|
### 2. Restrict the key's scope on the Jetson (recommended)
|
|
|
|
Edit `~/.ssh/authorized_keys` on the Jetson and prefix the line that the
|
|
`ssh-copy-id` step appended:
|
|
|
|
```
|
|
from="<mac-lan-ip>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA… jetson-e2e
|
|
```
|
|
|
|
Optionally lock to "only run the e2e driver" by adding
|
|
`command="docker compose -f /home/jetson/gps-denied-onboard/docker-compose.test.jetson.yml up --abort-on-container-exit"` —
|
|
the key can't get a general shell, only invoke that one command.
|
|
|
|
### 3. Harden sshd (optional, recommended for an exposed test rig)
|
|
|
|
On the Jetson, create `/etc/ssh/sshd_config.d/10-e2e.conf`:
|
|
|
|
```
|
|
PasswordAuthentication no
|
|
PermitRootLogin no
|
|
PubkeyAuthentication yes
|
|
```
|
|
|
|
Then `sudo systemctl reload ssh`.
|
|
|
|
### 4. Verify the Jetson Docker + GPU pipeline
|
|
|
|
```bash
|
|
ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
|
|
nvcr.io/nvidia/l4t-base:r36.4.0 nvidia-smi'
|
|
```
|
|
|
|
Expected output: a `nvidia-smi`-style table listing the Orin GPU. If
|
|
this fails with "runtime not found" or "no GPU devices", install
|
|
`nvidia-container-toolkit` and `sudo systemctl restart docker`.
|
|
|
|
### 5. Confirm disk + swap
|
|
|
|
```bash
|
|
ssh jetson-e2e 'df -h /var/lib/docker && swapon --show && free -h'
|
|
```
|
|
|
|
Need ≥ 30 GB free on `/var/lib/docker`. Swap should be at least 4 GB
|
|
(JetPack default is 4 GB zram).
|
|
|
|
## Running the harness
|
|
|
|
From the developer Mac, repo root:
|
|
|
|
```bash
|
|
bash scripts/run-tests-jetson.sh
|
|
```
|
|
|
|
What happens:
|
|
|
|
1. `rsync` source → `jetson-e2e:~/gps-denied-onboard/` (excludes `.git`,
|
|
`__pycache__`, build artefacts; LFS pointers transfer as text).
|
|
2. `ssh jetson-e2e docker compose -f docker-compose.test.jetson.yml build e2e-runner`
|
|
3. `ssh jetson-e2e docker compose ... up --abort-on-container-exit --exit-code-from e2e-runner`
|
|
4. stdout / stderr stream to the Mac terminal; exit code propagates.
|
|
|
|
Override the alias or remote dir if your setup differs:
|
|
|
|
```bash
|
|
JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
|
|
bash scripts/run-tests-jetson.sh
|
|
```
|
|
|
|
## Smoke vs. Reality Gate split — at a glance
|
|
|
|
| Test category | Marker | Colima (Tier-1) | Jetson (Tier-2) |
|
|
|---------------|--------|-----------------|-----------------|
|
|
| AC-4a AST scan | (none) | runs | runs |
|
|
| AC-4b byte-equality | (none) | runs | runs |
|
|
| AC-7 skip-gate self-check | (none) | runs | runs |
|
|
| AC-9 helper unit tests | (none) | runs | runs |
|
|
| AC-1 / AC-2 / AC-3 / AC-5 / AC-6 (heavy) | `tier2` | **SKIPPED** | runs |
|
|
| AC-8 operator workflow | `skip` (AZ-616 blocks) | skipped | skipped |
|
|
|
|
`GPS_DENIED_TIER` env var controls the auto-skip:
|
|
|
|
* `GPS_DENIED_TIER=1` (Colima default) → `tier2` / `gpu` / `docker`
|
|
marked tests auto-skipped via `tests/conftest.py:31-44`.
|
|
* `GPS_DENIED_TIER=2` (Jetson default) → all markers active; everything
|
|
runs (subject to other skip gates like `RUN_REPLAY_E2E`).
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | Likely cause | Fix |
|
|
|---------|--------------|-----|
|
|
| `cannot reach 'ssh jetson-e2e' non-interactively` | Agent isn't unlocked or key not in `authorized_keys` | `ssh-add -l` on Mac; check `~/.ssh/authorized_keys` on Jetson |
|
|
| `docker: Error response from daemon: could not select device driver "nvidia"` | nvidia-container-toolkit missing or daemon not restarted after install | `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker` |
|
|
| `torch.cuda.is_available() == False` inside the container | `runtime: nvidia` block missing, or building on x86 host | Verify `docker-compose.test.jetson.yml` has `runtime: nvidia`; rebuild on the Jetson |
|
|
| `replay.auto_sync.ac8_validation_failed` | AZ-614 (tlog time-base mismatch) — not a harness bug | Fix AZ-614 in `tests/e2e/replay/_tlog_synth.py` |
|
|
| `pull access denied for nvcr.io/nvidia/l4t-pytorch` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) |
|
|
|
|
## Related Jira
|
|
|
|
* AZ-615 — this harness (Jetson runner story)
|
|
* AZ-616 — replace `mock-sat` with real `../satellite-provider` service
|
|
* AZ-617 — mark heavy ACs with `tier2` (already applied; this story
|
|
documents and verifies the auto-skip)
|
|
* AZ-614 — tlog time-base mismatch (currently blocks the heavy ACs
|
|
from reaching the GPU stage)
|
|
* AZ-602 — parent Epic: E2E Tier-1 harness rehabilitation
|