Files
Oleksandr Bezdieniezhnykh a7b3e60716
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update Jetson test environment and satellite-provider integration
- Added `.env.test` to `.gitignore` to exclude test environment variables.
- Enhanced `docker-compose.test.jetson.yml` to include the real satellite-provider .NET service and its PostgreSQL database, replacing the mock service.
- Updated test execution policy to mandate all tests run exclusively on Jetson hardware, deprecating the previous two-tier model.
- Revised documentation in `_docs/LESSONS.md`, `_docs/02_document/tests/environment.md`, and `_docs/04_deploy/ci_cd_pipeline.md` to reflect the new testing strategy and environment setup.
- Improved `run-tests-jetson.sh` script to ensure proper environment variable handling and satellite-provider integration.

This commit aligns the testing framework with production environments, enhancing reliability and coverage.
2026-05-20 13:22:51 +03:00

248 lines
10 KiB
Markdown

# Jetson e2e Harness — Operator Setup
AZ-615 / AZ-602 cycle-2. Documents the one-time operator-side setup
that makes `scripts/run-tests-jetson.sh` work against a Jetson Orin Nano
reachable from the developer Mac over SSH.
## Why a separate Jetson harness exists
The Colima/Tier-1 smoke harness (`docker-compose.test.yml` +
`tests/e2e/Dockerfile`) verifies wiring, env config, fixture loading,
auto-sync, and JSONL schema — everything UP TO the GPU boundary. But
all three C7 inference strategies
(`pytorch_fp16_runtime.py`, `tensorrt_runtime.py`,
`onnx_trt_ep_runtime.py`) are CUDA-only by design (`model.half().cuda()`
on `pytorch_fp16_runtime.py:189`, no CPU fallback). The full Reality
Gate — including C3 matcher + C7 inference — therefore needs a
CUDA-capable host.
The Jetson harness runs the same test tree (`tests/e2e/`) on the Jetson
with `GPS_DENIED_TIER=2`, which turns OFF the auto-skip for
`@pytest.mark.tier2` tests (see `tests/conftest.py:31-44`).
## Hardware contract
Operator-confirmed environment (2026-05-17):
* Jetson Orin Nano dev kit
* JetPack 6.2.2+b24
* L4T R36.5.0 (Jan 2026)
* nvidia-container-toolkit 1.16.2
* ≥ 30 GB free on `/var/lib/docker` (l4t-pytorch base image ~7 GB +
build cache + fixture volumes)
* Swap enabled (Orin Nano has 8 GB RAM; PyTorch + TensorRT loads spike)
## One-time setup
### 1. SSH key + alias (on the Mac)
```bash
# Generate a dedicated keypair (separate from your daily-dev key).
# This command produces BOTH halves in one go:
# ~/.ssh/id_ed25519_jetson_e2e — private (keep secret, never share)
# ~/.ssh/id_ed25519_jetson_e2e.pub — public (push to Jetson below)
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519_jetson_e2e \
-C "jetson-e2e $(date +%Y-%m-%d)"
# Push the public half to the Jetson (asks for the Jetson password once).
# Add `-p <port>` if the Jetson's sshd listens on a non-default port:
ssh-copy-id -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# or with a custom port:
# ssh-copy-id -p <port> -i ~/.ssh/id_ed25519_jetson_e2e.pub <jetson-user>@<jetson-ip>
# Verify the Jetson's host key (run this ON the Jetson, via HDMI/serial,
# not over the LAN you're about to trust):
# ssh-keygen -lf /etc/ssh/ssh_host_ed25519_key.pub
# Then compare against what the Mac sees on first connect. Accept only
# if they match.
# Wire up ~/.ssh/config (gitignored, never committed). Add `Port <port>`
# if the Jetson's sshd listens on a non-default port.
#
# IMPORTANT: the leading blank line inside the heredoc is intentional.
# Without it, the appended block can fuse onto the previous file line
# (`IdentitiesOnly yesHost jetson-e2e` was a real failure mode).
cat >> ~/.ssh/config <<'EOF'
Host jetson-e2e
HostName <jetson-ip>
User <jetson-user>
Port 22
IdentityFile ~/.ssh/id_ed25519_jetson_e2e
IdentitiesOnly yes
AddKeysToAgent yes
UseKeychain yes
StrictHostKeyChecking accept-new
ServerAliveInterval 30
ServerAliveCountMax 4
EOF
# Cache the passphrase into macOS Keychain (one-time)
ssh-add --apple-use-keychain ~/.ssh/id_ed25519_jetson_e2e
```
### 2. Restrict the key's scope on the Jetson (recommended)
Edit `~/.ssh/authorized_keys` on the Jetson and prefix the line that the
`ssh-copy-id` step appended:
```
from="<mac-lan-ip>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA… jetson-e2e
```
Optionally lock to "only run the e2e driver" by adding
`command="docker compose -f /home/jetson/gps-denied-onboard/docker-compose.test.jetson.yml up --abort-on-container-exit"`
the key can't get a general shell, only invoke that one command.
### 3. Harden sshd (optional, recommended for an exposed test rig)
On the Jetson, create `/etc/ssh/sshd_config.d/10-e2e.conf`:
```
PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes
```
Then `sudo systemctl reload ssh`.
### 4. Verify the Jetson Docker + GPU pipeline
`nvidia-container-runtime` mounts `nvidia-smi` + CUDA libs from the
host into the container at runtime, so a tiny base image works for the
smoke test (no need to pull the 5 GB `l4t-jetpack` image just to check
GPU exposure):
```bash
ssh jetson-e2e 'docker run --rm --runtime=nvidia --gpus all \
ubuntu:22.04 nvidia-smi'
```
Expected output: an `nvidia-smi`-style table listing the Orin GPU. If
this fails with "could not select device driver \"nvidia\"" or "no GPU
devices", reinstall `nvidia-container-toolkit` and
`sudo systemctl restart docker`.
If `nvidia-smi` works on the host directly but not inside a container,
the problem is always nvidia-container-toolkit, not the driver.
### 5. Confirm disk + swap
```bash
ssh jetson-e2e 'df -h /var/lib/docker && swapon --show && free -h'
```
Need ≥ 30 GB free on `/var/lib/docker`. Swap should be at least 4 GB
(JetPack default is 4 GB zram).
## Running the harness
### Pre-flight (one-time, then on JWT secret rotation)
AZ-688 added the real `../satellite-provider` .NET service to the Jetson
compose graph. Two extra setup steps before the first run:
```bash
# 1. Sibling repo must be checked out alongside gps-denied-onboard/.
# The harness rsyncs both repos to the Jetson; the relative `../satellite-provider`
# path in docker-compose.test.jetson.yml resolves identically on Mac and Jetson.
ls ../satellite-provider/SatelliteProvider.sln # sanity check
# 2. Copy the env template and fill in the dev JWT secret. .env.test is
# gitignored; the script refuses to start if it's missing or if any
# of JWT_SECRET / JWT_ISSUER / JWT_AUDIENCE are unset.
cp .env.test.example .env.test
# Generate a fresh dev secret (≥32 bytes for HMAC-SHA256):
openssl rand -hex 32
# Paste into JWT_SECRET=… in .env.test. The same secret is later used by
# AZ-690 (dev JWT minting helper) to sign tokens that this same provider
# validates. Issuer/audience defaults are pre-filled.
```
The dev TLS cert (`../satellite-provider/certs/{api.pfx,api.crt,api.key}`)
is regenerated on demand by `scripts/ensure-dev-cert.sh`, which
`run-tests-jetson.sh` calls automatically. The cert is self-signed,
gitignored in both repos, and pinned to SAN `api`/`satellite-provider`/
`localhost`/`127.0.0.1` — see the script for the openssl recipe.
### Run
From the developer Mac, repo root:
```bash
bash scripts/run-tests-jetson.sh
```
What happens:
1. Load `.env.test` (fail-fast if missing / JWT vars unset / `JWT_SECRET` < 32 bytes).
2. `scripts/ensure-dev-cert.sh` on the Mac — idempotent dev TLS cert generation
into `../satellite-provider/certs/`.
3. `rsync` source → `jetson-e2e:~/gps-denied-onboard/` (excludes `.git`,
`__pycache__`, build artefacts; LFS pointers transfer as text).
4. `rsync` `../satellite-provider/``jetson-e2e:~/satellite-provider/`
(sibling of `gps-denied-onboard/` so the compose path resolves).
5. `ssh jetson-e2e docker compose ... build e2e-runner satellite-provider`
(env vars exported through the heredoc so the upstream compose's
`${JWT_SECRET}` interpolation resolves on the Jetson side).
6. `ssh jetson-e2e docker compose ... up --abort-on-container-exit --exit-code-from e2e-runner`.
7. stdout / stderr stream to the Mac terminal; exit code propagates.
Override the alias or remote dir if your setup differs:
```bash
JETSON_SSH_ALIAS=other-host JETSON_REMOTE_DIR=~/somewhere/else \
bash scripts/run-tests-jetson.sh
```
`JETSON_REMOTE_DIR` MUST be a path whose parent directory is writable —
the harness places `satellite-provider/` next to it. With the default
`~/gps-denied-onboard`, the satellite-provider lands at
`~/satellite-provider/` on the Jetson.
## Smoke vs. Reality Gate split — at a glance
| Test category | Marker | Colima (Tier-1) | Jetson (Tier-2) |
|---------------|--------|-----------------|-----------------|
| AC-4a AST scan | (none) | runs | runs |
| AC-4b byte-equality | (none) | runs | runs |
| AC-7 skip-gate self-check | (none) | runs | runs |
| AC-9 helper unit tests | (none) | runs | runs |
| AC-1 / AC-2 / AC-3 / AC-5 / AC-6 (heavy) | `tier2` | **SKIPPED** | runs |
| AC-8 operator workflow | `skip` (AZ-616 blocks) | skipped | skipped |
`GPS_DENIED_TIER` env var controls the auto-skip:
* `GPS_DENIED_TIER=1` (Colima default) → `tier2` / `gpu` / `docker`
marked tests auto-skipped via `tests/conftest.py:31-44`.
* `GPS_DENIED_TIER=2` (Jetson default) → all markers active; everything
runs (subject to other skip gates like `RUN_REPLAY_E2E`).
## Troubleshooting
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `cannot reach 'ssh jetson-e2e' non-interactively` | Agent isn't unlocked or key not in `authorized_keys` | `ssh-add -l` on Mac; check `~/.ssh/authorized_keys` on Jetson |
| `docker: Error response from daemon: could not select device driver "nvidia"` | nvidia-container-toolkit missing or daemon not restarted after install | `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker` |
| `torch.cuda.is_available() == False` inside the container | `runtime: nvidia` block missing, or building on x86 host | Verify `docker-compose.test.jetson.yml` has `runtime: nvidia`; rebuild on the Jetson |
| `replay.auto_sync.ac8_validation_failed` | AZ-614 (tlog time-base mismatch) — not a harness bug | Fix AZ-614 in `tests/e2e/replay/_tlog_synth.py` |
| `not found` / `tag not found` on `nvcr.io/nvidia/l4t-base:r36.*` | `l4t-base` was deprecated in JetPack 6 | use `l4t-jetpack:r36.4.0` for smoke tests; the harness itself uses `dustynv/l4t-pytorch:r36.4.0` |
| `pull access denied for nvcr.io/nvidia/...` | NGC requires login for some tags | `docker login nvcr.io` (use NGC API key from developer.nvidia.com) |
## Related Jira
* AZ-615 — this harness (Jetson runner story)
* AZ-616 — umbrella: replace `mock-sat` with real `../satellite-provider` service
* AZ-688 — Compose-include real satellite-provider + Postgres (this doc)
* AZ-689 — Seed Derkachi-bbox fixture tile set for hermetic e2e
* AZ-690 — Long-lived dev JWT minting helper
* AZ-691 — Python `SatelliteProviderClient`
* AZ-692 — Wire client into composition root; retire `mock-sat`
* AZ-693 — Docs: client contract + test env + containerization
* AZ-694 — AC-8 unskip + diagnose (sibling Story, not a subtask)
* AZ-617 — mark heavy ACs with `tier2` (already applied; this story
documents and verifies the auto-skip)
* AZ-614 — tlog time-base mismatch (currently blocks the heavy ACs
from reaching the GPU stage)
* AZ-602 — parent Epic: E2E Tier-1 harness rehabilitation