Files
gps-denied-onboard/docker-compose.test.jetson.yml
Oleksandr Bezdieniezhnykh 97f5f9793c [AZ-965] NetVLAD-VGG16 backbone checkpoint + YAML/compose wiring
AZ-965 ships the NetVLAD .pt checkpoint that clears the AZ-839
empty-c10_provisioning.backbones SKIP gate. Pipeline-integration
scaffold — encoder is real, NetVLAD tail is honestly labelled as
untrained.

Composition:

* Encoder (26 keys, encoder.0..encoder.28): torchvision
  vgg16(weights=IMAGENET1K_V1) features [:-2], BSD-3-Clause.
  Real ImageNet-pretrained VGG16 conv stack.
* NetVLAD pool + PCA tail (5 keys: pool.conv.{weight,bias},
  pool.centroids, pca.{weight,bias}): random-init via
  torch.manual_seed(0). NOT trained for visual place recognition.

Total: 149,002,112 params (568.4 MiB fp32, sha256=745c6f29...).
Round-trip verified locally: torch.load(weights_only=True) +
load_state_dict(strict=True) succeed; forward(1,3,480,480) emits
{'vlad_descriptor': (1, 4096) fp32} — matches NetVladStrategy
contract per net_vlad.py:247-251.

Two material discoveries documented in the AZ-965 spec:

1. The NetVLAD-VGG16 architecture already lives in repo at
   src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py
   — we instantiate it and save a state_dict, NOT externally source.
2. The PyTorch FP16 runtime expects a .pt state_dict (NOT .onnx).
   BackboneConfig.onnx_path is a misnomer for NetVLAD: per AZ-321
   design + c2_vpr description.md §1, NetVLAD runs on PyTorch FP16
   (NOT TRT). compile_engine is a no-op sha256+path wrap;
   deserialize_engine does torch.load(weights_only=True) +
   load_state_dict(strict=True).

User skipped Option A/B/C/D/E question — judgment call = Option B
(IMAGENET1K_V1 + random tail) per "use judgment, don't block":
* Option A (Nanne translation) was 5-8 SP, above the 5 SP budget.
* Option B is 3 SP, fits the budget, honestly labelled.
* Option C (pure random) was borderline-dishonest per Real Results.

Files:

* scripts/mk_netvlad_checkpoint.py — deterministic generator.
* models/netvlad/netvlad.pt — 568 MiB, via git-lfs (.gitattributes
  extended for models/**/*.pt, *.onnx, *.engine).
* configs/operator_replay.yaml — c2_vpr + c10_provisioning blocks
  populated; the field literally named onnx_path actually points
  at the .pt for NetVLAD per the runtime semantics noted above.
* docker-compose.test.jetson.yml — ./models:/opt/models:ro bind
  mount added to e2e-runner.
* _docs/03_ip_attribution/netvlad.md — provenance, licence, how-to-
  reproduce, honest scope statement ("NOT a real-retrieval
  checkpoint; ESKF divergence under garbage retrievals is the
  expected next gate").
* _docs/02_tasks/todo/AZ-965_netvlad_onnx_backbone_provisioning.md
  — rewritten to reflect the .pt-not-.onnx + Option B discoveries.

Tier-2 verification follows in a separate commit after the harness
run confirms the empty-backbones SKIP gate clears.

Out of scope (filed as follow-ups):

* Real-retrieval NetVLAD weights (Nanne Pittsburgh-30k translation
  or internal team checkpoint) — separate ticket.
* AZ-840 orchestrator PASSing end-to-end (depends on retrieval
  quality + ESKF stability).
* AZ-963 60s smoke ESKF divergence (independent chain).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-29 18:03:32 +03:00

258 lines
12 KiB
YAML

# Tier-2 e2e harness — Jetson Orin Nano (JetPack 6.x, L4T R36.x).
#
# AZ-615: companion compose file to `docker-compose.test.yml` that runs
# the full Reality Gate on a CUDA-capable host. Used via `ssh jetson-e2e
# "docker compose -f docker-compose.test.jetson.yml up ..."` driven by
# `scripts/run-tests-jetson.sh`.
#
# Difference vs. docker-compose.test.yml:
# * `runtime: nvidia` + `gpus: all` on `e2e-runner` so the SUT can
# resolve `model.half().cuda()` against the Orin GPU.
# * `GPS_DENIED_TIER=2` — turns OFF the auto-skip for `@pytest.mark.tier2`
# ACs (see tests/conftest.py:31-44). The heavy ACs (AC-1, AC-2, AC-3,
# AC-5, AC-6) actually run.
# * Builds from `tests/e2e/Dockerfile.jetson` (l4t-pytorch base).
# * Companion / db / mock-sat continue to come from the root
# `docker-compose.yml` via `extends:` (same as Colima) — they have ARM64
# tags via the existing build pipeline.
#
# AZ-688 (sibling of AZ-616): the real satellite-provider .NET service is
# defined inline below (services.satellite-provider + services.satellite-
# provider-postgres). `run-tests-jetson.sh` rsyncs `../satellite-provider/`
# to a sibling directory on the Jetson so the build context resolves
# identically on the workstation and on the Jetson.
#
# Why inline instead of `include: ../satellite-provider/docker-compose.yml`:
# Compose's `include:` rejects same-name service overrides ("conflicts with
# imported resource"). We need to customize the api service (healthcheck,
# network alias, internal-only ports) so the upstream compose's verbatim
# `include:` doesn't work. Inline is cleaner than the multi-`-f` ordering
# games required to make overlay precedence work.
#
# `mock-sat` remains in the graph for now — AZ-692 retires it once the
# gps-denied client (AZ-691) lands.
services:
# ------------------------------------------------------------------
# Init services (profiles: [setup]) — NOT started by the default
# `docker compose up`. They are invoked explicitly as one-shot jobs
# via `docker compose run --rm --profile setup <service>` before the
# main harness run:
#
# 1. db-migrate — applies Alembic migrations so companion's
# FreshnessGate / PostgresFilesystemStore find their tables.
# (AZ-618 ordering gap: build_pre_constructed queries the DB
# before the composition root can call apply_migrations.)
#
# 2. tile-init — writes a minimal valid HNSW32 FAISS descriptor
# index into the tile-data volume so FaissDescriptorIndex._load()
# succeeds during build_pre_constructed.
# (AZ-618 gap: the production provisioning pipeline normally
# writes the index; in the test harness it must pre-exist.)
#
# They are in profile "setup" so they do not participate in the
# default `docker compose up` and do not trip --abort-on-container-exit.
# ------------------------------------------------------------------
db-migrate:
profiles: ["setup"]
image: gps-denied-onboard/e2e-runner:jetson
entrypoint: ["alembic"]
command: ["upgrade", "head"]
working_dir: /opt/project
volumes:
- .:/opt/project:ro
depends_on:
db:
condition: service_healthy
restart: "no"
tile-init:
profiles: ["setup"]
image: gps-denied-onboard/e2e-runner:jetson
entrypoint: ["python3"]
command: ["/opt/project/scripts/mk_test_faiss_fixture.py"]
volumes:
- .:/opt/project:ro
- tile-data:/var/lib/gps-denied/tiles
restart: "no"
# companion and operator-orchestrator are intentionally absent from
# the Jetson e2e test harness.
#
# Every test in tests/e2e/replay/ invokes the ``gps-denied-replay``
# console-script directly as a subprocess and does not call the
# companion or operator-orchestrator HTTP APIs. Including either
# service caused the harness to abort before any test could run:
#
# * companion crashes at startup because live-mode requires a
# production-provisioned C7 inference engine (PyTorch FP16 or
# TensorRT) that is absent from the test environment. This is the
# pre-existing AZ-618 gap (build_pre_constructed fails before the
# composition root can apply_migrations + engine artifacts).
# * operator-orchestrator crashed for the same C7 inference reason.
#
# When the AZ-618 epic ships the full airborne boot-up in a sandboxed
# environment (Phase E / engine stubs), companion can be re-added here.
db:
extends:
file: docker-compose.yml
service: db
e2e-runner:
build:
context: .
dockerfile: tests/e2e/Dockerfile.jetson
image: gps-denied-onboard/e2e-runner:jetson
# nvidia-container-runtime exposes the Tegra GPU + libcuda mounts.
# Without this block the container starts but `torch.cuda.is_available()`
# returns False and every tier2 AC errors at `.cuda()`.
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
depends_on:
db:
condition: service_healthy
satellite-provider:
condition: service_healthy
environment:
# Same FullSystemConfig env block as Colima — see comments in
# docker-compose.test.yml for the per-var rationale.
GPS_DENIED_FC_PROFILE: ardupilot_plane
# Tier-2 turns OFF the `tier2` / `gpu` auto-skip in tests/conftest.py
# so the heavy ACs in tests/e2e/replay/test_derkachi_1min.py actually
# execute. This is the WHOLE POINT of the Jetson harness.
GPS_DENIED_TIER: "2"
DB_URL: postgresql://gps_denied:dev@db:5432/gps_denied
# AZ-777 Phase 1: e2e-runner consumes the real parent-suite
# satellite-provider .NET service over its compose-DNS name. The
# dev TLS cert is self-signed against `localhost`, so the suite-
# internal probe must skip cert verification — see SECURITY note
# in `.env.test.example`. Production deploys ship a real CA-issued
# cert and MUST set SATELLITE_PROVIDER_TLS_INSECURE="0" (or omit it).
SATELLITE_PROVIDER_URL: https://satellite-provider:8080
SATELLITE_PROVIDER_TLS_INSECURE: "1"
SATELLITE_PROVIDER_API_KEY: ${SATELLITE_PROVIDER_API_KEY:?SATELLITE_PROVIDER_API_KEY must be set via .env.test — see scripts/mint_dev_jwt.py}
# AZ-777 Phase 1 also forwards the JWT triple so the smoke test
# can mint its own dev token in-container as a fallback when
# SATELLITE_PROVIDER_API_KEY is rotated mid-session.
JWT_SECRET: ${JWT_SECRET}
JWT_ISSUER: ${JWT_ISSUER}
JWT_AUDIENCE: ${JWT_AUDIENCE}
COMPANION_URL: http://companion:8080
CAMERA_CALIBRATION_PATH: /opt/tests/fixtures/calibration/adti26.json
LOG_LEVEL: INFO
LOG_SINK: console
INFERENCE_BACKEND: pytorch_fp16
FDR_PATH: /var/lib/gps-denied/fdr
TILE_CACHE_PATH: /var/lib/gps-denied/tiles
MAVLINK_SIGNING_KEY: /opt/tests/fixtures/mavlink_signing/dev_key
RUN_REPLAY_E2E: "1"
# Replay-mode build flags (Invariant 9). See identical block in
# docker-compose.test.yml — all three are required for the
# composition root to construct the replay strategies. The
# original harness only set BUILD_REPLAY_SINK_JSONL because
# every Reality-Gate run died at auto-sync before the other
# two flags were checked.
BUILD_VIDEO_FILE_FRAME_SOURCE: "ON"
BUILD_TLOG_REPLAY_ADAPTER: "ON"
BUILD_REPLAY_SINK_JSONL: "ON"
# AZ-894 / AZ-895: the CSV-driven path is now the PRIMARY replay
# surface (auto-sync was deprecated). `_replay_branch._build_csv_bundle`
# constructs `CsvReplayFcAdapter`, which fails fast at __init__ when
# this flag is OFF — every test in tests/e2e/replay/ that runs the
# `replay_runner` fixture trips that gate without this line.
BUILD_CSV_REPLAY_ADAPTER: "ON"
BUILD_FAISS_INDEX: "ON"
# AZ-964: build_inference_runtime gates pytorch_fp16 behind
# this flag. The dustynv/l4t-pytorch base image bakes the
# Tegra-tuned PyTorch wheel, so the strategy module imports
# cleanly when the flag is ON. build_engine_compiler (called
# by the AZ-839 fixture) requires c7 inference runtime, so
# the flag must be ON for the orchestrator test to run.
BUILD_PYTORCH_FP16_RUNTIME: "ON"
# AZ-962: the AZ-839 C3 fixture (operator_pre_flight_setup) skips
# the AZ-840 orchestrator test when this var is missing. The YAML
# bind-mounted at /opt/configs/operator_replay.yaml declares the
# four blocks the fixture consumes (c6/c7/c10/c11). c10.backbones
# is intentionally empty — AZ-964 ships the .onnx + populates it.
GPS_DENIED_OPERATOR_CONFIG_PATH: /opt/configs/operator_replay.yaml
volumes:
- ./tests:/opt/tests:ro
- ./_docs/00_problem/input_data:/opt/_docs/00_problem/input_data:ro
- ./configs:/opt/configs:ro
- ./models:/opt/models:ro
- fdr-data:/var/lib/gps-denied/fdr
- tile-data:/var/lib/gps-denied/tiles
# AZ-688: real satellite-provider .NET service. Mirrors the upstream
# compose at ../satellite-provider/docker-compose.yml with three
# deliberate customizations:
# * service name = `satellite-provider` (clearer than the upstream's
# generic `api`) so AZ-692's client uses https://satellite-provider:8080
# * TCP-level healthcheck via bash /dev/tcp so other services can
# `depends_on: service_healthy`. The base image
# (mcr.microsoft.com/dotnet/aspnet:10.0, debian-12-slim) ships
# bash and /dev/tcp is a bash builtin; no extra package needed.
# * no host port mappings — internal-only access via compose DNS;
# keeps host ports free for nested e2e runs.
satellite-provider:
build:
context: ../satellite-provider
dockerfile: SatelliteProvider.Api/Dockerfile
image: gps-denied-onboard/satellite-provider:dev
container_name: gps-denied-e2e-satellite-provider
environment:
ASPNETCORE_ENVIRONMENT: Development
ASPNETCORE_URLS: https://+:8080
ASPNETCORE_Kestrel__Certificates__Default__Path: /app/certs/api.pfx
ASPNETCORE_Kestrel__Certificates__Default__Password: satellite-dev-cert
ConnectionStrings__DefaultConnection: Host=satellite-provider-postgres;Port=5432;Database=satelliteprovider;Username=postgres;Password=postgres
MapConfig__ApiKey: ${GOOGLE_MAPS_API_KEY:-}
# Suite JWT contract — see _docs/10_auth.md. Sourced from .env.test
# via run-tests-jetson.sh; the API fails fast at startup if any of
# the three are missing or whitespace-only.
JWT_SECRET: ${JWT_SECRET:?JWT_SECRET must be set via .env.test}
JWT_ISSUER: ${JWT_ISSUER:?JWT_ISSUER must be set via .env.test}
JWT_AUDIENCE: ${JWT_AUDIENCE:?JWT_AUDIENCE must be set via .env.test}
volumes:
- ../satellite-provider/certs/api.pfx:/app/certs/api.pfx:ro
- ../satellite-provider/tiles:/app/tiles
- ../satellite-provider/ready:/app/ready
- ../satellite-provider/logs:/app/logs
healthcheck:
test: ["CMD", "bash", "-c", "exec 3<>/dev/tcp/127.0.0.1/8080"]
interval: 5s
timeout: 3s
retries: 12
start_period: 30s
depends_on:
satellite-provider-postgres:
condition: service_healthy
satellite-provider-postgres:
image: postgres:16
container_name: gps-denied-e2e-satellite-provider-postgres
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: satelliteprovider
volumes:
- satellite-provider-postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
volumes:
db-data: {}
fdr-data: {}
tile-data: {}
satellite-provider-postgres-data: {}