Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-406_test_infrastructure.md
T
Oleksandr Bezdieniezhnykh 880eabcb3f Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 00:39:48 +03:00

22 KiB
Raw Blame History

Test Infrastructure

Task: AZ-406_test_infrastructure Name: Blackbox Test Infrastructure Bootstrap (Tier-1 Docker + Tier-2 Jetson harness scaffold) Description: Scaffold the e2e blackbox test project — e2e/ directory, pytest runner, docker-compose.test.yml, mock services, fixture builders, secrets handling, CSV reporter. This is the foundation every blackbox test depends on. Complexity: 5 points Dependencies: None Component: Blackbox Tests (epic AZ-262 / E-BBT) Tracker: AZ-406 Epic: AZ-262 (E-BBT)

Problem

The product (gps-denied-onboard) needs a behavioral verification layer that drives the SUT exclusively through its declared public boundaries (frame source, FC inbound, tile cache mount, FC outbound via SITL, GCS via mavproxy, FDR filesystem). Without a unified test harness — Docker compose for Tier-1, Jetson runner harness for Tier-2, fixture builders, mock Suite Sat Service, MAVLink listener, CSV reporter — every individual blackbox / performance / resilience / security / resource-limit task would re-invent its own scaffolding. This task delivers that shared scaffold once.

Outcome

  • A single cd e2e/docker && docker compose -f docker-compose.test.yml up --build --abort-on-container-exit e2e-runner command brings up the full Tier-1 environment and runs the full pytest suite (when test tasks land).
  • A single ./e2e/jetson/run-tier2.sh --fc-adapter <ardupilot|inav> --vio-strategy <okvis2|klt_ransac> runs the Tier-2 hardware-loop with the same CSV reporter contract.
  • The matrix dimensions FC_ADAPTER × VIO_STRATEGY × build_kind (per environment.md § CI runner mapping) are first-class CI parameters; CI YAML scaffold provided.
  • Every external dependency named in environment.md (ardupilot-plane-sitl, inav-sitl, mock-suite-sat-service, mavproxy-listener) is provisioned by the compose file and reachable inside e2e-net.
  • Egress isolation (internal: true on e2e-net) is enforced by default, satisfying RESTRICT-SAT-1 / NFT-SEC-02 at the network layer.
  • A single pytest-based runner discovers tests under e2e/tests/, parameterizes by FC_ADAPTER + VIO_STRATEGY, and emits the CSV report at e2e-results/run-${RUN_ID}/report.csv with the columns specified in environment.md § Reporting.
  • Fixture builders for tile-cache-fixture, synth-age-tile-set, outlier-injection-derkachi, blackout-spoof-derkachi, multi-segment-derkachi, cold-boot-fixture, cve-jpeg-fixture, mavlink-passkey exist as separate Dockerized helpers under tests/fixtures/.

Test Project Folder Layout

e2e/
├── docker/
│   ├── docker-compose.test.yml         # Tier-1 entrypoint
│   ├── docker-compose.tier2-bridge.yml # optional override for Jetson-attached SITLs
│   └── secrets/
│       └── mavlink_passkey             # Docker-secret mount target (test passkey)
├── jetson/
│   ├── run-tier2.sh                    # Tier-2 entrypoint
│   ├── tier2.service                   # systemd unit template for SUT lifecycle
│   ├── tegrastats_parser.py            # parse tegrastats → per-sample CSV rows
│   └── jtop_parser.py                  # parse jetson-stats jtop API → per-sample CSV
├── runner/
│   ├── Dockerfile                      # e2e-runner image (Python 3.12 + pytest 8.x)
│   ├── requirements.txt                # pytest, pymavlink, msp_gps_toy bridge, opencv-python>=4.12.0, numpy, scipy, geopy
│   ├── conftest.py                     # session/module/function fixtures, FC_ADAPTER/VIO_STRATEGY parameterization
│   ├── reporting/
│   │   ├── csv_reporter.py             # pytest plugin emitting environment.md § Reporting columns
│   │   └── evidence_bundler.py         # collects .tlog, FDR archives, profiler traces, screenshots
│   └── helpers/
│       ├── frame_source_replay.py      # replay images / video to V4L2 file source
│       ├── imu_replay.py               # replay data_imu.csv at 10 Hz to FC inbound
│       ├── sitl_observer.py            # AP/iNav state-read helpers (param read, GPS_RAW_INT, MSP queries)
│       ├── mavproxy_tlog_reader.py     # parse .tlog from mavproxy-listener
│       ├── fdr_reader.py               # post-run filesystem read of FDR archive
│       └── geo.py                      # Vincenty / WGS84 geodesic helpers
├── fixtures/
│   ├── tile-cache-builder/             # builds tile-cache-fixture from input_data + curated public subset
│   ├── age-injector/                   # mutates tile manifest dates → synth-age-tile-set
│   ├── injectors/
│   │   ├── outlier.py                  # outlier-injection-derkachi
│   │   ├── blackout_spoof.py           # blackout-spoof-derkachi (5/15/35 s windows)
│   │   ├── multi_segment.py            # multi-segment-derkachi
│   │   └── cold_boot.py                # cold-boot-fixture (frozen FC pose JSON)
│   ├── secrets/
│   │   └── mavlink-test-passkey.txt    # 32-byte hex; "TEST ONLY"
│   ├── security/
│   │   └── cve-2025-53644.jpg          # crafted JPEG for NFT-SEC-04 (license-checked PoC)
│   └── mock-suite-sat/                 # FastAPI stub for mock-suite-sat-service
│       ├── Dockerfile
│       └── app.py                      # 202 on well-formed publish; 400 on malformed
└── tests/
    ├── positive/                       # FT-P-* scenarios
    ├── negative/                       # FT-N-* scenarios
    ├── performance/                    # NFT-PERF-*
    ├── resilience/                     # NFT-RES-*
    ├── security/                       # NFT-SEC-*
    └── resource_limit/                 # NFT-LIM-*

Layout Rationale

  • e2e/docker/ and e2e/jetson/ separate the two execution tiers, mirroring environment.md § Two-tier execution profile. Each tier has its own entrypoint script — the runner image and CSV-reporter contract are shared.
  • e2e/runner/helpers/ keeps reusable boundary-driving primitives (frame replay, IMU replay, SITL observers, FDR reader, MAVLink listener) out of individual test modules — every blackbox task imports from here, not from the SUT.
  • e2e/fixtures/ holds fixture builders, not the data itself. Heavy fixture content (Derkachi video, 60 still images) is bind-mounted from _docs/00_problem/input_data/ per test-data.md.
  • e2e/tests/<category>/ mirrors the _docs/02_document/tests/*-tests.md grouping so a reader can map any spec scenario to its test file.

Mock Services

Mock Service Replaces Endpoints Behavior
mock-suite-sat-service Azaion Suite Satellite Service ingest API POST /tiles (publish), GET /tiles/audit (test-side audit retrieval) Returns 202 on well-formed publish; 400 on malformed; logs every received tile + per-tile quality metadata to /audit/<run-id>.jsonl; GET /audit reads the log back. Deterministic; same input → same response.
ardupilot-plane-sitl ArduPilot Plane FC UDP 14550 (MAVLink) Real ardupilot/ardupilot-sitl:plane-stable; GPS_TYPE=14. Tests OBSERVE, do not patch.
inav-sitl iNav FC TCP 5760 (MSP2) Real inavflight/inav-sitl:9.0.0; GPS provider configured to MSP per docs/SITL/SITL.md. Tests OBSERVE.
mavproxy-listener QGroundControl GCS UDP 14551 Passive MAVLink listener; captures SUT → GCS stream into /var/log/tlogs/<run-id>.tlog for assertions.

Mock Control Surface

The mock-suite-sat-service exposes:

  • POST /mock/config — test-time behavior control (e.g., simulate downtime, inject 400 errors for negative-path scenarios)
  • GET /mock/audit — returns received tiles + their declared quality metadata for assertion
  • POST /mock/reset — clears audit log between tests for isolation

The two SITL services (ArduPilot, iNav) are NOT control-surface mocks — they are real flight-controller stacks running in simulation. Tests interact via standard MAVLink / MSP2 protocols.

Docker Test Environment

docker-compose.test.yml structure

The full structure is defined in _docs/02_document/tests/environment.md § Docker Environment. The test infrastructure task implements that structure verbatim with the following behaviors:

  • All services on the e2e-net bridge network with internal: true (no external connectivity — RESTRICT-SAT-1 / NFT-SEC-02).
  • Volumes: tile-cache-fixture (RO mount into SUT), fdr-output (RW from SUT, RO from runner), input-data (RO bind from _docs/00_problem/input_data/), expected-results (RO bind from _docs/00_problem/input_data/expected_results/).
  • fdr-output sized exactly 64 GB via Docker --storage-opt size=64g to enforce AC-NEW-3 capacity at the volume layer (NFT-LIM-02 cross-checks rotation behavior).
  • MAVLINK_SIGNING_PASSKEY_FILE injected as a Docker secret from e2e/docker/secrets/mavlink_passkey.
  • FC_ADAPTER and VIO_STRATEGY pulled from environment, default ardupilot + okvis2. CI matrix sets these per job.

Networks and Volumes

Network / Volume Type Purpose
e2e-net bridge, internal: true All test traffic; enforces no-external-egress at network layer
tile-cache-fixture named volume Pre-built FAISS HNSW index + tile filesystem; built once per CI run
fdr-output named volume, 64 GB cap Per-flight FDR write target
input-data bind mount RO bind of _docs/00_problem/input_data/
expected-results bind mount RO bind of _docs/00_problem/input_data/expected_results/

Tier-2 Bridge

Tier-2 runs the SUT on real Jetson hardware with systemctl start gps-denied-onboard.service. SITL containers (ArduPilot, iNav) run either on the same Jetson (constrained CPU sharing) OR on a paired x86 host on the same network — docker-compose.tier2-bridge.yml provisions the SITL-only subset with the same e2e-net definition so the runner observes the same boundaries.

Test Runner Configuration

Framework: pytest 8.x Plugins:

  • pytest-csv — CSV emission per environment.md § Reporting (one row per test)
  • pytest-xdist — parallel test execution within a tier (Tier-1 only; Tier-2 runs serially due to single-Jetson constraint)
  • pytest-timeout — per-test wall-clock budget enforcement (matches the per-scenario Max execution time in test specs)
  • pymavlink — MAVLink ground side
  • msp_gps_toy (Rust binary, called via subprocess) — MSP2 ground side
  • opencv-python>=4.12.0 — frame source replay (CVE-mitigated per D-CROSS-CVE-1)
  • numpy + scipy + geopy (Vincenty) — geodesic-distance assertions in WGS84
  • pytest-forked--forked mode for hermetic-critical scenarios

Entry point: pytest e2e/tests/ --csv=e2e-results/run-${RUN_ID}/report.csv --csv-columns="test_id,test_name,traces_to,fc_adapter,vio_strategy,tier,started_at_utc,execution_time_ms,result,error_message,evidence_paths"

Fixture Strategy

Fixture Scope Purpose
fc_adapter session parametrized over {ardupilot, inav}; selects which SITL to bind
vio_strategy session parametrized over {okvis2, klt_ransac} (production); vins_mono only on research-build sessions
tile_cache session mounts tile-cache-fixture once per session
clean_sut function docker compose restart gps-denied-onboard between tests; resets fdr-output
clean_sut_forked function (--forked) full docker compose down/up per test; for hermetic-critical scenarios
mavproxy_tlog function starts a fresh .tlog capture window for the duration of the test
fdr_reader function post-run helper for parsing the FDR archive
sitl_observer function AP/iNav state-read helper

Parameterization

Every test file is parameterized by (fc_adapter, vio_strategy) unless the spec explicitly skips one or both. The conftest skip-rules:

  • AC-7.x scenarios: skipped on every run (NOT COVERED per traceability matrix; pytest.skip(reason="AC-7.x deferred — see traceability matrix")).
  • vins_mono: skipped on production-build runs (pytest.skip(reason="vins_mono is research-build-only per D-C1-1-SUB-A")).
  • Tier-2-only scenarios (NFT-PERF-01, NFT-LIM-01, NFT-PERF-03, NFT-LIM-04): skipped on Tier-1 with pytest.skip(reason="Tier-2 only — Jetson hardware required").
  • Chamber-only scenarios (AC-NEW-5 hot-soak portion): skipped on Tier-2 workstation-thermal runs; gated by --enable-chamber flag.

Test Data Fixtures

Data Set Source Format Used By
still-image-set-60 bind mount from _docs/00_problem/input_data/ JPEG + CSV GT FT-P-01, FT-P-03, FT-P-05, FT-P-06, FT-P-15, FT-P-19, NFT-RES-03, NFT-PERF-04
still-image-sat-refs-2 same PNG FT-P-05, FT-P-19
derkachi-fixture bind mount MP4 + CSV FT-P-02, FT-P-04, FT-P-07, FT-P-10, FT-N-01..04, NFT-PERF-01..02, NFT-RES-01..04, NFT-LIM-02, NFT-LIM-04
tile-cache-fixture named volume built by tests/fixtures/tile-cache-builder/ FAISS HNSW + tile filesystem FT-P-01, FT-P-05, FT-P-15..17, FT-P-19, FT-N-05..06, NFT-LIM-03, NFT-PERF-01, NFT-PERF-04, NFT-SEC-01..02
synth-age-tile-set built from tile-cache-fixture by age-injector/ tile filesystem with mutated manifest dates FT-N-05, FT-N-06
outlier-injection-derkachi runtime-generated by injectors/outlier.py tmpfs frame source FT-N-01
blackout-spoof-derkachi runtime-generated by injectors/blackout_spoof.py tmpfs + spoof injector on FC inbound FT-N-04, NFT-RES-04
multi-segment-derkachi runtime-generated by injectors/multi_segment.py tmpfs frame source FT-P-08
cold-boot-fixture static JSON fixture in tests/fixtures/cold-boot/ JSON pose snapshot FT-P-11, NFT-PERF-03
mavlink-passkey tests/fixtures/secrets/mavlink-test-passkey.txt 32-byte hex FT-P-09-AP, NFT-SEC-03
cve-jpeg-fixture tests/fixtures/security/cve-2025-53644.jpg crafted JPEG NFT-SEC-04

Data Isolation

Per test-data.md § Data Isolation Strategy:

  • Each test runs against a fresh SUT container (docker compose restart between tests, OR --forked pytest mode for hermetic-critical scenarios).
  • tile-cache-fixture and input-data mounts are read-only — cross-contamination at the SUT input layer is impossible.
  • fdr-output volume is reset between tests (docker volume rm + recreate).
  • Synthetic-injection fixtures generate to per-test tmpfs; never written back to a persistent volume.
  • For Tier-2: same isolation discipline at the systemd-service level (systemctl restart); /var/azaion/fdr wiped between tests.

Test Reporting

Format: CSV (one row per test) — exactly per environment.md § Reporting.

Columns: test_id, test_name, traces_to, fc_adapter, vio_strategy, tier, started_at_utc, execution_time_ms, result, error_message, evidence_paths

Output path: e2e-results/run-${RUN_ID}/report.csv plus a per-run bundle of evidence at e2e-results/run-${RUN_ID}/evidence/ (assembled by evidence_bundler.py from .tlog files, FDR archives, screenshots, profiler traces, tegrastats CSV, jtop CSV).

Acceptance Criteria

AC-1: Test environment starts (Tier-1) Given a clean checkout of the repo When cd e2e/docker && docker compose -f docker-compose.test.yml up --build --abort-on-container-exit e2e-runner is executed Then all services in environment.md § Docker Environment start, the e2e-runner image builds, and pytest discovers ≥1 test file (sample test in e2e/tests/positive/test_smoke.py).

AC-2: Mock services respond Given the test environment is running When the e2e-runner POSTs a well-formed tile-publish JSON to mock-suite-sat-service Then the service responds 202 and records the tile in its audit log; subsequent GET /mock/audit returns the recorded entry.

AC-3: SITL services accept SUT output Given the test environment is running with a placeholder SUT that emits one valid GPS_INPUT (AP) AND one valid MSP2_SENSOR_GPS (iNav) When the e2e-runner reads EK3_SRC1_POSXY from ardupilot-plane-sitl AND queries iNav GPS state via MSP from inav-sitl Then both SITLs reflect the test-injected GPS source as primary.

AC-4: CSV report generated with required columns Given the test runner executes When the test run completes Then e2e-results/run-${RUN_ID}/report.csv exists with exactly the columns from environment.md § Reporting, and a per-run evidence bundle exists at e2e-results/run-${RUN_ID}/evidence/.

AC-5: Egress isolation enforced Given the test environment is running with e2e-net.internal: true When the e2e-runner attempts a TCP connect to 1.1.1.1:443 from inside the SUT container Then the connection fails (network-layer block); no DNS resolution succeeds for non-e2e-net names.

AC-6: Tier-2 runner harness contract Given a Jetson Orin Nano Super with the SUT installed via systemd When ./e2e/jetson/run-tier2.sh --fc-adapter ardupilot --vio-strategy okvis2 --duration 5min is executed Then the same CSV report format is produced at e2e-results/run-${RUN_ID}/report.csv, with tier=tier2-jetson for every row, and tegrastats + jtop per-sample CSVs land in the evidence bundle.

AC-7: Fixture builders are reproducible Given a clean Docker volume state When tests/fixtures/tile-cache-builder/build.sh runs Then the same tile-cache-fixture content is produced bit-for-bit twice in a row (same FAISS index, same tile manifest hashes); same idempotency for age-injector and the static JSON fixtures.

AC-8: Parameterization matrix coverage Given the conftest sets up (fc_adapter, vio_strategy, tier, build_kind) parameterization When CI runs the standard matrix Then every produced report row has well-defined values for fc_adapter, vio_strategy, tier; vins_mono rows appear only on build_kind=research runs; Tier-2-only test_ids are SKIP on Tier-1 with the expected reason string.

AC-9: Skips per traceability matrix Given the e2e-runner starts When the discoverer encounters a test mapped to AC-7.1, AC-7.2, AC-NEW-5 chamber portion, AC-8.6 scene-change subset, RESTRICT-CAM-2, or RESTRICT-HW-2 chamber portion Then those test rows show result=SKIP (or XFAIL for AC-8.6 scene-change PARTIAL) with an error_message referencing the traceability-matrix mitigation entry.

Constraints

  • Public-boundary discipline: the e2e-runner image MUST NOT import any module from the SUT source tree. The only legal interaction surfaces are MAVLink / MSP2 / HTTP / filesystem — same as a real consumer would have.
  • OpenCV pin: the runner image's OpenCV version MUST be >=4.12.0 (D-CROSS-CVE-1); pinned in e2e/runner/requirements.txt.
  • MAVLink-passkey provenance: the test passkey is a checked-in fixture explicitly labeled "TEST ONLY"; the production passkey path is /run/secrets/mavlink_passkey per environment.md and is never the test fixture.
  • Tier separation: Tier-1 and Tier-2 produce IDENTICAL CSV row formats so the same downstream tooling (badge generators, regression detectors) can consume both.
  • No internal state probes: no test may read SUT internal state (GTSAM iSAM2 graph, FAISS in-memory index, internal Python/C++ buffers, logger handles). Only public boundaries + FDR archive + SITL observation are legal evidence sources.

Risks & Mitigation

Risk 1: Tier-1 runner image build slow / large

  • Risk: pulling tensorrt, gtsam, faiss-gpu, opencv-python>=4.12.0 plus dev dependencies into a single image bloats the e2e-runner build to ≥30 min and ≥10 GB.
  • Mitigation: the e2e-runner image is separate from the SUT image — the runner only needs ground-side libs (pymavlink, msp_gps_toy, opencv-python, numpy, scipy, geopy, pytest). The SUT image is what carries the heavy ML stack. Keep the runner image lean (target ≤2 GB).

Risk 2: SITL containers flaky / non-deterministic timing

  • Risk: ardupilot-plane-sitl and inav-sitl boot times vary; tests may race the SITL's parameter-init phase.
  • Mitigation: conftest fixture polls SITL readiness via a known parameter read (e.g., EKF_ENABLE) before any test runs. Failure to reach readiness within 60 s fails the SITL fixture, not the individual test.

Risk 3: Mock Suite Sat Service drift from D-PROJ-2 contract

  • Risk: when the real Suite Sat Service ingest contract ships (D-PROJ-2), the mock may diverge silently.
  • Mitigation: the mock's request/response schema is sourced from the contract sketch in _docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md; a contract test in NFT-SEC-01 asserts the mock's accepted-fields match that sketch. When the real endpoint ships, the mock is retired (per F7 in traceability matrix).

Risk 4: --storage-opt size=64g not portable

  • Risk: Docker's --storage-opt size=64g for volumes requires specific storage drivers (overlay2 with xfs backing); may not work on all CI runners.
  • Mitigation: fallback strategy in the docker-compose: if the volume cannot be size-capped at the Docker layer, the SUT enforces the cap internally per AC-NEW-3 and NFT-LIM-02 verifies via volume-size sampling. CI runner config flagged in the runner README.

Risk 5: cve-jpeg-fixture license / distribution

  • Risk: PoC JPEG for CVE-2025-53644 may have unclear redistribution terms.
  • Mitigation: license-checked at fixture-import time; if license unclear, the fixture is generated programmatically following the published PoC structure (no copyrighted bytes); generation script is itself part of tests/fixtures/security/.

Document Dependencies

  • _docs/02_document/tests/environment.md — full Docker environment spec (services, networks, volumes, secrets, ports)
  • _docs/02_document/tests/test-data.md — fixture sources, formats, isolation strategy, validation rules
  • _docs/02_document/tests/traceability-matrix.md — AC coverage map (drives the SKIP / XFAIL rules in conftest)
  • _docs/02_document/tests/blackbox-tests.md + performance-tests.md + resilience-tests.md + security-tests.md + resource-limit-tests.md — list of test categories that the e2e/tests/<category>/ folders mirror

Excluded

  • The SUT (gps-denied-onboard) container build — owned by the BUILD-side epics (E-CC-CONF / E-BOOT and per-component epics). The test infrastructure references the SUT image but does NOT build it.
  • Individual test scenario implementations (FT-P-, FT-N-, NFT-*) — owned by the per-scenario tasks decomposed in Step 3.
  • The Suite Sat Service real endpoint contract — owned by the parent suite (D-PROJ-2); the mock here mirrors a sketch only.
  • The thermal chamber AC-NEW-5 hot-soak run — physical hardware, deferred per traceability matrix.
  • The AI-camera fixture (AC-7.x) — out of scope per Phase 1 gate.