Cycle-3 addendum captures the layered Jetson rerun progression:
synth time-base fix (AZ-614) drops offset_ms from 1.7e12 to -4334;
AZ-611 skip-auto-sync then crosses the AC-9 validator; AZ-602
build-flag completeness opens VideoFileFrameSource and
TlogReplayFcAdapter; composition root logs
'replay.compose_root.ready: auto_sync_used=false', then crashes
inside runtime_root.airborne_bootstrap because production main()
never builds c13_fdr / c6_* / c7_inference / c3_lightglue_runtime /
c3_feature_extractor / c2_82_ransac_filter into pre_constructed.
The bootstrap gap is filed as AZ-618 (Story under AZ-602). It
affects both live and replay binaries -- every prior Reality-Gate
run died at auto-sync before the composition graph was walked, so
the gap was hidden. The 38 compose_root unit tests pass only via
the replay_components_factory stub kwarg, which bypasses the
bootstrap entirely.
Autodev sub_step advances to phase 8
'az614-az611-landed-bootstrap-gap-discovered' pending the user's
decision on whether to start AZ-618 immediately or close out
Step 11 with the current Reality-Gate signal.
Co-authored-by: Cursor <cursoragent@cursor.com>
Records the first Jetson Tier-2 run results in the step-11 report:
17 pass / 5 fail / 1 skip / 1 xfail (24 total, 10m09s) — identical to
Colima because all 5 failures hit AZ-614 (tlog time-base mismatch)
BEFORE reaching the GPU. So the infrastructure is proven (image
builds, GPU exposed inside container, SUT subprocess runs to the
auto-sync stage) but the heavy ACs haven't yet exercised ALIKED /
DISK LightGlue. Fixing AZ-614 is the gating prerequisite to actually
drive the GPU stages.
Also captures lessons learned that are now in the setup doc:
* Only dustynv/l4t-pytorch:r36.4.0 is a usable Jetson PyTorch base
on Docker Hub for R36 / JetPack 6 (l4t-base deprecated, official
l4t-pytorch has no R36 tags).
* The dustynv image bakes a maintainer-LAN-only pip mirror into
/etc/pip.conf — must be wiped + --index-url pinned to pypi.org.
* pip 24.2 (image default) rejects gtsam-4.3a0 pre-release; pip 26.x
accepts the same wheel for `gtsam<5.0,>=4.2` because there are no
stable aarch64 builds. Upgrade pip in the build, don't relax pin.
* nvidia-container-runtime mounts nvidia-smi from host, so the GPU
smoke test needs only ubuntu:22.04 (80 MB), not l4t-jetpack (5 GB).
Autodev state advances to phase 7 / jetson-harness-online.
Co-authored-by: Cursor <cursoragent@cursor.com>
Multi-stage Ubuntu 22.04 e2e-runner image installs gps-denied-onboard
(editable) into /opt/venv so the AZ-404 replay tests can subprocess
gps-denied-replay against the Derkachi fixture. Image layout mirrors
the host repo (/opt/pyproject.toml + /opt/src + /opt/tests bind mount)
so Path(__file__).parents[3] resolves to /opt and AC-4's AST scan
finds the components dir.
Entrypoint now runs `pytest /opt/tests/e2e/` instead of the empty
`scenarios/` dir. The bootstrap harness collects 24 tests vs. 0 before.
Compose: e2e-runner env mirrors the companion service (FullSystemConfig
requirements) plus RUN_REPLAY_E2E=1, BUILD_REPLAY_SINK_JSONL=ON;
bind-mounts the Derkachi fixture dir; adds writable fdr-data /
tile-data volumes the SUT requires.
Reality Gate signal is now real: 17 pass / 5 fail / 1 skip / 1 xfail.
The 5 heavy-AC failures share root cause AZ-614 (tlog synth time-base
mismatch, surfaced by the now-functional harness).
Also archives the replayed leftover entries (csv_reporter -> AZ-601,
harness rehab -> AZ-602 epic + 11 child stories).
Co-authored-by: Cursor <cursoragent@cursor.com>
Attempted Path-3 (Full SITL with community images) for the SUT Reality
Gate. Discovered sitl_observer is offline-fixture replay, not a live
SITL client -- compose-file SITL services in environment.md are
aspirational. The real Path-3 needs the fixture builders + SUT CLI
end-to-end, which surfaced 5 additional integration drifts (H-10..H-14)
on top of the prior 9.
Fixes:
- tests/fixtures/calibration/adti26.json: body_to_camera_se3 was a
{rotation_xyzw, translation_xyz_m} dict; runtime_root/_replay_branch.py
loader strictly expects a 4x4 SE3. Identity quaternion + zero
translation = identity 4x4, semantically equivalent.
New files:
- tests/fixtures/replay_config_minimal.yaml: minimal replay-mode config
for harness reproduction (mode=replay, ardupilot_plane defaults).
- .gitignore: e2e/fixtures/sitl_replay/ (generated by build_p0X_fixtures).
Documentation:
- Step 11 report: appended Path-3 attempt section.
- Leftover doc: H-10..H-14 ticket payloads added.
- Autodev state: reflects Path-3 outcome.
Step 11 stays blocked; H-13 (auto-sync AC-8 hard-fails on stationary
fixtures) requires a SUT design decision and cannot be unilaterally
fixed mid-session.
Co-authored-by: Cursor <cursoragent@cursor.com>
Local Tier-1 pytest suite: 3343 pass / 88 skip / 0 fail across 12 chunks.
Docker harness SUT Reality Gate UNMET — both Tier-1 docker harnesses
(scripts/run-tests.sh and e2e/docker/run-tier1.sh) have pre-existing
drift that prevents them from running end-to-end. Findings:
H-1..H-3 (fixed in 6ce3158): dockerfile rename, fdr-output tmpfs cap,
e2e-results bind dir + gitignore.
H-4..H-6 (deferred): three SITL/MAVLink Docker Hub images don't exist
(ardupilot/mavproxy, ardupilot/ardupilot-sitl,
inavflight/inav-sitl). environment.md spec was
written against aspirational image names.
H-7..H-8 (deferred): tests/e2e/Dockerfile entrypoint points at empty
scenarios dir + doesn't install the SUT package.
H-9 (deferred): tile-cache-fixture seeder missing (relates to AZ-595).
Plus a regression caught and fixed mid-run: pytest-csv autoload
conflicts with our custom --csv flag (commit eb6dc17). Also surfaced a
false-positive batch-89 test-result report; proposed preventive
meta-rule pending user approval.
Step 11 marked status=blocked pending harness rehabilitation tickets
(payloads recorded in _docs/_process_leftovers/). Full outcome report:
_docs/03_implementation/run_tests_step11_report.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
Final test-implementation report written at
_docs/03_implementation/implementation_report_tests.md. All 41
blackbox-test tasks (AZ-406..AZ-446) under epic AZ-262 are done.
Full-suite gate handed off to .cursor/skills/test-run/SKILL.md per
implement skill Step 16.
Co-authored-by: Cursor <cursoragent@cursor.com>
Batch 89 — adds optional `band`, `ci95_low`, `ci95_high` kw-only
parameters to `_NfrRecorder.record_metric` and emits a new per-metric
report.csv artifact (one row per scenario × metric, columns:
scenario_id, metric_name, value, value_band, ci95_low, ci95_high,
ac_id, outcome). Backwards compatible — existing 4-arg callers
unchanged; unbalanced ci95 pair raises ValueError. report.csv is
written once per pytest session from `pytest_sessionfinish` so the
annotation pass runs once per CI invocation regardless of
(fc_adapter, vio_strategy) (AC-3). `regression-baseline.json`
intentionally kept flat to preserve the diff contract used by
regression-detection tooling.
NFT-RES-03 + NFT-PERF-01 scenarios updated to pass real bands and
compute empirical 2.5/97.5-percentile ci95 from their own sample
streams (per-iteration envelope ratios for Monte Carlo,
per-frame latency samples for N-sample latency).
Tests: 1229 e2e/_unit_tests pass (+6 vs. batch 88 for AZ-446
band/CI behavior, value-error on unbalanced ci95, report.csv columns,
explicit-path override, and end-to-end emission via the pytest
plugin). Code review: PASS_WITH_WARNINGS — 1 Low (empirical-CI
semantics, documented inline), 1 Medium carried over from batch 88's
cumulative-review backlog (write_csv_evidence + _resolve_fixture_path
duplication is outside AZ-446 reporting scope).
This commit closes Step 10 Implement Tests for cycle 1 (41 of 41
blackbox-test tasks done, AZ-406..AZ-446). Greenfield auto-chains to
Step 11 Run Tests next.
Co-authored-by: Cursor <cursoragent@cursor.com>
Batch 88 — adds four resource-limit blackbox scenarios + pure-logic
helpers + unit tests:
- NFT-LIM-01 Jetson memory (AC-NEW-13): tier2_only; Plan A/B budgets;
AC-4 OOM-event scan; 30 s warm-up window; VmRSS + tegrastats streams.
- NFT-LIM-02 FDR size (AC-7.3): 30 min → 8 h linear extrapolation
against 50 GiB; ±60 s replay-window slack for AC-1.
- NFT-LIM-03+05 storage (AC-7.4 + AC-NEW-12 + RESTRICT-STORAGE):
aggregate ≤ 100 GiB across tile-cache + tile-cache-write +
fdr-output; thumbnail-log < 1 GiB strict 8 h-extrapolated.
- NFT-LIM-04 thermal (AC-NEW-5 PARTIAL): tier2_only; CPU/SoC p99
≤ T_throttle − 5 °C; throttle-event scan; PARTIAL annotation written
to traceability-status.json. Thresholds fixture lives at
e2e/fixtures/jetson/thermal-thresholds.json (moved from the
task spec's suggested tests/fixtures/ path so the file stays
inside the blackbox_tests Owns: e2e/** envelope).
All four helpers are public-boundary-only (no src/gps_denied_onboard
imports). Scenarios skip cleanly in the Tier-1 docker harness pending
AZ-595 (SITL replay builder) for the four shared fixture inputs and
AZ-444 (Tier-2 Jetson runner) for the tier2_only scenarios.
Code review: PASS_WITH_WARNINGS (0/0/2/1). Both Mediums are
carried-over write_csv_evidence + _resolve_fixture_path duplication,
deferred to AZ-446 (batch 89). Low is the self-resolved AZ-443 fixture
ownership drift documented in the review.
Tests: 1223 e2e/_unit_tests passing (+1 vs. batch 87 from the new
directory-layout entry); 24 resource_limit scenarios collect and skip
cleanly under runner/pytest.ini.
Co-authored-by: Cursor <cursoragent@cursor.com>
FT-P-15: parse FDR `cache-self-check` records; assert every tile-manifest
entry has CRS, tile_matrix, dimension, m_per_px, capture_date, source,
compression; m_per_px >= 0.5 (or rejected by FDR `tile-load-rejected`).
FT-P-16: read `docker network inspect e2e-net` + `docker inspect <sut>`
snapshots; assert `Internal == true` AND SUT attached only to e2e-net.
The 0-egress semantic of AC-8.3 is enforced structurally.
FT-P-18: walk FDR + tile-cache, probe JPEG dimensions via stdlib SOF
parser, reject any file matching nav-camera raw pattern (5472x3648 or
880x720). Extrapolate thumbnail-log size to 8h; assert < 1 GB.
Adds runner.helpers.tile_cache_inspector with five evaluators
(manifest schema, offline mode, raw-frame detection, thumbnail budget,
JPEG dimension probe) + walk_files helper. Pure-logic coverage: 43
new unit tests; full e2e/_unit_tests/ suite 793 passing (was 746).
Scenarios skip locally when SITL replay fixture or docker-inspect
env vars are missing; production hooks (cache-self-check FDR record,
tile-load-rejected events, docker-inspect snapshots) are tracked
outside this task.
See _docs/03_implementation/batch_82_report.md +
reviews/batch_82_review.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
Archive AZ-420 to done/; add cumulative review for batches 79-81 (PASS,
no new findings); advance autodev state to await batch 82.
Co-authored-by: Cursor <cursoragent@cursor.com>
FT-P-12: parse mavproxy-listener tlog over a 60 s Derkachi replay and
assert SUT->GCS GLOBAL_POSITION_INT cadence lands in [1, 2] Hz (AC-6.1).
FT-P-13: inject `RELOC:<lat>,<lon>,<radius_m>` STATUSTEXT while the SUT
is in dead_reckoned; verify FDR `c8.gcs.operator_command` ack <=2s,
`anchor_search_region` centre shifts toward the hint, and no
BAD_SIGNATURE / UNAUTHORIZED / REJECTED STATUSTEXT lands in the
post-inject window (AC-6.2).
Adds runner.helpers.gcs_telemetry_evaluator (rate, hint-ack correlation,
haversine search-region shift, rejection scan) and
sitl_observer.capture_gcs_tlog (parity surface to capture_ap_tlog).
Pure-logic coverage: 39 new unit tests; full e2e/_unit_tests/ suite
746 passing (was 700). Scenarios skip locally on missing SITL replay
fixture; production hooks (inbound STATUSTEXT parser, anchor_search_region
FDR emitter) tracked outside this task.
See _docs/03_implementation/batch_81_report.md +
reviews/batch_81_review.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 1: extend sitl_observer with cursor-based `wait_for_outbound`
returning `OutboundMessage` from `outbound_messages_<fc_kind>_<host>.json`
fixtures. Three outcomes: message, TimeoutError (null entries), or
RuntimeError (missing/malformed). Fix FT-P-01 + FT-P-05 scenarios to
use `fc_kind=` kwarg.
Phase 2: FT-P-01 vertical-slice fixture builder under
`e2e/fixtures/sitl_replay_builder/`. Reuses the production
`gps-denied-replay` CLI + `ReplayInputAdapter`: encode 60 stills as
1 fps MP4 + synthetic stationary tlog (pymavlink); run replay;
project FDR outbound estimates into the schema. Avoids the
13+ cp of SUT-side frame-ingestion that a live-SITL-capture path
would have required. Live execution remains a manual operator step.
+35 unit tests (664 total, up from 637). K=3 cumulative review for
b76-b78 documents the offline-replay arc convergence.
Co-authored-by: Cursor <cursoragent@cursor.com>
Add `runner/helpers/replay_mode.py` (NullFrameSink, NullFcInboundEmitter,
default_frame_period_ms, load_replay_json, resolve_replay_subdir,
imu_replay_noop) and rewire all 13 scenarios off their local
`_resolve_*` / `_drive_*` / `_push_*` NotImplementedError stubs.
Closes the offline FDR-replay execution path. `grep raise
NotImplementedError` under `e2e/tests/` now returns zero matches. +17
unit tests (626 total, up from 608). Unit-test behaviour unchanged
(scenarios still skip via b75 sitl_replay_ready gate when
E2E_SITL_REPLAY_DIR is unset).
Co-authored-by: Cursor <cursoragent@cursor.com>
Add `runner/helpers/fc_proxy_runtime.py` wrapping the existing
`BlackoutSpoofProxy` (AZ-406) with a scenario-facing `drive_fc_proxy`
entry point. FDR-replay mode only: loads `schedule.json`, optionally
activates the proxy against a caller clock for alignment verification,
and writes a `proxy_drive_report.json` audit record into
`${E2E_SITL_REPLAY_DIR}` for downstream evaluators.
Replaces the local `_drive_fc_proxy` stub in FT-N-04. Adds 3
@property accessors on `BlackoutSpoofProxy` so the wrapper does not
reach into private attributes. +11 unit tests (608 total, up from
596). Live-mode router wiring remains out of scope (future ticket).
Co-authored-by: Cursor <cursoragent@cursor.com>
Implement all 11 `sitl_observer` public surfaces as an offline
FDR-replay strategy (reads JSON fixtures under `${E2E_SITL_REPLAY_DIR}`
instead of live pymavlink/yamspy). Replace 12 per-scenario
`_harness_helpers_implemented` probes with one shared session-scoped
`sitl_replay_ready` fixture in `e2e/tests/conftest.py`.
Net: -636 LoC of duplicated scenario gating, +17 LoC shared fixture,
+38 new unit tests (596 total, up from 558). Includes K=3 cumulative
review for batches 73-75 (PASS).
Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces the NotImplementedError stubs AZ-406 reserved on three runner-
side helpers; these were stranded from any tracker ticket since
AZ-407/408 never came back to fill them. Concrete bodies:
* fdr_reader.iter_records: JSONL parser + wire-envelope validator;
recursive *.jsonl walk; projects {schema_version, ts, producer_id,
kind, payload} to runner-side FdrRecord with record_type/monotonic_ms
renames; yields oldest-first.
* frame_source_replay.replay_video: OpenCV VideoCapture decode + JPEG
re-encode; auto-detects file vs directory; injectable sleep_fn for
unit-test pacing.
* imu_replay.ImuReplayer.replay: csv.DictReader parse; degrees->radians
attitude conversion; tolerates scientific notation; same sleep_fn
injection pattern.
Adds 34 unit tests (14 + 10 + 10). Full e2e unit suite: 558 passed (+31).
Existing scenario _harness_helpers_implemented probes still return False
because they also depend on sitl_observer / fc_proxy_runtime stubs that
remain pending; scenario probe cleanup is out of AZ-594 scope.
Co-authored-by: Cursor <cursoragent@cursor.com>
Three blackbox-harness tasks landed together — all depend only on
AZ-406 and unblock the FT-* / NFT-* scenario tasks scheduled for
batches 69+.
AZ-407 — Static fixture builders (3pt):
* tile-cache-builder/{builder.py, Dockerfile, build.sh} produces a
deterministic tile-cache-fixture Docker volume from
_docs/00_problem/input_data/. Reproducibility primitives: sorted
iteration, frozen PIL JPEG settings, FAISS HNSW32 built single-
threaded with seeded stub descriptors.
* age-injector/{age_injector.py, inject.sh} clones the volume and
shifts capture_date by N×30.44 days; tile JPEG bytes preserved
bit-identical. Emits synth-age-7mo + synth-age-13mo volumes.
* cold-boot/cold_boot_fixture.json: frozen FC pose snapshot at
Derkachi sector centre, schema v1.
* secrets/mavlink-test-passkey.txt: 64-hex with required
`# TEST ONLY` header line per AC-5. Passkey-equality test now
compares the secret line after stripping the header.
* security/cve-2025-53644.jpg: synthetic 158-byte malformed JPEG
(truncated SOS marker). OpenCV 4.11.x rejects gracefully with
imdecode → None. AZ-439 will sharpen for ASan instrumentation.
* Top-level Makefile with `make fixtures` / `make fixtures-*` /
`make e2e-tier1*` / `make unit-tests` targets.
AZ-444 — Tier-2 Jetson harness wrapper (5pt):
* run-tier2.sh rewritten as orchestrator. Detects local
(aarch64 + TIER2_HOST=localhost) vs remote (ssh into TIER2_HOST).
New flags: -k/--selector, --build-kind production|asan,
--reflash (gated behind TIER2_REFLASH_ACK=1 two-key gate),
--dry-run.
* tier2-on-jetson.sh (new) — on-device delegate. Verifies
gps-denied-onboard{,-asan}.service health; restarts with 5s
tolerance; spawns tegrastats + jtop parallel samplers; tails
ASan unit's journal in asan mode; drives docker compose with
TIER=tier2-jetson; forwards SELECTOR to pytest -k.
* docker/run-tier1.sh (new) — selector-parity sibling.
* AC-1 (selector parity) and AC-6 (reflash gating) unit-tested via
--dry-run output assertions. AC-2/AC-3/AC-4/AC-5 are hardware-
loop ACs verified by the Tier-2 runtime smoke (no Jetson in the
unit-test layer).
AZ-445 — CSV reporter + evidence bundler refinements (2pt):
* reporting/nfr_recorder.py (new) — pytest plugin. Provides the
`nfr_recorder` fixture with record_metric(name, value, ac_id)
and partial(ac_id, reason). At session end emits:
- per-nfr/<scenario_id>.json (AC-1)
- traceability-status.json with every AC ID parsed from
traceability-matrix.md, classified Covered/PARTIAL/NOT
COVERED with source scenario IDs (AC-2)
- regression-baseline.json with all numeric metrics (AC-3)
* csv_reporter.py extended — `_outcome_to_result` consults the
aggregator; rows flip PASS → PARTIAL when an AC was marked
PARTIAL by nfr_recorder (AC-4). Graceful fallback when
aggregator isn't registered (unit-test contexts).
* conftest.py registers nfr_recorder in pytest_plugins.
* New --traceability-matrix CLI flag seeds the NOT COVERED rows.
Build / config:
* pyproject.toml dev extras: added Pillow>=10.4,<13.0 for the
tile-cache-builder unit test (broad enough to keep torchvision's
Pillow 12 pin happy; the production builder runs inside its own
Docker image with its own pin).
* Updated test_directory_layout.py to cover 10 new files + replaced
the byte-equal passkey assertion with the header-stripping
variant.
Test results:
* 157 focused tests pass (was 97 in batch 67; +60 new across this
batch). No regressions.
Module-layout / spec drift:
* AZ-407 spec text says `tests/fixtures/...`; module-layout
blackbox_tests entry (commit d7a17a8) authoritatively places the
harness under `e2e/`. Implementation followed the layout entry.
* AZ-444 spec mentions `e2e/tier2/run-tier2.sh`; AZ-406 placed it
at `e2e/jetson/run-tier2.sh`. Kept at `e2e/jetson/` for
consistency.
* Cold-boot README ownership: corrected from AZ-419 to AZ-407 per
AZ-419's own Dependencies field.
Specs archived to _docs/02_tasks/done/. Jira tickets transitioned to
In Testing on commit.
Co-authored-by: Cursor <cursoragent@cursor.com>
Bootstraps the public-boundary blackbox test harness owned by epic
AZ-262 (E-BBT). Establishes the e2e/ directory tree at the repo root,
fully separated from src/gps_denied_onboard/** and from the in-process
tests/** tree, and commits to the contracts every subsequent test
ticket (AZ-407..AZ-446) builds against.
Tier-1 (workstation Docker):
- docker/docker-compose.test.yml wires SUT + ArduPilot SITL + iNav SITL
+ mock Suite Sat Service + mavproxy listener + e2e-runner onto one
e2e-net bridge with internal: true (enforces RESTRICT-SAT-1 /
NFT-SEC-02 egress isolation at the network layer).
- docker/docker-compose.tier2-bridge.yml override disables the in-
compose SUT so Tier-2 pairs SITLs + mock + runner on an x86 host
while the SUT runs natively on the Jetson under systemd.
Tier-2 (Jetson):
- jetson/run-tier2.sh + tier2.service systemd unit + tegrastats /
jtop parsers feed per-sample telemetry into the evidence bundle.
Runner image (e2e/runner/):
- Dockerfile + requirements.txt install ONLY ground-side libs
(pymavlink, opencv-python>=4.12, numpy/scipy/geopy/pyproj, httpx,
orjson, pydantic, structlog, pytest 8.x). The runner deliberately
does NOT install the SUT package.
- conftest.py implements the AC-9 skip-rule mapping (tier2_only,
chamber_only, vins_mono, deferred_ac) tied to environment.md
parametrize axes.
- reporting/csv_reporter.py is a pytest plugin emitting one row per
test with the exact 11-column schema from environment.md §
Reporting (test_id, test_name, traces_to, fc_adapter, vio_strategy,
tier, started_at_utc, execution_time_ms, result, error_message,
evidence_paths). XFAIL surfaced only when a test carries
@pytest.mark.deferred_ac(verdict="xfail", reason=...).
- reporting/evidence_bundler.py exposes the attach_evidence fixture
that copies per-test artifacts (.tlog, FDR archives, screenshots,
tegrastats / jtop CSVs) into the run bundle and records relative
paths into the reporter's evidence_paths column.
- helpers/{frame_source_replay,imu_replay,sitl_observer,
mavproxy_tlog_reader,fdr_reader}.py declare the public surfaces
(concrete implementations owned by AZ-407 / AZ-408 / AZ-416 /
AZ-417 / AZ-441 per the dependency table); helpers/geo.py ships
today (no downstream task dep) — WGS84 distance / forward-bearing
/ offset via pyproj with NaN rejection.
Mock Suite Sat Service (e2e/fixtures/mock-suite-sat/):
- FastAPI app: POST /tiles (ingest contract from D-PROJ-2 follow-up),
GET /tiles/audit + /mock/audit (per-run read-back), POST
/mock/config (force-status, response delay), POST /mock/reset
(clears audit between tests), GET /mock/health.
Fixture scaffolds (e2e/fixtures/{tile-cache-builder, age-injector,
injectors, cold-boot, secrets, security}/):
- Public surfaces only. Concrete builders land in AZ-407 (static
fixtures), AZ-408 (runtime synthetic injection), AZ-419 (cold-boot
fixture), AZ-439 (CVE-2025-53644 JPEG generator).
Test tree (e2e/tests/{positive,negative,performance,resilience,
security,resource_limit}/):
- Mirror of the test-spec category grouping in
_docs/02_document/tests/*-tests.md.
- tests/positive/test_smoke.py is the AC-1 harness-boot smoke run
inside the e2e-runner image once Docker brings everything up.
Out-of-container unit tests (e2e/_unit_tests/):
- Exercises the harness internals (CSV reporter plugin lifecycle,
conftest skip rules, helper modules, parsers, mock app, compose
YAML structural contract, public-boundary enforcement) without
Docker / SITL. 97 unit tests, all passing.
Build / config:
- pyproject.toml: testpaths extended with e2e/_unit_tests; pythonpath
extended with e2e; fastapi>=0.111,<0.120 added to dev extras for the
mock-app TestClient unit test.
AC coverage:
- AC-1 (Tier-1 boot) → compose YAML test + directory layout
+ smoke test (Docker-bound)
- AC-2 (mock services) → 6 FastAPI TestClient unit tests
- AC-3 (SITLs accept output) → contract present; concrete check
deferred to AZ-416 / AZ-417
- AC-4 (CSV columns) → in-process plugin lifecycle test
emits the exact 11-column schema
- AC-5 (egress isolation) → static config test + runtime probe
in Docker-bound smoke
- AC-6 (Tier-2 contract) → tegrastats + jtop parser unit tests
+ jetson/* layout test; full Tier-2
contract is AZ-444
- AC-7 (fixture reproducibility) → deferred to AZ-407 per task spec
- AC-8 (parametrize matrix) → vins_mono skip-rule cases +
tests/positive/test_smoke
- AC-9 (skip semantics) → 9 conftest skip-rule unit tests
Module layout entry for blackbox_tests was added in 2026-05-16
preparatory commit d7a17a8 so this diff stays focused on the harness
scaffold. AZ-406 advances to In Testing on commit.
Co-authored-by: Cursor <cursoragent@cursor.com>
The 41 blackbox/e2e test tasks (AZ-406..AZ-446 under epic AZ-262) all
declare Component=Blackbox Tests, but module-layout.md had no matching
Per-Component Mapping entry. The implement skill's Step 4 (File
Ownership) requires every batch's component to be resolvable in
module-layout.md.
Add a `blackbox_tests` entry in the Shared / Cross-Cutting section
that owns the top-level `e2e/` directory (separate from `tests/`),
documents the public-boundary discipline (no SUT imports), and
clarifies that boundary-driven performance/resilience/security
scenarios live under `e2e/tests/<category>/` rather than under
`tests/perf|security|resilience/`.
Also update Layout Rule #7 to reflect the harness split and the
state file's sub_step to parse-and-detect-progress (Step 10 entry).
Co-authored-by: Cursor <cursoragent@cursor.com>
41 blackbox test task specs (AZ-406..AZ-446) under epic AZ-262 already
exist in _docs/02_tasks/todo/. Dependencies table reflects them
(155 = 114 product + 41 test, 133 blackbox-test pts).
tests/e2e/conftest.py + tests/e2e/Dockerfile placeholders confirm the
bootstrap was decomposed in a prior pass.
Folder fallback for Step 9 is satisfied. No new work executed.
State advanced to Step 10 (Implement Tests) — session boundary per
greenfield flow; suggest fresh conversation before continuing.
Co-authored-by: Cursor <cursoragent@cursor.com>
Autodev greenfield Step 8 closes with outcome
"Code is testable — no changes needed" after reviewing the 41 test
scenarios in _docs/02_document/tests/ against the codebase against the
Step-8 allowed-changes checklist.
Key findings:
- Hardcoded paths are config defaults, overridable via Config dataclass
- All mutable registries expose clear_*_registry()/_reset_for_tests()
- Hot-path timing uses injected Clock; cosmetic timestamps are
monkeypatch-safe (2105-test unit suite proves it)
- Heavy strategies (OKVIS2, VINS-Mono, FAISS, TRT) are BUILD_* gated
- compose_root(pre_constructed=...) (AZ-591) is the Tier-1 injection
seam; tests/e2e/replay already drives it end-to-end
Artifacts:
- _docs/04_refactoring/01-testability-refactoring/
testability_assessment.md
- State advanced to Step 9 (Decompose Tests)
- last_step_outcomes.step_8 recorded
Co-authored-by: Cursor <cursoragent@cursor.com>
Batch 66 — fixes the production gap surfaced during the cycle-1
completeness-gate post-mortem: the central _STRATEGY_REGISTRY was
empty in production source, so compose_root() raised
StrategyNotLinkedError on the first component lookup and the
airborne binary couldn't reach takeoff.
Changes:
- New module `src/.../runtime_root/airborne_bootstrap.py` exposes
`register_airborne_strategies()` and a documented
`AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS` table. The function
registers 14 entries into the central registry across 7
strategy-selecting slots (c1_vio + c2_vpr + c2_5_rerank +
c3_matcher + c3_5_adhop + c4_pose + c5_state). Per-slot wrappers
adapt the registry-factory signature (config, constructed) to each
per-component factory's kwarg surface and surface a
AirborneBootstrapError when a required infrastructure dep is
missing from constructed.
- `compose_root` gains a `pre_constructed` kwarg in live mode,
symmetric with the replay-mode seam. Replay entries still take
precedence on key collision (ADR-011). Existing callers unaffected
(kwarg defaults to None).
- `runtime_root/__init__.py::main()` now calls
`register_airborne_strategies()` before `compose_root(config)` so
production binaries no longer crash at the registry-lookup step.
- Lazy-loading preserved: state_factory's private _STATE_REGISTRY is
populated lazily inside the c5_state wrapper, gated by
BUILD_STATE_GTSAM_ISAM2 / BUILD_STATE_ESKF env flags. pose_factory's
own lazy-import fallback handles c4_pose without an explicit
register() call.
- 7 new unit tests in `tests/unit/runtime_root/test_az591_airborne_\
bootstrap.py` cover AC-1..AC-5 plus the negative-path
AirborneBootstrapError contract. Full unit suite 2105 passed / 88
environment-gated skips / 0 failures.
End-to-end takeoff still needs a follow-up task to wire infrastructure
pre-construction (c13_fdr / c6_* / c7_inference / etc.) into the
pre_constructed dict passed to compose_root. That follow-up is gated
by AZ-591 landing first; recommended split into per-component
infrastructure-prep tasks (3pt each).
Co-authored-by: Cursor <cursoragent@cursor.com>
Cycle 1 Product Implementation Completeness Gate post-mortem.
AZ-589 + AZ-590 were the wrong abstraction:
- AZ-589 targeted `okvis::ThreadedKFVio` (OKVIS v1 API) which does
not exist in the vendored OKVIS2 upstream; smartroboticslab/okvis2
exposes `okvis::ThreadedSlam` instead.
- AZ-590 assumed a "de-ROSified VINS-Mono pin" submodule exists;
`cpp/vins_mono/upstream/` has no `.gitmodules` entry.
- The actual production gap is the empty central
`_STRATEGY_REGISTRY`: `register_strategy(...)` is never called
outside test fixtures, so `compose_root()` raises
`StrategyNotLinkedError` for every component slug with a
strategy-selecting config field. Affects c1_vio + c2_vpr +
c2_5_rerank + c3_matcher + c3_5_adhop + c4_pose + c5_state.
Re-classification:
- AZ-589 + AZ-590 closed Won't Fix (Jira); spec files removed
from todo/ but rows retained in the dependencies table as
audit-trail.
- AZ-591 created (todo/, 5pt) — cross-cutting compose_root
per-binary bootstrap that populates `_STRATEGY_REGISTRY` for
the airborne binary. Scheduled as Batch 66 sole task.
- AZ-592 created (backlog/, 5pt placeholder) — AZ-332 Tier-2
validation bundle (real `okvis::ThreadedSlam` wiring + Linux CI
apt-install + DBoW2 vocab + Jetson). BLOCKED on Tier-2
prerequisites; honors AZ-332's `AZ-332_tier2_validation`
self-deferral handle.
- AZ-593 created (backlog/, 5pt placeholder) — AZ-333 Tier-2
validation bundle (de-ROSified VINS-Mono upstream + binding +
CI + Jetson). BLOCKED on upstream vendoring decision plus
Tier-2 prerequisites; honors AZ-333's parallel deferral pattern.
- AZ-332 + AZ-333 re-classified in cycle1 gate report from FAIL
to BLOCKED-on-Tier-2.
Step 7 stays in_progress until AZ-591 lands; after that it can
advance to Step 8 with AZ-592 + AZ-593 parked in backlog/.
Co-authored-by: Cursor <cursoragent@cursor.com>
The Product Implementation Completeness Gate (cycle 1, 2026-05-16)
audited 107 done product tasks. 105 PASS / 0 BLOCKED / 2 FAIL.
FAIL findings — both AZ-332 (OKVIS2) and AZ-333 (VINS-Mono) ship a
real Python facade + AC-tested fake backend, but their native pybind11
bindings (_native/okvis2_binding.cpp, _native/vins_mono_binding.cpp)
are skeletons: _build_estimator() sets estimator_built_ = false; the
first add_frame() raises *FatalException("estimator not yet wired").
Production-default VIO and the comparative-study path both crash on
the first nav-camera frame.
Remediation tasks created in _docs/02_tasks/todo/:
- AZ-589 remediate_okvis2_threadedkfvio_wiring (5pt)
- AZ-590 remediate_vins_mono_estimator_wiring (5pt)
Both tasks also seed the per-binary bootstrap register_strategy() call
sites — the existing strategy registry in runtime_root/__init__.py is
never invoked in src/ today.
Artifacts:
- _docs/03_implementation/implementation_completeness_cycle1_report.md
- _docs/02_tasks/todo/AZ-589_remediate_okvis2_threadedkfvio_wiring.md
- _docs/02_tasks/todo/AZ-590_remediate_vins_mono_estimator_wiring.md
- _docs/02_tasks/_dependencies_table.md (+2 rows; totals refreshed)
- _docs/_autodev_state.md (Step 7 phase 1 parse;
current_batch: 66)
Returning to implement-skill Step 1 to parse Batch 66 against these
remediation tasks (per Step 15 option A).
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds an opt-in C5-internal orthorectifier (`_orthorectifier.py`) that
emits at most one tile-aligned JPEG candidate per nav frame to the
C6 `TileStore.write_tile` API. Quality gates fire before any
OpenCV work: covariance Frobenius, inlier floor, source-label
(`SATELLITE_ANCHORED` only), and once-per-frame rate limit.
Cross-component import rule (AZ-507) is preserved: c5_state never
imports c6_tile_cache. `runtime_root.state_factory` carries a new
`_C6MidFlightIngestAdapter` that builds the canonical
`TileMetadata` (`ONBOARD_INGEST` / `FRESH` / `PENDING`), hashes
the JPEG, and translates `FreshnessRejectionError` to a `None`
return so the orthorectifier silently swallows freshness
rejection per AC-NEW-3.
Wiring is opt-in via `C5StateConfig.orthorectifier.enabled`;
existing tests/binaries default to disabled and are unaffected.
Both `GtsamIsam2StateEstimator` and `EskfStateEstimator`
participate through new `attach_orthorectifier` /
`set_latest_nav_frame` extension methods (Protocol surface
unchanged).
Tests: 22 new unit tests cover AC-1..AC-9 plus inlier-floor
gate plus the composition-root adapter. 216/216 c5_state and
38/38 runtime-root + compose tests pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
All FC adapter outbound MAVLink bytes now go through the AZ-401
MavlinkTransport seam (NoopMavlinkTransport in replay,
SerialMavlinkTransport in live). New helpers in
_outbound_mavlink_payloads.py extract encode/pack/seq-bump so the four
AP _send sites and the iNav statustext _send site become
encode -> pack -> transport.write. TlogReplayFcAdapter emits real
AP-shape MAVLink bytes through the injected NoopMavlinkTransport,
satisfying replay protocol Invariant 5 and unblocking AZ-401 AC-9.
Closes AZ-558. Also unskips AZ-401 AC-9 and AZ-404 AC-4b. Live wire
output remains byte-identical (proven via two-instance MAVLink
byte-equivalence tests). AST scan asserts no .mav.<name>_send( calls
remain in the retrofit set (AP / iNav / tlog adapters).
Out of scope (logged in review): GCS adapter retrofit; airborne live
strategy registration that would activate the SerialMavlinkTransport
factory injection path.
Tests: 2110 passed, 92 environmental skips, 1 unrelated pre-existing
macOS cold-start flake deselected.
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-389's task spec assumed the existence of `tile_store.put_mid_flight_
candidate(MidFlightTileCandidate)` (in Excluded: "owned by AZ-303 / E-C6"),
but the current TileStore Protocol has only the four-method baseline
shipped under AZ-303 — there is no put_mid_flight_candidate, no
MidFlightTileCandidate DTO, and no MID_FLIGHT_INGEST TileSource enum value.
Filed AZ-559 as a 5pt task to close the C6 storage gap (Protocol method
+ DTO + enum + persistence + freshness/LRU integration + contract
update). Updated AZ-389 spec to depend on AZ-559 (replacing the stale
AZ-303 dep) with a Status: BLOCKED note. Updated the dependencies
table totals: 151 tasks / 502 complexity points.
This is the same dep-gap pattern surfaced for AZ-401 in batch 61
(missing AZ-400 transport-seam retrofit) — the autodev replay-track
sequence is exposing under-spec deliveries upstream. Tracker remains
the source of truth via the new AZ-559 issue + Blocks link.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements the replay-mode CLI dispatcher per ADR-011 (replay-as-
configuration):
- src/gps_denied_onboard/cli/replay.py: argparse with all 6 required
args (--video, --tlog, --output, --camera-calibration, --config,
--mavlink-signing-key) plus --pace and --time-offset-ms; path
validation, calibration JSON schema-validation, config mutation
(mode='replay' + replay sub-block + signing-key hex on dev_static
field), dispatch into runtime_root.main(config).
- runtime_root.main() now accepts an optional Config (additive,
backward-compat). Adds dedicated catch for ReplayInputAdapterError
mapping to EXIT_FDR_OPEN_FAILURE (2) so the CLI's exit-code matrix
holds end-to-end (AC-9 + epic AZ-265 AC-8).
- Signing-key contents stored as hex; redacted in startup banner.
- Top-level except logs full traceback via logger.exception + stderr
print and exits 1.
The CLI does NOT call compose_root directly — it builds a Config and
hands it to the shared airborne main, which calls compose_root, which
branches on config.mode (AZ-401 / replay protocol Invariant 11).
Tests: 22 unit tests covering AC-1..AC-10 + extras (signing-key
redaction, file-not-dir validation, dev_static propagation, unhandled
exception traceback). Full regression: 2085 passed (+22) green; no
new flaky tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
Wires the airborne composition root for replay-as-configuration (ADR-011):
- compose_root(config) branches on config.mode in {"live", "replay"}.
Live behaviour is unchanged; replay builds ReplayInputAdapter,
attaches JsonlReplaySink, and injects NoopMavlinkTransport.
- New private module runtime_root/_replay_branch.py holds the
replay-only strategy graph + build-flag gate + calibration loader.
- Config gains Config.mode (Literal["live","replay"]) plus
Config.replay sub-block with nested ReplayAutoSyncConfig that mirrors
the AZ-405 AutoSyncConfig DTO; YAML loader + ENV map updated.
Absorbs the AZ-400 transport-seam retrofit that AZ-401 strictly
required but AZ-400 had not delivered:
- New MavlinkTransport Protocol (write/bytes_written/close).
- NoopMavlinkTransport (replay; build-flag gated, idempotent close,
thread-safe byte counter).
- SerialMavlinkTransport (live, no-op restructure of existing pymavlink
byte path; encoder retrofit to actually USE it is the AZ-558
follow-up).
AZ-401 AC-9 (NoopMavlinkTransport.bytes_written > 0 after C8 encoders
run) is BLOCKED on AZ-558 — the encoder routing retrofit is out of
the AZ-401 task envelope (FORBIDDEN files: pymavlink_ardupilot_adapter,
msp2_inav_adapter). AZ-558 spec, batch_61_review.md, and the test's
@pytest.mark.skip rationale all carry the deferral reason.
Tests: 22 compose_root replay-branch tests + 17 transport tests.
Full regression: 2063 passed, 86 environment-skips, 1 documented
skip (AC-9 / AZ-558), 1 pre-existing flaky perf test deselected.
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the Layer-4 cross-cutting `replay_input/` module per ADR-011:
ReplayInputAdapter converges (video, tlog) into the standard
FrameSource + FcAdapter + Clock surfaces the airborne composition
root consumes. Owns time-alignment between video frames and tlog
IMU/attitude ticks (manual via --time-offset-ms or auto via the
AZ-405 IMU-take-off detector + Farneback motion-onset detector).
Auto-sync algorithm (auto_sync.py):
- Tlog take-off detector: sustained vertical-accel excess > 0.5 g for
>= 0.5 s + sustained attitude-rate magnitude > 1 rad/s.
- Video motion-onset detector: dense Farneback flow magnitude > 1.5 px
sustained >= 0.5 s (deterministic per AC-10).
- compute_offset combines the two; confidence = min(tlog, video).
- validate_offset_or_fail implements the AC-9 95 % frame-window match
validator with configurable threshold + window.
ReplayInputAdapter.open() ordering (AC-13):
1. Load tlog samples + fail-fast on missing RAW_IMU/SCALED_IMU2 or
ATTITUDE BEFORE any video read.
2. Resolve offset (auto-sync OR manual override; manual bypasses the
detectors entirely per AC-8).
3. Run AC-9 validator on resolved offset; raise auto-sync hard-fail
for AC-7 (CLI exit 2 mapping).
4. Build single Clock instance per pace (TlogDerived/ASAP, Wall/REAL).
5. Construct VideoFileFrameSource and TlogReplayFcAdapter with the
resolved offset baked in (replay protocol Invariant 8).
Structured log + FDR records on auto-sync detected / low-confidence /
AC-8 hard-fail kinds. Idempotent close (AC-12).
Tests: 25 unit tests across tests/unit/replay_input/ covering all 13
ACs (kernel-level synthetic fixtures for AC-1..AC-10; coordinator-
level OpenCV synthetic videos + faked pymavlink for AC-6..AC-13).
Contract update: replay_protocol.md v2.0.0 added fdr_client to the
ReplayInputAdapter __init__ signature (was missing in the prose; the
task spec already listed it in the allowed-imports section).
Co-authored-by: Cursor <cursoragent@cursor.com>
Re-design replay mode per user direction: replay is no longer a fourth
Docker image with a reduced component set, but a `config.mode = "replay"`
branch of the single airborne binary. The pre-flight workflow (route in
suite UI -> C12 tile download via real satellite-provider -> C10
manifest+engines build) is identical between live and replay; only three
strategies swap at compose time:
FrameSource: Live <-> Video
FcAdapter: Pymavlink/MSP2 <-> TlogReplay
MavlinkTransport: Serial <-> Noop
The C8 outbound MAVLink encoders run unchanged in both modes; their
bytes hit `NoopMavlinkTransport` in replay and disappear. A new
`JsonlReplaySink` taps C5's `EstimatorOutput` stream so the parent-suite
UI sees per-tick coordinates by tailing `results.jsonl`. MAVLink 2.0
signing key remains mandatory (operator supplies a dummy file).
A new `replay_input/` Layer-4 cross-cutting coordinator owns
`(video, tlog) -> (FrameSource, FcAdapter, Clock)` convergence; the
composition root sees only standard interfaces past `.open()`.
Docs:
- architecture.md: new ADR-011 with full rationale; ADR-002 binary
narrative updated.
- contracts/replay/replay_protocol.md: bumped to v2.0.0; 12 invariants
(notably mode-agnosticism + encoder byte-equality + signing key
mandatory + real C6 cache in replay).
- module-layout.md: Build-Time Exclusion Map dropped from 4 to 3 binary
columns; replay-mode `BUILD_*` flags default ON in airborne;
`shared/replay_input` cross-cutting entry added.
- epics.md: E-DEMO-REPLAY scope reframed; story points 27-32 -> 19-24.
Task respecs:
- AZ-401: shrunk 3 -> 2 pts; `compose_root` mode branch + JSONL sink +
NoopMavlinkTransport wiring; legacy `compose_replay` export deleted.
- AZ-402: console-script wrapper that mutates `config.mode = "replay"`
and dispatches into the shared airborne main; `--mavlink-signing-key`
mandatory.
- AZ-403: CANCELLED. Moved to done/ with banner; Jira transition deferred
via `_docs/_process_leftovers/2026-05-14_az_403_cancellation_pending_tracker.md`.
- AZ-404: AC-4 reworded as mode-agnosticism AST scan + encoder
byte-equality test; new AC-8 operator-workflow rehearsal.
- AZ-405: also owns the `replay_input/` module + `ReplayInputAdapter`.
_dependencies_table.md updated: AZ-401 gains AZ-405 dep; AZ-404 drops
AZ-403 dep; AZ-403 row marked CANCELLED.
Co-authored-by: Cursor <cursoragent@cursor.com>