Land the fallback InferenceRuntime strategy that satisfies C7-IT-05:
when the TRT-direct path (AZ-298) cannot deserialise a cached engine
or when the operator explicitly selects ORT, the system stays in the
air at degraded latency rather than dropping the request. Conforms to
the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep".
Production
- onnx_trt_ep_runtime.py: compile_engine is a no-op returning an
EngineCacheEntry pointing at the source .onnx; deserialize_engine
is gate-first for .engine entries and gate-skip for .onnx, builds
an ORT InferenceSession with the provider list
[TensorrtExecutionProvider, CUDAExecutionProvider,
CPUExecutionProvider], stages cached engines into the ORT TRT EP
cache directory via symlink-or-copy, warms up with one session.run
after construction, and honours config.inference.ort_disallow_cpu_
fallback by raising EngineDeserializeError when the active provider
resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep
WARN log plus gcs_alert callback on first call when is_fallback=
True; release_engine is idempotent. _build_provider_args is the
single point that pins TRT EP option-key names (Risk-3) and caps
trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8).
- config.py: adds ort_trt_cache_dir (validated non-empty) and
ort_disallow_cpu_fallback to C7InferenceConfig.
- fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and
c7.cpu_fallback FDR record kinds.
Tests
- test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback
alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1
via fake ORT session; Tier-2 placeholders skip on macOS dev for
numerical FP16 comparison and session-creation perf NFR.
- test_protocol_conformance.py: drops onnx_trt_ep from the missing-
module parametrize now that the module ships.
- test_az272_fdr_record_schema.py: extends per-kind fixture builder
to cover the two new C7 FDR kinds in the roundtrip / schema-version
AC tests.
Docs
- module-layout.md: replaces the pending onnx_trt_runtime row with
the shipped onnx_trt_ep_runtime row + capabilities list.
- batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch
+ self-review (PASS_WITH_WARNINGS, 4 Low findings accepted).
Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit
suite (excluding pending components) 529 passing, 19 env-skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implement the production-default InferenceRuntime strategy on JetPack
6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT
lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig
hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with
EngineGate-first ordering and a pre-allocation GPU memory budget gate,
infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA
stream, idempotent release_engine, and an injected
ThermalStatePublisher delegation for thermal_state.
INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a
.calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new
.calib_cache.dataset_sha256 sidecar that records the dataset content
hash at compile time; reuse only when both agree, rebuild silently on
dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar
(never silently overwritten).
GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any
TRT call beyond the gate (AC-6); a pre-allocation refusal raises
OutOfMemoryError and leaves the resident state unchanged.
TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the
methods that need them so the module loads cleanly on Tier-0 hosts.
A standalone CLI entry (python -m
gps_denied_onboard.components.c7_inference.tensorrt_runtime compile
<onnx> <build_config.json>) is wired for C10 CacheProvisioner
(AZ-321) to invoke pre-flight without holding a runtime instance.
C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and
trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated
in __post_init__.
Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5
/ AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1
via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize
placeholders skip with prerequisite reason and live in the AZ-298
Tier-2 microbench harness. Code review verdict
PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied).
Co-authored-by: Cursor <cursoragent@cursor.com>
PASS_WITH_WARNINGS verdict for batches 28-30 (AZ-305, AZ-307, AZ-308);
five findings, all Medium/Low — module-layout drift + cross-batch DRY.
No Critical/High, no auto-fix gate; per implement Step 14.5,
PASS_WITH_WARNINGS continues to the next batch.
Co-authored-by: Cursor <cursoragent@cursor.com>
CacheBudgetEnforcer.reserve_headroom(needed_bytes) returns immediately
when total_disk_bytes() + needed_bytes <= budget, otherwise iterates
lru_candidates in eviction_batch_size batches, deletes via delete_tile,
emits one INFO log per evicted tile (c6.evicted) and one FDR record per
eviction batch (c6.eviction_batch, evicted_tile_ids capped to 5).
Raises CacheBudgetExhaustedError AFTER a full sweep if the budget
cannot be met. BudgetEnforcedTileStore decorates a TileStore so the
policy stays separable from PostgresFilesystemStore. Composition root
in storage_factory.build_tile_store wires the wrapper unconditionally.
PostgresFilesystemStore now accepts lru_clock: Clock | None = None;
when set, read_tile_pixels calls record_lru_access(tile_id, now) so
eviction picks the right LRU candidates. Production wiring injects
WallClock(); AZ-305 unit tests still construct without the clock and
keep their pass-through semantics. Contract tile_store.md bumped to
v1.1.0 to add CacheBudgetExhaustedError to the TileCacheError family;
shared FDR schema bumped to v1.3.0 for the new c6.eviction_batch kind.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces the AZ-305 pass-through _evaluate_freshness hook with the
production FreshnessGate. Loads tile_freshness_rules + sector
classifications once at construction, builds an rtree index, and on
every evaluate() either returns metadata unchanged (FRESH), stamps
freshness_label=DOWNGRADED (stable_rear + stale), or raises
FreshnessRejectionError carrying tile_id / age_seconds /
classification / rule diagnostics (active_conflict + stale).
Constructed inside PostgresFilesystemStore.from_config; the public
storage_factory signature is preserved so AZ-305 unit tests still
build the store with freshness_gate=None for the pass-through path.
FDR schema bumped to v1.2.0: adds c6.freshness.rejected and
c6.freshness.downgraded kinds (non-breaking; v1.1 readers route them
opaquely). Operator CLI `python -m c6_tile_cache.freshness_gate
explain` dry-runs the decision for a (lat, lon, capture_ts).
Adjacent hygiene: c6_tile_cache.tools._dump_tile now passes
os.environ to load_config (AZ-305 regression — load_config requires
the env mapping).
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the production PostgresFilesystemStore implementing both protocols
in a single class. Filesystem-backed JPEG I/O (atomic sidecar write,
read-only mmap) + Postgres-backed metadata (spatial bbox, LRU, voting,
upload bookkeeping). Wires composition via `from_config` classmethod.
Key behaviors:
- AC-3 strict reading: INSERT runs first inside an open transaction;
duplicate-key collisions raise `TileMetadataError` BEFORE any byte is
written, leaving the original file + sidecar byte-identical. Atomic
sidecar write happens inside the same transaction; commit closes it.
Comp-delete remains as a safety net for the rare commit-after-write
failure path.
- AC-2 content-hash gate runs before any I/O.
- Construction performs an orphan-file reconciliation scan and emits an
INFO `c6.store.construct` log with steady-state stats.
Adds `c6.write` and `c6.write_failed` FDR record kinds (schema v1.1.0,
forward-compatible) and a thin operator CLI at
`c6_tile_cache.tools dump` for inspection.
Dependencies: adds `psycopg-pool>=3.2,<4.0` for the connection pool used
on the F3 read-hot path.
Tests: 25 new tests for c6_tile_cache cover AC-1..AC-15 plus
MmapTilePixelHandle + helper round-trips. Full Tier-2 unit suite passes
(1215 passed, 8 skipped, 1 pre-existing unrelated failure
`test_ac8_read_host_tuple_on_jetson` — missing `pynvml` on macOS,
Jetson-only).
Co-authored-by: Cursor <cursoragent@cursor.com>
Cumulative review of batches 23-27 (cycle 1) surfaced three Medium
documentation-drift findings on module-layout.md. All three fixed
inline per user direction:
F1: c7_inference Internal list expanded with architecture_registry,
config, engine_gate, errors, manifest, thermal_publisher (added
across AZ-300/301/302).
F2: c6_tile_cache `connection.py` re-attributed from AZ-304 (which
deferred it) to AZ-305 with a "planned, not landed yet" tag.
F3: c7_inference Public API description rewritten by category
(Protocol + DTOs + component services + config + error family)
with a pointer to __init__.py's __all__ for the canonical list.
Cumulative review report: _docs/03_implementation/cumulative_review_
batches_23-27_cycle1_report.md (PASS_WITH_WARNINGS).
Autodev state moved to status: paused_user_requested per user
choice; /autodev will resume at greenfield Step 7 (next batch
selection) on next invocation.
Co-authored-by: Cursor <cursoragent@cursor.com>
Strictly additive Alembic migration on the AZ-263 baseline (data_model
.md § 6.1 / § 6.3): six new tiles columns (tile_uuid UNIQUE,
location_hash, content_sha256, disk_bytes, accessed_at, uploaded_at),
four new btree indices, one UNIQUE expression index over the
COALESCE-zero-uuid natural key, CHECK widening of
ck_tiles_freshness_status to the AZ-263 + AZ-303 vocabulary UNION,
four NULLable bbox columns on sector_classifications, and a new
tile_freshness_rules table seeded with the two default thresholds.
Pinned UUIDv5 namespace (TILE_NAMESPACE_UUID =
5b8d0c2e-1a4f-4b3a-8c9d-e7f6a3b2c1d0) + derive_tile_id /
derive_location_hash helpers cross-coordinated with
satellite-provider. Migration runner apply_migrations(config) drives
Alembic command.upgrade("head") against the AZ-263 env with one
retry on PG SQLSTATE 40001 and structured INFO logs on apply / no-op.
Contract bump tile_metadata_store.md v1.1.0 -> v1.2.0 adds
TileMetadata.location_hash: UUID | None = None (non-breaking).
module-layout.md updated so c6_tile_cache explicitly Owns
db/migrations/**.
Tier-1 tests: UUIDv5 determinism + locked vectors + DSN resolution +
retry mocked DBAPIError -> 1180 passed, 32 skipped. Tier-2 docker
schema tests gated by @pytest.mark.docker run against the existing
docker-compose.test.yml db service.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton
background-thread publisher that polls jtop (preferred) or pynvml
(fallback) at config.thermal_poll_hz, stores an atomic ThermalState
snapshot, and emits c7.thermal_transition FDR records on every throttle
flip with a WARN log on entry and an INFO log on exit. Default-safe on
TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited
WARN.
Sources return a raw ThermalReading; the publisher stamps measured_at_ns
via its injected Clock so _JtopSource / _PynvmlSource stay clean of
direct time.* calls (Invariant 2). _poll_once is the deterministic test
seam — start() spawns the production thread.
- c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS
- [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject
- 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural)
green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred
- full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:
1. filename schema parse -> EngineSchemaMismatchError(reason=...)
2. schema tuple match -> EngineSchemaMismatchError(expected,got)
3. sidecar present -> EngineSidecarMissingError
4. sidecar trust -> EngineHashMismatchError(stage=sidecar)
5. manifest match -> EngineHashMismatchError(stage=manifest)
Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).
Production code (new):
- components/c7_inference/engine_gate.py -- EngineGate, HostTuple,
read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
tensorrt.__version__; raises RuntimeError on Tier-1)
- components/c7_inference/manifest.py -- DeploymentManifest,
ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
type level: __getitem__ raises EngineHashMismatchError on
missing key, NEVER KeyError, so the gate cannot silently pass
- components/c7_inference/__init__.py -- re-exports the new
public surface
Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).
Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
1. HostTuple lives in engine_gate.py (the only consumer);
re-exported from package __init__.py.
2. read_host_tuple takes precision as a keyword argument — three
of four fields come from the host, precision is engine-build
metadata supplied by the caller.
3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
run on every CI host.
Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.
NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.
AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.
Suite: 1134 passed / 11 skipped. No regressions outside the new
files.
Co-authored-by: Cursor <cursoragent@cursor.com>
Python facade (`Okvis2Strategy`) is production-quality and satisfies
AZ-331's `VioStrategy` protocol; full AC-1..10 coverage with
AC-9 + NFR-perf marked `tier2`. The C++ pybind11 binding compiles
and loads but throws `OkvisFatalException("estimator not yet wired")`
on first `add_frame` — the `okvis::ThreadedKFVio` wiring is a tier2
follow-up the Step-15 Product Completeness Gate is expected to track
as a remediation task.
Resolved contradictions:
* Constructor signature aligned with the AZ-331 factory: `(config, *,
fdr_client, clock=None)`. Calibration / preintegrator / logger
built internally from config. No churn on AZ-331.
* IMU substrate: OKVIS2 owns its internal estimator IMU integration;
the AZ-276 `ImuPreintegrator` is a separate substrate consumed by
E-C5's fusion graph. Single source of truth lives at the sample
stream, not the integrator instance.
* FDR API: `FdrClient.enqueue(record)` with new `vio.health` kind
added to AZ-272 `KNOWN_PAYLOAD_KEYS`.
CI matrix forces `-DBUILD_OKVIS2=OFF` until the tier2 wiring task
brings Ceres / SuiteSparse / OKVIS2 vendored submodules into the
Linux build.
Files: 17 added/modified across `c1_vio/`, `fdr_client/records.py`,
`cpp/okvis2/CMakeLists.txt`, CI workflow, AZ-332 task spec
(implementation-notes section), batch 23 report.
Tests: 17 new (15 tier1 + 2 tier2). Full Tier-1 suite: 1109 pass,
2 skipped (env), 2 deselected (tier2). No regressions.
Co-authored-by: Cursor <cursoragent@cursor.com>
Handoff artifacts from the prior /autodev session that stopped at
Step 7 sub_step compute-next-batch:
- _docs/_autodev_state.md: pointer updated to batch 23, AZ-332 only
(AZ-345 deferred — dep AZ-346 not yet in done/).
- _docs/03_implementation/AZ-332_implementation_plan.md: locked-in
decisions (no ROS 2, no Python re-impl, three-env split: macOS dev /
Ubuntu CI / Jetson tier2) + step-by-step playbook for next session.
Pre-batch chore commit per implement skill prereq #4 (clean tree
required before AZ-332 commit so the batch diff stays focused).
Co-authored-by: Cursor <cursoragent@cursor.com>
F1 (High/Architecture) from cumulative review of batches 01-22:
`ISam2GraphHandleImpl` did not satisfy C4's `ISam2GraphHandle`
Protocol stub (AZ-355) because it lacked `get_pose_key`.
`pose_factory`'s isinstance gate would have raised at composition.
Two Protocols (C4 minimal consumer cut, C5 richer producer surface)
are intentional per AZ-355 Risk 1 — the impl just needed to expose
the canonical name. Delegates to estimator.key_for_frame.
Added cross-component conformance test asserting the C5 impl
satisfies both Protocols, so future drift trips a unit test.
F2 (Medium/Maintainability): added justifying comments at four
`except: pass` sites in runtime_root, c8_fc_adapter (ap + inav),
and c13_fdr writer. No behavioral change.
Updated cumulative review report verdict from FAIL to PASS and
recorded a post-mortem on the initial misframing
(treated the dual-Protocol design as duplication on first read).
Autodev state: batch 22 done, cumulative-review PASS,
ready for batch 23.
Co-authored-by: Cursor <cursoragent@cursor.com>
Add operator warm-start path to C5 StateEstimator Protocol and both
implementations (GtsamIsam2StateEstimator, EskfStateEstimator), plus
the third clause of the AZ-385 spoof-promotion gate.
- StateEstimator Protocol: set_takeoff_origin(origin, sigma_horiz_m,
sigma_vert_m) -> None.
- iSAM2: PriorFactorPose3 at origin with diagonal sigmas, single
isam2.update().
- ESKF: zero _nominal_pos, overwrite _P position block with sigma**2.
- SourceLabelStateMachine.process_gps_sample bounded-delta clause:
WgsConverter.horizontal_distance_m vs smoother estimate; reject
resets the dwell-time counter so AZ-385 cannot re-promote off bad
GPS.
- New EstimatorAlreadyStartedError (StateEstimatorConfigError
subclass) on late call after first add_*.
- C5StateConfig: spoof_promotion_bounded_delta_m=200,
default_takeoff_origin_sigma_horiz_m=5,
default_takeoff_origin_sigma_vert_m=10.
- New GpsSample DTO + WgsConverter.horizontal_distance_m helper.
- 4 new FDR kinds (cold_start_origin.{set,unavailable},
gps_bounded_delta.{accept,reject}) registered in AZ-272 schema.
- 33 new unit tests cover AC-1..AC-15; full repo 750 passed / 2
skipped (pre-existing CI tooling skips).
Docs synced: protocol contract, C5 component description,
architecture, glossary, system-flows, C10 provisioning description.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements the mandatory simple-baseline StateEstimator per AC-2.1a
engine-rule at C5 (IT-12 comparative study vs iSAM2). NumPy-only;
no GTSAM dependency so BUILD_STATE_ESKF=ON binaries ship without
GTSAM at all.
- 16-state error vector (pos 3 + vel 3 + rot 3 + ba 3 + bg 3 + dt 1)
over a textbook nominal-state / error-state ESKF split.
- add_fc_imu: full nonlinear IMU integration + linearised F P F^T + Q
covariance propagation per IMU sample.
- add_vio: simplified relative-pose update (snapshot-based; baseline
scope, documented).
- add_pose_anchor: absolute-pose update; integrates BOTH marginals and
jacobian modes (no skip — ESKF has no graph; AC-4).
- AC-9 divergence test: Mahalanobis r^T S^-1 r > 100 (10 sigma) on the
innovation covariance S = H P H^T + R.
- AC-5 SPD: Cholesky-positive enforcement on every emitted covariance;
non-SPD raises EstimatorFatalError and locks state to LOST.
- AC-6 honesty: smoothed_history entries carry smoothed=False; deviation
from C5 contract Invariant 7 documented in module + report.
- AC-7 / AC-10 BUILD_STATE_ESKF gating: works through existing factory
infra (state_factory._STATE_BUILD_FLAGS).
- AC-8: SourceLabelStateMachine + FallbackWatcher auto-wired eagerly
in __init__, same pattern as the iSAM2 estimator.
Tests: 20 new unit tests covering AC-1..AC-10 + robustness checks.
Full suite: 660 passed, 2 skipped (CI-only).
The AZ-386 Jira transition to Done is deferred (Atlassian MCP returned
'Not connected'); recorded in _docs/_process_leftovers/ for replay on
the next autodev invocation per the Leftovers Mechanism.
Co-authored-by: Cursor <cursoragent@cursor.com>
After every successful current_estimate(), emit one
c5.state.smoothed_history FDR record per newly-smoothed past
keyframe from IncrementalFixedLagSmoother. AC-4.5 (revised): the
smoothed stream goes ONLY to FDR; the C8 outbound forward-time
stream is unaffected.
Idempotency via _smoothed_fdr_watermark_s (smoother-native float
seconds); the same pose key is never emitted twice. Hook is
best-effort — internal failures log warnings but do not raise, so
a smoother divergence cannot contaminate the forward-time path.
Cross-task invariants documented:
- AC-3 ESKF no-op — AZ-386 installs an inert hook on the ESKF.
- AC-4 No C8 leak — enforced at the C8 boundary by AZ-261.
8 new unit tests against AC-1/2/5/6 + robustness (no-FDR-client,
marginals failure). Full suite: 640 passed, 2 skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements Invariants 5 + 8 + AC-NEW-2 / AC-NEW-8: the
EstimatorOutput.source_label now reflects a real state machine
(DEAD_RECKONED → SATELLITE_ANCHORED ↔ VISUAL_PROPAGATED) governed by
a spoof-promotion gate that latches closed on FC SPOOFED GPS health
and re-opens only when BOTH conditions hold — ≥10 s
STABLE_NON_SPOOFED AND next anchor within
spoof_promotion_visual_consistency_tol_m.
Every reject emits a c5.state.spoof_rejected FDR record plus a
subscriber-fan-out STATUSTEXT (severity WARNING, 50-char cap per
MAVLink). FDR and subscriber paths bypass the standard logger so
silencing logs cannot suppress the spoof trail (R07 / AC-6).
GtsamIsam2StateEstimator now eagerly builds the SM from C5StateConfig
in __init__; new public methods notify_gps_health() (delegates to
SM, called by composition root from C8 inbound) and
subscribe_spoof_rejection() (composition root attaches C8's
QgcTelemetryAdapter here). health_snapshot.spoof_promotion_blocked
+ current_estimate.source_label now flow from the live SM.
25 new unit tests across all 12 ACs plus cancellation, subscriber
exception isolation, and estimator wire-up integration cases. One
AZ-384 test renamed + updated to expect DEAD_RECKONED before any
anchor (was VISUAL_PROPAGATED placeholder pre-AZ-385).
Full suite: 632 passed, 2 skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements Invariant 9 / AC-5.2: when current_estimate cannot return a
fresh output for >= state.no_estimate_fallback_s (default 3.0 s), emit
ONE engagement signal (FDR kind=c5.state.no_estimate_fallback_engaged
+ GCS STATUSTEXT severity CRITICAL); on recovery, ONE recovery signal
(FDR kind=c5.state.no_estimate_fallback_recovered + STATUSTEXT NOTICE).
Rate-limited via single _in_fallback latch (AC-2: 30 s sustained
no-estimate still emits exactly one engagement).
New FallbackWatcher class owns the state machine; estimator wires it
through constructor + current_estimate entry/success hooks. Public
check_fallback_state(now_ns) watchdog (NFR p99 <= 5 us) + subscribe
APIs let C8 outbound react without coupling C5 to a concrete GCS
adapter at construction. Severity enum extended with CRITICAL=2 and
NOTICE=5 to match MAVLink MAV_SEVERITY.
18 new unit tests across all 8 ACs, deterministic synthetic clock,
integration tests patch monotonic_ns through GtsamIsam2StateEstimator
to drive AC-7 iSAM2 leg (ESKF leg deferred to AZ-386).
Full suite: 607 passed, 2 skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces the last three NotImplementedError placeholders on
GtsamIsam2StateEstimator with real Marginals + output methods:
- current_estimate(): recovers the 6x6 Marginals covariance for the
most-recently committed pose key, enforces the SPD invariant via
np.linalg.cholesky (Invariant 10), converts the local-ENU pose
translation to WGS84 via the shared WgsConverter, derives a
body->world quaternion, and emits a fresh EstimatorOutput
(smoothed=False, Invariant 4). On SPD failure transitions
isam2_state -> LOST and raises EstimatorFatalError (AC-5.2 path).
- smoothed_history(n): iterates the smoother's active POSE keys via
_smoother.calculateEstimate().keys() (filtered by GTSAM symbol
char) and the smoother timestamps via ts_map.at(key) - workaround
for the pinned gtsam_unstable build's non-iterable
FixedLagSmootherKeyTimestampMap. Bounded by K (Invariant 6); every
entry has smoothed=True (Invariant 7).
- health_snapshot(): cheap O(1) accumulator read; reports
IsamState lifecycle, pose-key count, AC-NEW-8
cov_norm_growing_for_s rolling 60s deque-backed counter, and
spoof_promotion_blocked via the AZ-385 state machine injection
point.
Adds two public injection points for AZ-385/composition root:
set_enu_origin(LatLonAlt) and attach_source_label_state_machine(machine).
Defaults: (0, 0, 0) ENU origin, VISUAL_PROPAGATED source label,
spoof_promotion_blocked=False.
Wires _record_committed_pose_key into the three add_* success paths
so current_estimate only reads keys that have real values in iSAM2.
The JACOBIAN path in add_pose_anchor deliberately skips this call -
Invariant 3 keeps the JACOBIAN pose out of the iSAM2 graph.
Tests: +27 in tests/unit/c5_state/test_az384_marginals_outputs.py
covering all 10 ACs. Three obsolete AZ-382 tests
(test_ac10_*_raises_named_az384) removed. Full suite: 589 passed,
2 skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces AZ-382 NotImplementedError placeholders with real GTSAM factor
adds wired against the iSAM2 graph handle:
- add_vio -> BetweenFactorPose3 between consecutive VIO pose keys
(first call primes the chain; AZ-388 owns first-keyframe seeding).
- add_pose_anchor -> mode-dispatch per pose.covariance_mode:
"marginals" -> PriorFactorPose3 + handle.update();
"jacobian" -> skip iSAM2 add per AZ-361 contract.
Both paths bump _last_anchor_ns via time.monotonic_ns().
- add_fc_imu -> shared ImuPreintegrator.integrate_window +
reset_for_new_keyframe; builds a CombinedImuFactor between the
prev/curr (X, V, B) keyframe triple. Introduces new 'v' (velocity)
and 'b' (bias) GTSAM key namespaces decoupled from the VIO/pose
frame_id mapping.
Invariant 2 - non-decreasing timestamps - enforced per call with
EstimatorDegradedError + c5.state.out_of_order log. Every successful
add emits a structured DEBUG *_ok log; every failure emits a
structured ERROR *_failed log and raises through the C5 error
hierarchy (R05).
Contract-vs-reality fix-ups also landed:
- StateEstimator Protocol: add_fc_imu(ImuWindow) - was incorrectly
annotated as ImuTelemetrySample by AZ-381.
- _last_anchor_ns semantics switched to monotonic_ns() to match
last_anchor_age_ms.
- create() factory back-wires the ISam2GraphHandle to the estimator
via the new attach_handle() method.
Tests: +21 in tests/unit/c5_state/test_az383_factor_adds.py covering
all 8 ACs with mock ISam2GraphHandle instances. Three obsolete
AZ-382 tests (test_ac10_add_*_raises_named_az383) removed. Full
suite: 565 passed, 2 skipped.
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the C8 inbound producer side:
- TelemetryRing[T]: bounded drop-oldest ring; first-overflow INFO log
+ monotonic dropped_count.
- SubscriptionBus + SubscriptionHandle: synchronous fan-out, lock-
released-before-callback to avoid deadlock; subscriber crash caught
+ DEBUG-logged so one bad subscriber cannot kill the decode loop.
- PymavlinkInboundDecoder: pymavlink-based AP decoder for RAW_IMU,
SCALED_IMU2, ATTITUDE, GPS_RAW_INT, GPS2_RAW, HEARTBEAT, STATUSTEXT.
Out-of-order drop (Invariant 7) per-kind WARN. STATUSTEXT spoofing
sentinel promotes subsequent GPS to GpsStatus.SPOOFED within 5 s.
AC-5.1 warm-start hint cached on first 3D+ fix; embedded into
every FlightStateSignal.
- Msp2InavInboundDecoder: YAMSPy-based iNav polling decoder for IMU /
attitude / GPS / flight-state. signed=False always (RESTRICT-COMM-2);
GpsStatus.SPOOFED is unreachable on iNav.
Adds yamspy>=0.3.3 + pyserial>=3.5 to pyproject.toml.
Tests: 443 pass / 2 skip / 0 fail (+33 in batch 9).
Contract: no drift on fc_adapter_protocol.md v1.0.0; this batch
implements the inbound producer side without changing signatures.
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-294: MidFlightTileSnapshotSink writes orthorectified tile JPEGs
atomically to flight_root/<flight_id>/tiles/<tile_id>.jpg, emits a
kind="mid_flight_tile_snapshot" pointer record, and evicts the oldest
tile when the per-flight 64 MiB cap is exceeded. Adds optional
frame_id to the snapshot payload (fdr_record_schema bump).
AZ-295: RecordKindPolicy with two paired gates:
- enforce_or_raise (producer-side) raises RawFrameWriteForbiddenError
for raw_nav_frame / raw_ai_cam_frame at the call site, defending
AC-8.5 / RESTRICT-UAV-4.
- gate_for_writer (writer-side) tumbling-window rate-caps
failed_tile_thumbnail records at <= 0.1 Hz; over-cap drops are
coalesced into kind="overrun" records with the originating
producer slug.
AZ-296: take_off() composition-root sequence with strict ordering
(writer.__init__ -> start -> open_flight -> fc_adapter.__init__ ->
fc_adapter.open). On FdrOpenError, logs ERROR record, calls
writer.stop(), prints the documented FATAL line to stderr, and
sys.exit(EXIT_FDR_OPEN_FAILURE=2). composition_root_protocol bumped
to v1.1.0 with the new constants + takeoff-sequence section.
29 new tests; full suite 356 passed / 2 skipped / 0 failures.
No new dependencies (stdlib only).
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-291 — FileFdrWriter: single writer thread draining every registered
FdrClient SPSC ring buffer to per-flight segment files; per-segment
size rotation; cross-process fcntl.flock filelock on flight_root;
ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert.
AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight /
close_flight lifecycle methods; four per-flight monotonic counters
(records_written, records_dropped_overrun, bytes_written,
rollover_count) reported by the footer; flight_id mismatch and
close-without-open are typed errors.
AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight
directory, drops the oldest CLOSED segment when total > cap (default
64 GiB), emits a kind="segment_rollover" record per drop. Never drops
the currently-open segment or segment 0 alone; cap_misconfigured path
logs ERROR + GCS alert. No config flag disables emission (C13-ST-01).
Schema: bumped fdr_record_schema flight_header / flight_footer payload
key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no
prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig
nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes,
debug_log_per_record).
Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite
323 passed, 2 pre-existing skips, 0 regressions.
Co-authored-by: Cursor <cursoragent@cursor.com>
E-CC-HELPERS closes with the three remaining Layer-1 helpers and
E-CC-CONF closes with the env > YAML > defaults precedence test
gate. All four tickets ship with frozen public surfaces, hermetic
unit tests, and no upward (components.*) imports.
* AZ-271 — tests/unit/shared/config/test_precedence.py (5 ACs + smoke
test + helper that names the layer in failure messages).
* AZ-282 — helpers/ransac_filter.py: static RansacFilter +
RansacResult; cv2.setRNGSeed(0) for byte-equal determinism;
median residual semantics pinned by contract.
* AZ-276 — helpers/imu_preintegrator.py + make_imu_preintegrator;
GTSAM PreintegratedCombinedMeasurements; strict-monotonic ts_ns
guard runs before any state mutation. Adjacent hygiene:
_types/nav.py ImuSample/ImuWindow now use ts_ns:int and the
spec-mandated ImuBias dataclass.
* AZ-278 — helpers/lightglue_runtime.py: structural R14 fix.
LightGlueRuntime + non-blocking concurrent-access guard that
raises rather than serialising. EngineHandle Protocol in
_types/manifests.py + KeypointSet/CorrespondenceSet in
_types/matching.py (Protocol surface adds approved by spec).
Dependency conflict (Finding 1, user-approved): gtsam 4.2 (PyPI) is
numpy-1.x-ABI only; opencv-python>=4.12 needs numpy>=2 at runtime.
Resolution: opencv-python pin relaxed to >=4.11.0.86,<4.12. The
D-CROSS-CVE-1 ratchet at ci/opencv_pin_gate.py is held at 4.11.0
with the original 4.12.0 floor restored once a numpy-2-compatible
gtsam wheel ships. Full replay procedure in
_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md.
Tests: 294 passed, 2 skipped (cmake/actionlint env-skips,
pre-existing). 43 new tests added for batch 5. Ruff check + format
clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.
Co-authored-by: Cursor <cursoragent@cursor.com>
AZ-270: composition root with strategy registry, tier-gated lookup,
topo-order construction, all-or-nothing teardown, StrategyNotLinkedError
payload.
AZ-272: orjson-backed FdrRecord serialise/parse with forward-compat for
unknown payload + top-level fields and canonical overrun-record shape.
AZ-279: pyproj-backed WGS84/ECEF/ENU + OSM slippy-map tile math with
WgsConversionError for shape/range/zoom guards.
AZ-281: strict EngineFilenameSchema build/parse/matches_host with
anchored regex + enum validation; round-trip identity by construction.
AZ-283: dtype-preserving (fp16/fp32) single + batch L2 normaliser with
zero-norm safety and descriptor_metric() source-of-truth.
pyproject.toml pins pyproj>=3.6 and orjson>=3.9 (named-backend deps per
the AZ-272 / AZ-279 contracts). New DTOs LatLonAlt + BoundingBox and
EngineCacheKey + HostCapabilities land in _types/ to back the helper
contracts.
203 unit tests pass (64 new). Review verdict: PASS_WITH_WARNINGS;
findings are perf-NFR deferrals + dep amendment + minor docstring polish.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Changed the autodev state to reflect the new phase and task name for remediation related to AZ-243.
- Updated the dependencies table to include the new task AZ-243 and adjusted dependencies for AZ-233.
- Added a section in the implementation completeness report to document the creation of the AZ-243 remediation task aimed at integrating the production native VIO runtime.
- Modified the Docker Compose configuration to include an input root for replay tests and added an environment variable for enabling SITL.
- Enhanced documentation for various testing processes, including the addition of a Runtime Completeness Decomposition Gate and clarifications on internal module testing requirements.
- Updated the implementation completeness report to reflect the current state and added new test cases for performance and resilience scenarios.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Refined task decomposition steps to ensure implementation tasks are atomic and complexity does not exceed 5 points.
- Enhanced the product implementation process with a completeness gate to verify task outcomes against architecture promises before proceeding to testing.
- Updated dependencies table to reflect new tasks and their relationships, ensuring all test tasks are linked to product remediation tasks.
- Adjusted workflow documentation to clarify entry points for task decomposition and implementation contexts.
Co-authored-by: Cursor <cursoragent@cursor.com>