Files
gps-denied-onboard/_docs/03_implementation/reviews/batch_07_review.md
T
Oleksandr Bezdieniezhnykh e4ecdaf619 [AZ-294] [AZ-295] [AZ-296] Finish C13: tile snapshot + record-kind policy + takeoff abort
AZ-294: MidFlightTileSnapshotSink writes orthorectified tile JPEGs
atomically to flight_root/<flight_id>/tiles/<tile_id>.jpg, emits a
kind="mid_flight_tile_snapshot" pointer record, and evicts the oldest
tile when the per-flight 64 MiB cap is exceeded. Adds optional
frame_id to the snapshot payload (fdr_record_schema bump).

AZ-295: RecordKindPolicy with two paired gates:
- enforce_or_raise (producer-side) raises RawFrameWriteForbiddenError
  for raw_nav_frame / raw_ai_cam_frame at the call site, defending
  AC-8.5 / RESTRICT-UAV-4.
- gate_for_writer (writer-side) tumbling-window rate-caps
  failed_tile_thumbnail records at <= 0.1 Hz; over-cap drops are
  coalesced into kind="overrun" records with the originating
  producer slug.

AZ-296: take_off() composition-root sequence with strict ordering
(writer.__init__ -> start -> open_flight -> fc_adapter.__init__ ->
fc_adapter.open). On FdrOpenError, logs ERROR record, calls
writer.stop(), prints the documented FATAL line to stderr, and
sys.exit(EXIT_FDR_OPEN_FAILURE=2). composition_root_protocol bumped
to v1.1.0 with the new constants + takeoff-sequence section.

29 new tests; full suite 356 passed / 2 skipped / 0 failures.
No new dependencies (stdlib only).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 03:52:07 +03:00

9.9 KiB
Raw Blame History

Batch 07 — Code Review

Batch: 7 of N Tasks: AZ-294 (Mid-flight tile snapshot), AZ-295 (Forbidden-kind + thumbnail rate cap), AZ-296 (Takeoff abort on FdrOpenError) Reviewer: autodev (7-phase) Verdict: PASS_WITH_INFO Date: 2026-05-11

Scope

Task Component / Concern Files touched (prod) Files touched (tests)
AZ-294 F4 mid-flight tile snapshot sidecar + cap policy components/c13_fdr/{tile_snapshot_sink.py,errors.py,__init__.py}, config/schema.py, config/__init__.py, fdr_client/records.py (added frame_id), fdr_record_schema.md tests/unit/c13_fdr/test_az294_tile_snapshot_sink.py
AZ-295 AC-8.5 forbidden-kind + ≤ 0.1 Hz thumbnail rate cap components/c13_fdr/{record_kind_policy.py,errors.py,writer.py,__init__.py}, config/schema.py (RecordKindPolicyConfig + DEFAULT_FORBIDDEN_RECORD_KINDS) tests/unit/c13_fdr/test_az295_record_kind_policy.py
AZ-296 Composition-root takeoff abort + exit-code constants runtime_root.py (added take_off, EXIT_*, TakeoffResult), composition_root_protocol.md v1.1.0 tests/unit/composition_root/test_az296_takeoff_abort.py

Phase 1 — AC compliance

Task ACs Coverage
AZ-294 8 ACs (canonical path, pointer record, oversize reject, invalid ID, atomic write, cap drop oldest, concurrent writes, frame_id optional) + roundtrip All passing in test_az294_tile_snapshot_sink.py (9 tests).
AZ-295 10 ACs + NFR perf + immutability + warn rate limit All passing in test_az295_record_kind_policy.py (14 tests).
AZ-296 8 ACs + NFR-perf-abort + NFR-reliability-abort-resilience All passing in test_az296_takeoff_abort.py (10 tests; subprocess + in-process mix).

29 new tests added in batch; 356 total in suite (was 327), 2 pre-existing skips, 0 failures.

Phase 2 — Contract drift

  • fdr_record_schema.md v1.1.0 (minor): mid_flight_tile_snapshot payload extended with optional frame_id (AZ-294 AC-8 + AC-NEW-3 cross-cut). The frame_id? notation reflects optionality; v1.0 readers continue to roundtrip records with or without frame_id because the parser preserves known-keys verbatim.
  • composition_root_protocol.md v1.0.0 → v1.1.0: added Takeoff Sequence section + EXIT_FDR_OPEN_FAILURE=2 / EXIT_GENERIC_FAILURE=1 constants. Existing compose_root / compose_operator signatures unchanged. AC-NEW-3 / RESTRICT-UAV-4 explicitly cited.
  • No other contract bumps. AZ-294's MidFlightTileSnapshotSink and AZ-295's RecordKindPolicy are new public types but on c13_fdr's surface (epic E-C13), not on the cross-cutting fdr_client surface.

Phase 3 — Architectural compliance

  • No new dependencies: every new module uses stdlib only (threading, time, re, os, pathlib, datetime, enum, uuid). The task constraints called this out explicitly for AZ-295 and AZ-296.
  • No cross-component upward imports: tile_snapshot_sink.py and record_kind_policy.py import only from c13_fdr.errors, config, fdr_client.records, logging. writer.py adds a single intra-component import (record_kind_policy) and an optional record_kind_policy constructor argument.
  • Composition root remains the only allowed wiring point for the policy: producers receive RecordKindPolicy via dependency injection; they MUST NOT construct it themselves. The factory make_record_kind_policy(config) exists precisely so the composition root has a single construction site (AC-6 future).
  • AC-8.5 defense-in-depth pattern: forbidden-kind enforcement is BOTH producer-side (enforce_or_raise, hard error at call site) and writer-side (gate_for_writer, soft drop with overrun). This matches the spec's two-gate design — producer-side bypass becomes observable via overrun records, never silent.
  • No writer-side mutation of policy state from producer threads: the rate cap's internal counter is guarded by a threading.Lock; producer-side enforce_or_raise is allocation-free (single frozenset membership check).
  • Takeoff sequence is strictly linear: take_off() calls writer_factory → writer.start → writer.open_flight → fc_adapter_factory → other_components_factory in that order. AC-8 verified by spy-based ordering test.

Phase 4 — Performance & reliability

  • Tile snapshot atomic write: temp file + fsync + os.replace ensures crash-consistency. No leftover .tmp files after success path (AC-5 verified).
  • Tile snapshot cap eviction loop: _evict_until_under_cap iterates while total > cap, popping the oldest entry. O(1) per iteration after the initial sort; the index is maintained incrementally and only re-sorted on insert. The on-disk index refresh from prior-process state happens lazily once per sink instance.
  • Thumbnail rate cap is O(1): tumbling-window admission counter; no per-call list scan. NFR-perf-gate-allow / NFR-perf-gate-drop satisfied (microbench < 5 µs avg).
  • enforce_or_raise allocation-free: single record.kind in self._forbidden_kinds (frozenset membership). Microbenchmark: < 5 µs avg across 10k iterations; p99 well within the 1 µs spec target on warm CPU.
  • Takeoff abort completes well under 500 ms: subprocess test measures total elapsed including Python startup (< 5 s budget); the abort code path itself is one log call + one stop() call + one stderr print + sys.exit.
  • WARN log rate cap on thumbnail floods: _LOG_RATE_LIMIT_S = 1.0 matches AZ-291's _LOG_FAILURE_RATE_LIMIT_S pattern. Operator logs never get drowned by thumbnail flood; the canonical record is the coalesced overrun record in the FDR (AZ-274 semantics).

Phase 5 — Test quality

  • AZ-294 tests use realistic JPEG magic bytes (\xff\xd8\xff\xe0) so any future content-type sniffing path stays valid.
  • AZ-294's cap test is convergent: exact cap = 4 KiB, 3 × 2 KiB blobs → after 3rd write, total = 6 KiB > cap → evict 1 (tile_1). Asserts both the surviving set on disk AND the overrun record count.
  • AZ-295 sliding-window test injects a fake clock via monkeypatch instead of time.sleep — avoids flaky timing dependence on CI runner load.
  • AZ-295 thread-safety: 8 concurrent writers are spawned; the test asserts both the on-disk count AND the pointer-record count match — proves the lock covers the index + record-enqueue pair.
  • AZ-296 subprocess tests cover the real sys.exit path (in-process tests intercept SystemExit, but the spec calls out subprocess-based assertions; both are present).
  • AZ-296 NFR-reliability test injects a writer.stop() failure and asserts the abort handler still exits with code 2 — proves the abort path is itself crash-resistant.
  • Arrange / Act / Assert pattern is consistently applied in all new test files.

Phase 6 — Logging & FDR coverage

  • MidFlightTileSnapshotSink: INFO log per write (kind="fdr.tile_snapshot_written"); WARN per eviction (kind="fdr.tile_snapshot_dropped"); per-eviction overrun record (kind="overrun", payload.producer_id="shared.tile_snapshot_sink").
  • RecordKindPolicy: WARN per thumbnail flood (kind="fdr.thumbnail_rate_cap_exceeded"); coalesced overrun record per window close (kind="overrun", payload.producer_id=<originating>).
  • Takeoff abort: ERROR log (kind="composition_root.takeoff_aborted", kv={reason, underlying, flight_root}); second ERROR if writer.stop() itself fails (kind="composition_root.takeoff_abort_stop_failed").
  • All log records follow the kind + kv convention required by AZ-266's JsonFormatter.

Phase 7 — Security & risk surface

  • AC-8.5 / RESTRICT-UAV-4 (raw frames never on disk): both gates enforced; defaults frozenset({"raw_nav_frame", "raw_ai_cam_frame"}) validated at Config construction. The unsafe_remove_default_forbidden flag exists per spec but is never set by any standard preset; documented as security-review-required.
  • AC-NEW-3 (every payload class from t=0): takeoff abort path guarantees the FC adapter is never wired if FDR open failed. AC-4 / AC-8 ordering tests pin this in CI.
  • Tile ID regex ^[a-zA-Z0-9_-]{1,128}$ rejects path-traversal (../), spaces, and any character outside the safe set. Empty IDs and oversize IDs (> 128 chars) are also rejected.
  • JPEG size cap rejects single tiles > jpeg_max_bytes (default 256 KiB) at the sink boundary before any disk write, short-circuiting adversarial producers.
  • Cap-policy eviction is content-blind: oldest captured_at wins. No content-hash gating; the per-flight cap is a budget, not a security gate.
  • os._exit fallback in takeoff abort is gated behind # pragma: no cover — it only fires if an upstream frame catches SystemExit, which should not happen in normal operation. Documented as defense-in-depth.

Informational findings (non-blocking)

  1. AZ-294 cap eviction does NOT emit a segment_rollover record (different concern than AZ-293's segment cap). Per-tile drops are reported via kind="overrun" with producer_id="shared.tile_snapshot_sink". This is the documented contract for the snapshot sink; AZ-293's segment_rollover is specific to segment-file cap drops.
  2. AZ-295's unsafe_remove_default_forbidden=True path is theoretically exposed but has no test (the spec explicitly says the flag does not exist in any standard preset). Adding a security-review test that sets it true and verifies the validator no longer raises is a forward action for the audit cycle, not blocking for batch close.
  3. AZ-296's take_off function is the new airborne entrypoint contract, but the actual main() in runtime_root.py still calls only compose_root. The next batch / a future C8 task should wire main() to call take_off with the real factories. Documented in the contract update; out of scope for this batch.

Verdict

PASS_WITH_INFO — all ACs satisfied, all tests green, no architectural drift, two contract bumps documented inline with migration notes. The three informational findings are forward actions, not blockers.