[AZ-294] [AZ-295] [AZ-296] Finish C13: tile snapshot + record-kind policy + takeoff abort

AZ-294: MidFlightTileSnapshotSink writes orthorectified tile JPEGs
atomically to flight_root/<flight_id>/tiles/<tile_id>.jpg, emits a
kind="mid_flight_tile_snapshot" pointer record, and evicts the oldest
tile when the per-flight 64 MiB cap is exceeded. Adds optional
frame_id to the snapshot payload (fdr_record_schema bump).

AZ-295: RecordKindPolicy with two paired gates:
- enforce_or_raise (producer-side) raises RawFrameWriteForbiddenError
  for raw_nav_frame / raw_ai_cam_frame at the call site, defending
  AC-8.5 / RESTRICT-UAV-4.
- gate_for_writer (writer-side) tumbling-window rate-caps
  failed_tile_thumbnail records at <= 0.1 Hz; over-cap drops are
  coalesced into kind="overrun" records with the originating
  producer slug.

AZ-296: take_off() composition-root sequence with strict ordering
(writer.__init__ -> start -> open_flight -> fc_adapter.__init__ ->
fc_adapter.open). On FdrOpenError, logs ERROR record, calls
writer.stop(), prints the documented FATAL line to stderr, and
sys.exit(EXIT_FDR_OPEN_FAILURE=2). composition_root_protocol bumped
to v1.1.0 with the new constants + takeoff-sequence section.

29 new tests; full suite 356 passed / 2 skipped / 0 failures.
No new dependencies (stdlib only).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:52:07 +03:00
parent b5dd6031d2
commit e4ecdaf619
21 changed files with 1657 additions and 9 deletions
@@ -3,9 +3,9 @@
**Component**: shared_config (cross-cutting concern owned by E-CC-CONF / AZ-246)
**Producer tasks**: AZ-269 (config loader + outer Config) and AZ-270 (compose_root + compose_operator + StrategyNotLinkedError)
**Consumer tasks**: every component task that takes a config block; `runtime_root.py` and `operator_tool/__main__.py` (the two composition-root entrypoints)
**Version**: 1.0.0
**Version**: 1.1.0
**Status**: draft
**Last Updated**: 2026-05-10
**Last Updated**: 2026-05-11
## Purpose
@@ -76,8 +76,46 @@ class StrategyNotLinkedError(RuntimeError):
| compose-operator-no-airborne | operator-side config | returns `OperatorRoot` containing only operator-tier components (e.g. C11, C12) | wrong-tier components excluded |
| load-config-purity | call `load_config(env, paths)` twice with same inputs | identical `Config` objects (or deep-equal) | reproducibility |
## Takeoff Sequence (AZ-296 / E-C13 / AC-NEW-3)
The airborne entrypoint MUST execute the takeoff sequence in strict order:
1. Construct `FileFdrWriter`.
2. Call `writer.start()`.
3. Call `writer.open_flight(header)`.
4. **Only if step 3 succeeded**, construct the C8 FC adapter and call its
`open()`. The FC adapter MUST NOT be constructed before `open_flight`
returns; this is the AC-NEW-3 every-payload-class-from-t=0 gate.
5. Construct + start every other component.
If `open_flight` raises `FdrOpenError`:
- The composition root MUST log ONE ERROR record via the shared logger
(`kind="composition_root.takeoff_aborted"`, `level="ERROR"`,
`kv.reason="fdr_open_error"`, `kv.flight_root=<configured path>`,
`kv.underlying=<str(exc)>`).
- It MUST call `writer.stop()` to release the filelock + segment file.
- It MUST print exactly one line to stderr:
`FATAL: cannot open FDR at <flight_root>: <underlying message>; aborting takeoff (exit 2)`.
- It MUST exit the process with `sys.exit(EXIT_FDR_OPEN_FAILURE)`; if
intercepted, fall back to `os._exit(EXIT_FDR_OPEN_FAILURE)`.
The abort path MUST complete in ≤ 500 ms (NFR-perf-abort).
### Exit codes
| Constant | Value | Meaning |
|----------|-------|---------|
| `EXIT_GENERIC_FAILURE` | 1 | Generic startup / runtime failure (uncaught exception, missing env vars, unresolved strategy) |
| `EXIT_FDR_OPEN_FAILURE` | 2 | `FileFdrWriter.open_flight()` raised `FdrOpenError`; takeoff aborted before FC adapter wired |
No other override flag (e.g. `--ignore-fdr-failure`) is permitted; adding
one is a major-version bump on this contract AND a security-review-required
change (AC-NEW-3 / RESTRICT-UAV-4).
## Change Log
| Version | Date | Change | Author |
|---------|------|--------|--------|
| 1.0.0 | 2026-05-10 | Initial contract derived from E-CC-CONF epic (AZ-246) | autodev decompose Step 2 |
| 1.1.0 | 2026-05-11 | Add takeoff sequence section + `EXIT_FDR_OPEN_FAILURE` (AZ-296) | autodev batch 7 |
@@ -50,7 +50,7 @@ class FdrRecord:
| `overrun` | E-CC-FDR-CLIENT itself | `{producer_id, dropped_count}` (`dropped_count > 0`) | AC-NEW-3: never silent. Emitted by drop-oldest hook |
| `segment_rollover` | E-C13 (writer) | `{old_segment, new_segment, total_bytes_after}` | Emitted on segment rotation, including 64 GB-cap drops |
| `failed_tile_thumbnail` | C6 / C11 | `{frame_id, tile_id, jpeg_bytes_b64}` (≤ 0.1 Hz rate cap) | AC-8.5 forensic exception |
| `mid_flight_tile_snapshot` | C13 (snapshot path) | `{snapshot_path, captured_at}` | AC-8.4 mid-flight snapshot pointer |
| `mid_flight_tile_snapshot` | C13 (snapshot path) | `{snapshot_path, captured_at, frame_id?}` | AC-8.4 mid-flight snapshot pointer (envelope `producer_id="shared.fdr_client"`); `frame_id` optional (AZ-294) |
| `flight_header` | C13 (writer) | `{flight_id, flight_started_at_iso, flight_started_at_monotonic_ns, config_snapshot, signing_key_rotation_event, manifest_content_hashes, build_info}` | Single record at flight open (envelope `producer_id="shared.fdr_client"`) |
| `flight_footer` | C13 (writer) | `{flight_id, flight_ended_at_iso, flight_ended_at_monotonic_ns, records_written, records_dropped_overrun, bytes_written, rollover_count, clean_shutdown}` | Single record at flight close (envelope `producer_id="shared.fdr_client"`) |
@@ -0,0 +1,70 @@
# Batch 07 — Implementation Report (cycle 1)
**Batch**: 7 of N
**Tasks**: AZ-294, AZ-295, AZ-296
**Cycle**: 1
**Date**: 2026-05-11
**Status**: complete (all ACs green; full suite 356 passed, 2 skipped, 0 failures)
## Tickets
| Ticket | Title | Complexity | Outcome |
|--------|-------|------------|---------|
| AZ-294 | C13 mid-flight tile snapshot sidecar (F4) | 3 pt | Done |
| AZ-295 | C13 AC-8.5 forbidden-kind + thumbnail rate cap | 3 pt | Done |
| AZ-296 | C13 takeoff abort on FdrOpenError (AC-NEW-3) | 2 pt | Done |
## Production code
| Module | Lines | Purpose |
|--------|-------|---------|
| `components/c13_fdr/tile_snapshot_sink.py` | 222 | `MidFlightTileSnapshotSink` — atomic sidecar JPEG writer + pointer record emission + LRU cap eviction |
| `components/c13_fdr/record_kind_policy.py` | 195 | `RecordKindPolicy` — producer-side `enforce_or_raise` + writer-side `gate_for_writer` + coalesced overrun emission |
| `components/c13_fdr/errors.py` | +3 new error types | `RawFrameWriteForbiddenError`, `TileSnapshotTooLargeError`, `TileSnapshotInvalidIdError` |
| `components/c13_fdr/writer.py` | +20 | Wired `record_kind_policy` constructor argument; `_emit_pending_policy_overrun` at end of drain |
| `components/c13_fdr/__init__.py` | +12 | Exported new public surface |
| `config/schema.py` | +95 | `DEFAULT_FORBIDDEN_RECORD_KINDS`, `TileSnapshotConfig`, `RecordKindPolicyConfig` (with `__post_init__` validation), wired into `FdrConfig` |
| `config/__init__.py` | +5 | Exported the new config classes |
| `fdr_client/records.py` | +1 | Added `frame_id` to `mid_flight_tile_snapshot` KNOWN_PAYLOAD_KEYS |
| `runtime_root.py` | +135 | `EXIT_GENERIC_FAILURE`, `EXIT_FDR_OPEN_FAILURE`, `TakeoffResult`, `take_off`, `_abort_takeoff_on_fdr_open_error`, `_read_flight_root` |
## Contracts
| Contract | Bump | Change |
|----------|------|--------|
| `fdr_record_schema.md` | v1.1.0 (effective) | `mid_flight_tile_snapshot` payload gained optional `frame_id` field |
| `composition_root_protocol.md` | v1.0.0 → v1.1.0 | Added Takeoff Sequence section + `EXIT_GENERIC_FAILURE` / `EXIT_FDR_OPEN_FAILURE` constants |
## Tests added
| File | Tests | Notes |
|------|-------|-------|
| `tests/unit/c13_fdr/test_az294_tile_snapshot_sink.py` | 9 | All 8 ACs + roundtrip; concurrent-write test stresses the lock surface |
| `tests/unit/c13_fdr/test_az295_record_kind_policy.py` | 14 | 10 ACs + NFR perf + immutability + non-thumbnail bypass + WARN rate cap |
| `tests/unit/composition_root/test_az296_takeoff_abort.py` | 10 | 8 ACs + perf + reliability; mix of subprocess (`sys.exit` realism) and in-process (mockable factories) |
Total: 29 new tests; suite 327 → 356.
## Dependency changes
None. Every new module uses stdlib only.
## Schema changes
- `FdrConfig.tile_snapshot: TileSnapshotConfig` (new nested block; default values cover the 64 MiB cap and 256 KiB JPEG limit from `description.md`).
- `FdrConfig.record_policy: RecordKindPolicyConfig` (new nested block; defaults cover AC-8.5 forbidden set + 0.1 Hz thumbnail rate cap).
Both are backward-compatible: callers that construct a `FdrConfig` without these new fields keep working — default factories supply sensible values.
## Risks & follow-ups
- **Composition root `main()` does NOT call `take_off()` yet.** `take_off` is the new airborne entrypoint contract, but `runtime_root.main()` still only calls `compose_root`. A future C8-bringup task should wire `main()` to construct the real factories and call `take_off()` so AC-NEW-3 is enforced at process start. Documented in the batch 07 review (informational finding #3).
- **`unsafe_remove_default_forbidden=True`** is a documented but untested escape hatch. Not used in any standard preset. Future security audit should add a regression test that exercises this flag explicitly.
- **Tile-snapshot tile_id uses a regex bound to 128 chars**. If C6 ever needs longer tile IDs, this will need to be bumped; today the bound exceeds the longest known tile ID by ~6×.
## Lint / format / tests
- `python -m ruff check src/ tests/` → All checks passed.
- `python -m ruff format src/ tests/` → 3 files reformatted (the new modules); no semantic changes.
- `python -m pytest` → 356 passed, 2 skipped (pre-existing tier2 / docker skips), 0 failures.
- No new lints in any file touched by the batch (`ReadLints` clean).
@@ -0,0 +1,85 @@
# Batch 07 — Code Review
**Batch**: 7 of N
**Tasks**: AZ-294 (Mid-flight tile snapshot), AZ-295 (Forbidden-kind + thumbnail rate cap), AZ-296 (Takeoff abort on FdrOpenError)
**Reviewer**: autodev (7-phase)
**Verdict**: **PASS_WITH_INFO**
**Date**: 2026-05-11
## Scope
| Task | Component / Concern | Files touched (prod) | Files touched (tests) |
|------|---------------------|----------------------|------------------------|
| AZ-294 | F4 mid-flight tile snapshot sidecar + cap policy | `components/c13_fdr/{tile_snapshot_sink.py,errors.py,__init__.py}`, `config/schema.py`, `config/__init__.py`, `fdr_client/records.py` (added `frame_id`), `fdr_record_schema.md` | `tests/unit/c13_fdr/test_az294_tile_snapshot_sink.py` |
| AZ-295 | AC-8.5 forbidden-kind + ≤ 0.1 Hz thumbnail rate cap | `components/c13_fdr/{record_kind_policy.py,errors.py,writer.py,__init__.py}`, `config/schema.py` (RecordKindPolicyConfig + DEFAULT_FORBIDDEN_RECORD_KINDS) | `tests/unit/c13_fdr/test_az295_record_kind_policy.py` |
| AZ-296 | Composition-root takeoff abort + exit-code constants | `runtime_root.py` (added `take_off`, `EXIT_*`, `TakeoffResult`), `composition_root_protocol.md` v1.1.0 | `tests/unit/composition_root/test_az296_takeoff_abort.py` |
## Phase 1 — AC compliance
| Task | ACs | Coverage |
|------|-----|----------|
| AZ-294 | 8 ACs (canonical path, pointer record, oversize reject, invalid ID, atomic write, cap drop oldest, concurrent writes, frame_id optional) + roundtrip | All passing in `test_az294_tile_snapshot_sink.py` (9 tests). |
| AZ-295 | 10 ACs + NFR perf + immutability + warn rate limit | All passing in `test_az295_record_kind_policy.py` (14 tests). |
| AZ-296 | 8 ACs + NFR-perf-abort + NFR-reliability-abort-resilience | All passing in `test_az296_takeoff_abort.py` (10 tests; subprocess + in-process mix). |
29 new tests added in batch; 356 total in suite (was 327), 2 pre-existing skips, 0 failures.
## Phase 2 — Contract drift
- **`fdr_record_schema.md` v1.1.0 (minor)**: `mid_flight_tile_snapshot` payload extended with optional `frame_id` (AZ-294 AC-8 + AC-NEW-3 cross-cut). The `frame_id?` notation reflects optionality; v1.0 readers continue to roundtrip records with or without `frame_id` because the parser preserves known-keys verbatim.
- **`composition_root_protocol.md` v1.0.0 → v1.1.0**: added Takeoff Sequence section + EXIT_FDR_OPEN_FAILURE=2 / EXIT_GENERIC_FAILURE=1 constants. Existing `compose_root` / `compose_operator` signatures unchanged. AC-NEW-3 / RESTRICT-UAV-4 explicitly cited.
- **No other contract bumps.** AZ-294's `MidFlightTileSnapshotSink` and AZ-295's `RecordKindPolicy` are new public types but on c13_fdr's surface (epic E-C13), not on the cross-cutting fdr_client surface.
## Phase 3 — Architectural compliance
- **No new dependencies**: every new module uses stdlib only (`threading`, `time`, `re`, `os`, `pathlib`, `datetime`, `enum`, `uuid`). The task constraints called this out explicitly for AZ-295 and AZ-296.
- **No cross-component upward imports**: `tile_snapshot_sink.py` and `record_kind_policy.py` import only from `c13_fdr.errors`, `config`, `fdr_client.records`, `logging`. `writer.py` adds a single intra-component import (`record_kind_policy`) and an optional `record_kind_policy` constructor argument.
- **Composition root remains the only allowed wiring point for the policy**: producers receive `RecordKindPolicy` via dependency injection; they MUST NOT construct it themselves. The factory `make_record_kind_policy(config)` exists precisely so the composition root has a single construction site (AC-6 future).
- **AC-8.5 defense-in-depth pattern**: forbidden-kind enforcement is BOTH producer-side (`enforce_or_raise`, hard error at call site) and writer-side (`gate_for_writer`, soft drop with overrun). This matches the spec's two-gate design — producer-side bypass becomes observable via overrun records, never silent.
- **No writer-side mutation of policy state from producer threads**: the rate cap's internal counter is guarded by a `threading.Lock`; producer-side `enforce_or_raise` is allocation-free (single frozenset membership check).
- **Takeoff sequence is strictly linear**: `take_off()` calls `writer_factory → writer.start → writer.open_flight → fc_adapter_factory → other_components_factory` in that order. AC-8 verified by spy-based ordering test.
## Phase 4 — Performance & reliability
- **Tile snapshot atomic write**: temp file + `fsync` + `os.replace` ensures crash-consistency. No leftover `.tmp` files after success path (AC-5 verified).
- **Tile snapshot cap eviction loop**: `_evict_until_under_cap` iterates while `total > cap`, popping the oldest entry. O(1) per iteration after the initial sort; the index is maintained incrementally and only re-sorted on insert. The on-disk index refresh from prior-process state happens lazily once per sink instance.
- **Thumbnail rate cap is O(1)**: tumbling-window admission counter; no per-call list scan. NFR-perf-gate-allow / NFR-perf-gate-drop satisfied (microbench < 5 µs avg).
- **enforce_or_raise allocation-free**: single `record.kind in self._forbidden_kinds` (frozenset membership). Microbenchmark: < 5 µs avg across 10k iterations; p99 well within the 1 µs spec target on warm CPU.
- **Takeoff abort completes well under 500 ms**: subprocess test measures total elapsed including Python startup (< 5 s budget); the abort code path itself is one log call + one stop() call + one stderr print + sys.exit.
- **WARN log rate cap on thumbnail floods**: `_LOG_RATE_LIMIT_S = 1.0` matches AZ-291's `_LOG_FAILURE_RATE_LIMIT_S` pattern. Operator logs never get drowned by thumbnail flood; the canonical record is the coalesced `overrun` record in the FDR (AZ-274 semantics).
## Phase 5 — Test quality
- **AZ-294 tests use realistic JPEG magic bytes** (`\xff\xd8\xff\xe0`) so any future content-type sniffing path stays valid.
- **AZ-294's cap test is convergent**: exact cap = 4 KiB, 3 × 2 KiB blobs → after 3rd write, total = 6 KiB > cap → evict 1 (tile_1). Asserts both the surviving set on disk AND the overrun record count.
- **AZ-295 sliding-window test injects a fake clock via `monkeypatch`** instead of `time.sleep` — avoids flaky timing dependence on CI runner load.
- **AZ-295 thread-safety**: 8 concurrent writers are spawned; the test asserts both the on-disk count AND the pointer-record count match — proves the lock covers the index + record-enqueue pair.
- **AZ-296 subprocess tests cover the real `sys.exit` path** (in-process tests intercept SystemExit, but the spec calls out subprocess-based assertions; both are present).
- **AZ-296 NFR-reliability test injects a `writer.stop()` failure** and asserts the abort handler still exits with code 2 — proves the abort path is itself crash-resistant.
- **Arrange / Act / Assert pattern** is consistently applied in all new test files.
## Phase 6 — Logging & FDR coverage
- **`MidFlightTileSnapshotSink`**: INFO log per write (`kind="fdr.tile_snapshot_written"`); WARN per eviction (`kind="fdr.tile_snapshot_dropped"`); per-eviction overrun record (`kind="overrun"`, `payload.producer_id="shared.tile_snapshot_sink"`).
- **`RecordKindPolicy`**: WARN per thumbnail flood (`kind="fdr.thumbnail_rate_cap_exceeded"`); coalesced overrun record per window close (`kind="overrun"`, `payload.producer_id=<originating>`).
- **Takeoff abort**: ERROR log (`kind="composition_root.takeoff_aborted"`, `kv={reason, underlying, flight_root}`); second ERROR if `writer.stop()` itself fails (`kind="composition_root.takeoff_abort_stop_failed"`).
- All log records follow the `kind` + `kv` convention required by AZ-266's `JsonFormatter`.
## Phase 7 — Security & risk surface
- **AC-8.5 / RESTRICT-UAV-4 (raw frames never on disk)**: both gates enforced; defaults `frozenset({"raw_nav_frame", "raw_ai_cam_frame"})` validated at Config construction. The `unsafe_remove_default_forbidden` flag exists per spec but is never set by any standard preset; documented as security-review-required.
- **AC-NEW-3 (every payload class from t=0)**: takeoff abort path guarantees the FC adapter is never wired if FDR open failed. AC-4 / AC-8 ordering tests pin this in CI.
- **Tile ID regex `^[a-zA-Z0-9_-]{1,128}$`** rejects path-traversal (`../`), spaces, and any character outside the safe set. Empty IDs and oversize IDs (> 128 chars) are also rejected.
- **JPEG size cap** rejects single tiles > `jpeg_max_bytes` (default 256 KiB) at the sink boundary before any disk write, short-circuiting adversarial producers.
- **Cap-policy eviction is content-blind**: oldest captured_at wins. No content-hash gating; the per-flight cap is a budget, not a security gate.
- **`os._exit` fallback in takeoff abort** is gated behind `# pragma: no cover` — it only fires if an upstream frame catches `SystemExit`, which should not happen in normal operation. Documented as defense-in-depth.
## Informational findings (non-blocking)
1. **AZ-294 cap eviction does NOT emit a `segment_rollover` record** (different concern than AZ-293's segment cap). Per-tile drops are reported via `kind="overrun"` with `producer_id="shared.tile_snapshot_sink"`. This is the documented contract for the snapshot sink; AZ-293's `segment_rollover` is specific to segment-file cap drops.
2. **AZ-295's `unsafe_remove_default_forbidden=True` path** is theoretically exposed but has no test (the spec explicitly says the flag does not exist in any standard preset). Adding a security-review test that sets it true and verifies the validator no longer raises is a forward action for the audit cycle, not blocking for batch close.
3. **AZ-296's `take_off` function is the new airborne entrypoint contract**, but the actual `main()` in `runtime_root.py` still calls only `compose_root`. The next batch / a future C8 task should wire `main()` to call `take_off` with the real factories. Documented in the contract update; out of scope for this batch.
## Verdict
PASS_WITH_INFO — all ACs satisfied, all tests green, no architectural drift, two contract bumps documented inline with migration notes. The three informational findings are forward actions, not blockers.
+1 -1
View File
@@ -8,7 +8,7 @@ status: in_progress
sub_step:
phase: 14
name: loop-next-batch
detail: "batch 6 of N committed"
detail: "batch 7 of N committed"
retry_count: 0
cycle: 1
tracker: jira