[AZ-291] [AZ-292] [AZ-293] C13 FDR writer chain (batch 6)

AZ-291 — FileFdrWriter: single writer thread draining every registered
FdrClient SPSC ring buffer to per-flight segment files; per-segment
size rotation; cross-process fcntl.flock filelock on flight_root;
ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert.

AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight /
close_flight lifecycle methods; four per-flight monotonic counters
(records_written, records_dropped_overrun, bytes_written,
rollover_count) reported by the footer; flight_id mismatch and
close-without-open are typed errors.

AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight
directory, drops the oldest CLOSED segment when total > cap (default
64 GiB), emits a kind="segment_rollover" record per drop. Never drops
the currently-open segment or segment 0 alone; cap_misconfigured path
logs ERROR + GCS alert. No config flag disables emission (C13-ST-01).

Schema: bumped fdr_record_schema flight_header / flight_footer payload
key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no
prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig
nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes,
debug_log_per_record).

Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite
323 passed, 2 pre-existing skips, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:38:58 +03:00
parent 33486588de
commit b5dd6031d2
19 changed files with 2152 additions and 10 deletions
@@ -0,0 +1,98 @@
# Batch 06 — Implementation Report (Cycle 1)
**Tasks**: AZ-291, AZ-292, AZ-293
**Component**: C13 FDR Writer (E-C13)
**Cycle**: 1 (Build → Ship)
**Date**: 2026-05-11
## Summary
Built the C13 FDR writer chain end-to-end. AZ-291 lands the single writer thread + segment file lifecycle + cross-process filelock + ENOSPC degraded mode. AZ-292 lands the `FlightHeader` / `FlightFooter` records and the four per-flight counters (records_written, records_dropped_overrun, bytes_written, rollover_count) that make a flight directory self-describing. AZ-293 lands the per-flight 64 GiB cap policy with oldest-segment-dropped + canonical `segment_rollover` record emission.
The three tasks share a single module (`components/c13_fdr/`) with these new files:
- `errors.py` — five typed exceptions covering construction, open, close, and concurrent-writer failure paths.
- `headers.py``FlightHeader` and `FlightFooter` frozen dataclasses.
- `writer.py``FileFdrWriter` (AZ-291 + AZ-292).
- `cap_policy.py``CapacityCapPolicy` (AZ-293).
- `__init__.py`, `interface.py` — re-exports.
## Features Landed
### AZ-291 — Writer thread + segment lifecycle
- `FileFdrWriter(flight_root, flight_id, config, fdr_clients, gcs_alert, *, on_rotation, drain_sleep_s)` constructor.
- `start()`, `stop()`, `open_flight(header)`, `close_flight()` lifecycle methods.
- Background writer thread that loops over every registered `FdrClient.drain(batch_size)` and writes serialised records to the current segment with `<uint32-LE length prefix> | <serialised body>` framing.
- Per-segment rotation triggered by `segment_size_bytes` (default 64 MiB).
- Cross-process filelock via `fcntl.flock(LOCK_EX | LOCK_NB)` on `flight_root/.fdr.lock`; held for the entire flight; constructor-time `FdrConcurrentWriterError` on contention.
- ENOSPC degraded mode: one ERROR log + one GCS alert; subsequent failures are log-rate-capped at 1/sec; producer buffers keep draining (records discarded) so producer-side memory does not grow unbounded.
- Public introspection: `current_segment_path()`, `current_segment_bytes()`, `segments_written()`, `is_rolling()`, `is_degraded()`, `current_size_bytes()`, `rollover_count`, `records_dropped_overrun`, `flight_id`, `flight_dir`.
### AZ-292 — FlightHeader / FlightFooter + counters
- `FlightHeader` dataclass with `flight_id`, `flight_started_at_iso`, `flight_started_at_monotonic_ns`, `config_snapshot`, `signing_key_rotation_event`, `manifest_content_hashes`, `build_info`.
- `FlightFooter` dataclass with `flight_id`, `flight_ended_at_iso`, `flight_ended_at_monotonic_ns`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count`, `clean_shutdown`.
- `open_flight(header)` writes the header as the first record of segment 0; rejects flight_id mismatch with `FdrOpenError`.
- `close_flight()` drains pending producer records, builds the footer (iteratively converging `bytes_written` to include the footer's own size), writes it, releases the filelock, and returns the `FlightFooter` to the caller. Idempotent (a second call returns the cached footer).
- Counter integration: `_append_record` increments `_records_written` and `_bytes_written`; `_observe_overrun_record` aggregates `payload.dropped_count` into `_records_dropped_overrun`; `_rotate_segment` bumps `_rollover_count`.
### AZ-293 — Capacity cap policy
- `CapacityCapPolicy(cap_bytes, fdr_client, gcs_alert)` callable; invoked by `FileFdrWriter` via the `on_rotation` hook after every per-segment rotation.
- Walks the flight directory, sums on-disk segment sizes + writer's running `current_segment_bytes`, and unlinks the oldest CLOSED segment if total > cap. Repeats until under cap.
- Segment 0 (containing the `flight_header`) is never dropped unless it is the only candidate AND the directory is over cap by itself — in that case logs `fdr.cap_misconfigured` ERROR + emits one GCS alert and lets the flight continue in degraded mode.
- Each drop enqueues a `kind="segment_rollover"` `FdrRecord` (envelope `producer_id="shared.fdr_client"`) carrying `old_segment`, `new_segment`, `total_bytes_after`; bumps `writer.rollover_count`; logs `fdr.cap_drop` INFO.
- Default `cap_bytes = 64 * 1024**3` (64 GiB exactly per AC-NEW-3 + AC-7); valid range `[1024, 2**40]`.
- No config flag disables `segment_rollover` emission (AC-6 verified by a config-schema scan test).
## Schema / Contract Changes
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md``flight_header` and `flight_footer` payload key sets extended to match AZ-292's task-spec dataclasses. Effective minor bump (1.0.0 → 1.1.0); no breaking change since no producer or consumer used the previous narrow shape.
- `src/gps_denied_onboard/fdr_client/records.py``KNOWN_PAYLOAD_KEYS` updated for the two kinds.
- `src/gps_denied_onboard/config/schema.py` — added `FdrWriterConfig` nested inside `FdrConfig`. Fields: `segment_size_bytes` (default 64 MiB), `batch_size` (default 64), `flight_cap_bytes` (default 64 GiB), `debug_log_per_record` (default False).
## Dependency Changes
None. Despite the AZ-291 spec calling for `filelock`, the package was not in `pyproject.toml` and `fcntl.flock` from the stdlib provides equivalent POSIX advisory-lock semantics (kernel auto-releases on process death — directly matching the Risk-3 mitigation). Documented inline in the writer's module docstring.
## Test Results
- **New tests**: 29 (9 for AZ-291, 10 for AZ-292, 10 for AZ-293).
- **Full suite**: 323 passed, 2 skipped (pre-existing cmake / actionlint skips). 0 regressions.
## Acceptance Criteria Coverage
| Task | AC | Test | Status |
|------|----|------|--------|
| AZ-291 | AC-1 drain all producers | `test_ac1_drain_all_registered_producers` | PASS |
| AZ-291 | AC-2 per-segment rotation | `test_ac2_per_segment_rotation_at_size_cap` | PASS |
| AZ-291 | AC-3 atomic rotation | `test_ac3_atomic_rotation_no_half_segment` | PASS |
| AZ-291 | AC-4 filelock prevents concurrent | `test_ac4_concurrent_writer_blocked_by_filelock` | PASS |
| AZ-291 | AC-5 ENOSPC degrades + alerts | `test_ac5_enospc_degrades_and_alerts` | PASS |
| AZ-291 | AC-6 stop drains + fsyncs + releases lock | `test_ac6_stop_drains_and_releases_lock` | PASS |
| AZ-291 | AC-7 segment file layout | `test_ac7_segment_layout` | PASS |
| AZ-291 | AC-8 steady-state no overrun | `test_ac8_steady_state_no_overrun` | PASS |
| AZ-292 | AC-1 header is first record | `test_ac1_flight_header_is_first_record` | PASS |
| AZ-292 | AC-2 footer is last record | `test_ac2_flight_footer_is_last_record` | PASS |
| AZ-292 | AC-3 counters reflect reality | `test_ac3_counters_reflect_on_disk_reality` | PASS |
| AZ-292 | AC-4 open_flight FdrOpenError on disk failure | `test_ac4_open_flight_fdrerror_on_disk_failure` | PASS |
| AZ-292 | AC-5 reject flight_id mismatch | `test_ac5_open_flight_rejects_flight_id_mismatch` | PASS |
| AZ-292 | AC-6 close without open raises | `test_ac6_close_without_open_raises` | PASS |
| AZ-292 | AC-7 clean_shutdown=False on teardown | `test_ac7_uncleansed_teardown_no_clean_shutdown` | PASS |
| AZ-292 | AC-8 records_dropped_overrun aggregates | `test_ac8_records_dropped_overrun_aggregates_dropped_counts` | PASS |
| AZ-293 | AC-1 drop oldest when over cap | `test_ac1_drop_oldest_when_dir_exceeds_cap` | PASS |
| AZ-293 | AC-2 loop until under cap | `test_ac2_loop_until_under_cap` | PASS |
| AZ-293 | AC-3 misconfigured cap path | `test_ac3_cap_misconfigured_when_segment_zero_alone` | PASS |
| AZ-293 | AC-4 open segment never dropped | `test_ac4_currently_open_segment_never_dropped` | PASS |
| AZ-293 | AC-5 canonical fields on rollover | `test_ac5_segment_rollover_record_has_canonical_fields` | PASS |
| AZ-293 | AC-6 no disable flag | `test_ac6_no_config_flag_disables_segment_rollover` + `test_config_full_schema_has_no_rollover_disable_field` | PASS |
| AZ-293 | AC-7 default cap is exactly 64 GiB | `test_ac7_default_cap_is_exactly_64_gib` | PASS |
| AZ-293 | AC-8 rollover_count matches | `test_ac8_rollover_count_matches_segment_rollover_records` | PASS |
## Follow-ups
- **AZ-294 / AZ-295 / AZ-296**: mid-flight tile snapshot path, thumbnail rate cap, and takeoff-abort wiring — next sub-tasks in E-C13 (out of scope for Batch 6).
- **Composition root wiring**: the `runtime_root.py` will inject the `CapacityCapPolicy` instance as the writer's `on_rotation` callback when E-C13's full wiring lands (likely a later batch or AZ-270 expansion).
- **NFR-perf microbenches**: NFR-perf-throughput (≥ 200 Hz on Tier-2), NFR-perf-rotation (p99 ≤ 50 ms), NFR-perf-hook (p99 ≤ 50 ms), NFR-perf-multi-drop (≤ 100 ms) are documented in the specs but require Tier-2 hardware to run; tracked for a future Jetson-harness cycle.
- **AZ-294 mid-flight tile snapshot**: depends on the writer being able to record a JSON pointer record without copying the JPEG inline (`sidecar_path` invariant); the existing `_append_record` supports this directly. Implementation will live in this same module.