mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 08:31:13 +00:00
[AZ-291] [AZ-292] [AZ-293] C13 FDR writer chain (batch 6)
AZ-291 — FileFdrWriter: single writer thread draining every registered FdrClient SPSC ring buffer to per-flight segment files; per-segment size rotation; cross-process fcntl.flock filelock on flight_root; ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert. AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight / close_flight lifecycle methods; four per-flight monotonic counters (records_written, records_dropped_overrun, bytes_written, rollover_count) reported by the footer; flight_id mismatch and close-without-open are typed errors. AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight directory, drops the oldest CLOSED segment when total > cap (default 64 GiB), emits a kind="segment_rollover" record per drop. Never drops the currently-open segment or segment 0 alone; cap_misconfigured path logs ERROR + GCS alert. No config flag disables emission (C13-ST-01). Schema: bumped fdr_record_schema flight_header / flight_footer payload key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes, debug_log_per_record). Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite 323 passed, 2 pre-existing skips, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,98 @@
|
||||
# Batch 06 — Implementation Report (Cycle 1)
|
||||
|
||||
**Tasks**: AZ-291, AZ-292, AZ-293
|
||||
**Component**: C13 FDR Writer (E-C13)
|
||||
**Cycle**: 1 (Build → Ship)
|
||||
**Date**: 2026-05-11
|
||||
|
||||
## Summary
|
||||
|
||||
Built the C13 FDR writer chain end-to-end. AZ-291 lands the single writer thread + segment file lifecycle + cross-process filelock + ENOSPC degraded mode. AZ-292 lands the `FlightHeader` / `FlightFooter` records and the four per-flight counters (records_written, records_dropped_overrun, bytes_written, rollover_count) that make a flight directory self-describing. AZ-293 lands the per-flight 64 GiB cap policy with oldest-segment-dropped + canonical `segment_rollover` record emission.
|
||||
|
||||
The three tasks share a single module (`components/c13_fdr/`) with these new files:
|
||||
|
||||
- `errors.py` — five typed exceptions covering construction, open, close, and concurrent-writer failure paths.
|
||||
- `headers.py` — `FlightHeader` and `FlightFooter` frozen dataclasses.
|
||||
- `writer.py` — `FileFdrWriter` (AZ-291 + AZ-292).
|
||||
- `cap_policy.py` — `CapacityCapPolicy` (AZ-293).
|
||||
- `__init__.py`, `interface.py` — re-exports.
|
||||
|
||||
## Features Landed
|
||||
|
||||
### AZ-291 — Writer thread + segment lifecycle
|
||||
|
||||
- `FileFdrWriter(flight_root, flight_id, config, fdr_clients, gcs_alert, *, on_rotation, drain_sleep_s)` constructor.
|
||||
- `start()`, `stop()`, `open_flight(header)`, `close_flight()` lifecycle methods.
|
||||
- Background writer thread that loops over every registered `FdrClient.drain(batch_size)` and writes serialised records to the current segment with `<uint32-LE length prefix> | <serialised body>` framing.
|
||||
- Per-segment rotation triggered by `segment_size_bytes` (default 64 MiB).
|
||||
- Cross-process filelock via `fcntl.flock(LOCK_EX | LOCK_NB)` on `flight_root/.fdr.lock`; held for the entire flight; constructor-time `FdrConcurrentWriterError` on contention.
|
||||
- ENOSPC degraded mode: one ERROR log + one GCS alert; subsequent failures are log-rate-capped at 1/sec; producer buffers keep draining (records discarded) so producer-side memory does not grow unbounded.
|
||||
- Public introspection: `current_segment_path()`, `current_segment_bytes()`, `segments_written()`, `is_rolling()`, `is_degraded()`, `current_size_bytes()`, `rollover_count`, `records_dropped_overrun`, `flight_id`, `flight_dir`.
|
||||
|
||||
### AZ-292 — FlightHeader / FlightFooter + counters
|
||||
|
||||
- `FlightHeader` dataclass with `flight_id`, `flight_started_at_iso`, `flight_started_at_monotonic_ns`, `config_snapshot`, `signing_key_rotation_event`, `manifest_content_hashes`, `build_info`.
|
||||
- `FlightFooter` dataclass with `flight_id`, `flight_ended_at_iso`, `flight_ended_at_monotonic_ns`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count`, `clean_shutdown`.
|
||||
- `open_flight(header)` writes the header as the first record of segment 0; rejects flight_id mismatch with `FdrOpenError`.
|
||||
- `close_flight()` drains pending producer records, builds the footer (iteratively converging `bytes_written` to include the footer's own size), writes it, releases the filelock, and returns the `FlightFooter` to the caller. Idempotent (a second call returns the cached footer).
|
||||
- Counter integration: `_append_record` increments `_records_written` and `_bytes_written`; `_observe_overrun_record` aggregates `payload.dropped_count` into `_records_dropped_overrun`; `_rotate_segment` bumps `_rollover_count`.
|
||||
|
||||
### AZ-293 — Capacity cap policy
|
||||
|
||||
- `CapacityCapPolicy(cap_bytes, fdr_client, gcs_alert)` callable; invoked by `FileFdrWriter` via the `on_rotation` hook after every per-segment rotation.
|
||||
- Walks the flight directory, sums on-disk segment sizes + writer's running `current_segment_bytes`, and unlinks the oldest CLOSED segment if total > cap. Repeats until under cap.
|
||||
- Segment 0 (containing the `flight_header`) is never dropped unless it is the only candidate AND the directory is over cap by itself — in that case logs `fdr.cap_misconfigured` ERROR + emits one GCS alert and lets the flight continue in degraded mode.
|
||||
- Each drop enqueues a `kind="segment_rollover"` `FdrRecord` (envelope `producer_id="shared.fdr_client"`) carrying `old_segment`, `new_segment`, `total_bytes_after`; bumps `writer.rollover_count`; logs `fdr.cap_drop` INFO.
|
||||
- Default `cap_bytes = 64 * 1024**3` (64 GiB exactly per AC-NEW-3 + AC-7); valid range `[1024, 2**40]`.
|
||||
- No config flag disables `segment_rollover` emission (AC-6 verified by a config-schema scan test).
|
||||
|
||||
## Schema / Contract Changes
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `flight_header` and `flight_footer` payload key sets extended to match AZ-292's task-spec dataclasses. Effective minor bump (1.0.0 → 1.1.0); no breaking change since no producer or consumer used the previous narrow shape.
|
||||
- `src/gps_denied_onboard/fdr_client/records.py` — `KNOWN_PAYLOAD_KEYS` updated for the two kinds.
|
||||
- `src/gps_denied_onboard/config/schema.py` — added `FdrWriterConfig` nested inside `FdrConfig`. Fields: `segment_size_bytes` (default 64 MiB), `batch_size` (default 64), `flight_cap_bytes` (default 64 GiB), `debug_log_per_record` (default False).
|
||||
|
||||
## Dependency Changes
|
||||
|
||||
None. Despite the AZ-291 spec calling for `filelock`, the package was not in `pyproject.toml` and `fcntl.flock` from the stdlib provides equivalent POSIX advisory-lock semantics (kernel auto-releases on process death — directly matching the Risk-3 mitigation). Documented inline in the writer's module docstring.
|
||||
|
||||
## Test Results
|
||||
|
||||
- **New tests**: 29 (9 for AZ-291, 10 for AZ-292, 10 for AZ-293).
|
||||
- **Full suite**: 323 passed, 2 skipped (pre-existing cmake / actionlint skips). 0 regressions.
|
||||
|
||||
## Acceptance Criteria Coverage
|
||||
|
||||
| Task | AC | Test | Status |
|
||||
|------|----|------|--------|
|
||||
| AZ-291 | AC-1 drain all producers | `test_ac1_drain_all_registered_producers` | PASS |
|
||||
| AZ-291 | AC-2 per-segment rotation | `test_ac2_per_segment_rotation_at_size_cap` | PASS |
|
||||
| AZ-291 | AC-3 atomic rotation | `test_ac3_atomic_rotation_no_half_segment` | PASS |
|
||||
| AZ-291 | AC-4 filelock prevents concurrent | `test_ac4_concurrent_writer_blocked_by_filelock` | PASS |
|
||||
| AZ-291 | AC-5 ENOSPC degrades + alerts | `test_ac5_enospc_degrades_and_alerts` | PASS |
|
||||
| AZ-291 | AC-6 stop drains + fsyncs + releases lock | `test_ac6_stop_drains_and_releases_lock` | PASS |
|
||||
| AZ-291 | AC-7 segment file layout | `test_ac7_segment_layout` | PASS |
|
||||
| AZ-291 | AC-8 steady-state no overrun | `test_ac8_steady_state_no_overrun` | PASS |
|
||||
| AZ-292 | AC-1 header is first record | `test_ac1_flight_header_is_first_record` | PASS |
|
||||
| AZ-292 | AC-2 footer is last record | `test_ac2_flight_footer_is_last_record` | PASS |
|
||||
| AZ-292 | AC-3 counters reflect reality | `test_ac3_counters_reflect_on_disk_reality` | PASS |
|
||||
| AZ-292 | AC-4 open_flight FdrOpenError on disk failure | `test_ac4_open_flight_fdrerror_on_disk_failure` | PASS |
|
||||
| AZ-292 | AC-5 reject flight_id mismatch | `test_ac5_open_flight_rejects_flight_id_mismatch` | PASS |
|
||||
| AZ-292 | AC-6 close without open raises | `test_ac6_close_without_open_raises` | PASS |
|
||||
| AZ-292 | AC-7 clean_shutdown=False on teardown | `test_ac7_uncleansed_teardown_no_clean_shutdown` | PASS |
|
||||
| AZ-292 | AC-8 records_dropped_overrun aggregates | `test_ac8_records_dropped_overrun_aggregates_dropped_counts` | PASS |
|
||||
| AZ-293 | AC-1 drop oldest when over cap | `test_ac1_drop_oldest_when_dir_exceeds_cap` | PASS |
|
||||
| AZ-293 | AC-2 loop until under cap | `test_ac2_loop_until_under_cap` | PASS |
|
||||
| AZ-293 | AC-3 misconfigured cap path | `test_ac3_cap_misconfigured_when_segment_zero_alone` | PASS |
|
||||
| AZ-293 | AC-4 open segment never dropped | `test_ac4_currently_open_segment_never_dropped` | PASS |
|
||||
| AZ-293 | AC-5 canonical fields on rollover | `test_ac5_segment_rollover_record_has_canonical_fields` | PASS |
|
||||
| AZ-293 | AC-6 no disable flag | `test_ac6_no_config_flag_disables_segment_rollover` + `test_config_full_schema_has_no_rollover_disable_field` | PASS |
|
||||
| AZ-293 | AC-7 default cap is exactly 64 GiB | `test_ac7_default_cap_is_exactly_64_gib` | PASS |
|
||||
| AZ-293 | AC-8 rollover_count matches | `test_ac8_rollover_count_matches_segment_rollover_records` | PASS |
|
||||
|
||||
## Follow-ups
|
||||
|
||||
- **AZ-294 / AZ-295 / AZ-296**: mid-flight tile snapshot path, thumbnail rate cap, and takeoff-abort wiring — next sub-tasks in E-C13 (out of scope for Batch 6).
|
||||
- **Composition root wiring**: the `runtime_root.py` will inject the `CapacityCapPolicy` instance as the writer's `on_rotation` callback when E-C13's full wiring lands (likely a later batch or AZ-270 expansion).
|
||||
- **NFR-perf microbenches**: NFR-perf-throughput (≥ 200 Hz on Tier-2), NFR-perf-rotation (p99 ≤ 50 ms), NFR-perf-hook (p99 ≤ 50 ms), NFR-perf-multi-drop (≤ 100 ms) are documented in the specs but require Tier-2 hardware to run; tracked for a future Jetson-harness cycle.
|
||||
- **AZ-294 mid-flight tile snapshot**: depends on the writer being able to record a JSON pointer record without copying the JPEG inline (`sidecar_path` invariant); the existing `_append_record` supports this directly. Implementation will live in this same module.
|
||||
@@ -0,0 +1,59 @@
|
||||
# Batch 06 — Code Review
|
||||
|
||||
**Batch**: 6 of N
|
||||
**Tasks**: AZ-291 (C13 writer thread), AZ-292 (FlightHeader/Footer), AZ-293 (Capacity cap policy)
|
||||
**Reviewer**: autodev (7-phase)
|
||||
**Verdict**: **PASS_WITH_INFO**
|
||||
**Date**: 2026-05-11
|
||||
|
||||
## Scope
|
||||
|
||||
| Task | Component / Concern | Files touched (prod) | Files touched (tests) |
|
||||
|------|---------------------|----------------------|------------------------|
|
||||
| AZ-291 | C13 writer thread + segment files + filelock + ENOSPC | `components/c13_fdr/{writer.py,errors.py,interface.py,__init__.py}`, `config/schema.py`, `fdr_client/records.py` | `tests/unit/c13_fdr/test_az291_writer_thread.py` |
|
||||
| AZ-292 | Flight header/footer + counters | `components/c13_fdr/{headers.py,writer.py}`, `fdr_record_schema.md` contract bump | `tests/unit/c13_fdr/test_az292_flight_header_footer.py`, `tests/unit/test_az272_fdr_record_schema.py` |
|
||||
| AZ-293 | 64 GiB cap + oldest-segment-dropped + segment_rollover | `components/c13_fdr/cap_policy.py`, writer hook surface | `tests/unit/c13_fdr/test_az293_capacity_cap_policy.py` |
|
||||
|
||||
## Phase 1 — AC compliance
|
||||
|
||||
All 8 ACs per task verified via the new unit tests; 29 new tests added, all passing. See per-task AC coverage in `batch_06_cycle1_report.md`.
|
||||
|
||||
## Phase 2 — Contract drift
|
||||
|
||||
- **`fdr_record_schema.md` v1.0.0 → v1.1.0 (effective)**: `flight_header` and `flight_footer` payload key sets were extended to match AZ-292's task-spec dataclass shape (`flight_started_at_iso`, `flight_started_at_monotonic_ns`, `config_snapshot`, `signing_key_rotation_event`, `manifest_content_hashes` on header; `flight_ended_at_iso`, `flight_ended_at_monotonic_ns`, `records_dropped_overrun`, `bytes_written`, `rollover_count`, `clean_shutdown` on footer). The previous narrow shape (`started_at` / `ended_at` / `records_dropped`) was an unimplemented draft — no producer or consumer relied on it. The change is a minor bump per the contract's own versioning rules ("new optional payload field appended → minor"); existing parsers stay forward-compatible (unknown keys end up in `payload.extra`). The AZ-272 round-trip test was updated to track the new canonical fields.
|
||||
|
||||
## Phase 3 — Architectural compliance
|
||||
|
||||
- **R14 / single-writer SPSC contract**: `FileFdrWriter` is the sole consumer of every registered `FdrClient`; the writer thread is the only mutator of the four flight counters. No reader-side locks.
|
||||
- **No cross-component upward imports**: `cap_policy.py` imports `FileFdrWriter` (allowed: same component); `writer.py` imports from `fdr_client.*` and `config.*` (allowed: cross-cutting); no component upward edges.
|
||||
- **AZ-291 vs AZ-293 separation**: per-segment rotation (size-driven) lives in the writer; per-flight cap policy (cumulative size-driven) lives in `CapacityCapPolicy` wired by composition root via `on_rotation` hook. Writer never imports the policy.
|
||||
- **No new dependencies**: contract said "atomicwrites + filelock" but `filelock` was not in `pyproject.toml`. Used `fcntl.flock` from stdlib (POSIX advisory locks — kernel releases on process death, matching the Risk-3 mitigation in the spec). Documented inline in `writer.py` module docstring.
|
||||
|
||||
## Phase 4 — Performance & reliability
|
||||
|
||||
- **`fsync` discipline (AC-3 / NFR-reliability-fsync)**: every segment close (rotation + stop + close_flight) calls `os.fsync` before `os.close`. No per-record fsync (NFR allows this).
|
||||
- **No backward seeks (NFR-reliability-no-seek)**: file descriptor opened with `O_WRONLY | O_CREAT | O_APPEND`; only `os.write` and `os.fsync` are called on it.
|
||||
- **Footer `bytes_written` self-reference (AC-3)**: the footer's `bytes_written` payload field must include the footer's own framed size. `close_flight()` iterates up to 8 times to converge (an integer field's ASCII length only changes at decimal-power boundaries, so the fixpoint is reached in ≤ 2 passes in practice).
|
||||
- **ENOSPC degraded mode (AC-5)**: catches `OSError` around `os.write`, emits one ERROR log + one GCS alert, drops further records while continuing to dequeue producer buffers so producers don't grow unbounded. Per-second log rate cap (`_LOG_FAILURE_RATE_LIMIT_S = 1.0`) caps repeated failure noise.
|
||||
- **Filelock recovery (Risk 3)**: `fcntl.flock` is kernel-managed; abrupt process death releases the lock automatically — verified by AC-6 (re-construct succeeds after `stop()`).
|
||||
|
||||
## Phase 5 — Test quality
|
||||
|
||||
- **AC coverage**: 8 ACs per task × 3 tasks = 24 explicit AC tests; plus 5 additional invariants (frozen dataclasses, double-start, double-close idempotency, cap-policy input validation, config-schema-no-disable-flag).
|
||||
- **Determinism**: tests use a busy-wait loop with explicit timeout (`deadline = time.monotonic() + 5.0`) to wait for the writer thread to drain producer buffers — preferred over fixed `time.sleep` (deterministic on fast machines, robust on slow ones).
|
||||
- **Production isolation**: no `monkeypatch` of stdlib unless required (AC-5 ENOSPC mocks `os.write` once to inject `errno.ENOSPC`; AC-4 of AZ-292 mocks `os.write` to inject `PermissionError` for the read-only-mount path).
|
||||
- **AC-7 of AZ-292 (clean_shutdown=False on uncleansed teardown)**: tests choose the "no footer at all" path (allowed by spec); production composition root can choose to add the partial-footer path later without breaking the contract.
|
||||
|
||||
## Phase 6 — Informational findings (no blockers)
|
||||
|
||||
1. **Footer `bytes_written` includes the footer record itself** — convergence loop runs once or twice in practice; documented inline. If a future test pins the exact byte total against the file, the loop is the canonical answer; no edge case where it diverges (8-iter cap is paranoia margin).
|
||||
|
||||
2. **Cap policy emits `segment_rollover` records via the shared FdrClient** — those records are themselves enqueued, drained, and written to the current open segment. Under aggressive test caps (cap ≤ 1024 bytes, segment ≤ 256 bytes), the cascade of rollover-record writes can extend the cap-drop loop. In production with cap = 64 GiB and segments = 64 MiB, the cascade is negligible (≤ tens of bytes per drop). Documented as test-only consideration in AC-2 test comments.
|
||||
|
||||
3. **Contract minor bump for `flight_header` / `flight_footer`** — see Phase 2. No consumer of the previous narrow shape exists; no migration needed; the v1.0.0 draft has been overwritten in place because no record matching the prior shape has ever been emitted.
|
||||
|
||||
4. **`filelock` dependency replaced with `fcntl.flock`** — see Phase 3. Net effect: one fewer transitive dependency, same semantics on Linux + macOS (target platforms). Windows is explicitly unsupported on the companion onboard runtime.
|
||||
|
||||
## Phase 7 — Verdict
|
||||
|
||||
**PASS_WITH_INFO** — all ACs covered, all 323 project tests green, no lint or formatting issues, no contract drift uncovered. Informational findings (1–4 above) are documented and require no follow-up beyond their inline notes.
|
||||
Reference in New Issue
Block a user