[AZ-291] [AZ-292] [AZ-293] C13 FDR writer chain (batch 6)

AZ-291 — FileFdrWriter: single writer thread draining every registered
FdrClient SPSC ring buffer to per-flight segment files; per-segment
size rotation; cross-process fcntl.flock filelock on flight_root;
ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert.

AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight /
close_flight lifecycle methods; four per-flight monotonic counters
(records_written, records_dropped_overrun, bytes_written,
rollover_count) reported by the footer; flight_id mismatch and
close-without-open are typed errors.

AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight
directory, drops the oldest CLOSED segment when total > cap (default
64 GiB), emits a kind="segment_rollover" record per drop. Never drops
the currently-open segment or segment 0 alone; cap_misconfigured path
logs ERROR + GCS alert. No config flag disables emission (C13-ST-01).

Schema: bumped fdr_record_schema flight_header / flight_footer payload
key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no
prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig
nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes,
debug_log_per_record).

Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite
323 passed, 2 pre-existing skips, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:38:58 +03:00
parent 33486588de
commit b5dd6031d2
19 changed files with 2152 additions and 10 deletions
@@ -0,0 +1,59 @@
# Batch 06 — Code Review
**Batch**: 6 of N
**Tasks**: AZ-291 (C13 writer thread), AZ-292 (FlightHeader/Footer), AZ-293 (Capacity cap policy)
**Reviewer**: autodev (7-phase)
**Verdict**: **PASS_WITH_INFO**
**Date**: 2026-05-11
## Scope
| Task | Component / Concern | Files touched (prod) | Files touched (tests) |
|------|---------------------|----------------------|------------------------|
| AZ-291 | C13 writer thread + segment files + filelock + ENOSPC | `components/c13_fdr/{writer.py,errors.py,interface.py,__init__.py}`, `config/schema.py`, `fdr_client/records.py` | `tests/unit/c13_fdr/test_az291_writer_thread.py` |
| AZ-292 | Flight header/footer + counters | `components/c13_fdr/{headers.py,writer.py}`, `fdr_record_schema.md` contract bump | `tests/unit/c13_fdr/test_az292_flight_header_footer.py`, `tests/unit/test_az272_fdr_record_schema.py` |
| AZ-293 | 64 GiB cap + oldest-segment-dropped + segment_rollover | `components/c13_fdr/cap_policy.py`, writer hook surface | `tests/unit/c13_fdr/test_az293_capacity_cap_policy.py` |
## Phase 1 — AC compliance
All 8 ACs per task verified via the new unit tests; 29 new tests added, all passing. See per-task AC coverage in `batch_06_cycle1_report.md`.
## Phase 2 — Contract drift
- **`fdr_record_schema.md` v1.0.0 → v1.1.0 (effective)**: `flight_header` and `flight_footer` payload key sets were extended to match AZ-292's task-spec dataclass shape (`flight_started_at_iso`, `flight_started_at_monotonic_ns`, `config_snapshot`, `signing_key_rotation_event`, `manifest_content_hashes` on header; `flight_ended_at_iso`, `flight_ended_at_monotonic_ns`, `records_dropped_overrun`, `bytes_written`, `rollover_count`, `clean_shutdown` on footer). The previous narrow shape (`started_at` / `ended_at` / `records_dropped`) was an unimplemented draft — no producer or consumer relied on it. The change is a minor bump per the contract's own versioning rules ("new optional payload field appended → minor"); existing parsers stay forward-compatible (unknown keys end up in `payload.extra`). The AZ-272 round-trip test was updated to track the new canonical fields.
## Phase 3 — Architectural compliance
- **R14 / single-writer SPSC contract**: `FileFdrWriter` is the sole consumer of every registered `FdrClient`; the writer thread is the only mutator of the four flight counters. No reader-side locks.
- **No cross-component upward imports**: `cap_policy.py` imports `FileFdrWriter` (allowed: same component); `writer.py` imports from `fdr_client.*` and `config.*` (allowed: cross-cutting); no component upward edges.
- **AZ-291 vs AZ-293 separation**: per-segment rotation (size-driven) lives in the writer; per-flight cap policy (cumulative size-driven) lives in `CapacityCapPolicy` wired by composition root via `on_rotation` hook. Writer never imports the policy.
- **No new dependencies**: contract said "atomicwrites + filelock" but `filelock` was not in `pyproject.toml`. Used `fcntl.flock` from stdlib (POSIX advisory locks — kernel releases on process death, matching the Risk-3 mitigation in the spec). Documented inline in `writer.py` module docstring.
## Phase 4 — Performance & reliability
- **`fsync` discipline (AC-3 / NFR-reliability-fsync)**: every segment close (rotation + stop + close_flight) calls `os.fsync` before `os.close`. No per-record fsync (NFR allows this).
- **No backward seeks (NFR-reliability-no-seek)**: file descriptor opened with `O_WRONLY | O_CREAT | O_APPEND`; only `os.write` and `os.fsync` are called on it.
- **Footer `bytes_written` self-reference (AC-3)**: the footer's `bytes_written` payload field must include the footer's own framed size. `close_flight()` iterates up to 8 times to converge (an integer field's ASCII length only changes at decimal-power boundaries, so the fixpoint is reached in ≤ 2 passes in practice).
- **ENOSPC degraded mode (AC-5)**: catches `OSError` around `os.write`, emits one ERROR log + one GCS alert, drops further records while continuing to dequeue producer buffers so producers don't grow unbounded. Per-second log rate cap (`_LOG_FAILURE_RATE_LIMIT_S = 1.0`) caps repeated failure noise.
- **Filelock recovery (Risk 3)**: `fcntl.flock` is kernel-managed; abrupt process death releases the lock automatically — verified by AC-6 (re-construct succeeds after `stop()`).
## Phase 5 — Test quality
- **AC coverage**: 8 ACs per task × 3 tasks = 24 explicit AC tests; plus 5 additional invariants (frozen dataclasses, double-start, double-close idempotency, cap-policy input validation, config-schema-no-disable-flag).
- **Determinism**: tests use a busy-wait loop with explicit timeout (`deadline = time.monotonic() + 5.0`) to wait for the writer thread to drain producer buffers — preferred over fixed `time.sleep` (deterministic on fast machines, robust on slow ones).
- **Production isolation**: no `monkeypatch` of stdlib unless required (AC-5 ENOSPC mocks `os.write` once to inject `errno.ENOSPC`; AC-4 of AZ-292 mocks `os.write` to inject `PermissionError` for the read-only-mount path).
- **AC-7 of AZ-292 (clean_shutdown=False on uncleansed teardown)**: tests choose the "no footer at all" path (allowed by spec); production composition root can choose to add the partial-footer path later without breaking the contract.
## Phase 6 — Informational findings (no blockers)
1. **Footer `bytes_written` includes the footer record itself** — convergence loop runs once or twice in practice; documented inline. If a future test pins the exact byte total against the file, the loop is the canonical answer; no edge case where it diverges (8-iter cap is paranoia margin).
2. **Cap policy emits `segment_rollover` records via the shared FdrClient** — those records are themselves enqueued, drained, and written to the current open segment. Under aggressive test caps (cap ≤ 1024 bytes, segment ≤ 256 bytes), the cascade of rollover-record writes can extend the cap-drop loop. In production with cap = 64 GiB and segments = 64 MiB, the cascade is negligible (≤ tens of bytes per drop). Documented as test-only consideration in AC-2 test comments.
3. **Contract minor bump for `flight_header` / `flight_footer`** — see Phase 2. No consumer of the previous narrow shape exists; no migration needed; the v1.0.0 draft has been overwritten in place because no record matching the prior shape has ever been emitted.
4. **`filelock` dependency replaced with `fcntl.flock`** — see Phase 3. Net effect: one fewer transitive dependency, same semantics on Linux + macOS (target platforms). Windows is explicitly unsupported on the companion onboard runtime.
## Phase 7 — Verdict
**PASS_WITH_INFO** — all ACs covered, all 323 project tests green, no lint or formatting issues, no contract drift uncovered. Informational findings (14 above) are documented and require no follow-up beyond their inline notes.