Decompose Step 6 snapshot: 140 task specs + contract docs

Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 00:39:48 +03:00
parent 8171fcb29e
commit 880eabcb3f
172 changed files with 22897 additions and 35 deletions
@@ -0,0 +1,171 @@
# C13 Writer Thread + Segment File Lifecycle
**Task**: AZ-291_c13_writer_thread
**Name**: C13 Writer Thread
**Description**: Implement the single-writer thread that drains every onboard producer's `FdrClient` SPSC ring buffer and persists records to per-flight segment files on the companion's NVM. Owns segment file open/append/close, atomic per-segment rotation when the configured per-segment size cap is reached, and the cross-process FDR-root `filelock` so the operator-side post-flight reader cannot collide with an in-flight writer. This task is the foundation every other E-C13 task (header/footer accounting, 64 GB cap policy, mid-flight tile snapshot, thumbnail rate cap, takeoff abort) builds on.
**Complexity**: 5 points
**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf, AZ-266_log_module, AZ-269_config_loader
**Component**: c13_fdr (epic AZ-248 / E-C13)
**Tracker**: AZ-291
**Epic**: AZ-248 (E-C13)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — wire format for every record this thread serialises to the segment file.
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines `pop_one()` / `drain()` consumer-side surface this thread invokes per registered producer.
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational log shape this thread uses for ERROR/WARN/INFO messages (segment open/rotate/write failure).
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that carries `flight_root`, segment-size cap, and registered producer set.
## Problem
Every onboard component publishes FDR records via its `FdrClient` SPSC ring buffer (AZ-273), but those buffers are write-only from the producer side. Without a single, contract-frozen writer thread:
- Buffers fill up and overruns dominate within seconds — the AC-NEW-3 "no silent drops" guarantee is unenforceable because nothing drains them.
- No segment file ever lands on disk — post-flight retrieval has nothing to read.
- Multiple ad-hoc writers would race on segment rotation, corrupting partially-written records.
- Operator workstation reads (post-flight via E-C12) and a misbehaving "still flying" writer process would race on the FDR root without `filelock` enforcement.
This task delivers exactly one thread that owns the entire write side of the FDR.
## Outcome
- A single `FileFdrWriter` instance, constructed once per flight by the composition root, runs one background thread that consumes records from every registered producer's `FdrClient` and appends them to the current open segment file in the per-flight directory under `flight_root`.
- Segment files roll over atomically when the configured per-segment size cap is reached: the current segment is closed and `fsync`ed, the next segment is opened via `atomicwrites`, and the writer continues without dropping records or losing wire-format alignment.
- The FDR root holds a `filelock` for the entire flight; the operator-side reader (future E-C12 retrieval task) MUST acquire the same lock before reading. Two airborne writer processes against the same `flight_root` is a constructor-time `FdrConcurrentWriterError`.
- A mid-flight filesystem write failure (ENOSPC, EIO) is logged via the shared logger at ERROR + a STATUSTEXT alert is requested through the C8 GCS adapter; the writer transitions to a degraded "drop-and-log" mode so the rest of the system keeps emitting external positions, but operators are alerted.
## Scope
### Included
- `FileFdrWriter(flight_root: Path, config: FdrWriterConfig, fdr_clients: Sequence[FdrClient], gcs_alert: Callable[[str], None])` constructor.
- `start()` method that opens segment 0 under `flight_root/<flight_id>/segment-0000.fdr`, acquires the FDR-root `filelock`, and starts the background thread.
- `stop()` method that signals the thread to drain remaining records, closes the current segment with `fsync`, releases the `filelock`, and joins.
- Background thread loop: per registered producer, call `drain(max_records=batch_size)` (batch size from config), serialise each `FdrRecord` via `fdr_record_schema.serialise`, append to the current segment with a length-prefixed framing identical to what `parse` reads, and rotate when the segment exceeds the per-segment size cap.
- Atomic per-segment rotation using `atomicwrites`: open the next segment under a temp path, swap to the canonical name only after the previous segment is closed + `fsync`ed.
- Cross-process `filelock` on `flight_root/.fdr.lock` held for the entire flight; constructor-time `FdrConcurrentWriterError` if the lock is already held.
- Mid-flight write failure handling: catch `OSError` around segment append/rotate, log ERROR via the shared logger (`kind="fdr.write_failure"`), invoke `gcs_alert(message)`, set internal `is_degraded = True`. Subsequent `drain` calls continue to dequeue records (so producer buffers don't grow unboundedly) but discard them with a per-second-rate-capped ERROR log; recovery is out of scope (operator must land + retry).
- Public read-only introspection: `current_segment_path() -> Path`, `current_segment_bytes() -> int`, `segments_written() -> int`, `is_rolling() -> bool` (true while a rotation is in progress).
- Diagnostic INFO log on `start()` and on each successful segment rotation; DEBUG log per record only when explicitly enabled in config (defaults off — DEBUG-per-record would flood at 100 Hz aggregate).
- Filesystem layout: `flight_root/<flight_id>/segment-NNNN.fdr` (4-digit zero-padded segment number, `.fdr` suffix). The `<flight_id>` directory is created on `start()` from `FlightHeader.flight_id` (header content is owned by AZ-248-2 / task #2; this task accepts the flight_id as a constructor argument or via an open-time setter).
### Excluded
- `FlightHeader` / `FlightFooter` records and `records_written` / `records_dropped_overrun` accounting — owned by task #2 of this epic.
- 64 GB total-flight cap + oldest-segment-dropped policy + `kind="segment_rollover"` record emission — owned by task #3 of this epic. (This task implements per-segment-size rotation only; per-flight-cap enforcement is a higher policy layer that observes segments rolled by this task.)
- Mid-flight tile snapshot path + `kind="mid_flight_tile_snapshot"` payload handling — owned by task #4.
- Failed-tile thumbnail rate limiter + AC-8.5 `RawFrameWriteForbiddenError` enforcement — owned by task #5.
- Takeoff abort wiring on `FdrOpenError` — owned by task #6.
- Producer-side `FdrClient` ring buffer + `on_overrun` policy — owned by AZ-273 + AZ-274.
- Post-flight segment file reader — out of scope this cycle (future E-C12 task).
- `FdrRecord` schema and `serialise` / `parse` implementations — owned by AZ-272.
## Acceptance Criteria
**AC-1: Single writer thread drains every registered producer**
Given 3 `FdrClient` instances each with 100 records buffered
When `FileFdrWriter.start()` is called and the test waits 1 s
Then segment 0 on disk contains all 300 records (parsed via `fdr_record_schema.parse` in deterministic order per-producer, interleaving allowed across producers)
**AC-2: Per-segment rotation at configured size cap**
Given `FdrWriterConfig.segment_size_bytes = 4096` and a producer enqueuing fixed-size records that cross 4096 bytes after N writes
When the writer runs
Then segment 0 on disk is ≤ 4096 bytes (within one record's worth of overshoot), segment 1 is opened atomically, and `parse(segment_0_bytes ++ segment_1_bytes)` yields all records in order with no truncation, no overlap, and no corruption at the rotation boundary
**AC-3: Atomic rotation does not lose records under crash**
Given a writer that has just appended a record to segment N and is mid-rotation to segment N+1
When the test simulates a crash (kill before `atomicwrites` finalises N+1)
Then on restart segment N is intact and parseable to the last record before rotation; segment N+1 either does not exist or is intact and parseable from offset 0 — there is no half-written intermediate file at the canonical segment N+1 path
**AC-4: Cross-process filelock prevents concurrent writers**
Given `FileFdrWriter` is running and holds the lock at `flight_root/.fdr.lock`
When a second `FileFdrWriter` constructor is called against the same `flight_root`
Then the second constructor raises `FdrConcurrentWriterError` and does NOT create a second writer thread or touch any segment file
**AC-5: Mid-flight ENOSPC degrades gracefully + alerts via GCS**
Given the writer is running and the underlying filesystem returns `OSError(ENOSPC)` on the next segment append
When the writer encounters the failure
Then (a) one ERROR log record is emitted with `kind="fdr.write_failure"` carrying `errno=ENOSPC`, (b) `gcs_alert(message)` is invoked exactly once with a message identifying the failure, (c) `is_degraded` becomes True, (d) subsequent `drain` calls still dequeue from the producer buffers (no unbounded growth on the producer side), (e) the per-second ERROR-log cap kicks in if the failure repeats (≤ 1 ERROR/sec related to write failures)
**AC-6: stop() drains, fsyncs, releases lock**
Given a running writer with N records buffered across all producers
When `stop()` is called
Then (a) all N records are appended and `fsync`ed before the method returns, (b) the FDR-root `filelock` is released (a subsequent constructor against the same `flight_root` succeeds), (c) the current segment file is closed and not held open by any descriptor
**AC-7: Segment file layout is exactly `<flight_id>/segment-NNNN.fdr`**
Given `flight_id="abc123-def4-..."` and 3 segment rotations during the flight
When `stop()` returns
Then `flight_root/abc123-def4-.../` contains exactly `segment-0000.fdr`, `segment-0001.fdr`, `segment-0002.fdr`, `segment-0003.fdr` (and nothing else from this writer); each is independently parseable as a stream of length-prefixed `FdrRecord`s
**AC-8: Steady-state writer thread does not block any producer**
Given a producer enqueuing at 200 Hz steady-state and a writer-thread that takes 4 ms to serialise + append a record (well under the per-record budget)
When the test runs for 60 s
Then the producer's `FdrClient` reports zero `EnqueueResult.OVERRUN` results from this scenario (the writer keeps up with steady state; overrun under burst is a separate concern owned by AZ-273 + AZ-274)
## Non-Functional Requirements
**Performance**
- Aggregate writer throughput ≥ 200 Hz sustained on Tier-2 (Jetson Orin Nano Super) under the workload defined by C13-PT-01 (~100 Hz combined producer rate). Headroom of 2× is the design margin.
- Per-record serialise + append p95 ≤ 5 ms (matches C13-PT-01 budget).
- Segment rotation completes in ≤ 50 ms p99 (so a rotation does not stall the writer past one record's worth of producer buffer headroom).
- `start()` returns within 100 ms after segment 0 is open and the thread is running (not blocking takeoff readiness).
**Reliability**
- The writer thread NEVER raises into the constructor's caller after `start()` returns. All runtime errors are caught and either (a) logged + degraded, or (b) coerced into a `stop()`-and-rethrow path that the composition root observes via a documented exit hook.
- Segment files are append-only between rotations: the writer NEVER seeks backward, NEVER overwrites a closed segment, NEVER truncates the current segment.
- `fsync` is called after every segment rotation (so a power loss preserves all closed segments). Per-record `fsync` is NOT required; the per-segment cap is the durability boundary.
**Concurrency**
- The writer thread is the ONLY consumer of every registered producer's `FdrClient` (matches AZ-273's SPSC contract — each `FdrClient` has exactly one consumer thread; this is it).
- The `start()` / `stop()` methods are NOT thread-safe to each other; the composition root calls each exactly once per `FileFdrWriter` lifetime.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | 3 FdrClients × 100 buffered records → start writer, wait, parse segment 0 | All 300 records present, in-order per-producer |
| AC-2 | segment_size_bytes=4096; emit fixed-size records across the cap | Segment 0 ≤ 4096 + 1 record overshoot; segment 1 contains the rest; concatenated parse yields all records in order |
| AC-3 | Kill writer mid-rotation (after segment N close, before segment N+1 finalise) | On restart, segment N parses cleanly; segment N+1 is either absent or parseable from offset 0 |
| AC-4 | Two FileFdrWriter constructors against the same flight_root | Second raises `FdrConcurrentWriterError`; first remains untouched |
| AC-5 | Inject `OSError(ENOSPC)` on segment append | One ERROR log; gcs_alert called once; is_degraded=True; producers still drained; subsequent failures log-rate-capped |
| AC-6 | stop() with N records buffered | All N records on disk; fsync called; filelock released |
| AC-7 | Run a 3-rotation flight, inspect filesystem | Exactly 4 files: `segment-0000.fdr` through `segment-0003.fdr` |
| AC-8 | 200 Hz producer, 60 s, writer running | Zero overrun results from steady-state load |
| NFR-perf-throughput | C13-PT-01 microbench | ≥ 200 Hz sustained on Tier-2 |
| NFR-perf-rotation | Microbench rotation step | p99 ≤ 50 ms |
| NFR-reliability-fsync | Track fsync calls during a 5-segment flight | fsync called once per segment close |
| NFR-reliability-no-seek | Open the segment file with a tracing layer; assert no `lseek` backward | No backward seeks observed |
## Constraints
- One concrete writer per project (`FileFdrWriter`); no `FdrWriter` Protocol abstraction unless and until a second writer is needed (per architecture description.md "single concrete `FileFdrWriter` behind a `FdrWriter` interface" — the interface is the boundary the composition root injects against, but only one implementation exists this cycle).
- Segment files use the same wire format as `serialise` / `parse` from AZ-272 (fdr_record_schema). The framing on disk is length-prefixed records back-to-back (length is a `uint32` little-endian header before each `serialise`d byte string); the framing is documented in the implementation report and is internal to C13 — no separate contract file this cycle.
- Dependencies pinned at AZ-263 / E-BOOT only: `atomicwrites`, `filelock`. No new project dependency is introduced by this task.
- The per-segment size cap and batch size for `drain()` are config-driven via `FdrWriterConfig` from `composition_root_protocol`; defaults are documented in the implementation report and chosen so steady-state Tier-2 throughput passes C13-PT-01.
- The writer thread runs at NORMAL priority. No real-time scheduling. The "writer must keep up at 200 Hz" budget is met by serialisation efficiency, not by priority elevation.
- Cross-process safety is `flight_root`-scoped, not segment-scoped. The lock is acquired ONCE on `start()` and released ONCE on `stop()`.
## Risks & Mitigation
**Risk 1: `atomicwrites` fsyncs the directory on Linux but the underlying filesystem doesn't honour it**
- *Risk*: The Tier-2 filesystem (likely ext4 on the Jetson NVM) honours `fsync` but in degraded conditions (e.g. overlayfs, tmpfs for fixtures) the rotation atomicity guarantee weakens.
- *Mitigation*: AC-3 explicitly tests under a real ext4 mount (or `tmpfs` with documented caveat); the implementation report documents the supported filesystem set.
**Risk 2: Single writer thread becomes a bottleneck when a producer suddenly bursts**
- *Risk*: The writer thread serves N producers serially within a `drain` loop; one slow producer's records starve others.
- *Mitigation*: `drain(max_records=batch_size)` enforces fair round-robin across producers — each producer's batch is bounded so no single producer monopolises a tick. AC-8 measures steady-state behaviour; burst-handling lives in producer-side overrun policy (AZ-274).
**Risk 3: `filelock` held across an unclean exit leaves the flight_root locked**
- *Risk*: Companion process killed (e.g. brownout) without `stop()` running; next boot finds the lock file present and refuses to construct a new writer.
- *Mitigation*: `filelock` uses POSIX advisory locks via `fcntl` — the kernel releases them on process death automatically. The lock file itself may linger but the lock state does not. Documented in the implementation report; AC-4 verifies the live-process case.
**Risk 4: ENOSPC degraded mode produces unbounded log records**
- *Risk*: A persistent ENOSPC under sustained load could log 200/sec.
- *Mitigation*: Per-second rate cap on `kind="fdr.write_failure"` ERROR records (AC-5e). The first failure is always emitted; subsequent failures within the same second are coalesced.
## Runtime Completeness
- **Named capability**: single-writer thread + segment file lifecycle (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; no silent drops).
- **Production code that must exist**: real background thread, real `drain` loop across registered FdrClients, real segment file open/append/close with `atomicwrites`, real `filelock` acquire/release on `flight_root`, real ENOSPC handler with shared-logger ERROR + GCS alert.
- **Allowed external stubs**: tests MAY substitute a `FakeGcsAlert` (collects messages); production wiring uses the real C8 GCS adapter via the composition root.
- **Unacceptable substitutes**: `time.sleep`-driven polling without a real producer-buffer drain, in-memory list "for now" instead of segment files on disk, `pickle` or any non-`fdr_record_schema` serialiser, omitting `fsync` ("we'll add durability later"), or omitting `filelock` ("companion is single-process anyway").