# C13 Writer Thread + Segment File Lifecycle **Task**: AZ-291_c13_writer_thread **Name**: C13 Writer Thread **Description**: Implement the single-writer thread that drains every onboard producer's `FdrClient` SPSC ring buffer and persists records to per-flight segment files on the companion's NVM. Owns segment file open/append/close, atomic per-segment rotation when the configured per-segment size cap is reached, and the cross-process FDR-root `filelock` so the operator-side post-flight reader cannot collide with an in-flight writer. This task is the foundation every other E-C13 task (header/footer accounting, 64 GB cap policy, mid-flight tile snapshot, thumbnail rate cap, takeoff abort) builds on. **Complexity**: 5 points **Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf, AZ-266_log_module, AZ-269_config_loader **Component**: c13_fdr (epic AZ-248 / E-C13) **Tracker**: AZ-291 **Epic**: AZ-248 (E-C13) ### Document Dependencies - `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — wire format for every record this thread serialises to the segment file. - `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines `pop_one()` / `drain()` consumer-side surface this thread invokes per registered producer. - `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational log shape this thread uses for ERROR/WARN/INFO messages (segment open/rotate/write failure). - `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that carries `flight_root`, segment-size cap, and registered producer set. ## Problem Every onboard component publishes FDR records via its `FdrClient` SPSC ring buffer (AZ-273), but those buffers are write-only from the producer side. Without a single, contract-frozen writer thread: - Buffers fill up and overruns dominate within seconds — the AC-NEW-3 "no silent drops" guarantee is unenforceable because nothing drains them. - No segment file ever lands on disk — post-flight retrieval has nothing to read. - Multiple ad-hoc writers would race on segment rotation, corrupting partially-written records. - Operator workstation reads (post-flight via E-C12) and a misbehaving "still flying" writer process would race on the FDR root without `filelock` enforcement. This task delivers exactly one thread that owns the entire write side of the FDR. ## Outcome - A single `FileFdrWriter` instance, constructed once per flight by the composition root, runs one background thread that consumes records from every registered producer's `FdrClient` and appends them to the current open segment file in the per-flight directory under `flight_root`. - Segment files roll over atomically when the configured per-segment size cap is reached: the current segment is closed and `fsync`ed, the next segment is opened via `atomicwrites`, and the writer continues without dropping records or losing wire-format alignment. - The FDR root holds a `filelock` for the entire flight; the operator-side reader (future E-C12 retrieval task) MUST acquire the same lock before reading. Two airborne writer processes against the same `flight_root` is a constructor-time `FdrConcurrentWriterError`. - A mid-flight filesystem write failure (ENOSPC, EIO) is logged via the shared logger at ERROR + a STATUSTEXT alert is requested through the C8 GCS adapter; the writer transitions to a degraded "drop-and-log" mode so the rest of the system keeps emitting external positions, but operators are alerted. ## Scope ### Included - `FileFdrWriter(flight_root: Path, config: FdrWriterConfig, fdr_clients: Sequence[FdrClient], gcs_alert: Callable[[str], None])` constructor. - `start()` method that opens segment 0 under `flight_root//segment-0000.fdr`, acquires the FDR-root `filelock`, and starts the background thread. - `stop()` method that signals the thread to drain remaining records, closes the current segment with `fsync`, releases the `filelock`, and joins. - Background thread loop: per registered producer, call `drain(max_records=batch_size)` (batch size from config), serialise each `FdrRecord` via `fdr_record_schema.serialise`, append to the current segment with a length-prefixed framing identical to what `parse` reads, and rotate when the segment exceeds the per-segment size cap. - Atomic per-segment rotation using `atomicwrites`: open the next segment under a temp path, swap to the canonical name only after the previous segment is closed + `fsync`ed. - Cross-process `filelock` on `flight_root/.fdr.lock` held for the entire flight; constructor-time `FdrConcurrentWriterError` if the lock is already held. - Mid-flight write failure handling: catch `OSError` around segment append/rotate, log ERROR via the shared logger (`kind="fdr.write_failure"`), invoke `gcs_alert(message)`, set internal `is_degraded = True`. Subsequent `drain` calls continue to dequeue records (so producer buffers don't grow unboundedly) but discard them with a per-second-rate-capped ERROR log; recovery is out of scope (operator must land + retry). - Public read-only introspection: `current_segment_path() -> Path`, `current_segment_bytes() -> int`, `segments_written() -> int`, `is_rolling() -> bool` (true while a rotation is in progress). - Diagnostic INFO log on `start()` and on each successful segment rotation; DEBUG log per record only when explicitly enabled in config (defaults off — DEBUG-per-record would flood at 100 Hz aggregate). - Filesystem layout: `flight_root//segment-NNNN.fdr` (4-digit zero-padded segment number, `.fdr` suffix). The `` directory is created on `start()` from `FlightHeader.flight_id` (header content is owned by AZ-248-2 / task #2; this task accepts the flight_id as a constructor argument or via an open-time setter). ### Excluded - `FlightHeader` / `FlightFooter` records and `records_written` / `records_dropped_overrun` accounting — owned by task #2 of this epic. - 64 GB total-flight cap + oldest-segment-dropped policy + `kind="segment_rollover"` record emission — owned by task #3 of this epic. (This task implements per-segment-size rotation only; per-flight-cap enforcement is a higher policy layer that observes segments rolled by this task.) - Mid-flight tile snapshot path + `kind="mid_flight_tile_snapshot"` payload handling — owned by task #4. - Failed-tile thumbnail rate limiter + AC-8.5 `RawFrameWriteForbiddenError` enforcement — owned by task #5. - Takeoff abort wiring on `FdrOpenError` — owned by task #6. - Producer-side `FdrClient` ring buffer + `on_overrun` policy — owned by AZ-273 + AZ-274. - Post-flight segment file reader — out of scope this cycle (future E-C12 task). - `FdrRecord` schema and `serialise` / `parse` implementations — owned by AZ-272. ## Acceptance Criteria **AC-1: Single writer thread drains every registered producer** Given 3 `FdrClient` instances each with 100 records buffered When `FileFdrWriter.start()` is called and the test waits 1 s Then segment 0 on disk contains all 300 records (parsed via `fdr_record_schema.parse` in deterministic order per-producer, interleaving allowed across producers) **AC-2: Per-segment rotation at configured size cap** Given `FdrWriterConfig.segment_size_bytes = 4096` and a producer enqueuing fixed-size records that cross 4096 bytes after N writes When the writer runs Then segment 0 on disk is ≤ 4096 bytes (within one record's worth of overshoot), segment 1 is opened atomically, and `parse(segment_0_bytes ++ segment_1_bytes)` yields all records in order with no truncation, no overlap, and no corruption at the rotation boundary **AC-3: Atomic rotation does not lose records under crash** Given a writer that has just appended a record to segment N and is mid-rotation to segment N+1 When the test simulates a crash (kill before `atomicwrites` finalises N+1) Then on restart segment N is intact and parseable to the last record before rotation; segment N+1 either does not exist or is intact and parseable from offset 0 — there is no half-written intermediate file at the canonical segment N+1 path **AC-4: Cross-process filelock prevents concurrent writers** Given `FileFdrWriter` is running and holds the lock at `flight_root/.fdr.lock` When a second `FileFdrWriter` constructor is called against the same `flight_root` Then the second constructor raises `FdrConcurrentWriterError` and does NOT create a second writer thread or touch any segment file **AC-5: Mid-flight ENOSPC degrades gracefully + alerts via GCS** Given the writer is running and the underlying filesystem returns `OSError(ENOSPC)` on the next segment append When the writer encounters the failure Then (a) one ERROR log record is emitted with `kind="fdr.write_failure"` carrying `errno=ENOSPC`, (b) `gcs_alert(message)` is invoked exactly once with a message identifying the failure, (c) `is_degraded` becomes True, (d) subsequent `drain` calls still dequeue from the producer buffers (no unbounded growth on the producer side), (e) the per-second ERROR-log cap kicks in if the failure repeats (≤ 1 ERROR/sec related to write failures) **AC-6: stop() drains, fsyncs, releases lock** Given a running writer with N records buffered across all producers When `stop()` is called Then (a) all N records are appended and `fsync`ed before the method returns, (b) the FDR-root `filelock` is released (a subsequent constructor against the same `flight_root` succeeds), (c) the current segment file is closed and not held open by any descriptor **AC-7: Segment file layout is exactly `/segment-NNNN.fdr`** Given `flight_id="abc123-def4-..."` and 3 segment rotations during the flight When `stop()` returns Then `flight_root/abc123-def4-.../` contains exactly `segment-0000.fdr`, `segment-0001.fdr`, `segment-0002.fdr`, `segment-0003.fdr` (and nothing else from this writer); each is independently parseable as a stream of length-prefixed `FdrRecord`s **AC-8: Steady-state writer thread does not block any producer** Given a producer enqueuing at 200 Hz steady-state and a writer-thread that takes 4 ms to serialise + append a record (well under the per-record budget) When the test runs for 60 s Then the producer's `FdrClient` reports zero `EnqueueResult.OVERRUN` results from this scenario (the writer keeps up with steady state; overrun under burst is a separate concern owned by AZ-273 + AZ-274) ## Non-Functional Requirements **Performance** - Aggregate writer throughput ≥ 200 Hz sustained on Tier-2 (Jetson Orin Nano Super) under the workload defined by C13-PT-01 (~100 Hz combined producer rate). Headroom of 2× is the design margin. - Per-record serialise + append p95 ≤ 5 ms (matches C13-PT-01 budget). - Segment rotation completes in ≤ 50 ms p99 (so a rotation does not stall the writer past one record's worth of producer buffer headroom). - `start()` returns within 100 ms after segment 0 is open and the thread is running (not blocking takeoff readiness). **Reliability** - The writer thread NEVER raises into the constructor's caller after `start()` returns. All runtime errors are caught and either (a) logged + degraded, or (b) coerced into a `stop()`-and-rethrow path that the composition root observes via a documented exit hook. - Segment files are append-only between rotations: the writer NEVER seeks backward, NEVER overwrites a closed segment, NEVER truncates the current segment. - `fsync` is called after every segment rotation (so a power loss preserves all closed segments). Per-record `fsync` is NOT required; the per-segment cap is the durability boundary. **Concurrency** - The writer thread is the ONLY consumer of every registered producer's `FdrClient` (matches AZ-273's SPSC contract — each `FdrClient` has exactly one consumer thread; this is it). - The `start()` / `stop()` methods are NOT thread-safe to each other; the composition root calls each exactly once per `FileFdrWriter` lifetime. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | 3 FdrClients × 100 buffered records → start writer, wait, parse segment 0 | All 300 records present, in-order per-producer | | AC-2 | segment_size_bytes=4096; emit fixed-size records across the cap | Segment 0 ≤ 4096 + 1 record overshoot; segment 1 contains the rest; concatenated parse yields all records in order | | AC-3 | Kill writer mid-rotation (after segment N close, before segment N+1 finalise) | On restart, segment N parses cleanly; segment N+1 is either absent or parseable from offset 0 | | AC-4 | Two FileFdrWriter constructors against the same flight_root | Second raises `FdrConcurrentWriterError`; first remains untouched | | AC-5 | Inject `OSError(ENOSPC)` on segment append | One ERROR log; gcs_alert called once; is_degraded=True; producers still drained; subsequent failures log-rate-capped | | AC-6 | stop() with N records buffered | All N records on disk; fsync called; filelock released | | AC-7 | Run a 3-rotation flight, inspect filesystem | Exactly 4 files: `segment-0000.fdr` through `segment-0003.fdr` | | AC-8 | 200 Hz producer, 60 s, writer running | Zero overrun results from steady-state load | | NFR-perf-throughput | C13-PT-01 microbench | ≥ 200 Hz sustained on Tier-2 | | NFR-perf-rotation | Microbench rotation step | p99 ≤ 50 ms | | NFR-reliability-fsync | Track fsync calls during a 5-segment flight | fsync called once per segment close | | NFR-reliability-no-seek | Open the segment file with a tracing layer; assert no `lseek` backward | No backward seeks observed | ## Constraints - One concrete writer per project (`FileFdrWriter`); no `FdrWriter` Protocol abstraction unless and until a second writer is needed (per architecture description.md "single concrete `FileFdrWriter` behind a `FdrWriter` interface" — the interface is the boundary the composition root injects against, but only one implementation exists this cycle). - Segment files use the same wire format as `serialise` / `parse` from AZ-272 (fdr_record_schema). The framing on disk is length-prefixed records back-to-back (length is a `uint32` little-endian header before each `serialise`d byte string); the framing is documented in the implementation report and is internal to C13 — no separate contract file this cycle. - Dependencies pinned at AZ-263 / E-BOOT only: `atomicwrites`, `filelock`. No new project dependency is introduced by this task. - The per-segment size cap and batch size for `drain()` are config-driven via `FdrWriterConfig` from `composition_root_protocol`; defaults are documented in the implementation report and chosen so steady-state Tier-2 throughput passes C13-PT-01. - The writer thread runs at NORMAL priority. No real-time scheduling. The "writer must keep up at 200 Hz" budget is met by serialisation efficiency, not by priority elevation. - Cross-process safety is `flight_root`-scoped, not segment-scoped. The lock is acquired ONCE on `start()` and released ONCE on `stop()`. ## Risks & Mitigation **Risk 1: `atomicwrites` fsyncs the directory on Linux but the underlying filesystem doesn't honour it** - *Risk*: The Tier-2 filesystem (likely ext4 on the Jetson NVM) honours `fsync` but in degraded conditions (e.g. overlayfs, tmpfs for fixtures) the rotation atomicity guarantee weakens. - *Mitigation*: AC-3 explicitly tests under a real ext4 mount (or `tmpfs` with documented caveat); the implementation report documents the supported filesystem set. **Risk 2: Single writer thread becomes a bottleneck when a producer suddenly bursts** - *Risk*: The writer thread serves N producers serially within a `drain` loop; one slow producer's records starve others. - *Mitigation*: `drain(max_records=batch_size)` enforces fair round-robin across producers — each producer's batch is bounded so no single producer monopolises a tick. AC-8 measures steady-state behaviour; burst-handling lives in producer-side overrun policy (AZ-274). **Risk 3: `filelock` held across an unclean exit leaves the flight_root locked** - *Risk*: Companion process killed (e.g. brownout) without `stop()` running; next boot finds the lock file present and refuses to construct a new writer. - *Mitigation*: `filelock` uses POSIX advisory locks via `fcntl` — the kernel releases them on process death automatically. The lock file itself may linger but the lock state does not. Documented in the implementation report; AC-4 verifies the live-process case. **Risk 4: ENOSPC degraded mode produces unbounded log records** - *Risk*: A persistent ENOSPC under sustained load could log 200/sec. - *Mitigation*: Per-second rate cap on `kind="fdr.write_failure"` ERROR records (AC-5e). The first failure is always emitted; subsequent failures within the same second are coalesced. ## Runtime Completeness - **Named capability**: single-writer thread + segment file lifecycle (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; no silent drops). - **Production code that must exist**: real background thread, real `drain` loop across registered FdrClients, real segment file open/append/close with `atomicwrites`, real `filelock` acquire/release on `flight_root`, real ENOSPC handler with shared-logger ERROR + GCS alert. - **Allowed external stubs**: tests MAY substitute a `FakeGcsAlert` (collects messages); production wiring uses the real C8 GCS adapter via the composition root. - **Unacceptable substitutes**: `time.sleep`-driven polling without a real producer-buffer drain, in-memory list "for now" instead of segment files on disk, `pickle` or any non-`fdr_record_schema` serialiser, omitting `fsync` ("we'll add durability later"), or omitting `filelock` ("companion is single-process anyway").