Files
gps-denied-onboard/_docs/02_tasks/done/AZ-291_c13_writer_thread.md
T
Oleksandr Bezdieniezhnykh b5dd6031d2 [AZ-291] [AZ-292] [AZ-293] C13 FDR writer chain (batch 6)
AZ-291 — FileFdrWriter: single writer thread draining every registered
FdrClient SPSC ring buffer to per-flight segment files; per-segment
size rotation; cross-process fcntl.flock filelock on flight_root;
ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert.

AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight /
close_flight lifecycle methods; four per-flight monotonic counters
(records_written, records_dropped_overrun, bytes_written,
rollover_count) reported by the footer; flight_id mismatch and
close-without-open are typed errors.

AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight
directory, drops the oldest CLOSED segment when total > cap (default
64 GiB), emits a kind="segment_rollover" record per drop. Never drops
the currently-open segment or segment 0 alone; cap_misconfigured path
logs ERROR + GCS alert. No config flag disables emission (C13-ST-01).

Schema: bumped fdr_record_schema flight_header / flight_footer payload
key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no
prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig
nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes,
debug_log_per_record).

Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite
323 passed, 2 pre-existing skips, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 03:38:58 +03:00

17 KiB
Raw Blame History

C13 Writer Thread + Segment File Lifecycle

Task: AZ-291_c13_writer_thread Name: C13 Writer Thread Description: Implement the single-writer thread that drains every onboard producer's FdrClient SPSC ring buffer and persists records to per-flight segment files on the companion's NVM. Owns segment file open/append/close, atomic per-segment rotation when the configured per-segment size cap is reached, and the cross-process FDR-root filelock so the operator-side post-flight reader cannot collide with an in-flight writer. This task is the foundation every other E-C13 task (header/footer accounting, 64 GB cap policy, mid-flight tile snapshot, thumbnail rate cap, takeoff abort) builds on. Complexity: 5 points Dependencies: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf, AZ-266_log_module, AZ-269_config_loader Component: c13_fdr (epic AZ-248 / E-C13) Tracker: AZ-291 Epic: AZ-248 (E-C13)

Document Dependencies

  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md — wire format for every record this thread serialises to the segment file.
  • _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md — defines pop_one() / drain() consumer-side surface this thread invokes per registered producer.
  • _docs/02_document/contracts/shared_logging/log_record_schema.md — operational log shape this thread uses for ERROR/WARN/INFO messages (segment open/rotate/write failure).
  • _docs/02_document/contracts/shared_config/composition_root_protocol.md — Config object that carries flight_root, segment-size cap, and registered producer set.

Problem

Every onboard component publishes FDR records via its FdrClient SPSC ring buffer (AZ-273), but those buffers are write-only from the producer side. Without a single, contract-frozen writer thread:

  • Buffers fill up and overruns dominate within seconds — the AC-NEW-3 "no silent drops" guarantee is unenforceable because nothing drains them.
  • No segment file ever lands on disk — post-flight retrieval has nothing to read.
  • Multiple ad-hoc writers would race on segment rotation, corrupting partially-written records.
  • Operator workstation reads (post-flight via E-C12) and a misbehaving "still flying" writer process would race on the FDR root without filelock enforcement.

This task delivers exactly one thread that owns the entire write side of the FDR.

Outcome

  • A single FileFdrWriter instance, constructed once per flight by the composition root, runs one background thread that consumes records from every registered producer's FdrClient and appends them to the current open segment file in the per-flight directory under flight_root.
  • Segment files roll over atomically when the configured per-segment size cap is reached: the current segment is closed and fsynced, the next segment is opened via atomicwrites, and the writer continues without dropping records or losing wire-format alignment.
  • The FDR root holds a filelock for the entire flight; the operator-side reader (future E-C12 retrieval task) MUST acquire the same lock before reading. Two airborne writer processes against the same flight_root is a constructor-time FdrConcurrentWriterError.
  • A mid-flight filesystem write failure (ENOSPC, EIO) is logged via the shared logger at ERROR + a STATUSTEXT alert is requested through the C8 GCS adapter; the writer transitions to a degraded "drop-and-log" mode so the rest of the system keeps emitting external positions, but operators are alerted.

Scope

Included

  • FileFdrWriter(flight_root: Path, config: FdrWriterConfig, fdr_clients: Sequence[FdrClient], gcs_alert: Callable[[str], None]) constructor.
  • start() method that opens segment 0 under flight_root/<flight_id>/segment-0000.fdr, acquires the FDR-root filelock, and starts the background thread.
  • stop() method that signals the thread to drain remaining records, closes the current segment with fsync, releases the filelock, and joins.
  • Background thread loop: per registered producer, call drain(max_records=batch_size) (batch size from config), serialise each FdrRecord via fdr_record_schema.serialise, append to the current segment with a length-prefixed framing identical to what parse reads, and rotate when the segment exceeds the per-segment size cap.
  • Atomic per-segment rotation using atomicwrites: open the next segment under a temp path, swap to the canonical name only after the previous segment is closed + fsynced.
  • Cross-process filelock on flight_root/.fdr.lock held for the entire flight; constructor-time FdrConcurrentWriterError if the lock is already held.
  • Mid-flight write failure handling: catch OSError around segment append/rotate, log ERROR via the shared logger (kind="fdr.write_failure"), invoke gcs_alert(message), set internal is_degraded = True. Subsequent drain calls continue to dequeue records (so producer buffers don't grow unboundedly) but discard them with a per-second-rate-capped ERROR log; recovery is out of scope (operator must land + retry).
  • Public read-only introspection: current_segment_path() -> Path, current_segment_bytes() -> int, segments_written() -> int, is_rolling() -> bool (true while a rotation is in progress).
  • Diagnostic INFO log on start() and on each successful segment rotation; DEBUG log per record only when explicitly enabled in config (defaults off — DEBUG-per-record would flood at 100 Hz aggregate).
  • Filesystem layout: flight_root/<flight_id>/segment-NNNN.fdr (4-digit zero-padded segment number, .fdr suffix). The <flight_id> directory is created on start() from FlightHeader.flight_id (header content is owned by AZ-248-2 / task #2; this task accepts the flight_id as a constructor argument or via an open-time setter).

Excluded

  • FlightHeader / FlightFooter records and records_written / records_dropped_overrun accounting — owned by task #2 of this epic.
  • 64 GB total-flight cap + oldest-segment-dropped policy + kind="segment_rollover" record emission — owned by task #3 of this epic. (This task implements per-segment-size rotation only; per-flight-cap enforcement is a higher policy layer that observes segments rolled by this task.)
  • Mid-flight tile snapshot path + kind="mid_flight_tile_snapshot" payload handling — owned by task #4.
  • Failed-tile thumbnail rate limiter + AC-8.5 RawFrameWriteForbiddenError enforcement — owned by task #5.
  • Takeoff abort wiring on FdrOpenError — owned by task #6.
  • Producer-side FdrClient ring buffer + on_overrun policy — owned by AZ-273 + AZ-274.
  • Post-flight segment file reader — out of scope this cycle (future E-C12 task).
  • FdrRecord schema and serialise / parse implementations — owned by AZ-272.

Acceptance Criteria

AC-1: Single writer thread drains every registered producer Given 3 FdrClient instances each with 100 records buffered When FileFdrWriter.start() is called and the test waits 1 s Then segment 0 on disk contains all 300 records (parsed via fdr_record_schema.parse in deterministic order per-producer, interleaving allowed across producers)

AC-2: Per-segment rotation at configured size cap Given FdrWriterConfig.segment_size_bytes = 4096 and a producer enqueuing fixed-size records that cross 4096 bytes after N writes When the writer runs Then segment 0 on disk is ≤ 4096 bytes (within one record's worth of overshoot), segment 1 is opened atomically, and parse(segment_0_bytes ++ segment_1_bytes) yields all records in order with no truncation, no overlap, and no corruption at the rotation boundary

AC-3: Atomic rotation does not lose records under crash Given a writer that has just appended a record to segment N and is mid-rotation to segment N+1 When the test simulates a crash (kill before atomicwrites finalises N+1) Then on restart segment N is intact and parseable to the last record before rotation; segment N+1 either does not exist or is intact and parseable from offset 0 — there is no half-written intermediate file at the canonical segment N+1 path

AC-4: Cross-process filelock prevents concurrent writers Given FileFdrWriter is running and holds the lock at flight_root/.fdr.lock When a second FileFdrWriter constructor is called against the same flight_root Then the second constructor raises FdrConcurrentWriterError and does NOT create a second writer thread or touch any segment file

AC-5: Mid-flight ENOSPC degrades gracefully + alerts via GCS Given the writer is running and the underlying filesystem returns OSError(ENOSPC) on the next segment append When the writer encounters the failure Then (a) one ERROR log record is emitted with kind="fdr.write_failure" carrying errno=ENOSPC, (b) gcs_alert(message) is invoked exactly once with a message identifying the failure, (c) is_degraded becomes True, (d) subsequent drain calls still dequeue from the producer buffers (no unbounded growth on the producer side), (e) the per-second ERROR-log cap kicks in if the failure repeats (≤ 1 ERROR/sec related to write failures)

AC-6: stop() drains, fsyncs, releases lock Given a running writer with N records buffered across all producers When stop() is called Then (a) all N records are appended and fsynced before the method returns, (b) the FDR-root filelock is released (a subsequent constructor against the same flight_root succeeds), (c) the current segment file is closed and not held open by any descriptor

AC-7: Segment file layout is exactly <flight_id>/segment-NNNN.fdr Given flight_id="abc123-def4-..." and 3 segment rotations during the flight When stop() returns Then flight_root/abc123-def4-.../ contains exactly segment-0000.fdr, segment-0001.fdr, segment-0002.fdr, segment-0003.fdr (and nothing else from this writer); each is independently parseable as a stream of length-prefixed FdrRecords

AC-8: Steady-state writer thread does not block any producer Given a producer enqueuing at 200 Hz steady-state and a writer-thread that takes 4 ms to serialise + append a record (well under the per-record budget) When the test runs for 60 s Then the producer's FdrClient reports zero EnqueueResult.OVERRUN results from this scenario (the writer keeps up with steady state; overrun under burst is a separate concern owned by AZ-273 + AZ-274)

Non-Functional Requirements

Performance

  • Aggregate writer throughput ≥ 200 Hz sustained on Tier-2 (Jetson Orin Nano Super) under the workload defined by C13-PT-01 (~100 Hz combined producer rate). Headroom of 2× is the design margin.
  • Per-record serialise + append p95 ≤ 5 ms (matches C13-PT-01 budget).
  • Segment rotation completes in ≤ 50 ms p99 (so a rotation does not stall the writer past one record's worth of producer buffer headroom).
  • start() returns within 100 ms after segment 0 is open and the thread is running (not blocking takeoff readiness).

Reliability

  • The writer thread NEVER raises into the constructor's caller after start() returns. All runtime errors are caught and either (a) logged + degraded, or (b) coerced into a stop()-and-rethrow path that the composition root observes via a documented exit hook.
  • Segment files are append-only between rotations: the writer NEVER seeks backward, NEVER overwrites a closed segment, NEVER truncates the current segment.
  • fsync is called after every segment rotation (so a power loss preserves all closed segments). Per-record fsync is NOT required; the per-segment cap is the durability boundary.

Concurrency

  • The writer thread is the ONLY consumer of every registered producer's FdrClient (matches AZ-273's SPSC contract — each FdrClient has exactly one consumer thread; this is it).
  • The start() / stop() methods are NOT thread-safe to each other; the composition root calls each exactly once per FileFdrWriter lifetime.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 3 FdrClients × 100 buffered records → start writer, wait, parse segment 0 All 300 records present, in-order per-producer
AC-2 segment_size_bytes=4096; emit fixed-size records across the cap Segment 0 ≤ 4096 + 1 record overshoot; segment 1 contains the rest; concatenated parse yields all records in order
AC-3 Kill writer mid-rotation (after segment N close, before segment N+1 finalise) On restart, segment N parses cleanly; segment N+1 is either absent or parseable from offset 0
AC-4 Two FileFdrWriter constructors against the same flight_root Second raises FdrConcurrentWriterError; first remains untouched
AC-5 Inject OSError(ENOSPC) on segment append One ERROR log; gcs_alert called once; is_degraded=True; producers still drained; subsequent failures log-rate-capped
AC-6 stop() with N records buffered All N records on disk; fsync called; filelock released
AC-7 Run a 3-rotation flight, inspect filesystem Exactly 4 files: segment-0000.fdr through segment-0003.fdr
AC-8 200 Hz producer, 60 s, writer running Zero overrun results from steady-state load
NFR-perf-throughput C13-PT-01 microbench ≥ 200 Hz sustained on Tier-2
NFR-perf-rotation Microbench rotation step p99 ≤ 50 ms
NFR-reliability-fsync Track fsync calls during a 5-segment flight fsync called once per segment close
NFR-reliability-no-seek Open the segment file with a tracing layer; assert no lseek backward No backward seeks observed

Constraints

  • One concrete writer per project (FileFdrWriter); no FdrWriter Protocol abstraction unless and until a second writer is needed (per architecture description.md "single concrete FileFdrWriter behind a FdrWriter interface" — the interface is the boundary the composition root injects against, but only one implementation exists this cycle).
  • Segment files use the same wire format as serialise / parse from AZ-272 (fdr_record_schema). The framing on disk is length-prefixed records back-to-back (length is a uint32 little-endian header before each serialised byte string); the framing is documented in the implementation report and is internal to C13 — no separate contract file this cycle.
  • Dependencies pinned at AZ-263 / E-BOOT only: atomicwrites, filelock. No new project dependency is introduced by this task.
  • The per-segment size cap and batch size for drain() are config-driven via FdrWriterConfig from composition_root_protocol; defaults are documented in the implementation report and chosen so steady-state Tier-2 throughput passes C13-PT-01.
  • The writer thread runs at NORMAL priority. No real-time scheduling. The "writer must keep up at 200 Hz" budget is met by serialisation efficiency, not by priority elevation.
  • Cross-process safety is flight_root-scoped, not segment-scoped. The lock is acquired ONCE on start() and released ONCE on stop().

Risks & Mitigation

Risk 1: atomicwrites fsyncs the directory on Linux but the underlying filesystem doesn't honour it

  • Risk: The Tier-2 filesystem (likely ext4 on the Jetson NVM) honours fsync but in degraded conditions (e.g. overlayfs, tmpfs for fixtures) the rotation atomicity guarantee weakens.
  • Mitigation: AC-3 explicitly tests under a real ext4 mount (or tmpfs with documented caveat); the implementation report documents the supported filesystem set.

Risk 2: Single writer thread becomes a bottleneck when a producer suddenly bursts

  • Risk: The writer thread serves N producers serially within a drain loop; one slow producer's records starve others.
  • Mitigation: drain(max_records=batch_size) enforces fair round-robin across producers — each producer's batch is bounded so no single producer monopolises a tick. AC-8 measures steady-state behaviour; burst-handling lives in producer-side overrun policy (AZ-274).

Risk 3: filelock held across an unclean exit leaves the flight_root locked

  • Risk: Companion process killed (e.g. brownout) without stop() running; next boot finds the lock file present and refuses to construct a new writer.
  • Mitigation: filelock uses POSIX advisory locks via fcntl — the kernel releases them on process death automatically. The lock file itself may linger but the lock state does not. Documented in the implementation report; AC-4 verifies the live-process case.

Risk 4: ENOSPC degraded mode produces unbounded log records

  • Risk: A persistent ENOSPC under sustained load could log 200/sec.
  • Mitigation: Per-second rate cap on kind="fdr.write_failure" ERROR records (AC-5e). The first failure is always emitted; subsequent failures within the same second are coalesced.

Runtime Completeness

  • Named capability: single-writer thread + segment file lifecycle (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; no silent drops).
  • Production code that must exist: real background thread, real drain loop across registered FdrClients, real segment file open/append/close with atomicwrites, real filelock acquire/release on flight_root, real ENOSPC handler with shared-logger ERROR + GCS alert.
  • Allowed external stubs: tests MAY substitute a FakeGcsAlert (collects messages); production wiring uses the real C8 GCS adapter via the composition root.
  • Unacceptable substitutes: time.sleep-driven polling without a real producer-buffer drain, in-memory list "for now" instead of segment files on disk, pickle or any non-fdr_record_schema serialiser, omitting fsync ("we'll add durability later"), or omitting filelock ("companion is single-process anyway").