mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 16:21:12 +00:00
[AZ-291] [AZ-292] [AZ-293] C13 FDR writer chain (batch 6)
AZ-291 — FileFdrWriter: single writer thread draining every registered FdrClient SPSC ring buffer to per-flight segment files; per-segment size rotation; cross-process fcntl.flock filelock on flight_root; ENOSPC degraded mode with rate-capped ERROR logs and one GCS alert. AZ-292 — FlightHeader/FlightFooter dataclasses + open_flight / close_flight lifecycle methods; four per-flight monotonic counters (records_written, records_dropped_overrun, bytes_written, rollover_count) reported by the footer; flight_id mismatch and close-without-open are typed errors. AZ-293 — CapacityCapPolicy (post-rotation hook): walks the flight directory, drops the oldest CLOSED segment when total > cap (default 64 GiB), emits a kind="segment_rollover" record per drop. Never drops the currently-open segment or segment 0 alone; cap_misconfigured path logs ERROR + GCS alert. No config flag disables emission (C13-ST-01). Schema: bumped fdr_record_schema flight_header / flight_footer payload key sets to match the AZ-292 task spec (effective 1.0.0 -> 1.1.0; no prior producer); KNOWN_PAYLOAD_KEYS updated. Added FdrWriterConfig nested in FdrConfig (segment_size_bytes, batch_size, flight_cap_bytes, debug_log_per_record). Tests: 29 new unit tests (8 AC + 1 invariant per task); full suite 323 passed, 2 pre-existing skips, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,171 +0,0 @@
|
||||
# C13 Writer Thread + Segment File Lifecycle
|
||||
|
||||
**Task**: AZ-291_c13_writer_thread
|
||||
**Name**: C13 Writer Thread
|
||||
**Description**: Implement the single-writer thread that drains every onboard producer's `FdrClient` SPSC ring buffer and persists records to per-flight segment files on the companion's NVM. Owns segment file open/append/close, atomic per-segment rotation when the configured per-segment size cap is reached, and the cross-process FDR-root `filelock` so the operator-side post-flight reader cannot collide with an in-flight writer. This task is the foundation every other E-C13 task (header/footer accounting, 64 GB cap policy, mid-flight tile snapshot, thumbnail rate cap, takeoff abort) builds on.
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf, AZ-266_log_module, AZ-269_config_loader
|
||||
**Component**: c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-291
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — wire format for every record this thread serialises to the segment file.
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines `pop_one()` / `drain()` consumer-side surface this thread invokes per registered producer.
|
||||
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational log shape this thread uses for ERROR/WARN/INFO messages (segment open/rotate/write failure).
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that carries `flight_root`, segment-size cap, and registered producer set.
|
||||
|
||||
## Problem
|
||||
|
||||
Every onboard component publishes FDR records via its `FdrClient` SPSC ring buffer (AZ-273), but those buffers are write-only from the producer side. Without a single, contract-frozen writer thread:
|
||||
|
||||
- Buffers fill up and overruns dominate within seconds — the AC-NEW-3 "no silent drops" guarantee is unenforceable because nothing drains them.
|
||||
- No segment file ever lands on disk — post-flight retrieval has nothing to read.
|
||||
- Multiple ad-hoc writers would race on segment rotation, corrupting partially-written records.
|
||||
- Operator workstation reads (post-flight via E-C12) and a misbehaving "still flying" writer process would race on the FDR root without `filelock` enforcement.
|
||||
|
||||
This task delivers exactly one thread that owns the entire write side of the FDR.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A single `FileFdrWriter` instance, constructed once per flight by the composition root, runs one background thread that consumes records from every registered producer's `FdrClient` and appends them to the current open segment file in the per-flight directory under `flight_root`.
|
||||
- Segment files roll over atomically when the configured per-segment size cap is reached: the current segment is closed and `fsync`ed, the next segment is opened via `atomicwrites`, and the writer continues without dropping records or losing wire-format alignment.
|
||||
- The FDR root holds a `filelock` for the entire flight; the operator-side reader (future E-C12 retrieval task) MUST acquire the same lock before reading. Two airborne writer processes against the same `flight_root` is a constructor-time `FdrConcurrentWriterError`.
|
||||
- A mid-flight filesystem write failure (ENOSPC, EIO) is logged via the shared logger at ERROR + a STATUSTEXT alert is requested through the C8 GCS adapter; the writer transitions to a degraded "drop-and-log" mode so the rest of the system keeps emitting external positions, but operators are alerted.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `FileFdrWriter(flight_root: Path, config: FdrWriterConfig, fdr_clients: Sequence[FdrClient], gcs_alert: Callable[[str], None])` constructor.
|
||||
- `start()` method that opens segment 0 under `flight_root/<flight_id>/segment-0000.fdr`, acquires the FDR-root `filelock`, and starts the background thread.
|
||||
- `stop()` method that signals the thread to drain remaining records, closes the current segment with `fsync`, releases the `filelock`, and joins.
|
||||
- Background thread loop: per registered producer, call `drain(max_records=batch_size)` (batch size from config), serialise each `FdrRecord` via `fdr_record_schema.serialise`, append to the current segment with a length-prefixed framing identical to what `parse` reads, and rotate when the segment exceeds the per-segment size cap.
|
||||
- Atomic per-segment rotation using `atomicwrites`: open the next segment under a temp path, swap to the canonical name only after the previous segment is closed + `fsync`ed.
|
||||
- Cross-process `filelock` on `flight_root/.fdr.lock` held for the entire flight; constructor-time `FdrConcurrentWriterError` if the lock is already held.
|
||||
- Mid-flight write failure handling: catch `OSError` around segment append/rotate, log ERROR via the shared logger (`kind="fdr.write_failure"`), invoke `gcs_alert(message)`, set internal `is_degraded = True`. Subsequent `drain` calls continue to dequeue records (so producer buffers don't grow unboundedly) but discard them with a per-second-rate-capped ERROR log; recovery is out of scope (operator must land + retry).
|
||||
- Public read-only introspection: `current_segment_path() -> Path`, `current_segment_bytes() -> int`, `segments_written() -> int`, `is_rolling() -> bool` (true while a rotation is in progress).
|
||||
- Diagnostic INFO log on `start()` and on each successful segment rotation; DEBUG log per record only when explicitly enabled in config (defaults off — DEBUG-per-record would flood at 100 Hz aggregate).
|
||||
- Filesystem layout: `flight_root/<flight_id>/segment-NNNN.fdr` (4-digit zero-padded segment number, `.fdr` suffix). The `<flight_id>` directory is created on `start()` from `FlightHeader.flight_id` (header content is owned by AZ-248-2 / task #2; this task accepts the flight_id as a constructor argument or via an open-time setter).
|
||||
|
||||
### Excluded
|
||||
|
||||
- `FlightHeader` / `FlightFooter` records and `records_written` / `records_dropped_overrun` accounting — owned by task #2 of this epic.
|
||||
- 64 GB total-flight cap + oldest-segment-dropped policy + `kind="segment_rollover"` record emission — owned by task #3 of this epic. (This task implements per-segment-size rotation only; per-flight-cap enforcement is a higher policy layer that observes segments rolled by this task.)
|
||||
- Mid-flight tile snapshot path + `kind="mid_flight_tile_snapshot"` payload handling — owned by task #4.
|
||||
- Failed-tile thumbnail rate limiter + AC-8.5 `RawFrameWriteForbiddenError` enforcement — owned by task #5.
|
||||
- Takeoff abort wiring on `FdrOpenError` — owned by task #6.
|
||||
- Producer-side `FdrClient` ring buffer + `on_overrun` policy — owned by AZ-273 + AZ-274.
|
||||
- Post-flight segment file reader — out of scope this cycle (future E-C12 task).
|
||||
- `FdrRecord` schema and `serialise` / `parse` implementations — owned by AZ-272.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Single writer thread drains every registered producer**
|
||||
Given 3 `FdrClient` instances each with 100 records buffered
|
||||
When `FileFdrWriter.start()` is called and the test waits 1 s
|
||||
Then segment 0 on disk contains all 300 records (parsed via `fdr_record_schema.parse` in deterministic order per-producer, interleaving allowed across producers)
|
||||
|
||||
**AC-2: Per-segment rotation at configured size cap**
|
||||
Given `FdrWriterConfig.segment_size_bytes = 4096` and a producer enqueuing fixed-size records that cross 4096 bytes after N writes
|
||||
When the writer runs
|
||||
Then segment 0 on disk is ≤ 4096 bytes (within one record's worth of overshoot), segment 1 is opened atomically, and `parse(segment_0_bytes ++ segment_1_bytes)` yields all records in order with no truncation, no overlap, and no corruption at the rotation boundary
|
||||
|
||||
**AC-3: Atomic rotation does not lose records under crash**
|
||||
Given a writer that has just appended a record to segment N and is mid-rotation to segment N+1
|
||||
When the test simulates a crash (kill before `atomicwrites` finalises N+1)
|
||||
Then on restart segment N is intact and parseable to the last record before rotation; segment N+1 either does not exist or is intact and parseable from offset 0 — there is no half-written intermediate file at the canonical segment N+1 path
|
||||
|
||||
**AC-4: Cross-process filelock prevents concurrent writers**
|
||||
Given `FileFdrWriter` is running and holds the lock at `flight_root/.fdr.lock`
|
||||
When a second `FileFdrWriter` constructor is called against the same `flight_root`
|
||||
Then the second constructor raises `FdrConcurrentWriterError` and does NOT create a second writer thread or touch any segment file
|
||||
|
||||
**AC-5: Mid-flight ENOSPC degrades gracefully + alerts via GCS**
|
||||
Given the writer is running and the underlying filesystem returns `OSError(ENOSPC)` on the next segment append
|
||||
When the writer encounters the failure
|
||||
Then (a) one ERROR log record is emitted with `kind="fdr.write_failure"` carrying `errno=ENOSPC`, (b) `gcs_alert(message)` is invoked exactly once with a message identifying the failure, (c) `is_degraded` becomes True, (d) subsequent `drain` calls still dequeue from the producer buffers (no unbounded growth on the producer side), (e) the per-second ERROR-log cap kicks in if the failure repeats (≤ 1 ERROR/sec related to write failures)
|
||||
|
||||
**AC-6: stop() drains, fsyncs, releases lock**
|
||||
Given a running writer with N records buffered across all producers
|
||||
When `stop()` is called
|
||||
Then (a) all N records are appended and `fsync`ed before the method returns, (b) the FDR-root `filelock` is released (a subsequent constructor against the same `flight_root` succeeds), (c) the current segment file is closed and not held open by any descriptor
|
||||
|
||||
**AC-7: Segment file layout is exactly `<flight_id>/segment-NNNN.fdr`**
|
||||
Given `flight_id="abc123-def4-..."` and 3 segment rotations during the flight
|
||||
When `stop()` returns
|
||||
Then `flight_root/abc123-def4-.../` contains exactly `segment-0000.fdr`, `segment-0001.fdr`, `segment-0002.fdr`, `segment-0003.fdr` (and nothing else from this writer); each is independently parseable as a stream of length-prefixed `FdrRecord`s
|
||||
|
||||
**AC-8: Steady-state writer thread does not block any producer**
|
||||
Given a producer enqueuing at 200 Hz steady-state and a writer-thread that takes 4 ms to serialise + append a record (well under the per-record budget)
|
||||
When the test runs for 60 s
|
||||
Then the producer's `FdrClient` reports zero `EnqueueResult.OVERRUN` results from this scenario (the writer keeps up with steady state; overrun under burst is a separate concern owned by AZ-273 + AZ-274)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Aggregate writer throughput ≥ 200 Hz sustained on Tier-2 (Jetson Orin Nano Super) under the workload defined by C13-PT-01 (~100 Hz combined producer rate). Headroom of 2× is the design margin.
|
||||
- Per-record serialise + append p95 ≤ 5 ms (matches C13-PT-01 budget).
|
||||
- Segment rotation completes in ≤ 50 ms p99 (so a rotation does not stall the writer past one record's worth of producer buffer headroom).
|
||||
- `start()` returns within 100 ms after segment 0 is open and the thread is running (not blocking takeoff readiness).
|
||||
|
||||
**Reliability**
|
||||
- The writer thread NEVER raises into the constructor's caller after `start()` returns. All runtime errors are caught and either (a) logged + degraded, or (b) coerced into a `stop()`-and-rethrow path that the composition root observes via a documented exit hook.
|
||||
- Segment files are append-only between rotations: the writer NEVER seeks backward, NEVER overwrites a closed segment, NEVER truncates the current segment.
|
||||
- `fsync` is called after every segment rotation (so a power loss preserves all closed segments). Per-record `fsync` is NOT required; the per-segment cap is the durability boundary.
|
||||
|
||||
**Concurrency**
|
||||
- The writer thread is the ONLY consumer of every registered producer's `FdrClient` (matches AZ-273's SPSC contract — each `FdrClient` has exactly one consumer thread; this is it).
|
||||
- The `start()` / `stop()` methods are NOT thread-safe to each other; the composition root calls each exactly once per `FileFdrWriter` lifetime.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | 3 FdrClients × 100 buffered records → start writer, wait, parse segment 0 | All 300 records present, in-order per-producer |
|
||||
| AC-2 | segment_size_bytes=4096; emit fixed-size records across the cap | Segment 0 ≤ 4096 + 1 record overshoot; segment 1 contains the rest; concatenated parse yields all records in order |
|
||||
| AC-3 | Kill writer mid-rotation (after segment N close, before segment N+1 finalise) | On restart, segment N parses cleanly; segment N+1 is either absent or parseable from offset 0 |
|
||||
| AC-4 | Two FileFdrWriter constructors against the same flight_root | Second raises `FdrConcurrentWriterError`; first remains untouched |
|
||||
| AC-5 | Inject `OSError(ENOSPC)` on segment append | One ERROR log; gcs_alert called once; is_degraded=True; producers still drained; subsequent failures log-rate-capped |
|
||||
| AC-6 | stop() with N records buffered | All N records on disk; fsync called; filelock released |
|
||||
| AC-7 | Run a 3-rotation flight, inspect filesystem | Exactly 4 files: `segment-0000.fdr` through `segment-0003.fdr` |
|
||||
| AC-8 | 200 Hz producer, 60 s, writer running | Zero overrun results from steady-state load |
|
||||
| NFR-perf-throughput | C13-PT-01 microbench | ≥ 200 Hz sustained on Tier-2 |
|
||||
| NFR-perf-rotation | Microbench rotation step | p99 ≤ 50 ms |
|
||||
| NFR-reliability-fsync | Track fsync calls during a 5-segment flight | fsync called once per segment close |
|
||||
| NFR-reliability-no-seek | Open the segment file with a tracing layer; assert no `lseek` backward | No backward seeks observed |
|
||||
|
||||
## Constraints
|
||||
|
||||
- One concrete writer per project (`FileFdrWriter`); no `FdrWriter` Protocol abstraction unless and until a second writer is needed (per architecture description.md "single concrete `FileFdrWriter` behind a `FdrWriter` interface" — the interface is the boundary the composition root injects against, but only one implementation exists this cycle).
|
||||
- Segment files use the same wire format as `serialise` / `parse` from AZ-272 (fdr_record_schema). The framing on disk is length-prefixed records back-to-back (length is a `uint32` little-endian header before each `serialise`d byte string); the framing is documented in the implementation report and is internal to C13 — no separate contract file this cycle.
|
||||
- Dependencies pinned at AZ-263 / E-BOOT only: `atomicwrites`, `filelock`. No new project dependency is introduced by this task.
|
||||
- The per-segment size cap and batch size for `drain()` are config-driven via `FdrWriterConfig` from `composition_root_protocol`; defaults are documented in the implementation report and chosen so steady-state Tier-2 throughput passes C13-PT-01.
|
||||
- The writer thread runs at NORMAL priority. No real-time scheduling. The "writer must keep up at 200 Hz" budget is met by serialisation efficiency, not by priority elevation.
|
||||
- Cross-process safety is `flight_root`-scoped, not segment-scoped. The lock is acquired ONCE on `start()` and released ONCE on `stop()`.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: `atomicwrites` fsyncs the directory on Linux but the underlying filesystem doesn't honour it**
|
||||
- *Risk*: The Tier-2 filesystem (likely ext4 on the Jetson NVM) honours `fsync` but in degraded conditions (e.g. overlayfs, tmpfs for fixtures) the rotation atomicity guarantee weakens.
|
||||
- *Mitigation*: AC-3 explicitly tests under a real ext4 mount (or `tmpfs` with documented caveat); the implementation report documents the supported filesystem set.
|
||||
|
||||
**Risk 2: Single writer thread becomes a bottleneck when a producer suddenly bursts**
|
||||
- *Risk*: The writer thread serves N producers serially within a `drain` loop; one slow producer's records starve others.
|
||||
- *Mitigation*: `drain(max_records=batch_size)` enforces fair round-robin across producers — each producer's batch is bounded so no single producer monopolises a tick. AC-8 measures steady-state behaviour; burst-handling lives in producer-side overrun policy (AZ-274).
|
||||
|
||||
**Risk 3: `filelock` held across an unclean exit leaves the flight_root locked**
|
||||
- *Risk*: Companion process killed (e.g. brownout) without `stop()` running; next boot finds the lock file present and refuses to construct a new writer.
|
||||
- *Mitigation*: `filelock` uses POSIX advisory locks via `fcntl` — the kernel releases them on process death automatically. The lock file itself may linger but the lock state does not. Documented in the implementation report; AC-4 verifies the live-process case.
|
||||
|
||||
**Risk 4: ENOSPC degraded mode produces unbounded log records**
|
||||
- *Risk*: A persistent ENOSPC under sustained load could log 200/sec.
|
||||
- *Mitigation*: Per-second rate cap on `kind="fdr.write_failure"` ERROR records (AC-5e). The first failure is always emitted; subsequent failures within the same second are coalesced.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: single-writer thread + segment file lifecycle (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; no silent drops).
|
||||
- **Production code that must exist**: real background thread, real `drain` loop across registered FdrClients, real segment file open/append/close with `atomicwrites`, real `filelock` acquire/release on `flight_root`, real ENOSPC handler with shared-logger ERROR + GCS alert.
|
||||
- **Allowed external stubs**: tests MAY substitute a `FakeGcsAlert` (collects messages); production wiring uses the real C8 GCS adapter via the composition root.
|
||||
- **Unacceptable substitutes**: `time.sleep`-driven polling without a real producer-buffer drain, in-memory list "for now" instead of segment files on disk, `pickle` or any non-`fdr_record_schema` serialiser, omitting `fsync` ("we'll add durability later"), or omitting `filelock` ("companion is single-process anyway").
|
||||
@@ -1,150 +0,0 @@
|
||||
# C13 FlightHeader / FlightFooter + Accounting
|
||||
|
||||
**Task**: AZ-292_c13_flight_header_footer
|
||||
**Name**: C13 Flight Header/Footer + Accounting
|
||||
**Description**: Wire the writer thread's flight-lifetime contract: an `open_flight(header: FlightHeader)` method that emits a single `kind="flight_header"` record as the first record of segment 0, a `close_flight() -> FlightFooter` method that emits a single `kind="flight_footer"` record as the last record before drain + stop, and the cross-flight running counters (`records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count`) that the footer reports. This is what makes a flight directory self-describing — without it, post-flight tooling cannot verify completeness or attribute drops to producers.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-292
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="flight_header"` and `kind="flight_footer"` payloads (consumed: every required field on each kind).
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config snapshot + signing-key-rotation-event + manifest-content-hashes the composition root passes into the FlightHeader.
|
||||
|
||||
## Problem
|
||||
|
||||
The writer thread from AZ-291 drains and persists FdrRecords, but at flight-time there is currently no canonical first record (which would identify the flight + carry the build/config snapshot the operator needs to reproduce post-flight) and no canonical last record (which would close the flight + report what was actually written vs. dropped). Without:
|
||||
|
||||
- A `flight_header` record written FIRST, the operator post-flight has no flight_id, no build manifest hash, no config snapshot — so the FDR cannot be uniquely attributed and provenance is broken.
|
||||
- A `flight_footer` record written LAST, post-flight tooling cannot distinguish a clean shutdown from a power-loss truncation, and AC-NEW-3 traceability ("how many records were dropped per producer this flight") has no canonical answer.
|
||||
- Cross-flight running counters fed into the footer, the AC-NEW-3 "every drop visible" guarantee degrades into "every drop visible only inside individual records" — there is no single number the operator can audit at landing time.
|
||||
|
||||
## Outcome
|
||||
|
||||
- The writer's `open_flight(header)` method opens segment 0 (the path is created by AZ-291's `start()`) and writes a `kind="flight_header"` record as the first record on disk; `open_flight` returning successfully is the precondition every other onboard component uses to consider the FDR "ready" (this is the AC-NEW-3 every-payload-class-from-t=0 readiness gate the takeoff path checks — task #6 wires the gate, this task makes it observable).
|
||||
- The writer maintains four monotonic counters across the entire flight: `records_written` (per-record on every successful append), `records_dropped_overrun` (incremented when the writer observes a `kind="overrun"` record from any producer — `payload.dropped_count` is added), `bytes_written` (cumulative serialised bytes), `rollover_count` (incremented per per-segment rotation from AZ-291).
|
||||
- The writer's `close_flight()` method writes a single `kind="flight_footer"` record carrying those four counters + flight-end timestamp + flight_id, drains remaining records (per AZ-291's `stop()` contract), `fsync`s, releases the filelock, and returns the same FlightFooter to the caller.
|
||||
- The `FlightFooter` is the canonical authoritative summary: post-flight tooling that finds a footer record with mismatched counts vs. the actual segment file contents reports a corruption finding; tooling that does NOT find a footer record marks the flight as truncated.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `FlightHeader` dataclass: `flight_id: UUID`, `flight_started_at_iso: str`, `flight_started_at_monotonic_ns: int`, `config_snapshot: dict`, `signing_key_rotation_event: dict`, `manifest_content_hashes: dict[str, str]`, `build_info: dict` (commit hash, build date, BUILD_* flag set per ADR-002).
|
||||
- `FlightFooter` dataclass: `flight_id: UUID`, `flight_ended_at_iso: str`, `flight_ended_at_monotonic_ns: int`, `records_written: int`, `records_dropped_overrun: int`, `bytes_written: int`, `rollover_count: int`, `clean_shutdown: bool`.
|
||||
- `FileFdrWriter.open_flight(header: FlightHeader) -> None` (extends AZ-291's writer): validates `header.flight_id` matches the `flight_id` `start()` was constructed with; serialises `header` into a `kind="flight_header"` `FdrRecord` (envelope `producer_id="shared.fdr_client"`); appends as the first record of segment 0; raises `FdrOpenError` on failure (the actual takeoff-abort wiring is task #6, this task only raises the right exception type).
|
||||
- `FileFdrWriter.close_flight() -> FlightFooter` (extends AZ-291's writer): synthesises the `FlightFooter` from the running counters; serialises into a `kind="flight_footer"` `FdrRecord`; appends as the last record before drain-and-stop; returns the FlightFooter to the caller.
|
||||
- Counter integration with AZ-291's writer loop: `records_written` increments on each successful `serialise + append`; `bytes_written` increments by `len(serialised)`; `rollover_count` increments per AZ-291's rotation event; `records_dropped_overrun` is updated by inspecting incoming `kind="overrun"` records and adding `payload.dropped_count`.
|
||||
- `current_size_bytes() -> int` and `is_rolling() -> bool` exposed on the writer (interface methods promised by `_docs/02_document/components/14_c13_fdr/description.md` § 2). `current_size_bytes` returns the cumulative `bytes_written`; `is_rolling` is task #1's per-segment-rotation flag re-exposed here for completeness of the public surface.
|
||||
- A diagnostic INFO log on `open_flight` (one record: `kind="fdr.flight_open"; flight_id`) and `close_flight` (one record: `kind="fdr.flight_close"; records_written; records_dropped_overrun; bytes_written; rollover_count; clean_shutdown`).
|
||||
- A `clean_shutdown=True` set by `close_flight`; `False` if the writer detects it is being torn down without `close_flight` ever called (e.g. via a process-exit hook the composition root installs — wiring of the hook is owned by the composition root, this task only writes the path that decides the flag value).
|
||||
|
||||
### Excluded
|
||||
|
||||
- Background writer thread + segment file lifecycle — owned by AZ-291.
|
||||
- 64 GB total-flight cap + oldest-segment-dropped + `kind="segment_rollover"` record emission — owned by task #3 (the `rollover_count` this task maintains is incremented PER SEGMENT regardless of whether the cap-policy task is online; once task #3 ships, segment_rollover records are emitted on top of the existing per-segment rotations from task #1).
|
||||
- Mid-flight tile snapshot path / failed-tile thumbnail rate cap — tasks #4 and #5.
|
||||
- `FdrOpenError`-driven takeoff abort wiring in the composition root — owned by task #6 (this task only raises the right exception type from `open_flight`; the abort path that translates the exception into "do NOT open the FC adapter" is the next task).
|
||||
- Composing the `FlightHeader` content (config snapshot, signing key state, manifest hashes) — that is the composition root's responsibility; this task accepts the constructed header.
|
||||
- Process-exit hook installation — owned by the composition root; this task only sets the `clean_shutdown` flag based on whether `close_flight` was reached.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: flight_header is the first record of segment 0**
|
||||
Given a valid `FlightHeader` and a constructed-but-not-yet-started writer
|
||||
When `start()` followed by `open_flight(header)` runs
|
||||
Then segment 0's first parsed record is `FdrRecord(kind="flight_header", payload=<header_dict>)` with `payload.flight_id == header.flight_id` and the record sits at byte offset 0 (no other record precedes it)
|
||||
|
||||
**AC-2: flight_footer is the last record before clean stop**
|
||||
Given a writer with N producer records appended and `clean_shutdown` reachable
|
||||
When `close_flight()` is called
|
||||
Then the last parsed record across all segments is `FdrRecord(kind="flight_footer", payload=<footer_dict>)` with `clean_shutdown=True`; the returned FlightFooter equals the on-disk footer payload deep-equal
|
||||
|
||||
**AC-3: counters reflect actual on-disk reality**
|
||||
Given a flight with R producer records, D overrun-record drops, S segment rotations
|
||||
When `close_flight()` runs and the test parses the footer
|
||||
Then `records_written == R + 2` (the +2 is the header + footer themselves), `records_dropped_overrun == D`, `bytes_written == sum(len(serialised(r)) for r in [header, *records, footer])`, `rollover_count == S`
|
||||
|
||||
**AC-4: open_flight raises FdrOpenError on disk failure**
|
||||
Given a `flight_root` whose segment 0 path cannot be opened (e.g. read-only mount)
|
||||
When `open_flight(header)` runs
|
||||
Then `FdrOpenError` is raised; no `flight_header` record lands on disk; the writer is in the `start()`-failed state with the filelock released
|
||||
|
||||
**AC-5: open_flight rejects flight_id mismatch**
|
||||
Given a writer constructed with `flight_id=A` and an `open_flight(header)` where `header.flight_id=B`
|
||||
When `open_flight` runs
|
||||
Then `FdrOpenError` is raised with a message naming the mismatch; no `flight_header` record lands on disk
|
||||
|
||||
**AC-6: close_flight without open_flight raises**
|
||||
Given a writer where `start()` ran but `open_flight()` was never called
|
||||
When `close_flight()` is called
|
||||
Then `FdrCloseWithoutOpenError` is raised; no `flight_footer` is appended; the writer transitions to stopped (filelock released, segment closed if any data was written)
|
||||
|
||||
**AC-7: clean_shutdown=False on uncleansed teardown**
|
||||
Given a writer that `start()` + `open_flight()` ran and was then torn down via the composition-root process-exit hook (without `close_flight()` having been called)
|
||||
When the test parses the resulting FDR directory
|
||||
Then either (a) no `flight_footer` exists (truncated flight detected), OR (b) a `flight_footer` exists with `clean_shutdown=False` — implementation chooses; the contract is that `clean_shutdown=True` MUST NOT appear when `close_flight` was not called, but writing a partial footer is allowed
|
||||
|
||||
**AC-8: records_dropped_overrun aggregates payload.dropped_count**
|
||||
Given the writer observes 5 `kind="overrun"` records with `payload.dropped_count` values [3, 7, 2, 11, 4]
|
||||
When `close_flight()` runs
|
||||
Then `records_dropped_overrun == 27` (sum of all dropped_count values, NOT the count of overrun records — the count is observable from the records themselves)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- `open_flight` returns within 50 ms p99 (it serialises one record + appends it; no network or compute beyond `serialise`).
|
||||
- `close_flight` returns within 200 ms p99 for typical flights (it triggers the writer's drain-and-stop sequence, but the per-record cost is dominated by `fsync` and the typical residual buffer is small).
|
||||
- Counter updates on the steady-state path add ≤ 0.5 µs per record (atomic increments; no locking — the writer thread is the sole mutator).
|
||||
|
||||
**Reliability**
|
||||
- The four counters are write-once-per-record from the writer thread (the writer is the sole mutator); reads from outside the thread (e.g. `current_size_bytes()`) MUST be atomic snapshots — Python's GIL covers this for `int`, but the implementation MUST NOT introduce any non-atomic compound update.
|
||||
- `close_flight()` is idempotent on success: a second call returns the same FlightFooter without writing again, OR raises `FdrAlreadyClosedError` — implementation chooses; the contract test covers either outcome and asserts no double-write of the footer.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | start + open_flight + parse segment 0 | Record at offset 0 is `flight_header` with matching flight_id |
|
||||
| AC-2 | open_flight + N producer records + close_flight | Last record across segments is `flight_footer`; returned footer == on-disk footer deep-equal; clean_shutdown=True |
|
||||
| AC-3 | Run a flight with known R, D, S; parse footer counters | counters match (records_written, records_dropped_overrun, bytes_written, rollover_count) |
|
||||
| AC-4 | open_flight against read-only flight_root | `FdrOpenError`; no header on disk; filelock released |
|
||||
| AC-5 | open_flight with mismatched flight_id | `FdrOpenError`; message names the mismatch |
|
||||
| AC-6 | close_flight without open_flight | `FdrCloseWithoutOpenError`; no footer written |
|
||||
| AC-7 | start + open_flight + tear down without close_flight | No flight_footer OR flight_footer with clean_shutdown=False |
|
||||
| AC-8 | Inject 5 overrun records with known dropped_counts | records_dropped_overrun == sum of dropped_count |
|
||||
| NFR-perf-open | Microbench open_flight | p99 ≤ 50 ms |
|
||||
| NFR-perf-close | Microbench close_flight | p99 ≤ 200 ms |
|
||||
| NFR-perf-counters | Microbench writer loop with counter updates vs. without | overhead ≤ 0.5 µs per record |
|
||||
| NFR-reliability-idempotent-close | call close_flight twice | second returns same footer OR raises FdrAlreadyClosedError; no double-write |
|
||||
|
||||
## Constraints
|
||||
|
||||
- `FlightHeader.config_snapshot` MUST be JSON-safe (no Python objects); the composition root is responsible for serialising the typed Config dataclass into a plain dict before constructing the header.
|
||||
- `FlightHeader.manifest_content_hashes` MUST be a `dict[str, str]` of `{relative_path: sha256_hex}`; relative-path keys are repository-rooted (matches the helper from AZ-280 sha256_sidecar's invariants).
|
||||
- The footer's `clean_shutdown` flag is the ONLY way to distinguish a graceful landing from a crash; do NOT add a separate "fault" record kind for this purpose.
|
||||
- This task does NOT add new Python dependencies — `uuid`, `datetime`, and `time.monotonic_ns` are stdlib.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: FlightHeader carries secrets via config_snapshot**
|
||||
- *Risk*: A composition-root config block contains an API key (e.g. satellite-provider) and ends up in the FDR — operator workstations now hold credentials in plain JSON.
|
||||
- *Mitigation*: The composition root scrubs known-secret fields (per the redacted-config helper from AZ-269) before constructing the header. AC validation here checks the dict is JSON-safe; the secret-scrub is owned by the composition root and is out of scope for this task. Documented in the constraints.
|
||||
|
||||
**Risk 2: Counters drift under writer-thread crash**
|
||||
- *Risk*: A crash mid-flight leaves the in-memory counters un-flushed; the post-flight reader infers different counts from segment-walking than the (absent) footer.
|
||||
- *Mitigation*: The footer is the authoritative summary on clean shutdown; on crash the operator MUST re-derive counters from segment scan and treat the absence of a footer as a known signal. AC-7 covers this.
|
||||
|
||||
**Risk 3: open_flight side-effects on failure**
|
||||
- *Risk*: `open_flight` opens segment 0, writes a partial header, then fails — leaving a half-written first record on disk.
|
||||
- *Mitigation*: `open_flight` writes the header via `serialise(header_record)` first, computes the byte string, then performs a single `write()` + `fsync()`; on failure the segment file is closed and unlinked (since segment 0 is empty by construction at this point, deletion is safe). AC-4 covers this.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: per-flight self-describing FDR (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; AC-NEW-3 audit trail).
|
||||
- **Production code that must exist**: real `FlightHeader` and `FlightFooter` dataclasses, real header/footer record append paths, real four-counter accounting in the writer-thread loop, real `clean_shutdown` flag.
|
||||
- **Allowed external stubs**: none — the header/footer + counters are the production runtime audit capability.
|
||||
- **Unacceptable substitutes**: header-or-footer-only emission ("we'll add the other one later"), counter values stored only in logs ("the log file is the audit trail"), or counters that DON'T include header/footer in `records_written` ("only producer records count") — the latter would force operators to do special-case math at audit time and is exactly the kind of off-by-N bug AC-NEW-3 traceability is meant to prevent.
|
||||
@@ -1,158 +0,0 @@
|
||||
# C13 64 GB Capacity Cap + Oldest-Segment-Dropped Policy
|
||||
|
||||
**Task**: AZ-293_c13_capacity_cap_policy
|
||||
**Name**: C13 Capacity Cap Policy
|
||||
**Description**: Enforce the per-flight ≤ 64 GB cap from AC-NEW-3 by observing the segment files written by the writer thread (AZ-291), deleting the oldest CLOSED segment when the cumulative on-disk size of the flight directory crosses the configured cap (default 64 GB; configurable down for tests), and emitting a `kind="segment_rollover"` `FdrRecord` carrying the dropped segment number, byte count freed, and total bytes after the drop. The drop is ALWAYS recorded — there is no config flag that silences `segment_rollover` records (per AC-NEW-3 + ADR-008 + C13-ST-01). The currently-open segment is NEVER dropped; only sealed segments older than the current one are eligible.
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: AZ-291_c13_writer_thread, AZ-292_c13_flight_header_footer, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-293
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="segment_rollover"` payloads (consumed: `old_segment`, `new_segment`, `total_bytes_after`).
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying `flight_cap_bytes` (default 64 GB; lowered in tests).
|
||||
|
||||
## Problem
|
||||
|
||||
The writer thread from AZ-291 rotates per-segment when the per-segment size cap is reached, but does NOT enforce the per-flight 64 GB cap from AC-NEW-3. Without:
|
||||
|
||||
- Drop policy when the flight directory crosses 64 GB, the writer would either run out of disk (likely on the Jetson NVM where other binaries live) or fail with `ENOSPC` and degrade per AC-5 of AZ-291. AC-NEW-3 requires the cap to be ENFORCED, not detected.
|
||||
- Oldest-segment-dropped semantics, the cap could be enforced by truncating the current segment — which would corrupt records mid-write and break the wire-format invariant from AZ-272.
|
||||
- A `kind="segment_rollover"` record per drop, the drop is silent — directly violating AC-NEW-3 ("no silent drops") and the C13-ST-01 security test ("no config flag silences these record kinds"). The drop record is ALSO the post-flight tooling's only way to learn that the flight USED to have records the file directory no longer contains.
|
||||
|
||||
## Outcome
|
||||
|
||||
- After every per-segment rotation the writer performs (AZ-291), this task checks whether the flight directory's cumulative on-disk byte size exceeds `flight_cap_bytes`. If yes, it deletes the oldest CLOSED segment (segment 0 first, then segment 1, etc., never the currently-open segment) and repeats until the directory size is back under cap.
|
||||
- For each drop, a `kind="segment_rollover"` `FdrRecord` is enqueued via the shared `FdrClient` for `producer_id="shared.fdr_client"`. The record carries `payload.old_segment` (the segment number that was deleted), `payload.new_segment` (the writer's currently-open segment number), and `payload.total_bytes_after` (the post-drop on-disk byte count).
|
||||
- The cap is configurable via `composition_root_protocol`'s `flight_cap_bytes` field (default 64 GB; tests use 4 KiB or similar to exercise the policy without filling real disks).
|
||||
- The cap policy NEVER drops the currently-open segment (would interrupt mid-record); NEVER drops `segment-0000.fdr` if it contains the `flight_header` UNLESS the directory is so over-cap that no other segment exists to drop (in that case the operator's flight has exceeded what the cap can absorb and a hard ERROR + GCS alert path is triggered, distinct from the normal drop path).
|
||||
- The post-flight reader uses the sequence of `segment_rollover` records to reconstruct what was dropped vs. what was retained, and the `FlightFooter`'s `rollover_count` (from AZ-292) reports the total number of cap-driven drops.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- A `CapacityCapPolicy(writer: FileFdrWriter, cap_bytes: int, fdr_client: FdrClient)` class wired into `FileFdrWriter` via a documented post-rotation hook.
|
||||
- The hook is invoked AFTER every successful per-segment rotation (AZ-291's rotation completion path); it walks `flight_root/<flight_id>/`, sums on-disk byte sizes of all `segment-NNNN.fdr` files (excluding the currently-open segment whose byte count comes from the writer's running `bytes_written` counter), and decides whether to drop.
|
||||
- Drop ordering: oldest segment first. Segment numbers are monotonic (per AZ-291's filesystem layout `segment-NNNN.fdr`), so "oldest" = lowest segment number among CLOSED segments.
|
||||
- Drop mechanics: `os.unlink` the segment file, increment the writer's `rollover_count` (the counter from AZ-292), enqueue the `kind="segment_rollover"` record via the shared `FdrClient`. The record's `payload.old_segment` is the deleted segment number; `payload.new_segment` is the writer's currently-open segment; `payload.total_bytes_after` is recomputed after the unlink.
|
||||
- Loop until under cap: if a single drop does not bring the directory under cap (e.g. very large segments + long flight), drop the next-oldest segment and emit another `segment_rollover` record. AC-3 covers loop termination.
|
||||
- Special-case "only segment 0 with header remains, AND it is over cap by itself": this is the operator-error case (cap configured smaller than a single segment + header). Hard-fail: log ERROR `kind="fdr.cap_misconfigured"`, invoke the GCS alert (the same one AZ-291 wires for ENOSPC), and refuse to drop `segment-0000.fdr`. The flight continues in degraded mode — segments accumulate on disk past the cap until either a normal drop becomes possible or the operator lands.
|
||||
- A diagnostic INFO log per drop (`kind="fdr.cap_drop"; old_segment; new_segment; total_bytes_after`) — distinct from the FDR record itself; the log line is for operator debugging, the FDR record is the canonical audit trail.
|
||||
- Configuration: `flight_cap_bytes` is a single integer field on the `FdrWriterConfig` consumed via `composition_root_protocol`; the default is `64 * 1024**3` (64 GiB exactly per AC-NEW-3); valid range is `1024 .. 2**40` (1 KiB minimum for tests, 1 TiB maximum sanity bound).
|
||||
- The cap policy does NOT have a config flag to disable it. The implementation MUST NOT expose a "disable cap" boolean on any Config block — verified by C13-ST-01 (that test scans the config schema for any flag that could disable rollover-drop emission).
|
||||
|
||||
### Excluded
|
||||
|
||||
- Per-segment file rotation itself — owned by AZ-291.
|
||||
- `FlightHeader` / `FlightFooter` accounting and `rollover_count` storage — owned by AZ-292 (this task increments the counter; the counter itself lives in the writer).
|
||||
- The `kind="segment_rollover"` payload schema — owned by AZ-272 (this task constructs records that conform to that schema).
|
||||
- Mid-flight tile snapshot path and failed-tile thumbnail rate cap — tasks #4 and #5.
|
||||
- ENOSPC degraded-mode handling — owned by AZ-291 (this task uses the same GCS alert callable for the cap-misconfigured edge case).
|
||||
- Post-flight reader logic that reconstructs dropped data from the rollover records — out of scope this cycle.
|
||||
- Cross-flight retention (deleting OLD flight directories to free disk) — out of scope; the cap is per-flight, the operator manages cross-flight cleanup.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Drop oldest closed segment when directory exceeds cap**
|
||||
Given a flight directory with segments 0..3 each sized 100 KiB, currently-open segment 4 at 50 KiB, and `flight_cap_bytes = 350 KiB`
|
||||
When the writer rotates to segment 5 (segment 4 is now closed at 100 KiB; total = 500 KiB > 350 KiB cap)
|
||||
Then segment 0 is unlinked from disk; the writer's `rollover_count` increments by 1; a `kind="segment_rollover"` record lands on the FDR with `payload.old_segment=0`, `payload.new_segment=5`, `payload.total_bytes_after == sum(file_sizes(segment-0001..segment-0005))`
|
||||
|
||||
**AC-2: Loop until under cap**
|
||||
Given a flight directory with segments 0..9 each 100 KiB and `flight_cap_bytes = 350 KiB`, currently-open segment 10
|
||||
When the post-rotation hook runs
|
||||
Then segments 0, 1, 2, 3, 4, 5, 6 are deleted (in order); 7 `kind="segment_rollover"` records land on the FDR (one per drop); the directory total falls to ≤ 350 KiB
|
||||
|
||||
**AC-3: Loop terminates even when bytes_after never reaches cap (degenerate case)**
|
||||
Given a contrived test where `cap_bytes` is 100 KiB but the currently-open segment alone is already 200 KiB, AND only segment 0 (containing the flight_header) closed before
|
||||
When the post-rotation hook runs
|
||||
Then segment 0 is NOT dropped (it contains the header); ONE ERROR log (`kind="fdr.cap_misconfigured"`) is emitted; ONE GCS alert is invoked; the loop terminates within bounded time (≤ 100 ms p99); the flight continues in degraded mode
|
||||
|
||||
**AC-4: Currently-open segment is NEVER dropped**
|
||||
Given a flight directory with segments 0..2 closed and segment 3 currently open
|
||||
When the post-rotation hook runs (after rotating to segment 4) AND the cap is exceeded by the currently-open segment alone
|
||||
Then segment 4 (the new currently-open segment) is NOT dropped; older segments (0, 1, 2, 3) are dropped first per the oldest-first rule
|
||||
|
||||
**AC-5: segment_rollover record contains canonical fields**
|
||||
Given any cap-driven drop event
|
||||
When the test parses the resulting `segment_rollover` record
|
||||
Then `payload` has exactly `old_segment` (int), `new_segment` (int), `total_bytes_after` (int >= 0); the OUTER envelope's `producer_id == "shared.fdr_client"` (per the schema contract); the record's `ts` is within 100 ms of the `os.unlink` call
|
||||
|
||||
**AC-6: No config flag disables segment_rollover emission**
|
||||
Given the project's full Config schema and every documented config preset
|
||||
When the test scans config classes for a field that could suppress `kind="segment_rollover"` records (per C13-ST-01)
|
||||
Then no such field exists; injecting a synthetic preset that attempts to suppress the record fails type-check or runtime validation
|
||||
|
||||
**AC-7: Default cap is exactly 64 GiB**
|
||||
Given a default `FdrWriterConfig` constructed with no overrides
|
||||
When the test reads `cap_bytes`
|
||||
Then `cap_bytes == 64 * 1024**3` (exactly 64 GiB)
|
||||
|
||||
**AC-8: rollover_count from FlightFooter matches segment_rollover record count**
|
||||
Given a flight that triggered N cap-driven drops over its lifetime
|
||||
When `close_flight()` runs and the test parses the footer
|
||||
Then `footer.rollover_count == N + per_segment_rotations` (the AZ-292 counter increments on EVERY rotation; cap-driven drops add to it; the segment_rollover record count provides cross-validation against the cap-driven subset)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Post-rotation hook execution time p99 ≤ 50 ms per rotation under steady-state (one drop per rotation at most, typical case). Per AC-2 worst case, multiple drops may extend the hook; the implementation MUST NOT block the writer thread's drain loop for more than 100 ms total even under worst-case multi-drop bursts (cap configured very low for tests).
|
||||
- `os.unlink` on the per-flight NVM (typical Jetson Orin Nano Super filesystem) takes < 5 ms p99 for files up to 256 MiB; the implementation relies on this, no async unlink.
|
||||
- Directory scan for byte counting uses a per-flight sorted-segment-list cached by the policy class (refreshed on each rotation), NOT a fresh `os.scandir` per check — `os.scandir` cost grows with segment count and would dominate for long flights.
|
||||
|
||||
**Reliability**
|
||||
- The cap policy MUST NOT delete a segment whose deletion is in progress (idempotency: a re-entry into the hook before `os.unlink` returns is impossible because the writer thread is the sole invoker, but the policy MUST handle the case where a previous unlink left a stale entry in the cached segment list — refresh the list from disk on every entry).
|
||||
- A failed `os.unlink` (e.g. read-only filesystem, ENOENT for an already-deleted segment due to operator manual intervention) is logged at WARN with `kind="fdr.cap_unlink_failed"` and the policy continues to the next-oldest segment; it does NOT halt the writer.
|
||||
- The `segment_rollover` record is enqueued via the shared `FdrClient` (which has its own overrun policy from AZ-274); if the FdrClient's buffer is full at the moment of drop, the record itself may overrun — that is fine, AZ-274's overrun policy emits a `kind="overrun"` record with `producer_id="shared.fdr_client"` and the drop is still observable through AZ-292's `records_dropped_overrun` counter.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | 4 closed segments × 100 KiB + 50 KiB open + cap=350 KiB; trigger rotation to segment 5 | Segment 0 deleted; one segment_rollover record with correct payload fields |
|
||||
| AC-2 | 10 closed segments × 100 KiB + cap=350 KiB; one rotation | 7 oldest segments deleted (in order); 7 segment_rollover records; final dir total ≤ 350 KiB |
|
||||
| AC-3 | Cap=100 KiB, segment 3 currently open at 200 KiB, only segment 0 (header) closed | Segment 0 NOT deleted; one ERROR log "fdr.cap_misconfigured"; one GCS alert; hook terminates ≤ 100 ms |
|
||||
| AC-4 | Currently-open segment exceeds cap by itself; older segments exist | Older segments drop first; currently-open never dropped |
|
||||
| AC-5 | Trigger any drop; parse the resulting segment_rollover record | payload has exactly old_segment / new_segment / total_bytes_after; outer producer_id == "shared.fdr_client"; ts within 100 ms of unlink |
|
||||
| AC-6 | Scan Config class hierarchy for "disable_segment_rollover" / "suppress_*" / "no_rollover" fields | None found; synthetic config trying to disable fails validation |
|
||||
| AC-7 | Default `FdrWriterConfig()` | `cap_bytes == 64 * 1024**3` |
|
||||
| AC-8 | Run a flight with N cap-driven drops + M per-segment rotations; parse footer + segment_rollover records | `footer.rollover_count == N + M`; segment_rollover record count == N |
|
||||
| NFR-perf-hook | Microbench post-rotation hook with 1 drop | p99 ≤ 50 ms |
|
||||
| NFR-perf-multi-drop | Microbench worst-case multi-drop burst | total ≤ 100 ms |
|
||||
| NFR-reliability-stale-list | Manually delete a segment file under the policy; trigger hook | WARN log "fdr.cap_unlink_failed"; policy continues |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The cap is enforced ONLY via oldest-segment-dropped. The implementation MUST NOT truncate any segment file, MUST NOT modify any record once written, MUST NOT seek into closed segments. AZ-291's "append-only between rotations" invariant extends to "no in-place modification across the entire flight".
|
||||
- The cap is applied to the SUM of all on-disk segment file sizes (closed + currently-open). Sidecar files outside the segment files (e.g. mid-flight tile snapshots from task #4 — those land in a separate path under `flight_root/<flight_id>/tiles/`) are NOT counted toward the cap; their cap is owned by task #4. This task's cap is segment-file-only.
|
||||
- The cap policy hook is wired by the composition root, NOT by AZ-291's writer constructor (so AZ-291 stays focused on per-segment lifecycle without knowing about per-flight cap policy). The composition root injects the policy as a callback the writer invokes after each rotation.
|
||||
- The configuration field name is `flight_cap_bytes`; renaming is a breaking change requiring a major bump on `composition_root_protocol`.
|
||||
- The `kind="segment_rollover"` record is mandatory per AC-NEW-3 + ADR-008 + C13-ST-01. There is no future PBI that adds an opt-out flag — that is a contract test, not a code-review preference.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Filesystem reports cached size, drop appears not to free space**
|
||||
- *Risk*: Some filesystems lazily release `unlink`ed inodes; `os.statvfs` immediately after `unlink` shows the bytes still allocated; the cap policy thinks it needs to drop more.
|
||||
- *Mitigation*: The policy uses `os.path.getsize` summed across actual segment files, NOT `statvfs` of the mount. Once the segment file is `unlink`ed, it no longer appears in `os.scandir` and is not summed. This is correct independent of inode-release timing.
|
||||
|
||||
**Risk 2: Operator manually deletes a segment mid-flight**
|
||||
- *Risk*: An operator with shell access to the companion deletes `segment-0001.fdr`; the policy's cached segment list is now stale.
|
||||
- *Mitigation*: AC-NFR-reliability-stale-list — the policy refreshes from `os.scandir` on every hook entry, logs WARN if a previously-tracked segment is missing, and continues. Treat operator interference as out-of-band noise, not a failure mode.
|
||||
|
||||
**Risk 3: Cap policy and AZ-291's per-segment rotation race**
|
||||
- *Risk*: The policy reads the segment list while AZ-291 is opening a new segment; the new segment file may exist but be empty.
|
||||
- *Mitigation*: The hook is invoked SYNCHRONOUSLY by AZ-291's rotation completion path (not by a separate thread or timer). The writer thread is the sole mutator; there is no concurrent rotation. AC-1 verifies this end-to-end.
|
||||
|
||||
**Risk 4: GCS alert flooded by cap-misconfigured edge case**
|
||||
- *Risk*: AC-3 path triggers GCS alerts on every rotation; alerts overwhelm the GCS link.
|
||||
- *Mitigation*: Per-flight rate cap on `kind="fdr.cap_misconfigured"` GCS alerts — at most one per flight, since the misconfig is a flight-level constant. After the first alert, subsequent occurrences are logged at ERROR but NOT alerted.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: per-flight 64 GB cap enforcement with oldest-segment-dropped + canonical drop-record emission (architecture / E-C13 / AC-NEW-3, ADR-008).
|
||||
- **Production code that must exist**: real `os.unlink` on segment files, real `FdrRecord(kind="segment_rollover")` enqueue via the shared FdrClient, real config-driven cap reading, real loop-until-under-cap with degenerate-case handling.
|
||||
- **Allowed external stubs**: tests MAY stub the `FdrClient` (use FakeFdrSink from AZ-275) and the GCS alert callable; production wiring uses the real instances via the composition root.
|
||||
- **Unacceptable substitutes**: cap detection without enforcement ("we just log a warning when we exceed cap"), per-record drop instead of per-segment drop ("simpler to drop the oldest record"), in-place segment truncation ("avoid the unlink overhead"), suppressing the segment_rollover record under any config preset ("debug builds don't need the audit trail"), or replacing the cap policy with cross-flight cleanup ("we'll delete old FLIGHTS to make room"). All of those break AC-NEW-3 + ADR-008.
|
||||
Reference in New Issue
Block a user