[AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test

AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:00:49 +03:00
parent 3acc7f33dd
commit ba20c2d195
24 changed files with 2714 additions and 20 deletions
@@ -1,100 +0,0 @@
# FDR Log Bridge (ERROR + WARN forwarding)
**Task**: AZ-267_fdr_log_bridge
**Name**: FDR Log Bridge
**Description**: Subscribe a logging Handler to the shared logger that forwards every ERROR and WARN record into the Flight Data Recorder via the FDR producer client, tagged `kind="log"` so post-flight tooling can correlate log events with the rest of the recorded telemetry.
**Complexity**: 2 points
**Dependencies**: AZ-266_log_module, AZ-247 (forward — FDR producer + record schema not yet decomposed; this task's contract surface is satisfied once AZ-247's record schema contract is published)
**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG)
**Tracker**: AZ-267
**Epic**: AZ-245 (E-CC-LOG)
### Document Dependencies
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — log envelope this bridge consumes (produced by AZ-266).
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — FDR record schema this bridge writes into (produced by AZ-247; document does not yet exist — Step 4 cross-verification will catch the forward reference).
## Problem
The acceptance criterion "ERROR + WARN records appear in FDR with `kind = \"log\"` and a back-reference to the originating component" requires a bridge between the shared Python `logging` machinery and the FDR producer client. Without this bridge, post-flight tools cannot correlate a `c5_state` ERROR log with the surrounding telemetry frames captured at the same flight time.
## Outcome
- Every emitted log record at level WARN or ERROR is enqueued into the FDR producer queue with `kind="log"` and the originating component slug preserved.
- INFO and DEBUG records are NEVER enqueued into FDR (verified by the contract test in PBI #3 of this epic).
- The bridge never blocks the calling thread — it uses the FDR producer client's drop-oldest semantics so a saturated queue cannot stall a `logger.error(...)` call on the hot path.
## Scope
### Included
- A logging Handler subclass installed onto the root onboard logger (or each `get_logger(...)` instance, whichever the AZ-266 implementation chose) that subscribes to records at WARN and ERROR.
- Translation logic from `LogRecord` (per `log_record_schema` v1.0.0) into the FDR record envelope expected by the FDR producer client, with `kind="log"` and a `component` back-reference.
- Wire-up in the composition root (consumed from AZ-246 / E-CC-CONF) so the bridge is attached exactly once, after the logger and the FDR client are both initialised.
### Excluded
- The FDR producer client itself — owned by AZ-247 / E-CC-FDR-CLIENT.
- The on-disk FDR segment writer thread — owned by AZ-248 / E-C13.
- The contract test that verifies "DEBUG + INFO never reach FDR" — owned by PBI #3 of this epic (next task).
- Per-component log call sites — owned by each component epic.
## Acceptance Criteria
**AC-1: WARN records reach FDR**
Given the bridge is installed and the FDR client's queue is below capacity
When any component emits `logger.warning(...)` via the shared logger
Then a single FDR record with `kind="log"`, `level="WARN"`, and `component=<originating component slug>` is enqueued
**AC-2: ERROR records reach FDR with traceback when applicable**
Given the bridge is installed
When a component emits `logger.exception(...)` from inside an `except` clause
Then the enqueued FDR record's `exc` field carries the formatted traceback string from the `LogRecord`
**AC-3: INFO and DEBUG never reach FDR**
Given the bridge is installed
When any component emits `logger.info(...)` or `logger.debug(...)`
Then no FDR record is enqueued for that log call (verified by both unit tests here and the contract test in the next task)
**AC-4: Backpressure is non-blocking**
Given the FDR producer queue is at its drop-oldest threshold
When a component emits `logger.error(...)` on the hot path
Then the call returns within the same latency budget as a stdout-only WARN call (no blocking on the queue), and the FDR client's existing drop counter is incremented
**AC-5: Single attachment**
Given `compose_root(config)` runs at process start
When the bridge wire-up is invoked
Then exactly one bridge Handler is attached to the logger; reinitialising the composition root in tests does not stack duplicates
## Non-Functional Requirements
**Performance**
- Bridge add ≤ 0.05 ms p99 latency on top of the formatter's 0.2 ms budget (i.e. logger.error → bridge enqueue total p99 ≤ 0.25 ms on Tier-2).
**Reliability**
- A failure to enqueue (queue full + drop-oldest already saturated) MUST NOT raise into the caller; it MUST log a one-shot internal `WARN` record (via stdout only — recursion into the bridge is short-circuited by a thread-local flag) every N occurrences, where N is at least 1000.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Emit a WARN through the shared logger with the bridge installed | Stub FDR queue receives one record with `kind="log"`, `level="WARN"`, `component` matching origin |
| AC-2 | Inside an `except` block, call `logger.exception("boom")` | Stub FDR queue's record carries non-empty `exc` traceback string |
| AC-3 | Emit INFO and DEBUG records | Stub FDR queue receives zero records |
| AC-4 | Pre-fill stub FDR queue to drop-oldest threshold, then emit an ERROR | Caller returns under 0.5 ms wall clock; FDR client's drop counter increments |
| AC-5 | Call `compose_root` twice with the same config in a single process | Logger has exactly one bridge Handler attached after the second call |
## Constraints
- The bridge has a forward dependency on AZ-247 (FDR producer client + record schema). It cannot pass its own AC tests until AZ-247 is implemented; Step 4 cross-verification will record this temporal dependency in `_dependencies_table.md`.
- The bridge's record translation MUST consume only the public surface of `log_record_schema` v1.0.0 — no peeking into formatter internals.
## Risks & Mitigation
**Risk 1: Recursion via internal `WARN` on enqueue failure**
- *Risk*: The "queue full" internal WARN itself goes through the bridge, recurses, and corrupts the queue further.
- *Mitigation*: Thread-local "in-bridge" flag short-circuits any logging call originating from the bridge itself; verified by a unit test that fills the queue and asserts no infinite loop.
**Risk 2: Forward dependency on AZ-247 contract not yet written**
- *Risk*: The FDR record schema is described in epic AZ-247's text but not yet a contract file; this task's expectations may drift from AZ-247's eventual contract.
- *Mitigation*: AZ-247's first PBI MUST publish `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` before AZ-247's other PBIs; this task's implementation begins only after that contract exists. Step 4 cross-verification flags the temporal dependency.
@@ -1,68 +0,0 @@
# Log Schema Contract Test
**Task**: AZ-268_log_schema_contract_test
**Name**: Log Schema Contract Test
**Description**: A standalone test module that verifies every shared logger emission conforms to `log_record_schema` v1.0.0 — field names, field ordering, required keys, and the "INFO + DEBUG never reach FDR" invariant.
**Complexity**: 2 points
**Dependencies**: AZ-266_log_module, AZ-267_fdr_log_bridge
**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG)
**Tracker**: AZ-268
**Epic**: AZ-245 (E-CC-LOG)
### Document Dependencies
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — the contract this test verifies.
## Problem
The shared logging contract (v1.0.0) declares a strict 8-field set with mandated ordering. Without an automated test that parses raw emitted bytes and asserts the contract, formatter changes can silently drift the schema and break post-flight FDR analysis tools that depend on stable column ordering.
## Outcome
- A single test module under `tests/contract/log_schema.py` runs in unit-test scope, fails CI fast on any schema drift, and is the single authority that enforces the contract at code-review time.
- "DEBUG + INFO never reach FDR" is verified by a paired test case that wires a stub FDR queue and asserts zero records after a fixed batch of INFO/DEBUG calls.
## Scope
### Included
- One test file (`tests/contract/log_schema.py` per epic AZ-245 AC-4) with cases for every row in the contract's "Test Cases" table (valid-info-no-frame, valid-warn-with-frame, valid-error-with-exc, invalid-bad-level, invalid-multiline-msg, invalid-non-serialisable-kv, ordering-stable).
- A "DEBUG + INFO never reach FDR" case that uses a stub FDR queue.
- A pytest marker (`contract`) so CI can run contract tests as a discrete stage if desired.
### Excluded
- Integration-level "every component logs at least one record" tests — owned by per-component test specs in their own epics (Step 9 Decompose Tests).
- Performance microbenchmarks for the formatter — owned by the AZ-266 unit tests.
## Acceptance Criteria
**AC-1: Contract cases all pass**
Given the AZ-266 + AZ-267 implementations are complete
When `pytest tests/contract/log_schema.py` runs
Then all test cases listed in `_docs/02_document/contracts/shared_logging/log_record_schema.md § Test Cases` pass
**AC-2: Schema drift fails fast**
Given a hypothetical formatter change that re-orders the JSON keys
When `pytest tests/contract/log_schema.py` runs
Then the `ordering-stable` case fails with a diff showing actual vs. expected key order
**AC-3: FDR-suppression invariant verified**
Given a stub FDR queue wired into the bridge
When the test emits 100 INFO + 100 DEBUG records
Then the stub queue reports zero records received
**AC-4: Contract version pinned**
Given the test imports the contract version constant
When the contract is bumped to a new major version
Then the test fails until updated, preventing accidental coupling to an unreviewed contract change
## Non-Functional Requirements
**Reliability**
- The test never depends on real FDR I/O — only on the documented `enqueue` interface of the FDR producer client.
## Constraints
- Test file path is fixed at `tests/contract/log_schema.py` per epic AZ-245 AC-4 (allows the `traceability-matrix` reference to remain stable).
- Contract version constant must be sourced from a single location (the contract file or a generated constant) — never duplicated across the test and the formatter.
@@ -1,151 +0,0 @@
# FdrClient Lock-Free SPSC Ring Buffer + Public API
**Task**: AZ-273_fdr_client_ringbuf
**Name**: FdrClient Ring Buffer
**Description**: Implement the producer-side `FdrClient(producer_id)` and its lock-free single-producer / single-consumer (SPSC) ring buffer. `enqueue` is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself.
**Complexity**: 5 points
**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
**Tracker**: AZ-273
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this client enqueues.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the Config object that carries this client's capacity setting.
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — diagnostic logs emitted by this client (NOT on the steady-state hot path).
## Problem
Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state `enqueue` p99 ≤ 5 µs budget rule out:
- Any lock-acquiring queue (Python `queue.Queue`, `threading.Lock`-protected list, asyncio queue).
- Any allocation on the steady-state path (no `dict.copy()`, no `list.append` that may resize, no `dataclasses.replace`).
- Any blocking I/O.
Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development.
## Outcome
- A single `FdrClient(producer_id)` is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component.
- `enqueue` p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots).
- `enqueue` NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit `kind="overrun"`) is implemented by the next PBI via the buffer's documented hook.
- The dequeue side (`pop_one` / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test.
## Scope
### Included
- `FdrClient(producer_id: str, capacity: int)` constructor + module-level `make_fdr_client(producer_id, config) -> FdrClient` factory that reads capacity from the cross-cutting `config.fdr_client.<producer_id>.capacity` block (with documented default).
- `FdrClient.enqueue(record: FdrRecord) -> EnqueueResult` — lock-free, non-blocking, allocation-free on the steady-state path. Returns `EnqueueResult.OK` or `EnqueueResult.OVERRUN` (the next PBI consumes `OVERRUN`).
- A documented `on_overrun: Callable[[FdrRecord], None] | None` hook the overrun-policy PBI populates with the drop-oldest + record-emit closure.
- Single-consumer dequeue API for the C13 writer: `pop_one() -> FdrRecord | None` and `drain(max_records: int) -> list[FdrRecord]`.
- `flush() -> None` test-only method that blocks until the buffer is empty (used by `FakeFdrSink` and contract tests; production callers MUST NOT call this on the hot path).
- Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger.
- Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
### Excluded
- The drop-oldest behaviour and the `kind="overrun"` record emission — owned by the next PBI in this epic.
- The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248).
- The `FakeFdrSink` for tests — owned by the fourth PBI in this epic.
- Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC.
- The actual `FdrRecord` schema and serialiser — owned by AZ-272.
## Acceptance Criteria
**AC-1: Lock-free, never blocks**
Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue)
When the producer calls `enqueue(record)` 1025 times in rapid succession
Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns `EnqueueResult.OVERRUN`
**AC-2: Allocation-free steady-state**
Given an FdrClient warmed up with one prior `enqueue`
When the producer calls `enqueue(record)` for an in-buffer record (slot is free)
Then the call performs zero heap allocations (verified via `tracemalloc` snapshot diff: 0 new objects on the hot path)
**AC-3: Capacity is config-driven**
Given the cross-cutting Config block sets `config.fdr_client.<producer_id>.capacity = 4096`
When `make_fdr_client(producer_id, config)` runs
Then the returned client's internal buffer length is 4096 (verified via the test-only `_capacity()` introspection method)
**AC-4: SPSC dequeue contract**
Given two threads concurrently call `pop_one()`
When both calls race
Then the contract test detects undefined behaviour (asserted via a contract test that wraps `pop_one` in a guard which raises `FdrSpscViolationError` on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant)
**AC-5: Overrun hook is wired**
Given an `FdrClient` with `on_overrun` set to a recording closure
When the buffer fills and the next `enqueue` would overrun
Then `on_overrun` is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour)
**AC-6: flush() drains buffer**
Given an FdrClient with N records buffered and a consumer thread draining
When the test calls `flush()`
Then `flush()` returns only after `pop_one()` has been called N times (no records left in the buffer)
**AC-7: producer_id is non-empty and stamped on every record**
Given a constructor call `FdrClient(producer_id="")` (empty string)
When construction runs
Then `ValueError` is raised — anonymous producers are forbidden
## Non-Functional Requirements
**Performance**
- `enqueue` p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a `payload` dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache).
- `pop_one` p99 ≤ 10 µs on Tier-2 under steady-state.
- Memory: per-producer ring buffer ≤ `capacity * sizeof(slot)` bytes; no unbounded growth. Pre-sized at construction.
**Reliability**
- `enqueue` never raises into the caller. Schema violations from `FdrRecord` are caught and forwarded to the same `on_overrun` hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean.
- Multiple `make_fdr_client(producer_id, config)` calls with the same `producer_id` return the same cached instance — there is exactly one FdrClient per producer_id per process.
**Concurrency**
- SPSC: ONE producer thread MAY call `enqueue`, ONE consumer thread MAY call `pop_one` / `drain`. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4).
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns `OVERRUN` |
| AC-2 | `tracemalloc` snapshot diff across one `enqueue` after warmup | Zero new objects allocated |
| AC-3 | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` |
| AC-4 | Two threads call `pop_one()` concurrently with the SPSC guard enabled | `FdrSpscViolationError` raised |
| AC-5 | Wire a recording `on_overrun`; force overrun | Closure invoked exactly once with the offending record |
| AC-6 | Enqueue N records, start a draining consumer, call `flush()` | `flush()` returns only after buffer is empty |
| AC-7 | `FdrClient(producer_id="")` | `ValueError` |
| NFR-perf | Microbench `enqueue` over 10k iterations on Tier-2 | p99 ≤ 5 µs |
| NFR-perf-pop | Microbench `pop_one` over 10k iterations | p99 ≤ 10 µs |
| NFR-reliability | Two `make_fdr_client("c1_vio", config)` calls | same instance returned |
## Constraints
- Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` v1.0.0.
- SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists.
- The lock-free implementation MAY use `multiprocessing.shared_memory`, `cffi`-backed atomics, a Cython extension, or pure Python with `array.array` + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report.
- No new dependency beyond what AZ-263 / E-BOOT pinned.
## Risks & Mitigation
**Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2**
- *Risk*: CPython's GIL + dict operations push p99 above 5 µs on the Jetson.
- *Mitigation*: Bench against a `cffi` or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle.
**Risk 2: Overrun hook called with record that holds a reference to caller-mutable state**
- *Risk*: Producer mutates `record.payload` after `enqueue`; the overrun closure sees the mutated value.
- *Mitigation*: `FdrRecord` is `@frozen` (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract `Invariants`.
**Risk 3: Cached FdrClient leaks across test cases**
- *Risk*: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient.
- *Mitigation*: A `_reset_for_tests()` private function (documented as test-only in the contract `Non-Goals`) clears the cache; integration test fixture calls it on teardown.
## Runtime Completeness
- **Named capability**: lock-free SPSC ring buffer + `FdrClient` public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR `enqueue` p99 ≤ 5 µs).
- **Production code that must exist**: real lock-free SPSC primitive (no Python `queue.Queue`, no lock-acquiring fallback); real allocation-free hot path; real `on_overrun` hook plumbing.
- **Allowed external stubs**: none — the queue is the production runtime capability.
- **Unacceptable substitutes**: `queue.Queue`, `threading.Lock`-guarded list, `collections.deque` with a lock, "for now we just `time.sleep(0)` on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms.
## Contract
This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
Consumers MUST read that file — not this task spec — to discover the interface.
@@ -1,125 +0,0 @@
# Drop-Oldest Policy + `kind="overrun"` Record Emission
**Task**: AZ-274_fdr_overrun_emission
**Name**: FDR Overrun Policy
**Description**: Wire the producer-side overrun policy on top of the FdrClient ring buffer. When a producer's enqueue would overflow, the policy drops the OLDEST queued record from that producer's buffer to make room for the new record AND synthesises a `FdrRecord(kind="overrun", payload={producer_id, dropped_count})` that lands on the same queue. This is the production-side enforcement of AC-NEW-3 ("no silent drops").
**Complexity**: 2 points
**Dependencies**: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
**Tracker**: AZ-274
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="overrun"` records (consumed: `payload.producer_id` + `payload.dropped_count`).
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines the `on_overrun` hook this task implements + the "exactly-once" invariant.
## Problem
AZ-273 (FdrClient ring buffer) leaves the `on_overrun` hook unwired by default. In production, an unwired hook means the buffer silently drops `OVERRUN` events — directly violating AC-NEW-3 and breaking C13's invariant that every dropped record is recoverable from a `kind="overrun"` record on the FDR. This task closes that gap by providing the canonical drop-oldest hook and registering it via the composition root for every onboard producer.
## Outcome
- A single, contract-frozen drop-oldest hook is the only `on_overrun` callable any production FdrClient is wired to. Tests MAY substitute their own.
- For every burst that exceeds capacity, a coalesced `kind="overrun"` record is enqueued on the SAME producer's buffer carrying the originating producer's slug + `dropped_count` reflecting how many records were dropped in the burst (coalescing keeps the overrun record from itself triggering further overruns when bursts are sustained).
- The composition root wires the hook on every FdrClient created via `make_fdr_client` — consumers (component code) do not interact with the hook directly.
## Scope
### Included
- A `default_overrun_policy(client: FdrClient) -> Callable[[FdrRecord], None]` factory that returns the canonical drop-oldest closure for the given client.
- Drop-oldest semantics: when `enqueue` returns `OVERRUN`, the closure pops one record from the buffer's tail (oldest), discards it, retries the new record's enqueue (one retry only), and arranges for a `kind="overrun"` record to land on the same buffer. If the retry also fails, the policy logs an ERROR via the shared logger (`kind="fdr.overrun_retry_failed"`) — this is rare; it implies the consumer is making zero progress.
- Coalescing: while a burst of consecutive overruns is in flight (consecutive `OVERRUN` returns within the same producer "tick"), the policy increments `dropped_count` on the in-flight overrun record instead of synthesising a new one per drop. The overrun record itself is enqueued at the END of the burst (next successful `enqueue` slot).
- Composition-root wiring: `make_fdr_client` is updated (or a new `wire_fdr_client_overrun(client)` helper is exposed and called inside `make_fdr_client`) so every production FdrClient is constructed with this policy attached. Tests that explicitly construct `FdrClient(...)` directly opt out by leaving `on_overrun` as `None`.
- Diagnostic ERROR log only when the retry-after-drop also fails (NOT on every overrun — overruns are normal under bursty load and would flood the log).
### Excluded
- The buffer itself, the `on_overrun` hook plumbing, and the SPSC contract — owned by AZ-273.
- The `FdrRecord` schema and the `kind="overrun"` payload definition — owned by AZ-272.
- The C13 writer thread's behaviour upon receiving an `overrun` record (it just logs it like any other record) — owned by E-C13 (AZ-248).
- `FakeFdrSink` — owned by the next PBI in this epic.
## Acceptance Criteria
**AC-1: Drop-oldest produces canonical overrun record**
Given an FdrClient with capacity 4 wired with `default_overrun_policy`, fully buffered with 4 user records
When the producer calls `enqueue` for a 5th record
Then the consumer side observes (in order): the 5th user record, then a `kind="overrun"` record whose `payload.producer_id` matches the originating producer and `payload.dropped_count == 1`
**AC-2: Coalescing across a burst**
Given an FdrClient with capacity 4, fully buffered, and the consumer is stalled
When the producer calls `enqueue` 10 times in a row (8 of them overrun)
Then exactly ONE `kind="overrun"` record is emitted at the end of the burst with `payload.dropped_count == 8`
**AC-3: Overrun record carries originating producer_id**
Given an FdrClient(producer_id="c1_vio") wired with the default policy
When the buffer overruns
Then the emitted overrun record has `payload.producer_id == "c1_vio"` (NOT `"shared.fdr_client"` — the OUTER envelope's `producer_id` may be `"shared.fdr_client"` per the schema contract, but the payload identifies the originating producer)
**AC-4: Composition root wires every FdrClient**
Given a production process initialised via `compose_root(config)`
When the test inspects every constructed `FdrClient` in the resulting `RuntimeRoot`
Then every client has a non-None `on_overrun` set to a callable from `default_overrun_policy`
**AC-5: Retry-after-drop failure logs ERROR**
Given a contrived test that monkey-patches the buffer so retry-after-drop ALSO returns `OVERRUN` (simulating a frozen consumer mid-policy)
When an overrun is triggered
Then exactly one ERROR log record is emitted with `kind="fdr.overrun_retry_failed"`; the policy does not loop indefinitely; the overrun record is dropped (test asserts no overrun record on the buffer in this pathological case)
**AC-6: No log flood under sustained overruns**
Given an FdrClient under sustained overrun (1000 consecutive overruns)
When the policy runs
Then the shared logger receives at most 1 ERROR record per second related to overruns (rate cap on the diagnostic log; the FDR record itself is the canonical record of overruns)
## Non-Functional Requirements
**Performance**
- Steady-state overhead: when `on_overrun` is set but the buffer is NOT full (so the hook is never invoked), `enqueue` overhead from this PBI's wiring is ≤ 0.5 µs (effectively a single null-check per call). The 5 µs `enqueue` p99 budget MUST still hold.
- Overrun path overhead: the drop-oldest + retry sequence completes within 20 µs p99 on Tier-2 (it runs only on the cold path; cold-path budget is generous).
**Reliability**
- The policy NEVER loops indefinitely on retry. One retry only; then ERROR-log + drop.
- The policy NEVER raises into the producer's `enqueue` caller. Any exception inside the closure is logged via `kind="fdr.overrun_policy_error"` and swallowed; the producer's hot path stays clean.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Capacity-4 buffer fully filled, then 5th enqueue with `default_overrun_policy` | Consumer sees 5th record + canonical overrun record (`dropped_count == 1`) |
| AC-2 | 10 consecutive overruns in one burst | Exactly one overrun record with `dropped_count == 8` |
| AC-3 | Overrun on FdrClient(producer_id="c1_vio") | Emitted overrun record `payload.producer_id == "c1_vio"` |
| AC-4 | Boot a stub composition root with 3 producers; inspect all FdrClients | Every client has `on_overrun != None` |
| AC-5 | Monkey-patched retry-after-drop also fails | Exactly one ERROR log; no overrun record on buffer; no infinite loop |
| AC-6 | 1000 consecutive overruns | Logger receives ≤ 1 ERROR/sec related to overruns |
| NFR-perf-steady | Microbench `enqueue` with hook set but not invoked | p99 overhead ≤ 0.5 µs vs unhooked |
| NFR-perf-overrun | Microbench drop-oldest + retry sequence | p99 ≤ 20 µs |
| NFR-reliability | Inject an exception into the closure; trigger overrun | Producer call returns normally; ERROR logged |
## Constraints
- The policy plugs into AZ-273's `on_overrun` hook ONLY — no other extension point. Behavioural deviation requires a new contract.
- Coalescing window is bounded by "until the next successful enqueue" — NOT by wall-clock time. Rationale: the buffer is the only synchronisation point; the writer thread drains it; once it drains one slot, the producer's next enqueue succeeds and that is the natural emission point for the overrun record.
- The overrun record's OUTER envelope `producer_id` is `"shared.fdr_client"` (per schema contract); the originating producer's slug is in `payload.producer_id`.
## Risks & Mitigation
**Risk 1: Overrun record itself causes another overrun**
- *Risk*: At the moment of overflow, enqueueing the synthesised overrun record might also fail.
- *Mitigation*: The drop-oldest sequence is "drop one → retry the user record → if successful, then enqueue the overrun record at the next slot the consumer drains". The overrun record is emitted at the END of the burst, on a slot known to be free. If the buffer is so degenerate that one drop is insufficient, the AC-5 ERROR-log path catches it.
**Risk 2: Coalescing hides individual overruns under steady degradation**
- *Risk*: A long-stalled consumer produces one `dropped_count=10000` record at flush time; tooling cannot reconstruct fine-grained timing.
- *Mitigation*: The coalescing scope is "consecutive overruns until next successful enqueue". As soon as the consumer drains one slot, the overrun record is emitted with the count up to that point. Tooling can correlate against the drained record's `ts` to reconstruct timing windows. Documented in the schema contract's invariants.
**Risk 3: Composition-root wiring drift**
- *Risk*: A future component constructs `FdrClient(...)` directly instead of using `make_fdr_client(...)`, ending up with `on_overrun = None` and silent drops in production.
- *Mitigation*: AC-4's contract test scans the constructed `RuntimeRoot` for any FdrClient with `on_overrun is None` and fails. Documented as a code-review Phase 2 (Spec Compliance) check tied to the fdr_client_protocol contract.
## Runtime Completeness
- **Named capability**: drop-oldest + `kind="overrun"` record emission policy (architecture / E-CC-FDR-CLIENT / AC-NEW-3).
- **Production code that must exist**: real drop-oldest closure, real overrun-record synthesis, real composition-root wiring of every producer.
- **Allowed external stubs**: tests MAY replace `on_overrun` with a recording closure; production wiring MUST NOT.
- **Unacceptable substitutes**: `pass` as the hook ("for now we just log a warning"), in-memory counter without record emission ("we'll add the record later"), or relying on the C13 writer to synthesise overrun records (it cannot — only the producer side knows the burst boundary).
-128
View File
@@ -1,128 +0,0 @@
# FakeFdrSink for Component-Level Tests
**Task**: AZ-275_fake_fdr_sink
**Name**: FakeFdrSink
**Description**: An in-process, in-memory test double for `FdrClient` that conforms to the `fdr_client_protocol` contract's public surface and lets component-level tests assert on every record their code emits to the FDR. Drop-in replacement for `FdrClient` everywhere it is injected; no writer thread, no segment files, no real ring buffer — just a list-of-records the test inspects.
**Complexity**: 2 points
**Dependencies**: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
**Tracker**: AZ-275
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — the public surface this fake conforms to.
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this fake stores in memory.
## Problem
Component-level tests (every component under `tests/unit/components/<name>/` and `tests/integration/<name>/`) must assert on what their code writes to the FDR. Without a fake:
- Tests would have to spin up the C13 writer thread + a tmp segment file just to read records back — slow, brittle, cross-component coupling.
- Tests would all reach into `FdrClient`'s private buffer state, freezing internal layout into every test and blocking future implementation changes.
A simple, contract-conforming `FakeFdrSink` lets each component's test assert on records via a stable public API — and crucially, the same API every other component test uses, so test infrastructure does not fork per component.
## Outcome
- Tests inject `FakeFdrSink(producer_id="c1_vio")` wherever production code expects an `FdrClient`. The component code is unchanged; the test reads `sink.records` after exercising the component.
- Every assertion the contract test of `fdr_client_protocol` makes against a real `FdrClient` ALSO holds against `FakeFdrSink` — except the lock-free / allocation-free / SPSC-guard NFRs (those are real-buffer concerns and are explicitly out of scope for the fake).
- Tests can opt in to drop-oldest semantics (`FakeFdrSink(capacity=N, with_default_overrun_policy=True)`) when verifying overrun behaviour, or leave it disabled and rely on unbounded list mode for general assertions.
## Scope
### Included
- `FakeFdrSink(producer_id: str, capacity: int | None = None, with_default_overrun_policy: bool = False)` constructor implementing the `FdrClient` public surface from `fdr_client_protocol.md`:
- `enqueue`, `pop_one`, `drain`, `flush`, `producer_id`, `on_overrun` getter/setter.
- An `FakeFdrSink.records: list[FdrRecord]` property returning the records currently in-buffer in FIFO order. Tests use this directly for assertions.
- An `FakeFdrSink.all_records_ever: list[FdrRecord]` property returning every record ever enqueued, INCLUDING records dropped by the overrun policy when it is active. Lets tests assert on what the producer TRIED to send vs. what the buffer KEPT.
- Behaviour parity with `FdrClient` for the contract-relevant subset:
- Returns `EnqueueResult.OVERRUN` when `capacity` is set and the buffer is full.
- Invokes `on_overrun` exactly once per overrun event when wired.
- Stamps `producer_id` correctly per the protocol (does NOT mutate `record.producer_id`).
- A pytest fixture (`fake_fdr_sink`) under `tests/conftest.py` that constructs a default-configuration sink and yields it to tests.
### Excluded
- The lock-free SPSC ring buffer, allocation-free hot path, and SPSC guards — owned by AZ-273 (this is a fake; real concurrency primitives are explicitly NOT replicated).
- The drop-oldest closure itself — owned by AZ-274; the fake imports and reuses it when the user opts in via `with_default_overrun_policy=True`.
- The `FdrRecord` schema — owned by AZ-272.
- The C13 writer thread, segment files, etc. — owned by E-C13 (AZ-248).
- A "fake C13 writer" that drains the sink — out of scope. Tests that need the drained side use `pop_one` / `drain` directly on the fake.
## Acceptance Criteria
**AC-1: Drop-in for FdrClient public surface**
Given any production code that takes an `FdrClient` parameter (e.g. `Vio(fdr=fdr_client, ...)`)
When the test passes a `FakeFdrSink` instead
Then the production code's calls (`enqueue`, `flush`) work identically; no AttributeError, no signature mismatch
**AC-2: records reflects in-buffer state**
Given a `FakeFdrSink` with no capacity limit
When the producer enqueues 3 records, then the test calls `pop_one()` once
Then `sink.records` returns the 2 remaining records in FIFO order
**AC-3: all_records_ever captures dropped records**
Given a `FakeFdrSink(capacity=2, with_default_overrun_policy=True)` filled to capacity
When the producer enqueues a 3rd record (drop-oldest fires)
Then `sink.records` has 2 entries (newest 2) AND `sink.all_records_ever` has 3 entries (all of them, including the dropped one)
**AC-4: Overrun policy parity with real FdrClient**
Given a `FakeFdrSink(capacity=4, with_default_overrun_policy=True)`
When the test reproduces AC-1 from AZ-274 (overflow + canonical overrun record)
Then the same assertion that holds against real `FdrClient` holds against `FakeFdrSink` — same overrun record shape, same coalescing across bursts
**AC-5: pytest fixture available**
Given a test file imports the standard project conftest
When the test signature is `def test_x(fake_fdr_sink): ...`
Then pytest injects a default-configuration `FakeFdrSink` and yields it; teardown clears the sink
**AC-6: producer_id is preserved**
Given `FakeFdrSink(producer_id="c2_vpr")` and an enqueued record carrying `producer_id="c2_vpr"`
When the test inspects `sink.records[0]`
Then `records[0].producer_id == "c2_vpr"` (the fake does NOT rewrite producer_id)
## Non-Functional Requirements
**Performance**
- `enqueue` p99 ≤ 100 µs on Tier-2 (developer machines + CI). The fake is not in the production critical path; the budget exists only to keep tests fast (10k assertions in a long fixture should add < 1 s).
**Reliability**
- The fake is single-threaded only. Concurrent `enqueue` / `pop_one` is undefined behaviour and not tested. Documented in the docstring.
**Compatibility**
- The fake's public surface mirrors the `fdr_client_protocol.md` contract version it conforms to. The fake's docstring records the contract version. Bumping the protocol contract major version requires bumping the fake's surface in lock-step.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Inject `FakeFdrSink` into a stub component that expects `FdrClient` | No AttributeError; calls succeed |
| AC-2 | 3 enqueues + 1 pop on unbounded sink | `len(sink.records) == 2` in FIFO order |
| AC-3 | Capacity-2 sink with overrun policy + 3 enqueues | `len(sink.records) == 2`, `len(sink.all_records_ever) == 3` |
| AC-4 | Re-run AZ-274 AC-1 + AC-2 against the fake | Same overrun record shape; same coalescing |
| AC-5 | A trivial test using `fake_fdr_sink` fixture | Fixture provides a clean sink per test |
| AC-6 | Construct sink + enqueue with explicit producer_id | producer_id preserved on the popped record |
## Constraints
- Public surface is fixed by `fdr_client_protocol.md` v1.0.0. The fake is allowed to expose ADDITIONAL test-only attributes (`records`, `all_records_ever`) — these are documented as fake-only and never appear on the real `FdrClient` (so production code accidentally using them fails the type checker).
- The fake lives at `src/gps_denied_onboard/fdr_client/fakes.py` — a separate module from the production code so production imports never pick it up. Tests import `from gps_denied_onboard.fdr_client.fakes import FakeFdrSink`.
- The fake reuses `default_overrun_policy` from AZ-274 verbatim; it does NOT re-implement the policy.
## Risks & Mitigation
**Risk 1: Fake drift from real client**
- *Risk*: Engineers add a method to `FdrClient` and forget to mirror it on `FakeFdrSink`; tests pass against the fake but production fails.
- *Mitigation*: A contract test (`tests/contract/fdr_client_fake_parity.py`) iterates over every public method on `FdrClient` and asserts the same method exists on `FakeFdrSink` with a compatible signature. Failure mode is loud and immediate.
**Risk 2: Tests reach into `_records` private state, freezing implementation**
- *Risk*: A test does `sink._buffer[3]` instead of `sink.records[3]`; later refactor breaks the test.
- *Mitigation*: `records` and `all_records_ever` are the documented public access; pyright/mypy mark `_buffer` as private with `_` prefix; code review catches private-state access.
## Runtime Completeness
- **Named capability**: `FakeFdrSink` test double — it is NOT a runtime capability; it is test infrastructure. Production code MUST NOT import from `fakes.py` (verified by import-linter rule in the project's `pyproject.toml`).
- **Production code that must exist**: import-linter rule preventing `src/gps_denied_onboard/**/*.py` (excluding `tests/`) from importing `gps_denied_onboard.fdr_client.fakes`. Otherwise none — this PBI's deliverable is test infrastructure.
- **Allowed external stubs**: this IS the stub. It is allowed in tests only.
- **Unacceptable substitutes**: production code wiring `FakeFdrSink` instead of `FdrClient` (would silently disable real FDR writes); per-test ad-hoc fakes that drift from the contract.