# FdrClient Lock-Free SPSC Ring Buffer + Public API **Task**: AZ-273_fdr_client_ringbuf **Name**: FdrClient Ring Buffer **Description**: Implement the producer-side `FdrClient(producer_id)` and its lock-free single-producer / single-consumer (SPSC) ring buffer. `enqueue` is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself. **Complexity**: 5 points **Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module **Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) **Tracker**: AZ-273 **Epic**: AZ-247 (E-CC-FDR-CLIENT) ### Document Dependencies - `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this client enqueues. - `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the Config object that carries this client's capacity setting. - `_docs/02_document/contracts/shared_logging/log_record_schema.md` — diagnostic logs emitted by this client (NOT on the steady-state hot path). ## Problem Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state `enqueue` p99 ≤ 5 µs budget rule out: - Any lock-acquiring queue (Python `queue.Queue`, `threading.Lock`-protected list, asyncio queue). - Any allocation on the steady-state path (no `dict.copy()`, no `list.append` that may resize, no `dataclasses.replace`). - Any blocking I/O. Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development. ## Outcome - A single `FdrClient(producer_id)` is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component. - `enqueue` p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots). - `enqueue` NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit `kind="overrun"`) is implemented by the next PBI via the buffer's documented hook. - The dequeue side (`pop_one` / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test. ## Scope ### Included - `FdrClient(producer_id: str, capacity: int)` constructor + module-level `make_fdr_client(producer_id, config) -> FdrClient` factory that reads capacity from the cross-cutting `config.fdr_client..capacity` block (with documented default). - `FdrClient.enqueue(record: FdrRecord) -> EnqueueResult` — lock-free, non-blocking, allocation-free on the steady-state path. Returns `EnqueueResult.OK` or `EnqueueResult.OVERRUN` (the next PBI consumes `OVERRUN`). - A documented `on_overrun: Callable[[FdrRecord], None] | None` hook the overrun-policy PBI populates with the drop-oldest + record-emit closure. - Single-consumer dequeue API for the C13 writer: `pop_one() -> FdrRecord | None` and `drain(max_records: int) -> list[FdrRecord]`. - `flush() -> None` test-only method that blocks until the buffer is empty (used by `FakeFdrSink` and contract tests; production callers MUST NOT call this on the hot path). - Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger. - Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`. ### Excluded - The drop-oldest behaviour and the `kind="overrun"` record emission — owned by the next PBI in this epic. - The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248). - The `FakeFdrSink` for tests — owned by the fourth PBI in this epic. - Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC. - The actual `FdrRecord` schema and serialiser — owned by AZ-272. ## Acceptance Criteria **AC-1: Lock-free, never blocks** Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue) When the producer calls `enqueue(record)` 1025 times in rapid succession Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns `EnqueueResult.OVERRUN` **AC-2: Allocation-free steady-state** Given an FdrClient warmed up with one prior `enqueue` When the producer calls `enqueue(record)` for an in-buffer record (slot is free) Then the call performs zero heap allocations (verified via `tracemalloc` snapshot diff: 0 new objects on the hot path) **AC-3: Capacity is config-driven** Given the cross-cutting Config block sets `config.fdr_client..capacity = 4096` When `make_fdr_client(producer_id, config)` runs Then the returned client's internal buffer length is 4096 (verified via the test-only `_capacity()` introspection method) **AC-4: SPSC dequeue contract** Given two threads concurrently call `pop_one()` When both calls race Then the contract test detects undefined behaviour (asserted via a contract test that wraps `pop_one` in a guard which raises `FdrSpscViolationError` on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant) **AC-5: Overrun hook is wired** Given an `FdrClient` with `on_overrun` set to a recording closure When the buffer fills and the next `enqueue` would overrun Then `on_overrun` is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour) **AC-6: flush() drains buffer** Given an FdrClient with N records buffered and a consumer thread draining When the test calls `flush()` Then `flush()` returns only after `pop_one()` has been called N times (no records left in the buffer) **AC-7: producer_id is non-empty and stamped on every record** Given a constructor call `FdrClient(producer_id="")` (empty string) When construction runs Then `ValueError` is raised — anonymous producers are forbidden ## Non-Functional Requirements **Performance** - `enqueue` p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a `payload` dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache). - `pop_one` p99 ≤ 10 µs on Tier-2 under steady-state. - Memory: per-producer ring buffer ≤ `capacity * sizeof(slot)` bytes; no unbounded growth. Pre-sized at construction. **Reliability** - `enqueue` never raises into the caller. Schema violations from `FdrRecord` are caught and forwarded to the same `on_overrun` hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean. - Multiple `make_fdr_client(producer_id, config)` calls with the same `producer_id` return the same cached instance — there is exactly one FdrClient per producer_id per process. **Concurrency** - SPSC: ONE producer thread MAY call `enqueue`, ONE consumer thread MAY call `pop_one` / `drain`. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4). ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns `OVERRUN` | | AC-2 | `tracemalloc` snapshot diff across one `enqueue` after warmup | Zero new objects allocated | | AC-3 | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` | | AC-4 | Two threads call `pop_one()` concurrently with the SPSC guard enabled | `FdrSpscViolationError` raised | | AC-5 | Wire a recording `on_overrun`; force overrun | Closure invoked exactly once with the offending record | | AC-6 | Enqueue N records, start a draining consumer, call `flush()` | `flush()` returns only after buffer is empty | | AC-7 | `FdrClient(producer_id="")` | `ValueError` | | NFR-perf | Microbench `enqueue` over 10k iterations on Tier-2 | p99 ≤ 5 µs | | NFR-perf-pop | Microbench `pop_one` over 10k iterations | p99 ≤ 10 µs | | NFR-reliability | Two `make_fdr_client("c1_vio", config)` calls | same instance returned | ## Constraints - Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` v1.0.0. - SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists. - The lock-free implementation MAY use `multiprocessing.shared_memory`, `cffi`-backed atomics, a Cython extension, or pure Python with `array.array` + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report. - No new dependency beyond what AZ-263 / E-BOOT pinned. ## Risks & Mitigation **Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2** - *Risk*: CPython's GIL + dict operations push p99 above 5 µs on the Jetson. - *Mitigation*: Bench against a `cffi` or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle. **Risk 2: Overrun hook called with record that holds a reference to caller-mutable state** - *Risk*: Producer mutates `record.payload` after `enqueue`; the overrun closure sees the mutated value. - *Mitigation*: `FdrRecord` is `@frozen` (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract `Invariants`. **Risk 3: Cached FdrClient leaks across test cases** - *Risk*: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient. - *Mitigation*: A `_reset_for_tests()` private function (documented as test-only in the contract `Non-Goals`) clears the cache; integration test fixture calls it on teardown. ## Runtime Completeness - **Named capability**: lock-free SPSC ring buffer + `FdrClient` public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR `enqueue` p99 ≤ 5 µs). - **Production code that must exist**: real lock-free SPSC primitive (no Python `queue.Queue`, no lock-acquiring fallback); real allocation-free hot path; real `on_overrun` hook plumbing. - **Allowed external stubs**: none — the queue is the production runtime capability. - **Unacceptable substitutes**: `queue.Queue`, `threading.Lock`-guarded list, `collections.deque` with a lock, "for now we just `time.sleep(0)` on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms. ## Contract This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`. Consumers MUST read that file — not this task spec — to discover the interface.