[AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test

AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:00:49 +03:00
parent 3acc7f33dd
commit ba20c2d195
24 changed files with 2714 additions and 20 deletions
@@ -0,0 +1,151 @@
# FdrClient Lock-Free SPSC Ring Buffer + Public API
**Task**: AZ-273_fdr_client_ringbuf
**Name**: FdrClient Ring Buffer
**Description**: Implement the producer-side `FdrClient(producer_id)` and its lock-free single-producer / single-consumer (SPSC) ring buffer. `enqueue` is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself.
**Complexity**: 5 points
**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
**Tracker**: AZ-273
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this client enqueues.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the Config object that carries this client's capacity setting.
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — diagnostic logs emitted by this client (NOT on the steady-state hot path).
## Problem
Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state `enqueue` p99 ≤ 5 µs budget rule out:
- Any lock-acquiring queue (Python `queue.Queue`, `threading.Lock`-protected list, asyncio queue).
- Any allocation on the steady-state path (no `dict.copy()`, no `list.append` that may resize, no `dataclasses.replace`).
- Any blocking I/O.
Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development.
## Outcome
- A single `FdrClient(producer_id)` is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component.
- `enqueue` p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots).
- `enqueue` NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit `kind="overrun"`) is implemented by the next PBI via the buffer's documented hook.
- The dequeue side (`pop_one` / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test.
## Scope
### Included
- `FdrClient(producer_id: str, capacity: int)` constructor + module-level `make_fdr_client(producer_id, config) -> FdrClient` factory that reads capacity from the cross-cutting `config.fdr_client.<producer_id>.capacity` block (with documented default).
- `FdrClient.enqueue(record: FdrRecord) -> EnqueueResult` — lock-free, non-blocking, allocation-free on the steady-state path. Returns `EnqueueResult.OK` or `EnqueueResult.OVERRUN` (the next PBI consumes `OVERRUN`).
- A documented `on_overrun: Callable[[FdrRecord], None] | None` hook the overrun-policy PBI populates with the drop-oldest + record-emit closure.
- Single-consumer dequeue API for the C13 writer: `pop_one() -> FdrRecord | None` and `drain(max_records: int) -> list[FdrRecord]`.
- `flush() -> None` test-only method that blocks until the buffer is empty (used by `FakeFdrSink` and contract tests; production callers MUST NOT call this on the hot path).
- Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger.
- Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
### Excluded
- The drop-oldest behaviour and the `kind="overrun"` record emission — owned by the next PBI in this epic.
- The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248).
- The `FakeFdrSink` for tests — owned by the fourth PBI in this epic.
- Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC.
- The actual `FdrRecord` schema and serialiser — owned by AZ-272.
## Acceptance Criteria
**AC-1: Lock-free, never blocks**
Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue)
When the producer calls `enqueue(record)` 1025 times in rapid succession
Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns `EnqueueResult.OVERRUN`
**AC-2: Allocation-free steady-state**
Given an FdrClient warmed up with one prior `enqueue`
When the producer calls `enqueue(record)` for an in-buffer record (slot is free)
Then the call performs zero heap allocations (verified via `tracemalloc` snapshot diff: 0 new objects on the hot path)
**AC-3: Capacity is config-driven**
Given the cross-cutting Config block sets `config.fdr_client.<producer_id>.capacity = 4096`
When `make_fdr_client(producer_id, config)` runs
Then the returned client's internal buffer length is 4096 (verified via the test-only `_capacity()` introspection method)
**AC-4: SPSC dequeue contract**
Given two threads concurrently call `pop_one()`
When both calls race
Then the contract test detects undefined behaviour (asserted via a contract test that wraps `pop_one` in a guard which raises `FdrSpscViolationError` on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant)
**AC-5: Overrun hook is wired**
Given an `FdrClient` with `on_overrun` set to a recording closure
When the buffer fills and the next `enqueue` would overrun
Then `on_overrun` is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour)
**AC-6: flush() drains buffer**
Given an FdrClient with N records buffered and a consumer thread draining
When the test calls `flush()`
Then `flush()` returns only after `pop_one()` has been called N times (no records left in the buffer)
**AC-7: producer_id is non-empty and stamped on every record**
Given a constructor call `FdrClient(producer_id="")` (empty string)
When construction runs
Then `ValueError` is raised — anonymous producers are forbidden
## Non-Functional Requirements
**Performance**
- `enqueue` p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a `payload` dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache).
- `pop_one` p99 ≤ 10 µs on Tier-2 under steady-state.
- Memory: per-producer ring buffer ≤ `capacity * sizeof(slot)` bytes; no unbounded growth. Pre-sized at construction.
**Reliability**
- `enqueue` never raises into the caller. Schema violations from `FdrRecord` are caught and forwarded to the same `on_overrun` hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean.
- Multiple `make_fdr_client(producer_id, config)` calls with the same `producer_id` return the same cached instance — there is exactly one FdrClient per producer_id per process.
**Concurrency**
- SPSC: ONE producer thread MAY call `enqueue`, ONE consumer thread MAY call `pop_one` / `drain`. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4).
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns `OVERRUN` |
| AC-2 | `tracemalloc` snapshot diff across one `enqueue` after warmup | Zero new objects allocated |
| AC-3 | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` |
| AC-4 | Two threads call `pop_one()` concurrently with the SPSC guard enabled | `FdrSpscViolationError` raised |
| AC-5 | Wire a recording `on_overrun`; force overrun | Closure invoked exactly once with the offending record |
| AC-6 | Enqueue N records, start a draining consumer, call `flush()` | `flush()` returns only after buffer is empty |
| AC-7 | `FdrClient(producer_id="")` | `ValueError` |
| NFR-perf | Microbench `enqueue` over 10k iterations on Tier-2 | p99 ≤ 5 µs |
| NFR-perf-pop | Microbench `pop_one` over 10k iterations | p99 ≤ 10 µs |
| NFR-reliability | Two `make_fdr_client("c1_vio", config)` calls | same instance returned |
## Constraints
- Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` v1.0.0.
- SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists.
- The lock-free implementation MAY use `multiprocessing.shared_memory`, `cffi`-backed atomics, a Cython extension, or pure Python with `array.array` + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report.
- No new dependency beyond what AZ-263 / E-BOOT pinned.
## Risks & Mitigation
**Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2**
- *Risk*: CPython's GIL + dict operations push p99 above 5 µs on the Jetson.
- *Mitigation*: Bench against a `cffi` or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle.
**Risk 2: Overrun hook called with record that holds a reference to caller-mutable state**
- *Risk*: Producer mutates `record.payload` after `enqueue`; the overrun closure sees the mutated value.
- *Mitigation*: `FdrRecord` is `@frozen` (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract `Invariants`.
**Risk 3: Cached FdrClient leaks across test cases**
- *Risk*: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient.
- *Mitigation*: A `_reset_for_tests()` private function (documented as test-only in the contract `Non-Goals`) clears the cache; integration test fixture calls it on teardown.
## Runtime Completeness
- **Named capability**: lock-free SPSC ring buffer + `FdrClient` public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR `enqueue` p99 ≤ 5 µs).
- **Production code that must exist**: real lock-free SPSC primitive (no Python `queue.Queue`, no lock-acquiring fallback); real allocation-free hot path; real `on_overrun` hook plumbing.
- **Allowed external stubs**: none — the queue is the production runtime capability.
- **Unacceptable substitutes**: `queue.Queue`, `threading.Lock`-guarded list, `collections.deque` with a lock, "for now we just `time.sleep(0)` on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms.
## Contract
This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
Consumers MUST read that file — not this task spec — to discover the interface.