AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of- two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError on the public surface. make_fdr_client caches one client per producer_id and reads capacity from config.fdr.per_producer_capacity with fallback to queue_size. AZ-274: default_overrun_policy implements drop-oldest + retry + immediate marker emission, with prior-marker dropped_count folding via _evict_one so user-loss info is never lost across iterations. ERROR diagnostic is rate-limited to <=1/sec per producer. AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the production default_overrun_policy via a duck-typed _PolicyAdapter. The test-only records/all_records_ever properties let component tests assert both in-buffer and lifetime state. tests/conftest.py registers the fake_fdr_sink fixture and an AST architecture lint forbids production imports of fakes. AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge and forwards only WARN+ERROR records into the FDR with kind="log". Thread-local recursion guard short-circuits internal logging; saturated- queue diagnostics go to stderr every N=1000 drops. AZ-268: tests/contract/log_schema.py covers every row of the schema's Test Cases table plus the "DEBUG+INFO never reach FDR" invariant. pyproject.toml registers the contract pytest marker and the contract-mandated log_schema.py file-name. 251 unit + contract tests pass (48 new). Review verdict: PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer. Co-authored-by: Cursor <cursoragent@cursor.com>
11 KiB
FdrClient Lock-Free SPSC Ring Buffer + Public API
Task: AZ-273_fdr_client_ringbuf
Name: FdrClient Ring Buffer
Description: Implement the producer-side FdrClient(producer_id) and its lock-free single-producer / single-consumer (SPSC) ring buffer. enqueue is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself.
Complexity: 5 points
Dependencies: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module
Component: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
Tracker: AZ-273
Epic: AZ-247 (E-CC-FDR-CLIENT)
Document Dependencies
_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md— the record envelope this client enqueues._docs/02_document/contracts/shared_config/composition_root_protocol.md— the Config object that carries this client's capacity setting._docs/02_document/contracts/shared_logging/log_record_schema.md— diagnostic logs emitted by this client (NOT on the steady-state hot path).
Problem
Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state enqueue p99 ≤ 5 µs budget rule out:
- Any lock-acquiring queue (Python
queue.Queue,threading.Lock-protected list, asyncio queue). - Any allocation on the steady-state path (no
dict.copy(), nolist.appendthat may resize, nodataclasses.replace). - Any blocking I/O.
Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development.
Outcome
- A single
FdrClient(producer_id)is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component. enqueuep99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots).enqueueNEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emitkind="overrun") is implemented by the next PBI via the buffer's documented hook.- The dequeue side (
pop_one/ iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test.
Scope
Included
FdrClient(producer_id: str, capacity: int)constructor + module-levelmake_fdr_client(producer_id, config) -> FdrClientfactory that reads capacity from the cross-cuttingconfig.fdr_client.<producer_id>.capacityblock (with documented default).FdrClient.enqueue(record: FdrRecord) -> EnqueueResult— lock-free, non-blocking, allocation-free on the steady-state path. ReturnsEnqueueResult.OKorEnqueueResult.OVERRUN(the next PBI consumesOVERRUN).- A documented
on_overrun: Callable[[FdrRecord], None] | Nonehook the overrun-policy PBI populates with the drop-oldest + record-emit closure. - Single-consumer dequeue API for the C13 writer:
pop_one() -> FdrRecord | Noneanddrain(max_records: int) -> list[FdrRecord]. flush() -> Nonetest-only method that blocks until the buffer is empty (used byFakeFdrSinkand contract tests; production callers MUST NOT call this on the hot path).- Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger.
- Public interface contract published at
_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md.
Excluded
- The drop-oldest behaviour and the
kind="overrun"record emission — owned by the next PBI in this epic. - The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248).
- The
FakeFdrSinkfor tests — owned by the fourth PBI in this epic. - Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC.
- The actual
FdrRecordschema and serialiser — owned by AZ-272.
Acceptance Criteria
AC-1: Lock-free, never blocks
Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue)
When the producer calls enqueue(record) 1025 times in rapid succession
Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns EnqueueResult.OVERRUN
AC-2: Allocation-free steady-state
Given an FdrClient warmed up with one prior enqueue
When the producer calls enqueue(record) for an in-buffer record (slot is free)
Then the call performs zero heap allocations (verified via tracemalloc snapshot diff: 0 new objects on the hot path)
AC-3: Capacity is config-driven
Given the cross-cutting Config block sets config.fdr_client.<producer_id>.capacity = 4096
When make_fdr_client(producer_id, config) runs
Then the returned client's internal buffer length is 4096 (verified via the test-only _capacity() introspection method)
AC-4: SPSC dequeue contract
Given two threads concurrently call pop_one()
When both calls race
Then the contract test detects undefined behaviour (asserted via a contract test that wraps pop_one in a guard which raises FdrSpscViolationError on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant)
AC-5: Overrun hook is wired
Given an FdrClient with on_overrun set to a recording closure
When the buffer fills and the next enqueue would overrun
Then on_overrun is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour)
AC-6: flush() drains buffer
Given an FdrClient with N records buffered and a consumer thread draining
When the test calls flush()
Then flush() returns only after pop_one() has been called N times (no records left in the buffer)
AC-7: producer_id is non-empty and stamped on every record
Given a constructor call FdrClient(producer_id="") (empty string)
When construction runs
Then ValueError is raised — anonymous producers are forbidden
Non-Functional Requirements
Performance
enqueuep99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying apayloaddict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache).pop_onep99 ≤ 10 µs on Tier-2 under steady-state.- Memory: per-producer ring buffer ≤
capacity * sizeof(slot)bytes; no unbounded growth. Pre-sized at construction.
Reliability
enqueuenever raises into the caller. Schema violations fromFdrRecordare caught and forwarded to the sameon_overrunhook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean.- Multiple
make_fdr_client(producer_id, config)calls with the sameproducer_idreturn the same cached instance — there is exactly one FdrClient per producer_id per process.
Concurrency
- SPSC: ONE producer thread MAY call
enqueue, ONE consumer thread MAY callpop_one/drain. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4).
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns OVERRUN |
| AC-2 | tracemalloc snapshot diff across one enqueue after warmup |
Zero new objects allocated |
| AC-3 | make_fdr_client("c1_vio", config_with_capacity_4096) |
client._capacity() == 4096 |
| AC-4 | Two threads call pop_one() concurrently with the SPSC guard enabled |
FdrSpscViolationError raised |
| AC-5 | Wire a recording on_overrun; force overrun |
Closure invoked exactly once with the offending record |
| AC-6 | Enqueue N records, start a draining consumer, call flush() |
flush() returns only after buffer is empty |
| AC-7 | FdrClient(producer_id="") |
ValueError |
| NFR-perf | Microbench enqueue over 10k iterations on Tier-2 |
p99 ≤ 5 µs |
| NFR-perf-pop | Microbench pop_one over 10k iterations |
p99 ≤ 10 µs |
| NFR-reliability | Two make_fdr_client("c1_vio", config) calls |
same instance returned |
Constraints
- Public surface frozen by
_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.mdv1.0.0. - SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists.
- The lock-free implementation MAY use
multiprocessing.shared_memory,cffi-backed atomics, a Cython extension, or pure Python witharray.array+ a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report. - No new dependency beyond what AZ-263 / E-BOOT pinned.
Risks & Mitigation
Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2
- Risk: CPython's GIL + dict operations push p99 above 5 µs on the Jetson.
- Mitigation: Bench against a
cffior Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle.
Risk 2: Overrun hook called with record that holds a reference to caller-mutable state
- Risk: Producer mutates
record.payloadafterenqueue; the overrun closure sees the mutated value. - Mitigation:
FdrRecordis@frozen(per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contractInvariants.
Risk 3: Cached FdrClient leaks across test cases
- Risk: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient.
- Mitigation: A
_reset_for_tests()private function (documented as test-only in the contractNon-Goals) clears the cache; integration test fixture calls it on teardown.
Runtime Completeness
- Named capability: lock-free SPSC ring buffer +
FdrClientpublic API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFRenqueuep99 ≤ 5 µs). - Production code that must exist: real lock-free SPSC primitive (no Python
queue.Queue, no lock-acquiring fallback); real allocation-free hot path; realon_overrunhook plumbing. - Allowed external stubs: none — the queue is the production runtime capability.
- Unacceptable substitutes:
queue.Queue,threading.Lock-guarded list,collections.dequewith a lock, "for now we justtime.sleep(0)on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms.
Contract
This task produces the contract at _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md.
Consumers MUST read that file — not this task spec — to discover the interface.