Files
gps-denied-onboard/_docs/02_tasks/done/AZ-273_fdr_client_ringbuf.md
T
Oleksandr Bezdieniezhnykh ba20c2d195 [AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test
AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 03:00:49 +03:00

11 KiB

FdrClient Lock-Free SPSC Ring Buffer + Public API

Task: AZ-273_fdr_client_ringbuf Name: FdrClient Ring Buffer Description: Implement the producer-side FdrClient(producer_id) and its lock-free single-producer / single-consumer (SPSC) ring buffer. enqueue is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself. Complexity: 5 points Dependencies: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module Component: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) Tracker: AZ-273 Epic: AZ-247 (E-CC-FDR-CLIENT)

Document Dependencies

  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md — the record envelope this client enqueues.
  • _docs/02_document/contracts/shared_config/composition_root_protocol.md — the Config object that carries this client's capacity setting.
  • _docs/02_document/contracts/shared_logging/log_record_schema.md — diagnostic logs emitted by this client (NOT on the steady-state hot path).

Problem

Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state enqueue p99 ≤ 5 µs budget rule out:

  • Any lock-acquiring queue (Python queue.Queue, threading.Lock-protected list, asyncio queue).
  • Any allocation on the steady-state path (no dict.copy(), no list.append that may resize, no dataclasses.replace).
  • Any blocking I/O.

Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development.

Outcome

  • A single FdrClient(producer_id) is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component.
  • enqueue p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots).
  • enqueue NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit kind="overrun") is implemented by the next PBI via the buffer's documented hook.
  • The dequeue side (pop_one / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test.

Scope

Included

  • FdrClient(producer_id: str, capacity: int) constructor + module-level make_fdr_client(producer_id, config) -> FdrClient factory that reads capacity from the cross-cutting config.fdr_client.<producer_id>.capacity block (with documented default).
  • FdrClient.enqueue(record: FdrRecord) -> EnqueueResult — lock-free, non-blocking, allocation-free on the steady-state path. Returns EnqueueResult.OK or EnqueueResult.OVERRUN (the next PBI consumes OVERRUN).
  • A documented on_overrun: Callable[[FdrRecord], None] | None hook the overrun-policy PBI populates with the drop-oldest + record-emit closure.
  • Single-consumer dequeue API for the C13 writer: pop_one() -> FdrRecord | None and drain(max_records: int) -> list[FdrRecord].
  • flush() -> None test-only method that blocks until the buffer is empty (used by FakeFdrSink and contract tests; production callers MUST NOT call this on the hot path).
  • Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger.
  • Public interface contract published at _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md.

Excluded

  • The drop-oldest behaviour and the kind="overrun" record emission — owned by the next PBI in this epic.
  • The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248).
  • The FakeFdrSink for tests — owned by the fourth PBI in this epic.
  • Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC.
  • The actual FdrRecord schema and serialiser — owned by AZ-272.

Acceptance Criteria

AC-1: Lock-free, never blocks Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue) When the producer calls enqueue(record) 1025 times in rapid succession Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns EnqueueResult.OVERRUN

AC-2: Allocation-free steady-state Given an FdrClient warmed up with one prior enqueue When the producer calls enqueue(record) for an in-buffer record (slot is free) Then the call performs zero heap allocations (verified via tracemalloc snapshot diff: 0 new objects on the hot path)

AC-3: Capacity is config-driven Given the cross-cutting Config block sets config.fdr_client.<producer_id>.capacity = 4096 When make_fdr_client(producer_id, config) runs Then the returned client's internal buffer length is 4096 (verified via the test-only _capacity() introspection method)

AC-4: SPSC dequeue contract Given two threads concurrently call pop_one() When both calls race Then the contract test detects undefined behaviour (asserted via a contract test that wraps pop_one in a guard which raises FdrSpscViolationError on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant)

AC-5: Overrun hook is wired Given an FdrClient with on_overrun set to a recording closure When the buffer fills and the next enqueue would overrun Then on_overrun is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour)

AC-6: flush() drains buffer Given an FdrClient with N records buffered and a consumer thread draining When the test calls flush() Then flush() returns only after pop_one() has been called N times (no records left in the buffer)

AC-7: producer_id is non-empty and stamped on every record Given a constructor call FdrClient(producer_id="") (empty string) When construction runs Then ValueError is raised — anonymous producers are forbidden

Non-Functional Requirements

Performance

  • enqueue p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a payload dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache).
  • pop_one p99 ≤ 10 µs on Tier-2 under steady-state.
  • Memory: per-producer ring buffer ≤ capacity * sizeof(slot) bytes; no unbounded growth. Pre-sized at construction.

Reliability

  • enqueue never raises into the caller. Schema violations from FdrRecord are caught and forwarded to the same on_overrun hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean.
  • Multiple make_fdr_client(producer_id, config) calls with the same producer_id return the same cached instance — there is exactly one FdrClient per producer_id per process.

Concurrency

  • SPSC: ONE producer thread MAY call enqueue, ONE consumer thread MAY call pop_one / drain. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4).

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Stalled consumer + 1025 enqueues into a 1024-capacity client Every call returns within 50 µs; #1025 returns OVERRUN
AC-2 tracemalloc snapshot diff across one enqueue after warmup Zero new objects allocated
AC-3 make_fdr_client("c1_vio", config_with_capacity_4096) client._capacity() == 4096
AC-4 Two threads call pop_one() concurrently with the SPSC guard enabled FdrSpscViolationError raised
AC-5 Wire a recording on_overrun; force overrun Closure invoked exactly once with the offending record
AC-6 Enqueue N records, start a draining consumer, call flush() flush() returns only after buffer is empty
AC-7 FdrClient(producer_id="") ValueError
NFR-perf Microbench enqueue over 10k iterations on Tier-2 p99 ≤ 5 µs
NFR-perf-pop Microbench pop_one over 10k iterations p99 ≤ 10 µs
NFR-reliability Two make_fdr_client("c1_vio", config) calls same instance returned

Constraints

  • Public surface frozen by _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md v1.0.0.
  • SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists.
  • The lock-free implementation MAY use multiprocessing.shared_memory, cffi-backed atomics, a Cython extension, or pure Python with array.array + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report.
  • No new dependency beyond what AZ-263 / E-BOOT pinned.

Risks & Mitigation

Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2

  • Risk: CPython's GIL + dict operations push p99 above 5 µs on the Jetson.
  • Mitigation: Bench against a cffi or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle.

Risk 2: Overrun hook called with record that holds a reference to caller-mutable state

  • Risk: Producer mutates record.payload after enqueue; the overrun closure sees the mutated value.
  • Mitigation: FdrRecord is @frozen (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract Invariants.

Risk 3: Cached FdrClient leaks across test cases

  • Risk: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient.
  • Mitigation: A _reset_for_tests() private function (documented as test-only in the contract Non-Goals) clears the cache; integration test fixture calls it on teardown.

Runtime Completeness

  • Named capability: lock-free SPSC ring buffer + FdrClient public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR enqueue p99 ≤ 5 µs).
  • Production code that must exist: real lock-free SPSC primitive (no Python queue.Queue, no lock-acquiring fallback); real allocation-free hot path; real on_overrun hook plumbing.
  • Allowed external stubs: none — the queue is the production runtime capability.
  • Unacceptable substitutes: queue.Queue, threading.Lock-guarded list, collections.deque with a lock, "for now we just time.sleep(0) on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms.

Contract

This task produces the contract at _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md. Consumers MUST read that file — not this task spec — to discover the interface.