mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 18:11:14 +00:00
Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# FdrClient Lock-Free SPSC Ring Buffer + Public API
|
||||
|
||||
**Task**: AZ-273_fdr_client_ringbuf
|
||||
**Name**: FdrClient Ring Buffer
|
||||
**Description**: Implement the producer-side `FdrClient(producer_id)` and its lock-free single-producer / single-consumer (SPSC) ring buffer. `enqueue` is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself.
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
|
||||
**Tracker**: AZ-273
|
||||
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this client enqueues.
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the Config object that carries this client's capacity setting.
|
||||
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — diagnostic logs emitted by this client (NOT on the steady-state hot path).
|
||||
|
||||
## Problem
|
||||
|
||||
Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state `enqueue` p99 ≤ 5 µs budget rule out:
|
||||
- Any lock-acquiring queue (Python `queue.Queue`, `threading.Lock`-protected list, asyncio queue).
|
||||
- Any allocation on the steady-state path (no `dict.copy()`, no `list.append` that may resize, no `dataclasses.replace`).
|
||||
- Any blocking I/O.
|
||||
|
||||
Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A single `FdrClient(producer_id)` is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component.
|
||||
- `enqueue` p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots).
|
||||
- `enqueue` NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit `kind="overrun"`) is implemented by the next PBI via the buffer's documented hook.
|
||||
- The dequeue side (`pop_one` / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `FdrClient(producer_id: str, capacity: int)` constructor + module-level `make_fdr_client(producer_id, config) -> FdrClient` factory that reads capacity from the cross-cutting `config.fdr_client.<producer_id>.capacity` block (with documented default).
|
||||
- `FdrClient.enqueue(record: FdrRecord) -> EnqueueResult` — lock-free, non-blocking, allocation-free on the steady-state path. Returns `EnqueueResult.OK` or `EnqueueResult.OVERRUN` (the next PBI consumes `OVERRUN`).
|
||||
- A documented `on_overrun: Callable[[FdrRecord], None] | None` hook the overrun-policy PBI populates with the drop-oldest + record-emit closure.
|
||||
- Single-consumer dequeue API for the C13 writer: `pop_one() -> FdrRecord | None` and `drain(max_records: int) -> list[FdrRecord]`.
|
||||
- `flush() -> None` test-only method that blocks until the buffer is empty (used by `FakeFdrSink` and contract tests; production callers MUST NOT call this on the hot path).
|
||||
- Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger.
|
||||
- Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
|
||||
|
||||
### Excluded
|
||||
|
||||
- The drop-oldest behaviour and the `kind="overrun"` record emission — owned by the next PBI in this epic.
|
||||
- The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248).
|
||||
- The `FakeFdrSink` for tests — owned by the fourth PBI in this epic.
|
||||
- Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC.
|
||||
- The actual `FdrRecord` schema and serialiser — owned by AZ-272.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Lock-free, never blocks**
|
||||
Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue)
|
||||
When the producer calls `enqueue(record)` 1025 times in rapid succession
|
||||
Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns `EnqueueResult.OVERRUN`
|
||||
|
||||
**AC-2: Allocation-free steady-state**
|
||||
Given an FdrClient warmed up with one prior `enqueue`
|
||||
When the producer calls `enqueue(record)` for an in-buffer record (slot is free)
|
||||
Then the call performs zero heap allocations (verified via `tracemalloc` snapshot diff: 0 new objects on the hot path)
|
||||
|
||||
**AC-3: Capacity is config-driven**
|
||||
Given the cross-cutting Config block sets `config.fdr_client.<producer_id>.capacity = 4096`
|
||||
When `make_fdr_client(producer_id, config)` runs
|
||||
Then the returned client's internal buffer length is 4096 (verified via the test-only `_capacity()` introspection method)
|
||||
|
||||
**AC-4: SPSC dequeue contract**
|
||||
Given two threads concurrently call `pop_one()`
|
||||
When both calls race
|
||||
Then the contract test detects undefined behaviour (asserted via a contract test that wraps `pop_one` in a guard which raises `FdrSpscViolationError` on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant)
|
||||
|
||||
**AC-5: Overrun hook is wired**
|
||||
Given an `FdrClient` with `on_overrun` set to a recording closure
|
||||
When the buffer fills and the next `enqueue` would overrun
|
||||
Then `on_overrun` is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour)
|
||||
|
||||
**AC-6: flush() drains buffer**
|
||||
Given an FdrClient with N records buffered and a consumer thread draining
|
||||
When the test calls `flush()`
|
||||
Then `flush()` returns only after `pop_one()` has been called N times (no records left in the buffer)
|
||||
|
||||
**AC-7: producer_id is non-empty and stamped on every record**
|
||||
Given a constructor call `FdrClient(producer_id="")` (empty string)
|
||||
When construction runs
|
||||
Then `ValueError` is raised — anonymous producers are forbidden
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- `enqueue` p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a `payload` dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache).
|
||||
- `pop_one` p99 ≤ 10 µs on Tier-2 under steady-state.
|
||||
- Memory: per-producer ring buffer ≤ `capacity * sizeof(slot)` bytes; no unbounded growth. Pre-sized at construction.
|
||||
|
||||
**Reliability**
|
||||
- `enqueue` never raises into the caller. Schema violations from `FdrRecord` are caught and forwarded to the same `on_overrun` hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean.
|
||||
- Multiple `make_fdr_client(producer_id, config)` calls with the same `producer_id` return the same cached instance — there is exactly one FdrClient per producer_id per process.
|
||||
|
||||
**Concurrency**
|
||||
- SPSC: ONE producer thread MAY call `enqueue`, ONE consumer thread MAY call `pop_one` / `drain`. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns `OVERRUN` |
|
||||
| AC-2 | `tracemalloc` snapshot diff across one `enqueue` after warmup | Zero new objects allocated |
|
||||
| AC-3 | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` |
|
||||
| AC-4 | Two threads call `pop_one()` concurrently with the SPSC guard enabled | `FdrSpscViolationError` raised |
|
||||
| AC-5 | Wire a recording `on_overrun`; force overrun | Closure invoked exactly once with the offending record |
|
||||
| AC-6 | Enqueue N records, start a draining consumer, call `flush()` | `flush()` returns only after buffer is empty |
|
||||
| AC-7 | `FdrClient(producer_id="")` | `ValueError` |
|
||||
| NFR-perf | Microbench `enqueue` over 10k iterations on Tier-2 | p99 ≤ 5 µs |
|
||||
| NFR-perf-pop | Microbench `pop_one` over 10k iterations | p99 ≤ 10 µs |
|
||||
| NFR-reliability | Two `make_fdr_client("c1_vio", config)` calls | same instance returned |
|
||||
|
||||
## Constraints
|
||||
|
||||
- Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` v1.0.0.
|
||||
- SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists.
|
||||
- The lock-free implementation MAY use `multiprocessing.shared_memory`, `cffi`-backed atomics, a Cython extension, or pure Python with `array.array` + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report.
|
||||
- No new dependency beyond what AZ-263 / E-BOOT pinned.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2**
|
||||
- *Risk*: CPython's GIL + dict operations push p99 above 5 µs on the Jetson.
|
||||
- *Mitigation*: Bench against a `cffi` or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle.
|
||||
|
||||
**Risk 2: Overrun hook called with record that holds a reference to caller-mutable state**
|
||||
- *Risk*: Producer mutates `record.payload` after `enqueue`; the overrun closure sees the mutated value.
|
||||
- *Mitigation*: `FdrRecord` is `@frozen` (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract `Invariants`.
|
||||
|
||||
**Risk 3: Cached FdrClient leaks across test cases**
|
||||
- *Risk*: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient.
|
||||
- *Mitigation*: A `_reset_for_tests()` private function (documented as test-only in the contract `Non-Goals`) clears the cache; integration test fixture calls it on teardown.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: lock-free SPSC ring buffer + `FdrClient` public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR `enqueue` p99 ≤ 5 µs).
|
||||
- **Production code that must exist**: real lock-free SPSC primitive (no Python `queue.Queue`, no lock-acquiring fallback); real allocation-free hot path; real `on_overrun` hook plumbing.
|
||||
- **Allowed external stubs**: none — the queue is the production runtime capability.
|
||||
- **Unacceptable substitutes**: `queue.Queue`, `threading.Lock`-guarded list, `collections.deque` with a lock, "for now we just `time.sleep(0)` on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms.
|
||||
|
||||
## Contract
|
||||
|
||||
This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`.
|
||||
Consumers MUST read that file — not this task spec — to discover the interface.
|
||||
Reference in New Issue
Block a user