Files
gps-denied-onboard/_docs/02_tasks/done/AZ-274_fdr_overrun_emission.md
T
Oleksandr Bezdieniezhnykh ba20c2d195 [AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test
AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 03:00:49 +03:00

11 KiB

Drop-Oldest Policy + kind="overrun" Record Emission

Task: AZ-274_fdr_overrun_emission Name: FDR Overrun Policy Description: Wire the producer-side overrun policy on top of the FdrClient ring buffer. When a producer's enqueue would overflow, the policy drops the OLDEST queued record from that producer's buffer to make room for the new record AND synthesises a FdrRecord(kind="overrun", payload={producer_id, dropped_count}) that lands on the same queue. This is the production-side enforcement of AC-NEW-3 ("no silent drops"). Complexity: 2 points Dependencies: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf Component: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) Tracker: AZ-274 Epic: AZ-247 (E-CC-FDR-CLIENT)

Document Dependencies

  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md — defines the canonical shape of kind="overrun" records (consumed: payload.producer_id + payload.dropped_count).
  • _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md — defines the on_overrun hook this task implements + the "exactly-once" invariant.

Problem

AZ-273 (FdrClient ring buffer) leaves the on_overrun hook unwired by default. In production, an unwired hook means the buffer silently drops OVERRUN events — directly violating AC-NEW-3 and breaking C13's invariant that every dropped record is recoverable from a kind="overrun" record on the FDR. This task closes that gap by providing the canonical drop-oldest hook and registering it via the composition root for every onboard producer.

Outcome

  • A single, contract-frozen drop-oldest hook is the only on_overrun callable any production FdrClient is wired to. Tests MAY substitute their own.
  • For every burst that exceeds capacity, a coalesced kind="overrun" record is enqueued on the SAME producer's buffer carrying the originating producer's slug + dropped_count reflecting how many records were dropped in the burst (coalescing keeps the overrun record from itself triggering further overruns when bursts are sustained).
  • The composition root wires the hook on every FdrClient created via make_fdr_client — consumers (component code) do not interact with the hook directly.

Scope

Included

  • A default_overrun_policy(client: FdrClient) -> Callable[[FdrRecord], None] factory that returns the canonical drop-oldest closure for the given client.
  • Drop-oldest semantics: when enqueue returns OVERRUN, the closure pops one record from the buffer's tail (oldest), discards it, retries the new record's enqueue (one retry only), and arranges for a kind="overrun" record to land on the same buffer. If the retry also fails, the policy logs an ERROR via the shared logger (kind="fdr.overrun_retry_failed") — this is rare; it implies the consumer is making zero progress.
  • Coalescing: while a burst of consecutive overruns is in flight (consecutive OVERRUN returns within the same producer "tick"), the policy increments dropped_count on the in-flight overrun record instead of synthesising a new one per drop. The overrun record itself is enqueued at the END of the burst (next successful enqueue slot).
  • Composition-root wiring: make_fdr_client is updated (or a new wire_fdr_client_overrun(client) helper is exposed and called inside make_fdr_client) so every production FdrClient is constructed with this policy attached. Tests that explicitly construct FdrClient(...) directly opt out by leaving on_overrun as None.
  • Diagnostic ERROR log only when the retry-after-drop also fails (NOT on every overrun — overruns are normal under bursty load and would flood the log).

Excluded

  • The buffer itself, the on_overrun hook plumbing, and the SPSC contract — owned by AZ-273.
  • The FdrRecord schema and the kind="overrun" payload definition — owned by AZ-272.
  • The C13 writer thread's behaviour upon receiving an overrun record (it just logs it like any other record) — owned by E-C13 (AZ-248).
  • FakeFdrSink — owned by the next PBI in this epic.

Acceptance Criteria

AC-1: Drop-oldest produces canonical overrun record Given an FdrClient with capacity 4 wired with default_overrun_policy, fully buffered with 4 user records When the producer calls enqueue for a 5th record Then the consumer side observes (in order): the 5th user record, then a kind="overrun" record whose payload.producer_id matches the originating producer and payload.dropped_count == 1

AC-2: Coalescing across a burst Given an FdrClient with capacity 4, fully buffered, and the consumer is stalled When the producer calls enqueue 10 times in a row (8 of them overrun) Then exactly ONE kind="overrun" record is emitted at the end of the burst with payload.dropped_count == 8

AC-3: Overrun record carries originating producer_id Given an FdrClient(producer_id="c1_vio") wired with the default policy When the buffer overruns Then the emitted overrun record has payload.producer_id == "c1_vio" (NOT "shared.fdr_client" — the OUTER envelope's producer_id may be "shared.fdr_client" per the schema contract, but the payload identifies the originating producer)

AC-4: Composition root wires every FdrClient Given a production process initialised via compose_root(config) When the test inspects every constructed FdrClient in the resulting RuntimeRoot Then every client has a non-None on_overrun set to a callable from default_overrun_policy

AC-5: Retry-after-drop failure logs ERROR Given a contrived test that monkey-patches the buffer so retry-after-drop ALSO returns OVERRUN (simulating a frozen consumer mid-policy) When an overrun is triggered Then exactly one ERROR log record is emitted with kind="fdr.overrun_retry_failed"; the policy does not loop indefinitely; the overrun record is dropped (test asserts no overrun record on the buffer in this pathological case)

AC-6: No log flood under sustained overruns Given an FdrClient under sustained overrun (1000 consecutive overruns) When the policy runs Then the shared logger receives at most 1 ERROR record per second related to overruns (rate cap on the diagnostic log; the FDR record itself is the canonical record of overruns)

Non-Functional Requirements

Performance

  • Steady-state overhead: when on_overrun is set but the buffer is NOT full (so the hook is never invoked), enqueue overhead from this PBI's wiring is ≤ 0.5 µs (effectively a single null-check per call). The 5 µs enqueue p99 budget MUST still hold.
  • Overrun path overhead: the drop-oldest + retry sequence completes within 20 µs p99 on Tier-2 (it runs only on the cold path; cold-path budget is generous).

Reliability

  • The policy NEVER loops indefinitely on retry. One retry only; then ERROR-log + drop.
  • The policy NEVER raises into the producer's enqueue caller. Any exception inside the closure is logged via kind="fdr.overrun_policy_error" and swallowed; the producer's hot path stays clean.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Capacity-4 buffer fully filled, then 5th enqueue with default_overrun_policy Consumer sees 5th record + canonical overrun record (dropped_count == 1)
AC-2 10 consecutive overruns in one burst Exactly one overrun record with dropped_count == 8
AC-3 Overrun on FdrClient(producer_id="c1_vio") Emitted overrun record payload.producer_id == "c1_vio"
AC-4 Boot a stub composition root with 3 producers; inspect all FdrClients Every client has on_overrun != None
AC-5 Monkey-patched retry-after-drop also fails Exactly one ERROR log; no overrun record on buffer; no infinite loop
AC-6 1000 consecutive overruns Logger receives ≤ 1 ERROR/sec related to overruns
NFR-perf-steady Microbench enqueue with hook set but not invoked p99 overhead ≤ 0.5 µs vs unhooked
NFR-perf-overrun Microbench drop-oldest + retry sequence p99 ≤ 20 µs
NFR-reliability Inject an exception into the closure; trigger overrun Producer call returns normally; ERROR logged

Constraints

  • The policy plugs into AZ-273's on_overrun hook ONLY — no other extension point. Behavioural deviation requires a new contract.
  • Coalescing window is bounded by "until the next successful enqueue" — NOT by wall-clock time. Rationale: the buffer is the only synchronisation point; the writer thread drains it; once it drains one slot, the producer's next enqueue succeeds and that is the natural emission point for the overrun record.
  • The overrun record's OUTER envelope producer_id is "shared.fdr_client" (per schema contract); the originating producer's slug is in payload.producer_id.

Risks & Mitigation

Risk 1: Overrun record itself causes another overrun

  • Risk: At the moment of overflow, enqueueing the synthesised overrun record might also fail.
  • Mitigation: The drop-oldest sequence is "drop one → retry the user record → if successful, then enqueue the overrun record at the next slot the consumer drains". The overrun record is emitted at the END of the burst, on a slot known to be free. If the buffer is so degenerate that one drop is insufficient, the AC-5 ERROR-log path catches it.

Risk 2: Coalescing hides individual overruns under steady degradation

  • Risk: A long-stalled consumer produces one dropped_count=10000 record at flush time; tooling cannot reconstruct fine-grained timing.
  • Mitigation: The coalescing scope is "consecutive overruns until next successful enqueue". As soon as the consumer drains one slot, the overrun record is emitted with the count up to that point. Tooling can correlate against the drained record's ts to reconstruct timing windows. Documented in the schema contract's invariants.

Risk 3: Composition-root wiring drift

  • Risk: A future component constructs FdrClient(...) directly instead of using make_fdr_client(...), ending up with on_overrun = None and silent drops in production.
  • Mitigation: AC-4's contract test scans the constructed RuntimeRoot for any FdrClient with on_overrun is None and fails. Documented as a code-review Phase 2 (Spec Compliance) check tied to the fdr_client_protocol contract.

Runtime Completeness

  • Named capability: drop-oldest + kind="overrun" record emission policy (architecture / E-CC-FDR-CLIENT / AC-NEW-3).
  • Production code that must exist: real drop-oldest closure, real overrun-record synthesis, real composition-root wiring of every producer.
  • Allowed external stubs: tests MAY replace on_overrun with a recording closure; production wiring MUST NOT.
  • Unacceptable substitutes: pass as the hook ("for now we just log a warning"), in-memory counter without record emission ("we'll add the record later"), or relying on the C13 writer to synthesise overrun records (it cannot — only the producer side knows the burst boundary).