[AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test

AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:00:49 +03:00
parent 3acc7f33dd
commit ba20c2d195
24 changed files with 2714 additions and 20 deletions
@@ -0,0 +1,125 @@
# Drop-Oldest Policy + `kind="overrun"` Record Emission
**Task**: AZ-274_fdr_overrun_emission
**Name**: FDR Overrun Policy
**Description**: Wire the producer-side overrun policy on top of the FdrClient ring buffer. When a producer's enqueue would overflow, the policy drops the OLDEST queued record from that producer's buffer to make room for the new record AND synthesises a `FdrRecord(kind="overrun", payload={producer_id, dropped_count})` that lands on the same queue. This is the production-side enforcement of AC-NEW-3 ("no silent drops").
**Complexity**: 2 points
**Dependencies**: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf
**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
**Tracker**: AZ-274
**Epic**: AZ-247 (E-CC-FDR-CLIENT)
### Document Dependencies
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="overrun"` records (consumed: `payload.producer_id` + `payload.dropped_count`).
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines the `on_overrun` hook this task implements + the "exactly-once" invariant.
## Problem
AZ-273 (FdrClient ring buffer) leaves the `on_overrun` hook unwired by default. In production, an unwired hook means the buffer silently drops `OVERRUN` events — directly violating AC-NEW-3 and breaking C13's invariant that every dropped record is recoverable from a `kind="overrun"` record on the FDR. This task closes that gap by providing the canonical drop-oldest hook and registering it via the composition root for every onboard producer.
## Outcome
- A single, contract-frozen drop-oldest hook is the only `on_overrun` callable any production FdrClient is wired to. Tests MAY substitute their own.
- For every burst that exceeds capacity, a coalesced `kind="overrun"` record is enqueued on the SAME producer's buffer carrying the originating producer's slug + `dropped_count` reflecting how many records were dropped in the burst (coalescing keeps the overrun record from itself triggering further overruns when bursts are sustained).
- The composition root wires the hook on every FdrClient created via `make_fdr_client` — consumers (component code) do not interact with the hook directly.
## Scope
### Included
- A `default_overrun_policy(client: FdrClient) -> Callable[[FdrRecord], None]` factory that returns the canonical drop-oldest closure for the given client.
- Drop-oldest semantics: when `enqueue` returns `OVERRUN`, the closure pops one record from the buffer's tail (oldest), discards it, retries the new record's enqueue (one retry only), and arranges for a `kind="overrun"` record to land on the same buffer. If the retry also fails, the policy logs an ERROR via the shared logger (`kind="fdr.overrun_retry_failed"`) — this is rare; it implies the consumer is making zero progress.
- Coalescing: while a burst of consecutive overruns is in flight (consecutive `OVERRUN` returns within the same producer "tick"), the policy increments `dropped_count` on the in-flight overrun record instead of synthesising a new one per drop. The overrun record itself is enqueued at the END of the burst (next successful `enqueue` slot).
- Composition-root wiring: `make_fdr_client` is updated (or a new `wire_fdr_client_overrun(client)` helper is exposed and called inside `make_fdr_client`) so every production FdrClient is constructed with this policy attached. Tests that explicitly construct `FdrClient(...)` directly opt out by leaving `on_overrun` as `None`.
- Diagnostic ERROR log only when the retry-after-drop also fails (NOT on every overrun — overruns are normal under bursty load and would flood the log).
### Excluded
- The buffer itself, the `on_overrun` hook plumbing, and the SPSC contract — owned by AZ-273.
- The `FdrRecord` schema and the `kind="overrun"` payload definition — owned by AZ-272.
- The C13 writer thread's behaviour upon receiving an `overrun` record (it just logs it like any other record) — owned by E-C13 (AZ-248).
- `FakeFdrSink` — owned by the next PBI in this epic.
## Acceptance Criteria
**AC-1: Drop-oldest produces canonical overrun record**
Given an FdrClient with capacity 4 wired with `default_overrun_policy`, fully buffered with 4 user records
When the producer calls `enqueue` for a 5th record
Then the consumer side observes (in order): the 5th user record, then a `kind="overrun"` record whose `payload.producer_id` matches the originating producer and `payload.dropped_count == 1`
**AC-2: Coalescing across a burst**
Given an FdrClient with capacity 4, fully buffered, and the consumer is stalled
When the producer calls `enqueue` 10 times in a row (8 of them overrun)
Then exactly ONE `kind="overrun"` record is emitted at the end of the burst with `payload.dropped_count == 8`
**AC-3: Overrun record carries originating producer_id**
Given an FdrClient(producer_id="c1_vio") wired with the default policy
When the buffer overruns
Then the emitted overrun record has `payload.producer_id == "c1_vio"` (NOT `"shared.fdr_client"` — the OUTER envelope's `producer_id` may be `"shared.fdr_client"` per the schema contract, but the payload identifies the originating producer)
**AC-4: Composition root wires every FdrClient**
Given a production process initialised via `compose_root(config)`
When the test inspects every constructed `FdrClient` in the resulting `RuntimeRoot`
Then every client has a non-None `on_overrun` set to a callable from `default_overrun_policy`
**AC-5: Retry-after-drop failure logs ERROR**
Given a contrived test that monkey-patches the buffer so retry-after-drop ALSO returns `OVERRUN` (simulating a frozen consumer mid-policy)
When an overrun is triggered
Then exactly one ERROR log record is emitted with `kind="fdr.overrun_retry_failed"`; the policy does not loop indefinitely; the overrun record is dropped (test asserts no overrun record on the buffer in this pathological case)
**AC-6: No log flood under sustained overruns**
Given an FdrClient under sustained overrun (1000 consecutive overruns)
When the policy runs
Then the shared logger receives at most 1 ERROR record per second related to overruns (rate cap on the diagnostic log; the FDR record itself is the canonical record of overruns)
## Non-Functional Requirements
**Performance**
- Steady-state overhead: when `on_overrun` is set but the buffer is NOT full (so the hook is never invoked), `enqueue` overhead from this PBI's wiring is ≤ 0.5 µs (effectively a single null-check per call). The 5 µs `enqueue` p99 budget MUST still hold.
- Overrun path overhead: the drop-oldest + retry sequence completes within 20 µs p99 on Tier-2 (it runs only on the cold path; cold-path budget is generous).
**Reliability**
- The policy NEVER loops indefinitely on retry. One retry only; then ERROR-log + drop.
- The policy NEVER raises into the producer's `enqueue` caller. Any exception inside the closure is logged via `kind="fdr.overrun_policy_error"` and swallowed; the producer's hot path stays clean.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Capacity-4 buffer fully filled, then 5th enqueue with `default_overrun_policy` | Consumer sees 5th record + canonical overrun record (`dropped_count == 1`) |
| AC-2 | 10 consecutive overruns in one burst | Exactly one overrun record with `dropped_count == 8` |
| AC-3 | Overrun on FdrClient(producer_id="c1_vio") | Emitted overrun record `payload.producer_id == "c1_vio"` |
| AC-4 | Boot a stub composition root with 3 producers; inspect all FdrClients | Every client has `on_overrun != None` |
| AC-5 | Monkey-patched retry-after-drop also fails | Exactly one ERROR log; no overrun record on buffer; no infinite loop |
| AC-6 | 1000 consecutive overruns | Logger receives ≤ 1 ERROR/sec related to overruns |
| NFR-perf-steady | Microbench `enqueue` with hook set but not invoked | p99 overhead ≤ 0.5 µs vs unhooked |
| NFR-perf-overrun | Microbench drop-oldest + retry sequence | p99 ≤ 20 µs |
| NFR-reliability | Inject an exception into the closure; trigger overrun | Producer call returns normally; ERROR logged |
## Constraints
- The policy plugs into AZ-273's `on_overrun` hook ONLY — no other extension point. Behavioural deviation requires a new contract.
- Coalescing window is bounded by "until the next successful enqueue" — NOT by wall-clock time. Rationale: the buffer is the only synchronisation point; the writer thread drains it; once it drains one slot, the producer's next enqueue succeeds and that is the natural emission point for the overrun record.
- The overrun record's OUTER envelope `producer_id` is `"shared.fdr_client"` (per schema contract); the originating producer's slug is in `payload.producer_id`.
## Risks & Mitigation
**Risk 1: Overrun record itself causes another overrun**
- *Risk*: At the moment of overflow, enqueueing the synthesised overrun record might also fail.
- *Mitigation*: The drop-oldest sequence is "drop one → retry the user record → if successful, then enqueue the overrun record at the next slot the consumer drains". The overrun record is emitted at the END of the burst, on a slot known to be free. If the buffer is so degenerate that one drop is insufficient, the AC-5 ERROR-log path catches it.
**Risk 2: Coalescing hides individual overruns under steady degradation**
- *Risk*: A long-stalled consumer produces one `dropped_count=10000` record at flush time; tooling cannot reconstruct fine-grained timing.
- *Mitigation*: The coalescing scope is "consecutive overruns until next successful enqueue". As soon as the consumer drains one slot, the overrun record is emitted with the count up to that point. Tooling can correlate against the drained record's `ts` to reconstruct timing windows. Documented in the schema contract's invariants.
**Risk 3: Composition-root wiring drift**
- *Risk*: A future component constructs `FdrClient(...)` directly instead of using `make_fdr_client(...)`, ending up with `on_overrun = None` and silent drops in production.
- *Mitigation*: AC-4's contract test scans the constructed `RuntimeRoot` for any FdrClient with `on_overrun is None` and fails. Documented as a code-review Phase 2 (Spec Compliance) check tied to the fdr_client_protocol contract.
## Runtime Completeness
- **Named capability**: drop-oldest + `kind="overrun"` record emission policy (architecture / E-CC-FDR-CLIENT / AC-NEW-3).
- **Production code that must exist**: real drop-oldest closure, real overrun-record synthesis, real composition-root wiring of every producer.
- **Allowed external stubs**: tests MAY replace `on_overrun` with a recording closure; production wiring MUST NOT.
- **Unacceptable substitutes**: `pass` as the hook ("for now we just log a warning"), in-memory counter without record emission ("we'll add the record later"), or relying on the C13 writer to synthesise overrun records (it cannot — only the producer side knows the burst boundary).