Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
11 KiB
Drop-Oldest Policy + kind="overrun" Record Emission
Task: AZ-274_fdr_overrun_emission
Name: FDR Overrun Policy
Description: Wire the producer-side overrun policy on top of the FdrClient ring buffer. When a producer's enqueue would overflow, the policy drops the OLDEST queued record from that producer's buffer to make room for the new record AND synthesises a FdrRecord(kind="overrun", payload={producer_id, dropped_count}) that lands on the same queue. This is the production-side enforcement of AC-NEW-3 ("no silent drops").
Complexity: 2 points
Dependencies: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf
Component: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT)
Tracker: AZ-274
Epic: AZ-247 (E-CC-FDR-CLIENT)
Document Dependencies
_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md— defines the canonical shape ofkind="overrun"records (consumed:payload.producer_id+payload.dropped_count)._docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md— defines theon_overrunhook this task implements + the "exactly-once" invariant.
Problem
AZ-273 (FdrClient ring buffer) leaves the on_overrun hook unwired by default. In production, an unwired hook means the buffer silently drops OVERRUN events — directly violating AC-NEW-3 and breaking C13's invariant that every dropped record is recoverable from a kind="overrun" record on the FDR. This task closes that gap by providing the canonical drop-oldest hook and registering it via the composition root for every onboard producer.
Outcome
- A single, contract-frozen drop-oldest hook is the only
on_overruncallable any production FdrClient is wired to. Tests MAY substitute their own. - For every burst that exceeds capacity, a coalesced
kind="overrun"record is enqueued on the SAME producer's buffer carrying the originating producer's slug +dropped_countreflecting how many records were dropped in the burst (coalescing keeps the overrun record from itself triggering further overruns when bursts are sustained). - The composition root wires the hook on every FdrClient created via
make_fdr_client— consumers (component code) do not interact with the hook directly.
Scope
Included
- A
default_overrun_policy(client: FdrClient) -> Callable[[FdrRecord], None]factory that returns the canonical drop-oldest closure for the given client. - Drop-oldest semantics: when
enqueuereturnsOVERRUN, the closure pops one record from the buffer's tail (oldest), discards it, retries the new record's enqueue (one retry only), and arranges for akind="overrun"record to land on the same buffer. If the retry also fails, the policy logs an ERROR via the shared logger (kind="fdr.overrun_retry_failed") — this is rare; it implies the consumer is making zero progress. - Coalescing: while a burst of consecutive overruns is in flight (consecutive
OVERRUNreturns within the same producer "tick"), the policy incrementsdropped_counton the in-flight overrun record instead of synthesising a new one per drop. The overrun record itself is enqueued at the END of the burst (next successfulenqueueslot). - Composition-root wiring:
make_fdr_clientis updated (or a newwire_fdr_client_overrun(client)helper is exposed and called insidemake_fdr_client) so every production FdrClient is constructed with this policy attached. Tests that explicitly constructFdrClient(...)directly opt out by leavingon_overrunasNone. - Diagnostic ERROR log only when the retry-after-drop also fails (NOT on every overrun — overruns are normal under bursty load and would flood the log).
Excluded
- The buffer itself, the
on_overrunhook plumbing, and the SPSC contract — owned by AZ-273. - The
FdrRecordschema and thekind="overrun"payload definition — owned by AZ-272. - The C13 writer thread's behaviour upon receiving an
overrunrecord (it just logs it like any other record) — owned by E-C13 (AZ-248). FakeFdrSink— owned by the next PBI in this epic.
Acceptance Criteria
AC-1: Drop-oldest produces canonical overrun record
Given an FdrClient with capacity 4 wired with default_overrun_policy, fully buffered with 4 user records
When the producer calls enqueue for a 5th record
Then the consumer side observes (in order): the 5th user record, then a kind="overrun" record whose payload.producer_id matches the originating producer and payload.dropped_count == 1
AC-2: Coalescing across a burst
Given an FdrClient with capacity 4, fully buffered, and the consumer is stalled
When the producer calls enqueue 10 times in a row (8 of them overrun)
Then exactly ONE kind="overrun" record is emitted at the end of the burst with payload.dropped_count == 8
AC-3: Overrun record carries originating producer_id
Given an FdrClient(producer_id="c1_vio") wired with the default policy
When the buffer overruns
Then the emitted overrun record has payload.producer_id == "c1_vio" (NOT "shared.fdr_client" — the OUTER envelope's producer_id may be "shared.fdr_client" per the schema contract, but the payload identifies the originating producer)
AC-4: Composition root wires every FdrClient
Given a production process initialised via compose_root(config)
When the test inspects every constructed FdrClient in the resulting RuntimeRoot
Then every client has a non-None on_overrun set to a callable from default_overrun_policy
AC-5: Retry-after-drop failure logs ERROR
Given a contrived test that monkey-patches the buffer so retry-after-drop ALSO returns OVERRUN (simulating a frozen consumer mid-policy)
When an overrun is triggered
Then exactly one ERROR log record is emitted with kind="fdr.overrun_retry_failed"; the policy does not loop indefinitely; the overrun record is dropped (test asserts no overrun record on the buffer in this pathological case)
AC-6: No log flood under sustained overruns Given an FdrClient under sustained overrun (1000 consecutive overruns) When the policy runs Then the shared logger receives at most 1 ERROR record per second related to overruns (rate cap on the diagnostic log; the FDR record itself is the canonical record of overruns)
Non-Functional Requirements
Performance
- Steady-state overhead: when
on_overrunis set but the buffer is NOT full (so the hook is never invoked),enqueueoverhead from this PBI's wiring is ≤ 0.5 µs (effectively a single null-check per call). The 5 µsenqueuep99 budget MUST still hold. - Overrun path overhead: the drop-oldest + retry sequence completes within 20 µs p99 on Tier-2 (it runs only on the cold path; cold-path budget is generous).
Reliability
- The policy NEVER loops indefinitely on retry. One retry only; then ERROR-log + drop.
- The policy NEVER raises into the producer's
enqueuecaller. Any exception inside the closure is logged viakind="fdr.overrun_policy_error"and swallowed; the producer's hot path stays clean.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Capacity-4 buffer fully filled, then 5th enqueue with default_overrun_policy |
Consumer sees 5th record + canonical overrun record (dropped_count == 1) |
| AC-2 | 10 consecutive overruns in one burst | Exactly one overrun record with dropped_count == 8 |
| AC-3 | Overrun on FdrClient(producer_id="c1_vio") | Emitted overrun record payload.producer_id == "c1_vio" |
| AC-4 | Boot a stub composition root with 3 producers; inspect all FdrClients | Every client has on_overrun != None |
| AC-5 | Monkey-patched retry-after-drop also fails | Exactly one ERROR log; no overrun record on buffer; no infinite loop |
| AC-6 | 1000 consecutive overruns | Logger receives ≤ 1 ERROR/sec related to overruns |
| NFR-perf-steady | Microbench enqueue with hook set but not invoked |
p99 overhead ≤ 0.5 µs vs unhooked |
| NFR-perf-overrun | Microbench drop-oldest + retry sequence | p99 ≤ 20 µs |
| NFR-reliability | Inject an exception into the closure; trigger overrun | Producer call returns normally; ERROR logged |
Constraints
- The policy plugs into AZ-273's
on_overrunhook ONLY — no other extension point. Behavioural deviation requires a new contract. - Coalescing window is bounded by "until the next successful enqueue" — NOT by wall-clock time. Rationale: the buffer is the only synchronisation point; the writer thread drains it; once it drains one slot, the producer's next enqueue succeeds and that is the natural emission point for the overrun record.
- The overrun record's OUTER envelope
producer_idis"shared.fdr_client"(per schema contract); the originating producer's slug is inpayload.producer_id.
Risks & Mitigation
Risk 1: Overrun record itself causes another overrun
- Risk: At the moment of overflow, enqueueing the synthesised overrun record might also fail.
- Mitigation: The drop-oldest sequence is "drop one → retry the user record → if successful, then enqueue the overrun record at the next slot the consumer drains". The overrun record is emitted at the END of the burst, on a slot known to be free. If the buffer is so degenerate that one drop is insufficient, the AC-5 ERROR-log path catches it.
Risk 2: Coalescing hides individual overruns under steady degradation
- Risk: A long-stalled consumer produces one
dropped_count=10000record at flush time; tooling cannot reconstruct fine-grained timing. - Mitigation: The coalescing scope is "consecutive overruns until next successful enqueue". As soon as the consumer drains one slot, the overrun record is emitted with the count up to that point. Tooling can correlate against the drained record's
tsto reconstruct timing windows. Documented in the schema contract's invariants.
Risk 3: Composition-root wiring drift
- Risk: A future component constructs
FdrClient(...)directly instead of usingmake_fdr_client(...), ending up withon_overrun = Noneand silent drops in production. - Mitigation: AC-4's contract test scans the constructed
RuntimeRootfor any FdrClient withon_overrun is Noneand fails. Documented as a code-review Phase 2 (Spec Compliance) check tied to the fdr_client_protocol contract.
Runtime Completeness
- Named capability: drop-oldest +
kind="overrun"record emission policy (architecture / E-CC-FDR-CLIENT / AC-NEW-3). - Production code that must exist: real drop-oldest closure, real overrun-record synthesis, real composition-root wiring of every producer.
- Allowed external stubs: tests MAY replace
on_overrunwith a recording closure; production wiring MUST NOT. - Unacceptable substitutes:
passas the hook ("for now we just log a warning"), in-memory counter without record emission ("we'll add the record later"), or relying on the C13 writer to synthesise overrun records (it cannot — only the producer side knows the burst boundary).