AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of- two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError on the public surface. make_fdr_client caches one client per producer_id and reads capacity from config.fdr.per_producer_capacity with fallback to queue_size. AZ-274: default_overrun_policy implements drop-oldest + retry + immediate marker emission, with prior-marker dropped_count folding via _evict_one so user-loss info is never lost across iterations. ERROR diagnostic is rate-limited to <=1/sec per producer. AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the production default_overrun_policy via a duck-typed _PolicyAdapter. The test-only records/all_records_ever properties let component tests assert both in-buffer and lifetime state. tests/conftest.py registers the fake_fdr_sink fixture and an AST architecture lint forbids production imports of fakes. AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge and forwards only WARN+ERROR records into the FDR with kind="log". Thread-local recursion guard short-circuits internal logging; saturated- queue diagnostics go to stderr every N=1000 drops. AZ-268: tests/contract/log_schema.py covers every row of the schema's Test Cases table plus the "DEBUG+INFO never reach FDR" invariant. pyproject.toml registers the contract pytest marker and the contract-mandated log_schema.py file-name. 251 unit + contract tests pass (48 new). Review verdict: PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer. Co-authored-by: Cursor <cursoragent@cursor.com>
6.7 KiB
FDR Log Bridge (ERROR + WARN forwarding)
Task: AZ-267_fdr_log_bridge
Name: FDR Log Bridge
Description: Subscribe a logging Handler to the shared logger that forwards every ERROR and WARN record into the Flight Data Recorder via the FDR producer client, tagged kind="log" so post-flight tooling can correlate log events with the rest of the recorded telemetry.
Complexity: 2 points
Dependencies: AZ-266_log_module, AZ-247 (forward — FDR producer + record schema not yet decomposed; this task's contract surface is satisfied once AZ-247's record schema contract is published)
Component: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG)
Tracker: AZ-267
Epic: AZ-245 (E-CC-LOG)
Document Dependencies
_docs/02_document/contracts/shared_logging/log_record_schema.md— log envelope this bridge consumes (produced by AZ-266)._docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md— FDR record schema this bridge writes into (produced by AZ-247; document does not yet exist — Step 4 cross-verification will catch the forward reference).
Problem
The acceptance criterion "ERROR + WARN records appear in FDR with kind = \"log\" and a back-reference to the originating component" requires a bridge between the shared Python logging machinery and the FDR producer client. Without this bridge, post-flight tools cannot correlate a c5_state ERROR log with the surrounding telemetry frames captured at the same flight time.
Outcome
- Every emitted log record at level WARN or ERROR is enqueued into the FDR producer queue with
kind="log"and the originating component slug preserved. - INFO and DEBUG records are NEVER enqueued into FDR (verified by the contract test in PBI #3 of this epic).
- The bridge never blocks the calling thread — it uses the FDR producer client's drop-oldest semantics so a saturated queue cannot stall a
logger.error(...)call on the hot path.
Scope
Included
- A logging Handler subclass installed onto the root onboard logger (or each
get_logger(...)instance, whichever the AZ-266 implementation chose) that subscribes to records at WARN and ERROR. - Translation logic from
LogRecord(perlog_record_schemav1.0.0) into the FDR record envelope expected by the FDR producer client, withkind="log"and acomponentback-reference. - Wire-up in the composition root (consumed from AZ-246 / E-CC-CONF) so the bridge is attached exactly once, after the logger and the FDR client are both initialised.
Excluded
- The FDR producer client itself — owned by AZ-247 / E-CC-FDR-CLIENT.
- The on-disk FDR segment writer thread — owned by AZ-248 / E-C13.
- The contract test that verifies "DEBUG + INFO never reach FDR" — owned by PBI #3 of this epic (next task).
- Per-component log call sites — owned by each component epic.
Acceptance Criteria
AC-1: WARN records reach FDR
Given the bridge is installed and the FDR client's queue is below capacity
When any component emits logger.warning(...) via the shared logger
Then a single FDR record with kind="log", level="WARN", and component=<originating component slug> is enqueued
AC-2: ERROR records reach FDR with traceback when applicable
Given the bridge is installed
When a component emits logger.exception(...) from inside an except clause
Then the enqueued FDR record's exc field carries the formatted traceback string from the LogRecord
AC-3: INFO and DEBUG never reach FDR
Given the bridge is installed
When any component emits logger.info(...) or logger.debug(...)
Then no FDR record is enqueued for that log call (verified by both unit tests here and the contract test in the next task)
AC-4: Backpressure is non-blocking
Given the FDR producer queue is at its drop-oldest threshold
When a component emits logger.error(...) on the hot path
Then the call returns within the same latency budget as a stdout-only WARN call (no blocking on the queue), and the FDR client's existing drop counter is incremented
AC-5: Single attachment
Given compose_root(config) runs at process start
When the bridge wire-up is invoked
Then exactly one bridge Handler is attached to the logger; reinitialising the composition root in tests does not stack duplicates
Non-Functional Requirements
Performance
- Bridge add ≤ 0.05 ms p99 latency on top of the formatter's 0.2 ms budget (i.e. logger.error → bridge enqueue total p99 ≤ 0.25 ms on Tier-2).
Reliability
- A failure to enqueue (queue full + drop-oldest already saturated) MUST NOT raise into the caller; it MUST log a one-shot internal
WARNrecord (via stdout only — recursion into the bridge is short-circuited by a thread-local flag) every N occurrences, where N is at least 1000.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Emit a WARN through the shared logger with the bridge installed | Stub FDR queue receives one record with kind="log", level="WARN", component matching origin |
| AC-2 | Inside an except block, call logger.exception("boom") |
Stub FDR queue's record carries non-empty exc traceback string |
| AC-3 | Emit INFO and DEBUG records | Stub FDR queue receives zero records |
| AC-4 | Pre-fill stub FDR queue to drop-oldest threshold, then emit an ERROR | Caller returns under 0.5 ms wall clock; FDR client's drop counter increments |
| AC-5 | Call compose_root twice with the same config in a single process |
Logger has exactly one bridge Handler attached after the second call |
Constraints
- The bridge has a forward dependency on AZ-247 (FDR producer client + record schema). It cannot pass its own AC tests until AZ-247 is implemented; Step 4 cross-verification will record this temporal dependency in
_dependencies_table.md. - The bridge's record translation MUST consume only the public surface of
log_record_schemav1.0.0 — no peeking into formatter internals.
Risks & Mitigation
Risk 1: Recursion via internal WARN on enqueue failure
- Risk: The "queue full" internal WARN itself goes through the bridge, recurses, and corrupts the queue further.
- Mitigation: Thread-local "in-bridge" flag short-circuits any logging call originating from the bridge itself; verified by a unit test that fills the queue and asserts no infinite loop.
Risk 2: Forward dependency on AZ-247 contract not yet written
- Risk: The FDR record schema is described in epic AZ-247's text but not yet a contract file; this task's expectations may drift from AZ-247's eventual contract.
- Mitigation: AZ-247's first PBI MUST publish
_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.mdbefore AZ-247's other PBIs; this task's implementation begins only after that contract exists. Step 4 cross-verification flags the temporal dependency.