[AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test

AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 03:00:49 +03:00
parent 3acc7f33dd
commit ba20c2d195
24 changed files with 2714 additions and 20 deletions
@@ -0,0 +1,100 @@
# FDR Log Bridge (ERROR + WARN forwarding)
**Task**: AZ-267_fdr_log_bridge
**Name**: FDR Log Bridge
**Description**: Subscribe a logging Handler to the shared logger that forwards every ERROR and WARN record into the Flight Data Recorder via the FDR producer client, tagged `kind="log"` so post-flight tooling can correlate log events with the rest of the recorded telemetry.
**Complexity**: 2 points
**Dependencies**: AZ-266_log_module, AZ-247 (forward — FDR producer + record schema not yet decomposed; this task's contract surface is satisfied once AZ-247's record schema contract is published)
**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG)
**Tracker**: AZ-267
**Epic**: AZ-245 (E-CC-LOG)
### Document Dependencies
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — log envelope this bridge consumes (produced by AZ-266).
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — FDR record schema this bridge writes into (produced by AZ-247; document does not yet exist — Step 4 cross-verification will catch the forward reference).
## Problem
The acceptance criterion "ERROR + WARN records appear in FDR with `kind = \"log\"` and a back-reference to the originating component" requires a bridge between the shared Python `logging` machinery and the FDR producer client. Without this bridge, post-flight tools cannot correlate a `c5_state` ERROR log with the surrounding telemetry frames captured at the same flight time.
## Outcome
- Every emitted log record at level WARN or ERROR is enqueued into the FDR producer queue with `kind="log"` and the originating component slug preserved.
- INFO and DEBUG records are NEVER enqueued into FDR (verified by the contract test in PBI #3 of this epic).
- The bridge never blocks the calling thread — it uses the FDR producer client's drop-oldest semantics so a saturated queue cannot stall a `logger.error(...)` call on the hot path.
## Scope
### Included
- A logging Handler subclass installed onto the root onboard logger (or each `get_logger(...)` instance, whichever the AZ-266 implementation chose) that subscribes to records at WARN and ERROR.
- Translation logic from `LogRecord` (per `log_record_schema` v1.0.0) into the FDR record envelope expected by the FDR producer client, with `kind="log"` and a `component` back-reference.
- Wire-up in the composition root (consumed from AZ-246 / E-CC-CONF) so the bridge is attached exactly once, after the logger and the FDR client are both initialised.
### Excluded
- The FDR producer client itself — owned by AZ-247 / E-CC-FDR-CLIENT.
- The on-disk FDR segment writer thread — owned by AZ-248 / E-C13.
- The contract test that verifies "DEBUG + INFO never reach FDR" — owned by PBI #3 of this epic (next task).
- Per-component log call sites — owned by each component epic.
## Acceptance Criteria
**AC-1: WARN records reach FDR**
Given the bridge is installed and the FDR client's queue is below capacity
When any component emits `logger.warning(...)` via the shared logger
Then a single FDR record with `kind="log"`, `level="WARN"`, and `component=<originating component slug>` is enqueued
**AC-2: ERROR records reach FDR with traceback when applicable**
Given the bridge is installed
When a component emits `logger.exception(...)` from inside an `except` clause
Then the enqueued FDR record's `exc` field carries the formatted traceback string from the `LogRecord`
**AC-3: INFO and DEBUG never reach FDR**
Given the bridge is installed
When any component emits `logger.info(...)` or `logger.debug(...)`
Then no FDR record is enqueued for that log call (verified by both unit tests here and the contract test in the next task)
**AC-4: Backpressure is non-blocking**
Given the FDR producer queue is at its drop-oldest threshold
When a component emits `logger.error(...)` on the hot path
Then the call returns within the same latency budget as a stdout-only WARN call (no blocking on the queue), and the FDR client's existing drop counter is incremented
**AC-5: Single attachment**
Given `compose_root(config)` runs at process start
When the bridge wire-up is invoked
Then exactly one bridge Handler is attached to the logger; reinitialising the composition root in tests does not stack duplicates
## Non-Functional Requirements
**Performance**
- Bridge add ≤ 0.05 ms p99 latency on top of the formatter's 0.2 ms budget (i.e. logger.error → bridge enqueue total p99 ≤ 0.25 ms on Tier-2).
**Reliability**
- A failure to enqueue (queue full + drop-oldest already saturated) MUST NOT raise into the caller; it MUST log a one-shot internal `WARN` record (via stdout only — recursion into the bridge is short-circuited by a thread-local flag) every N occurrences, where N is at least 1000.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Emit a WARN through the shared logger with the bridge installed | Stub FDR queue receives one record with `kind="log"`, `level="WARN"`, `component` matching origin |
| AC-2 | Inside an `except` block, call `logger.exception("boom")` | Stub FDR queue's record carries non-empty `exc` traceback string |
| AC-3 | Emit INFO and DEBUG records | Stub FDR queue receives zero records |
| AC-4 | Pre-fill stub FDR queue to drop-oldest threshold, then emit an ERROR | Caller returns under 0.5 ms wall clock; FDR client's drop counter increments |
| AC-5 | Call `compose_root` twice with the same config in a single process | Logger has exactly one bridge Handler attached after the second call |
## Constraints
- The bridge has a forward dependency on AZ-247 (FDR producer client + record schema). It cannot pass its own AC tests until AZ-247 is implemented; Step 4 cross-verification will record this temporal dependency in `_dependencies_table.md`.
- The bridge's record translation MUST consume only the public surface of `log_record_schema` v1.0.0 — no peeking into formatter internals.
## Risks & Mitigation
**Risk 1: Recursion via internal `WARN` on enqueue failure**
- *Risk*: The "queue full" internal WARN itself goes through the bridge, recurses, and corrupts the queue further.
- *Mitigation*: Thread-local "in-bridge" flag short-circuits any logging call originating from the bridge itself; verified by a unit test that fills the queue and asserts no infinite loop.
**Risk 2: Forward dependency on AZ-247 contract not yet written**
- *Risk*: The FDR record schema is described in epic AZ-247's text but not yet a contract file; this task's expectations may drift from AZ-247's eventual contract.
- *Mitigation*: AZ-247's first PBI MUST publish `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` before AZ-247's other PBIs; this task's implementation begins only after that contract exists. Step 4 cross-verification flags the temporal dependency.