Files
gps-denied-onboard/_docs/02_tasks/done/AZ-267_fdr_log_bridge.md
T
Oleksandr Bezdieniezhnykh ba20c2d195 [AZ-273] [AZ-274] [AZ-275] [AZ-267] [AZ-268] FDR producer chain + log bridge + contract test
AZ-273: lock-free SPSC ring buffer with pre-allocated slots, power-of-
two capacity, opt-in SPSC guard, and EnqueueResult / FdrSpscViolationError
on the public surface. make_fdr_client caches one client per producer_id
and reads capacity from config.fdr.per_producer_capacity with fallback
to queue_size.
AZ-274: default_overrun_policy implements drop-oldest + retry + immediate
marker emission, with prior-marker dropped_count folding via _evict_one
so user-loss info is never lost across iterations. ERROR diagnostic is
rate-limited to <=1/sec per producer.
AZ-275: FakeFdrSink mirrors the FdrClient public surface and reuses the
production default_overrun_policy via a duck-typed _PolicyAdapter. The
test-only records/all_records_ever properties let component tests assert
both in-buffer and lifetime state. tests/conftest.py registers the
fake_fdr_sink fixture and an AST architecture lint forbids production
imports of fakes.
AZ-267: FdrLogBridgeHandler installs on the root logger via wire_log_bridge
and forwards only WARN+ERROR records into the FDR with kind="log".
Thread-local recursion guard short-circuits internal logging; saturated-
queue diagnostics go to stderr every N=1000 drops.
AZ-268: tests/contract/log_schema.py covers every row of the schema's
Test Cases table plus the "DEBUG+INFO never reach FDR" invariant.
pyproject.toml registers the contract pytest marker and the
contract-mandated log_schema.py file-name.
251 unit + contract tests pass (48 new). Review verdict:
PASS_WITH_WARNINGS; findings are NFR-perf deferrals + documented
relaxation of AZ-274 AC-2 coalescing under permanently-stalled consumer.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 03:00:49 +03:00

6.7 KiB

FDR Log Bridge (ERROR + WARN forwarding)

Task: AZ-267_fdr_log_bridge Name: FDR Log Bridge Description: Subscribe a logging Handler to the shared logger that forwards every ERROR and WARN record into the Flight Data Recorder via the FDR producer client, tagged kind="log" so post-flight tooling can correlate log events with the rest of the recorded telemetry. Complexity: 2 points Dependencies: AZ-266_log_module, AZ-247 (forward — FDR producer + record schema not yet decomposed; this task's contract surface is satisfied once AZ-247's record schema contract is published) Component: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG) Tracker: AZ-267 Epic: AZ-245 (E-CC-LOG)

Document Dependencies

  • _docs/02_document/contracts/shared_logging/log_record_schema.md — log envelope this bridge consumes (produced by AZ-266).
  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md — FDR record schema this bridge writes into (produced by AZ-247; document does not yet exist — Step 4 cross-verification will catch the forward reference).

Problem

The acceptance criterion "ERROR + WARN records appear in FDR with kind = \"log\" and a back-reference to the originating component" requires a bridge between the shared Python logging machinery and the FDR producer client. Without this bridge, post-flight tools cannot correlate a c5_state ERROR log with the surrounding telemetry frames captured at the same flight time.

Outcome

  • Every emitted log record at level WARN or ERROR is enqueued into the FDR producer queue with kind="log" and the originating component slug preserved.
  • INFO and DEBUG records are NEVER enqueued into FDR (verified by the contract test in PBI #3 of this epic).
  • The bridge never blocks the calling thread — it uses the FDR producer client's drop-oldest semantics so a saturated queue cannot stall a logger.error(...) call on the hot path.

Scope

Included

  • A logging Handler subclass installed onto the root onboard logger (or each get_logger(...) instance, whichever the AZ-266 implementation chose) that subscribes to records at WARN and ERROR.
  • Translation logic from LogRecord (per log_record_schema v1.0.0) into the FDR record envelope expected by the FDR producer client, with kind="log" and a component back-reference.
  • Wire-up in the composition root (consumed from AZ-246 / E-CC-CONF) so the bridge is attached exactly once, after the logger and the FDR client are both initialised.

Excluded

  • The FDR producer client itself — owned by AZ-247 / E-CC-FDR-CLIENT.
  • The on-disk FDR segment writer thread — owned by AZ-248 / E-C13.
  • The contract test that verifies "DEBUG + INFO never reach FDR" — owned by PBI #3 of this epic (next task).
  • Per-component log call sites — owned by each component epic.

Acceptance Criteria

AC-1: WARN records reach FDR Given the bridge is installed and the FDR client's queue is below capacity When any component emits logger.warning(...) via the shared logger Then a single FDR record with kind="log", level="WARN", and component=<originating component slug> is enqueued

AC-2: ERROR records reach FDR with traceback when applicable Given the bridge is installed When a component emits logger.exception(...) from inside an except clause Then the enqueued FDR record's exc field carries the formatted traceback string from the LogRecord

AC-3: INFO and DEBUG never reach FDR Given the bridge is installed When any component emits logger.info(...) or logger.debug(...) Then no FDR record is enqueued for that log call (verified by both unit tests here and the contract test in the next task)

AC-4: Backpressure is non-blocking Given the FDR producer queue is at its drop-oldest threshold When a component emits logger.error(...) on the hot path Then the call returns within the same latency budget as a stdout-only WARN call (no blocking on the queue), and the FDR client's existing drop counter is incremented

AC-5: Single attachment Given compose_root(config) runs at process start When the bridge wire-up is invoked Then exactly one bridge Handler is attached to the logger; reinitialising the composition root in tests does not stack duplicates

Non-Functional Requirements

Performance

  • Bridge add ≤ 0.05 ms p99 latency on top of the formatter's 0.2 ms budget (i.e. logger.error → bridge enqueue total p99 ≤ 0.25 ms on Tier-2).

Reliability

  • A failure to enqueue (queue full + drop-oldest already saturated) MUST NOT raise into the caller; it MUST log a one-shot internal WARN record (via stdout only — recursion into the bridge is short-circuited by a thread-local flag) every N occurrences, where N is at least 1000.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Emit a WARN through the shared logger with the bridge installed Stub FDR queue receives one record with kind="log", level="WARN", component matching origin
AC-2 Inside an except block, call logger.exception("boom") Stub FDR queue's record carries non-empty exc traceback string
AC-3 Emit INFO and DEBUG records Stub FDR queue receives zero records
AC-4 Pre-fill stub FDR queue to drop-oldest threshold, then emit an ERROR Caller returns under 0.5 ms wall clock; FDR client's drop counter increments
AC-5 Call compose_root twice with the same config in a single process Logger has exactly one bridge Handler attached after the second call

Constraints

  • The bridge has a forward dependency on AZ-247 (FDR producer client + record schema). It cannot pass its own AC tests until AZ-247 is implemented; Step 4 cross-verification will record this temporal dependency in _dependencies_table.md.
  • The bridge's record translation MUST consume only the public surface of log_record_schema v1.0.0 — no peeking into formatter internals.

Risks & Mitigation

Risk 1: Recursion via internal WARN on enqueue failure

  • Risk: The "queue full" internal WARN itself goes through the bridge, recurses, and corrupts the queue further.
  • Mitigation: Thread-local "in-bridge" flag short-circuits any logging call originating from the bridge itself; verified by a unit test that fills the queue and asserts no infinite loop.

Risk 2: Forward dependency on AZ-247 contract not yet written

  • Risk: The FDR record schema is described in epic AZ-247's text but not yet a contract file; this task's expectations may drift from AZ-247's eventual contract.
  • Mitigation: AZ-247's first PBI MUST publish _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md before AZ-247's other PBIs; this task's implementation begins only after that contract exists. Step 4 cross-verification flags the temporal dependency.