Files
gps-denied-onboard/_docs/02_document/components/14_c13_fdr/description.md
T
Oleksandr Bezdieniezhnykh 12aba8139f [autodev] Step 13 partial: c10/c11/c12/c13 cycle-1 doc sync
Batch 4 of the cycle-1 component-doc sync. For each of C10
(provisioning), C11 (tilemanager), C12 (operator_orchestrator),
and C13 (fdr):

- Append "Cycle-1 operational reality" paragraph to § 1
  documenting the actual cycle-1 wiring path:
  - C10: operator-side / cross-tier; NOT in _STRATEGY_REGISTRY;
    composed via runtime_root/c10_factory.py with six per-service
    factories; reuses C7 InferenceRuntime for engine compile;
    AZ-323 Ed25519 signer + C10ManifestConfig signing-mode gate;
    AZ-324 ManifestVerifierImpl with airborne/operator modes;
    AZ-507 c6 cuts kept in c10_factory; AZ-687 N/A.
  - C11: operator-workstation-only; airborne build target
    excludes source tree (ADR-004 / AC-8.4); composed via
    runtime_root/c11_factory.py with three per-service factories;
    distinct FdrClient producer_ids for signing_key + tile_uploader;
    AZ-320 IdempotentRetryTileUploader wraps by default;
    AZ-507 keeps c6 surfaces caller-injected; AZ-687 N/A.
  - C12: operator-workstation CLI binary; airborne build excludes
    source tree (ADR-004 + Principle #9); composed via
    runtime_root/c12_factory.py; OperatorOrchestratorServices
    dataclass aggregates AZ-326/327/328/329/330/489 services with
    sibling fields defaulting to None; AZ-507 cuts via
    RemoteCacheProvisionerInvoker + TileDownloaderCut/UploaderCut;
    AZ-687 N/A.
  - C13: airborne infrastructure; pre_constructed[c13_fdr] seeded
    FIRST via make_fdr_client(AIRBORNE_MAIN_PRODUCER_ID, config)
    (AZ-619 Phase A); per-producer _CACHE gives AC-619.2 singleton;
    AZ-274 drop-oldest overrun policy wired at construction;
    c1_vio / c5_state require it, c2_5/c3/c3_5/c4 optional; AZ-687
    guard explicitly does NOT apply — seed runs before any block
    presence check so replay binaries still write FDR.

Also bump _docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md
replay timestamp to 17:18 (start of this /autodev invocation);
gtsam==4.2.1 still requires numpy<2.0.0 so the relaxed opencv pin
remains in effect.

Update _docs/_autodev_state.md sub_step.detail to record batch
4/~5 done; next batch is the 8 helpers under common-helpers/.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 17:25:53 +03:00

9.1 KiB
Raw Blame History

C13 — Flight Data Recorder (FDR)

1. High-Level Overview

Purpose: persist a per-flight ≤ 64 GB record of every payload class onboard (estimates, IMU traces, emitted MAVLink, system health, mid-flight tiles, ≤0.1 Hz failed-tile thumbnails) without silently dropping data (AC-NEW-3). Exclude raw nav/AI-cam frames (AC-8.5; only the failed-tile thumbnail forensic exception is allowed). The FDR is the system's audit log: every safety-critical decision, every emitted frame, every signing key rotation, every spoof-promotion-block lands here.

Architectural Pattern: single concrete FileFdrWriter behind a FdrWriter interface. Single writer thread fed by lock-free in-process queues from every component. Lossy on writer-thread overrun only by logging the rollover event, never silently.

Cycle-1 operational reality: C13 is airborne infrastructure seeded as the very first slot of build_pre_constructedconstructed["c13_fdr"] = make_fdr_client(AIRBORNE_MAIN_PRODUCER_ID, config) (AZ-619 Phase A, where AIRBORNE_MAIN_PRODUCER_ID = "airborne_main"). The make_fdr_client(producer_id, config) factory in fdr_client/client.py carries a process-level _CACHE keyed by producer_id, so any later make_fdr_client("airborne_main", config) call in the same process returns the SAME FdrClient instance — that's the AC-619.2 cross-component singleton guarantee. Per-component callers can also obtain their OWN per-producer FdrClient via make_fdr_client("<their_slug>", config): C11 uses "c11_tile_manager.signing_key" (AZ-318) and "c11_tile_manager.tile_uploader" (AZ-319), C6's freshness_gate.py uses its own producer, etc. — each entry in the cache is a distinct SpscRingBuffer consumer side. Per-producer capacity comes from config.fdr.per_producer_capacity[producer_id] (override) or config.fdr.queue_size (default), rounded UP to the next power of two and clipped to MIN_CAPACITY. The drop-oldest overrun policy (AZ-274 default_overrun_policy) is wired automatically at FdrClient construction time; AZ-274 also routes the dropped record through the on_overrun hook so the rollover-log event is emitted exactly once per overrun, never silently. Required-key relationship: c1_vio and c5_state list c13_fdr in AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS (missing raises AirborneBootstrapError); c2_5_rerank, c3_matcher, c3_5_adhop, and c4_pose read it via constructed.get("c13_fdr") (optional — silently passes None to the wrapper, which is the documented contract for "FDR off" test fixtures). AZ-687 replay-mode guard does NOT apply to C13: the slot is seeded unconditionally before any _replay_omits_component_block(...) check — a replay binary still writes FDR (TlogDerivedClock-stamped) so post-flight analysis tools can drain the queue.

Upstream dependencies: every component publishes to C13 via in-process pub/sub (drop-oldest-with-rollover-log on overrun).

Downstream consumers:

  • Post-flight: operator workstation (read via C12 retrieval).
  • Real-time: nothing — C13 is write-only at runtime.

2. Internal Interfaces

Interface: FdrWriter

Method Input Output Async Error Types
open_flight FlightHeader None No (called once at takeoff) FdrOpenError
write_record FdrRecord None No (lock-free enqueue) FdrQueueOverrunError (logged but does not raise)
close_flight () FlightFooter No (called once at landing)
current_size_bytes () int No
is_rolling () bool No

Input/Output DTOs:

FlightHeader:
  flight_id:                       uuid
  flight_started_at:               ISO 8601 + monotonic_ns
  config_snapshot:                 JSON
  signing_key_rotation_event:      record
  manifest_content_hashes:         dict[Path, sha256]

FdrRecord:                       see data_model.md (FdrRecord; tagged union over payload classes)

FlightFooter:
  flight_ended_at:                 ISO 8601 + monotonic_ns
  records_written:                 int
  records_dropped_overrun:         int
  bytes_written:                   int
  rollover_count:                  int

3. External API Specification

Not applicable.

4. Data Access Patterns

Query Frequency Hot Path Index Needed
write_record from every component up to ~100 Hz aggregate Yes n/a
Post-flight read (operator retrieval) once per flight No filesystem layout per (flight_id, segment)

Caching Strategy

Data Cache Type TTL Invalidation
In-process queue from each producer bounded ring (drop-oldest with rollover log) flight lifetime per-record write
Writer-thread buffer sized for ≥1 s of typical write load flight lifetime flush on segment rollover

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
Per-flight record file (segmented, oldest-segment-dropped policy) bounded by 64 GB per AC-NEW-3 varies per payload class ≤ 64 GB / flight bounded by AC-NEW-3
Per-flight tile snapshots (mid-flight tiles) ~few hundred / flight 50200 KB each up to ~50 MB / flight bounded by F4 mid-flight gen
Per-flight failed-tile thumbnails (AC-8.5 forensic exception) ≤ 0.1 Hz × 8 h = ≤ 2880 thumbnails / flight small JPEG <50 MB bounded by ≤ 0.1 Hz cap

Data Management

Seed data: none.

Rollback: per-segment file layout makes per-segment deletion safe. The writer never overwrites a closed segment; it only appends to the current open segment, then opens a new segment when the previous reaches a configurable size cap.

5. Implementation Details

Algorithmic Complexity: per-record cost is O(record_size) for serialisation + write. Aggregate throughput sized for the worst-case AC-NEW-3 cap.

State Management:

  • Owns the open per-flight segment file handle.
  • Owns the writer thread and the in-process producer queues.
  • Owns the rollover policy (oldest-segment-dropped first when total reaches 64 GB).

Key Dependencies:

Library Version Purpose
orjson / msgpack per project pin Record serialisation (serialised format choice during decompose phase)
atomicwrites latest Segment file rotation (atomic open of new segment + close of previous)
filelock per project pin Cross-process safety for the FDR root (operator-orchestrator reads while companion writes — companion-only access during flight)

Error Handling Strategy:

  • FdrOpenError at takeoff: refuse takeoff (per AC-NEW-3 every payload class must be present from t=0).
  • FdrQueueOverrunError: per-producer drop-oldest, but the rollover event itself is ALWAYS logged (a separate "overrun" record in the FDR records the dropped count and producer-id). Never silent.
  • Filesystem write failure mid-flight: log to stdout/stderr (since we can't log to FDR at this point) + STATUSTEXT to GCS; the system continues to emit external positions because losing the audit log doesn't compromise navigation, but the operator must be alerted.

6. Extensions and Helpers

Helper Purpose Used By
RecordSchema versioned record schema for cross-version FDR compatibility C13 only — this is internal

7. Caveats & Edge Cases

Known limitations:

  • 64 GB cap is per AC-NEW-3. If payload-class throughput grows beyond what the cap supports for an 8 h flight, the producers MUST throttle or accept oldest-dropped — the FDR will not silently exceed the cap.
  • Failed-tile thumbnail forensic exception is the ONLY raw-imagery-adjacent persistence; AC-8.5 must be re-asserted if any new payload class is added.

Potential race conditions:

  • The writer thread is the single writer; producers enqueue lock-free. No filesystem contention from within the companion. Operator-tool reads happen post-landing only.

Performance bottlenecks:

  • Writer-thread serialisation throughput must exceed peak producer throughput. NFT-LIM-02 (8 h synthetic AC-NEW-3) validates.

8. Dependency Graph

Must be implemented after: nothing internal — C13 is foundational along with C7.

Can be implemented in parallel with: every other component.

Blocks: every component (every component logs to C13).

9. Logging Strategy

Log Level When Example
ERROR FdrOpenError, mid-flight filesystem write failure C13 segment write failure: errno=ENOSPC; STATUSTEXT to GCS
WARN queue overrun (any producer) C13 queue overrun: producer=c5_state; dropped_count=23
INFO open/close flight; segment rollover C13 flight opened: flight_id=…; segment=0
DEBUG per-write timing (only in dev tier) C13 record written: kind=estimate; bytes=412; took=0.1ms

Log format: structured JSON to stdout/journald. Log storage: stdout / journald — but not C13 itself for ERROR (we'd be writing to the broken thing). FDR records are the project-level "logs" for everything except C13's own operational status.