Files
gps-denied-onboard/_docs/02_document/components/14_c13_fdr/description.md
T
Oleksandr Bezdieniezhnykh 5fe67023b2 [AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor
Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 19:42:46 +03:00

7.2 KiB
Raw Blame History

C13 — Flight Data Recorder (FDR)

1. High-Level Overview

Purpose: persist a per-flight ≤ 64 GB record of every payload class onboard (estimates, IMU traces, emitted MAVLink, system health, mid-flight tiles, ≤0.1 Hz failed-tile thumbnails) without silently dropping data (AC-NEW-3). Exclude raw nav/AI-cam frames (AC-8.5; only the failed-tile thumbnail forensic exception is allowed). The FDR is the system's audit log: every safety-critical decision, every emitted frame, every signing key rotation, every spoof-promotion-block lands here.

Architectural Pattern: single concrete FileFdrWriter behind a FdrWriter interface. Single writer thread fed by lock-free in-process queues from every component. Lossy on writer-thread overrun only by logging the rollover event, never silently.

Upstream dependencies: every component publishes to C13 via in-process pub/sub (drop-oldest-with-rollover-log on overrun).

Downstream consumers:

  • Post-flight: operator workstation (read via C12 retrieval).
  • Real-time: nothing — C13 is write-only at runtime.

2. Internal Interfaces

Interface: FdrWriter

Method Input Output Async Error Types
open_flight FlightHeader None No (called once at takeoff) FdrOpenError
write_record FdrRecord None No (lock-free enqueue) FdrQueueOverrunError (logged but does not raise)
close_flight () FlightFooter No (called once at landing)
current_size_bytes () int No
is_rolling () bool No

Input/Output DTOs:

FlightHeader:
  flight_id:                       uuid
  flight_started_at:               ISO 8601 + monotonic_ns
  config_snapshot:                 JSON
  signing_key_rotation_event:      record
  manifest_content_hashes:         dict[Path, sha256]

FdrRecord:                       see data_model.md (FdrRecord; tagged union over payload classes)

FlightFooter:
  flight_ended_at:                 ISO 8601 + monotonic_ns
  records_written:                 int
  records_dropped_overrun:         int
  bytes_written:                   int
  rollover_count:                  int

3. External API Specification

Not applicable.

4. Data Access Patterns

Query Frequency Hot Path Index Needed
write_record from every component up to ~100 Hz aggregate Yes n/a
Post-flight read (operator retrieval) once per flight No filesystem layout per (flight_id, segment)

Caching Strategy

Data Cache Type TTL Invalidation
In-process queue from each producer bounded ring (drop-oldest with rollover log) flight lifetime per-record write
Writer-thread buffer sized for ≥1 s of typical write load flight lifetime flush on segment rollover

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
Per-flight record file (segmented, oldest-segment-dropped policy) bounded by 64 GB per AC-NEW-3 varies per payload class ≤ 64 GB / flight bounded by AC-NEW-3
Per-flight tile snapshots (mid-flight tiles) ~few hundred / flight 50200 KB each up to ~50 MB / flight bounded by F4 mid-flight gen
Per-flight failed-tile thumbnails (AC-8.5 forensic exception) ≤ 0.1 Hz × 8 h = ≤ 2880 thumbnails / flight small JPEG <50 MB bounded by ≤ 0.1 Hz cap

Data Management

Seed data: none.

Rollback: per-segment file layout makes per-segment deletion safe. The writer never overwrites a closed segment; it only appends to the current open segment, then opens a new segment when the previous reaches a configurable size cap.

5. Implementation Details

Algorithmic Complexity: per-record cost is O(record_size) for serialisation + write. Aggregate throughput sized for the worst-case AC-NEW-3 cap.

State Management:

  • Owns the open per-flight segment file handle.
  • Owns the writer thread and the in-process producer queues.
  • Owns the rollover policy (oldest-segment-dropped first when total reaches 64 GB).

Key Dependencies:

Library Version Purpose
orjson / msgpack per project pin Record serialisation (serialised format choice during decompose phase)
atomicwrites latest Segment file rotation (atomic open of new segment + close of previous)
filelock per project pin Cross-process safety for the FDR root (operator-orchestrator reads while companion writes — companion-only access during flight)

Error Handling Strategy:

  • FdrOpenError at takeoff: refuse takeoff (per AC-NEW-3 every payload class must be present from t=0).
  • FdrQueueOverrunError: per-producer drop-oldest, but the rollover event itself is ALWAYS logged (a separate "overrun" record in the FDR records the dropped count and producer-id). Never silent.
  • Filesystem write failure mid-flight: log to stdout/stderr (since we can't log to FDR at this point) + STATUSTEXT to GCS; the system continues to emit external positions because losing the audit log doesn't compromise navigation, but the operator must be alerted.

6. Extensions and Helpers

Helper Purpose Used By
RecordSchema versioned record schema for cross-version FDR compatibility C13 only — this is internal

7. Caveats & Edge Cases

Known limitations:

  • 64 GB cap is per AC-NEW-3. If payload-class throughput grows beyond what the cap supports for an 8 h flight, the producers MUST throttle or accept oldest-dropped — the FDR will not silently exceed the cap.
  • Failed-tile thumbnail forensic exception is the ONLY raw-imagery-adjacent persistence; AC-8.5 must be re-asserted if any new payload class is added.

Potential race conditions:

  • The writer thread is the single writer; producers enqueue lock-free. No filesystem contention from within the companion. Operator-tool reads happen post-landing only.

Performance bottlenecks:

  • Writer-thread serialisation throughput must exceed peak producer throughput. NFT-LIM-02 (8 h synthetic AC-NEW-3) validates.

8. Dependency Graph

Must be implemented after: nothing internal — C13 is foundational along with C7.

Can be implemented in parallel with: every other component.

Blocks: every component (every component logs to C13).

9. Logging Strategy

Log Level When Example
ERROR FdrOpenError, mid-flight filesystem write failure C13 segment write failure: errno=ENOSPC; STATUSTEXT to GCS
WARN queue overrun (any producer) C13 queue overrun: producer=c5_state; dropped_count=23
INFO open/close flight; segment rollover C13 flight opened: flight_id=…; segment=0
DEBUG per-write timing (only in dev tier) C13 record written: kind=estimate; bytes=412; took=0.1ms

Log format: structured JSON to stdout/journald. Log storage: stdout / journald — but not C13 itself for ERROR (we'd be writing to the broken thing). FDR records are the project-level "logs" for everything except C13's own operational status.