Batch 4 of the cycle-1 component-doc sync. For each of C10
(provisioning), C11 (tilemanager), C12 (operator_orchestrator),
and C13 (fdr):
- Append "Cycle-1 operational reality" paragraph to § 1
documenting the actual cycle-1 wiring path:
- C10: operator-side / cross-tier; NOT in _STRATEGY_REGISTRY;
composed via runtime_root/c10_factory.py with six per-service
factories; reuses C7 InferenceRuntime for engine compile;
AZ-323 Ed25519 signer + C10ManifestConfig signing-mode gate;
AZ-324 ManifestVerifierImpl with airborne/operator modes;
AZ-507 c6 cuts kept in c10_factory; AZ-687 N/A.
- C11: operator-workstation-only; airborne build target
excludes source tree (ADR-004 / AC-8.4); composed via
runtime_root/c11_factory.py with three per-service factories;
distinct FdrClient producer_ids for signing_key + tile_uploader;
AZ-320 IdempotentRetryTileUploader wraps by default;
AZ-507 keeps c6 surfaces caller-injected; AZ-687 N/A.
- C12: operator-workstation CLI binary; airborne build excludes
source tree (ADR-004 + Principle #9); composed via
runtime_root/c12_factory.py; OperatorOrchestratorServices
dataclass aggregates AZ-326/327/328/329/330/489 services with
sibling fields defaulting to None; AZ-507 cuts via
RemoteCacheProvisionerInvoker + TileDownloaderCut/UploaderCut;
AZ-687 N/A.
- C13: airborne infrastructure; pre_constructed[c13_fdr] seeded
FIRST via make_fdr_client(AIRBORNE_MAIN_PRODUCER_ID, config)
(AZ-619 Phase A); per-producer _CACHE gives AC-619.2 singleton;
AZ-274 drop-oldest overrun policy wired at construction;
c1_vio / c5_state require it, c2_5/c3/c3_5/c4 optional; AZ-687
guard explicitly does NOT apply — seed runs before any block
presence check so replay binaries still write FDR.
Also bump _docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md
replay timestamp to 17:18 (start of this /autodev invocation);
gtsam==4.2.1 still requires numpy<2.0.0 so the relaxed opencv pin
remains in effect.
Update _docs/_autodev_state.md sub_step.detail to record batch
4/~5 done; next batch is the 8 helpers under common-helpers/.
Co-authored-by: Cursor <cursoragent@cursor.com>
9.1 KiB
C13 — Flight Data Recorder (FDR)
1. High-Level Overview
Purpose: persist a per-flight ≤ 64 GB record of every payload class onboard (estimates, IMU traces, emitted MAVLink, system health, mid-flight tiles, ≤0.1 Hz failed-tile thumbnails) without silently dropping data (AC-NEW-3). Exclude raw nav/AI-cam frames (AC-8.5; only the failed-tile thumbnail forensic exception is allowed). The FDR is the system's audit log: every safety-critical decision, every emitted frame, every signing key rotation, every spoof-promotion-block lands here.
Architectural Pattern: single concrete FileFdrWriter behind a FdrWriter interface. Single writer thread fed by lock-free in-process queues from every component. Lossy on writer-thread overrun only by logging the rollover event, never silently.
Cycle-1 operational reality: C13 is airborne infrastructure seeded as the very first slot of build_pre_constructed — constructed["c13_fdr"] = make_fdr_client(AIRBORNE_MAIN_PRODUCER_ID, config) (AZ-619 Phase A, where AIRBORNE_MAIN_PRODUCER_ID = "airborne_main"). The make_fdr_client(producer_id, config) factory in fdr_client/client.py carries a process-level _CACHE keyed by producer_id, so any later make_fdr_client("airborne_main", config) call in the same process returns the SAME FdrClient instance — that's the AC-619.2 cross-component singleton guarantee. Per-component callers can also obtain their OWN per-producer FdrClient via make_fdr_client("<their_slug>", config): C11 uses "c11_tile_manager.signing_key" (AZ-318) and "c11_tile_manager.tile_uploader" (AZ-319), C6's freshness_gate.py uses its own producer, etc. — each entry in the cache is a distinct SpscRingBuffer consumer side. Per-producer capacity comes from config.fdr.per_producer_capacity[producer_id] (override) or config.fdr.queue_size (default), rounded UP to the next power of two and clipped to MIN_CAPACITY. The drop-oldest overrun policy (AZ-274 default_overrun_policy) is wired automatically at FdrClient construction time; AZ-274 also routes the dropped record through the on_overrun hook so the rollover-log event is emitted exactly once per overrun, never silently. Required-key relationship: c1_vio and c5_state list c13_fdr in AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS (missing raises AirborneBootstrapError); c2_5_rerank, c3_matcher, c3_5_adhop, and c4_pose read it via constructed.get("c13_fdr") (optional — silently passes None to the wrapper, which is the documented contract for "FDR off" test fixtures). AZ-687 replay-mode guard does NOT apply to C13: the slot is seeded unconditionally before any _replay_omits_component_block(...) check — a replay binary still writes FDR (TlogDerivedClock-stamped) so post-flight analysis tools can drain the queue.
Upstream dependencies: every component publishes to C13 via in-process pub/sub (drop-oldest-with-rollover-log on overrun).
Downstream consumers:
- Post-flight: operator workstation (read via C12 retrieval).
- Real-time: nothing — C13 is write-only at runtime.
2. Internal Interfaces
Interface: FdrWriter
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
open_flight |
FlightHeader |
None |
No (called once at takeoff) | FdrOpenError |
write_record |
FdrRecord |
None |
No (lock-free enqueue) | FdrQueueOverrunError (logged but does not raise) |
close_flight |
() |
FlightFooter |
No (called once at landing) | — |
current_size_bytes |
() |
int |
No | — |
is_rolling |
() |
bool |
No | — |
Input/Output DTOs:
FlightHeader:
flight_id: uuid
flight_started_at: ISO 8601 + monotonic_ns
config_snapshot: JSON
signing_key_rotation_event: record
manifest_content_hashes: dict[Path, sha256]
FdrRecord: see data_model.md (FdrRecord; tagged union over payload classes)
FlightFooter:
flight_ended_at: ISO 8601 + monotonic_ns
records_written: int
records_dropped_overrun: int
bytes_written: int
rollover_count: int
3. External API Specification
Not applicable.
4. Data Access Patterns
| Query | Frequency | Hot Path | Index Needed |
|---|---|---|---|
write_record from every component |
up to ~100 Hz aggregate | Yes | n/a |
| Post-flight read (operator retrieval) | once per flight | No | filesystem layout per (flight_id, segment) |
Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|---|---|---|---|
| In-process queue from each producer | bounded ring (drop-oldest with rollover log) | flight lifetime | per-record write |
| Writer-thread buffer | sized for ≥1 s of typical write load | flight lifetime | flush on segment rollover |
Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|---|---|---|---|---|
| Per-flight record file (segmented, oldest-segment-dropped policy) | bounded by 64 GB per AC-NEW-3 | varies per payload class | ≤ 64 GB / flight | bounded by AC-NEW-3 |
| Per-flight tile snapshots (mid-flight tiles) | ~few hundred / flight | 50–200 KB each | up to ~50 MB / flight | bounded by F4 mid-flight gen |
| Per-flight failed-tile thumbnails (AC-8.5 forensic exception) | ≤ 0.1 Hz × 8 h = ≤ 2880 thumbnails / flight | small JPEG | <50 MB | bounded by ≤ 0.1 Hz cap |
Data Management
Seed data: none.
Rollback: per-segment file layout makes per-segment deletion safe. The writer never overwrites a closed segment; it only appends to the current open segment, then opens a new segment when the previous reaches a configurable size cap.
5. Implementation Details
Algorithmic Complexity: per-record cost is O(record_size) for serialisation + write. Aggregate throughput sized for the worst-case AC-NEW-3 cap.
State Management:
- Owns the open per-flight segment file handle.
- Owns the writer thread and the in-process producer queues.
- Owns the rollover policy (oldest-segment-dropped first when total reaches 64 GB).
Key Dependencies:
| Library | Version | Purpose |
|---|---|---|
| orjson / msgpack | per project pin | Record serialisation (serialised format choice during decompose phase) |
| atomicwrites | latest | Segment file rotation (atomic open of new segment + close of previous) |
| filelock | per project pin | Cross-process safety for the FDR root (operator-orchestrator reads while companion writes — companion-only access during flight) |
Error Handling Strategy:
FdrOpenErrorat takeoff: refuse takeoff (per AC-NEW-3 every payload class must be present from t=0).FdrQueueOverrunError: per-producer drop-oldest, but the rollover event itself is ALWAYS logged (a separate "overrun" record in the FDR records the dropped count and producer-id). Never silent.- Filesystem write failure mid-flight: log to stdout/stderr (since we can't log to FDR at this point) + STATUSTEXT to GCS; the system continues to emit external positions because losing the audit log doesn't compromise navigation, but the operator must be alerted.
6. Extensions and Helpers
| Helper | Purpose | Used By |
|---|---|---|
RecordSchema |
versioned record schema for cross-version FDR compatibility | C13 only — this is internal |
7. Caveats & Edge Cases
Known limitations:
- 64 GB cap is per AC-NEW-3. If payload-class throughput grows beyond what the cap supports for an 8 h flight, the producers MUST throttle or accept oldest-dropped — the FDR will not silently exceed the cap.
- Failed-tile thumbnail forensic exception is the ONLY raw-imagery-adjacent persistence; AC-8.5 must be re-asserted if any new payload class is added.
Potential race conditions:
- The writer thread is the single writer; producers enqueue lock-free. No filesystem contention from within the companion. Operator-tool reads happen post-landing only.
Performance bottlenecks:
- Writer-thread serialisation throughput must exceed peak producer throughput. NFT-LIM-02 (8 h synthetic AC-NEW-3) validates.
8. Dependency Graph
Must be implemented after: nothing internal — C13 is foundational along with C7.
Can be implemented in parallel with: every other component.
Blocks: every component (every component logs to C13).
9. Logging Strategy
| Log Level | When | Example |
|---|---|---|
| ERROR | FdrOpenError, mid-flight filesystem write failure |
C13 segment write failure: errno=ENOSPC; STATUSTEXT to GCS |
| WARN | queue overrun (any producer) | C13 queue overrun: producer=c5_state; dropped_count=23 |
| INFO | open/close flight; segment rollover | C13 flight opened: flight_id=…; segment=0 |
| DEBUG | per-write timing (only in dev tier) | C13 record written: kind=estimate; bytes=412; took=0.1ms |
Log format: structured JSON to stdout/journald. Log storage: stdout / journald — but not C13 itself for ERROR (we'd be writing to the broken thing). FDR records are the project-level "logs" for everything except C13's own operational status.