mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 04:41:13 +00:00
[AZ-294] [AZ-295] [AZ-296] Finish C13: tile snapshot + record-kind policy + takeoff abort
AZ-294: MidFlightTileSnapshotSink writes orthorectified tile JPEGs atomically to flight_root/<flight_id>/tiles/<tile_id>.jpg, emits a kind="mid_flight_tile_snapshot" pointer record, and evicts the oldest tile when the per-flight 64 MiB cap is exceeded. Adds optional frame_id to the snapshot payload (fdr_record_schema bump). AZ-295: RecordKindPolicy with two paired gates: - enforce_or_raise (producer-side) raises RawFrameWriteForbiddenError for raw_nav_frame / raw_ai_cam_frame at the call site, defending AC-8.5 / RESTRICT-UAV-4. - gate_for_writer (writer-side) tumbling-window rate-caps failed_tile_thumbnail records at <= 0.1 Hz; over-cap drops are coalesced into kind="overrun" records with the originating producer slug. AZ-296: take_off() composition-root sequence with strict ordering (writer.__init__ -> start -> open_flight -> fc_adapter.__init__ -> fc_adapter.open). On FdrOpenError, logs ERROR record, calls writer.stop(), prints the documented FATAL line to stderr, and sys.exit(EXIT_FDR_OPEN_FAILURE=2). composition_root_protocol bumped to v1.1.0 with the new constants + takeoff-sequence section. 29 new tests; full suite 356 passed / 2 skipped / 0 failures. No new dependencies (stdlib only). Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,165 @@
|
||||
# C13 Mid-Flight Tile Snapshot Path + Filesystem Layout
|
||||
|
||||
**Task**: AZ-294_c13_mid_flight_tile_snapshot
|
||||
**Name**: C13 Mid-Flight Tile Snapshot Path
|
||||
**Description**: Implement the sidecar-file path that persists mid-flight orthorectified tile snapshots produced by C6 / C11 (per AC-8.4 / F4) onto the per-flight FDR tree, and emit the corresponding `kind="mid_flight_tile_snapshot"` `FdrRecord` carrying a pointer (`snapshot_path` + `captured_at`) — NOT the JPEG bytes — so the FdrRecord schema's "embedded binary blobs ≤ 4 KiB" invariant is preserved. The sidecar files live under `flight_root/<flight_id>/tiles/<tile_id>.jpg`. This task does NOT generate the tiles (C6 / C11 own that); it provides the FDR-side storage layout, the sidecar write helper, and the pointer-record emission path.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader
|
||||
**Component**: c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-294
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the `kind="mid_flight_tile_snapshot"` payload shape (`snapshot_path`, `captured_at`) AND the ≤ 4 KiB inline-blob invariant this task respects by emitting a pointer instead of bytes.
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying the per-flight tile cap byte budget (`tile_snapshot_cap_bytes`, default ~50 MiB per `description.md` storage estimate).
|
||||
|
||||
## Problem
|
||||
|
||||
Mid-flight tile snapshots are generated by C6 / C11 (per F4 mid-flight tile gen) at sizes 50–200 KiB each, up to ~50 MB per 8 h flight. They cannot be inlined into FdrRecords (the schema invariant caps inline blobs at 4 KiB) and they cannot live in the segment file (segment files are append-only streams of FdrRecords; appending arbitrary JPEG bytes would break the record framing AZ-291 + AZ-272 jointly establish).
|
||||
|
||||
Without a sidecar path:
|
||||
- Producers (C6 / C11) have no canonical filesystem location to write the JPEGs. Each component would invent its own, drifting on naming and breaking post-flight retrieval.
|
||||
- The FdrRecord that ties the JPEG to a frame_id / tile_id / timestamp would either go missing (no record at all) or violate the schema invariant (inlining the JPEG bytes), poisoning the whole FDR.
|
||||
- The per-flight tile cap (~50 MB per `description.md`) has no enforcement layer; a runaway tile producer could exhaust the same NVM the segment files compete for.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `MidFlightTileSnapshotSink(flight_root: Path, flight_id: UUID, fdr_client: FdrClient, config: TileSnapshotConfig)` class is the single sidecar write path. C6 / C11 producers call its `write_snapshot(tile_id: str, jpeg_bytes: bytes, captured_at: datetime, frame_id: int | None) -> Path` method; this task does NOT produce the JPEG itself.
|
||||
- Sidecar files land at `flight_root/<flight_id>/tiles/<tile_id>.jpg`; the directory `tiles/` is created on first write (lazy creation — empty flights leave no `tiles/` directory).
|
||||
- Per call, ONE `kind="mid_flight_tile_snapshot"` FdrRecord is enqueued via the shared FdrClient with `payload.snapshot_path = "tiles/<tile_id>.jpg"` (relative to `flight_root/<flight_id>/` so the FDR is portable) and `payload.captured_at = <ISO 8601>`. The JPEG bytes are NEVER inlined.
|
||||
- The per-flight tile cap (`tile_snapshot_cap_bytes`, default 64 MiB to comfortably fit the ~50 MB worst case from `description.md`) is enforced via oldest-tile-dropped policy, mirroring the segment cap policy from AZ-293 but scoped to the `tiles/` subdirectory and emitted as a `kind="overrun"` record (NOT `segment_rollover` — that kind is reserved for segment-file drops). Each tile drop emits a record with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count=1`.
|
||||
- The sink is thread-safe for many producers (C6, C11 may call concurrently from different threads); the file write itself uses `atomicwrites` to avoid partial JPEGs on crash.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `MidFlightTileSnapshotSink` class as defined above.
|
||||
- `write_snapshot(tile_id: str, jpeg_bytes: bytes, captured_at: datetime, frame_id: int | None = None) -> Path`:
|
||||
1. Validate `len(jpeg_bytes) <= jpeg_max_bytes` (default 256 KiB; rejects with `TileSnapshotTooLargeError` — not infinite-trust on producers).
|
||||
2. Validate `tile_id` matches `[a-zA-Z0-9_-]{1,128}` (rejects with `TileSnapshotInvalidIdError`).
|
||||
3. Compute the absolute sidecar path; create `flight_root/<flight_id>/tiles/` if missing (`os.makedirs(exist_ok=True)`).
|
||||
4. Write the JPEG via `atomicwrites` (temp file + `os.rename` after `fsync`).
|
||||
5. Enqueue the `kind="mid_flight_tile_snapshot"` FdrRecord with relative path + ISO timestamp + optional `frame_id`.
|
||||
6. Check the cap (sum of bytes under `tiles/`). If over cap, drop the oldest `tile_id`-by-`captured_at` and emit an overrun record.
|
||||
7. Return the absolute sidecar path to the caller (so the producer can log it if needed).
|
||||
- `tile_snapshot_cap_bytes` config field (`composition_root_protocol`); default `64 * 1024**2` (64 MiB).
|
||||
- `jpeg_max_bytes` config field; default `256 * 1024` (256 KiB; per `description.md` "50–200 KB each", 256 KiB gives a small safety margin while bounding adversarial growth).
|
||||
- Thread-safe API: a single `threading.Lock` around the cap-check + drop sequence (the file write itself is `atomicwrites` so it is independently safe). The lock is held for ≤ 5 ms p99; `write_snapshot` is NOT a hot path (tiles are sparse — ~0.01–0.1 Hz typical).
|
||||
- A diagnostic INFO log on each successful write (`kind="fdr.tile_snapshot_written"; tile_id; size_bytes`) and WARN on each cap-driven drop (`kind="fdr.tile_snapshot_dropped"; tile_id; size_bytes_freed; cap_bytes_after`).
|
||||
- Recovery on existing `tiles/` directory: on construction, the sink scans `flight_root/<flight_id>/tiles/` for any pre-existing tiles (e.g. from a crashed and resumed flight via the same flight_id); the cap policy treats them as in-cap unless they push the directory over cap. No tiles are auto-deleted on construction; only on overflow.
|
||||
- The sink does NOT interact with AZ-293's segment cap policy directly. The `tiles/` subdirectory is excluded from segment-cap accounting (per AZ-293 constraint "sidecar files outside the segment files are NOT counted toward the cap"); the tile cap is independent.
|
||||
|
||||
### Excluded
|
||||
|
||||
- Generating tile JPEGs (orthorectification, downsampling, encoding) — owned by F4 / C6 / C11 producers.
|
||||
- The `kind="mid_flight_tile_snapshot"` payload schema — owned by AZ-272.
|
||||
- Post-flight retrieval / upload of tile sidecars — owned by C12 post-landing upload trigger (out of scope this cycle).
|
||||
- Failed-tile thumbnail rate limiter — owned by task #5 (this task is for SUCCESS-path tile snapshots from F4; failed-tile thumbnails are a separate, AC-8.5-governed forensic category).
|
||||
- Per-segment file rotation, 64 GB cap on segments, header/footer accounting — owned by AZ-291 / AZ-292 / AZ-293.
|
||||
- Compression of the JPEG (already JPEG; no further compression).
|
||||
- Encryption / signing of the JPEG — out of scope this cycle.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: write_snapshot persists JPEG to canonical sidecar path**
|
||||
Given a sink constructed for `flight_root=/tmp/fdr` and `flight_id=abc-123`, and a JPEG byte string of size 100 KiB
|
||||
When `write_snapshot(tile_id="t_001", jpeg_bytes=<100 KiB>, captured_at=now)` is called
|
||||
Then `/tmp/fdr/abc-123/tiles/t_001.jpg` exists on disk; its bytes equal the input JPEG byte-for-byte; the file is fully written (no temp file artifacts left)
|
||||
|
||||
**AC-2: Pointer record is enqueued, not inline bytes**
|
||||
Given the same call as AC-1
|
||||
When the consumer drains the FdrClient and parses the record
|
||||
Then ONE record with `kind="mid_flight_tile_snapshot"` is observed; `payload.snapshot_path == "tiles/t_001.jpg"` (relative to flight directory); `payload.captured_at` is the ISO 8601 string of the input timestamp; `payload.frame_id` matches the input (or is absent if input was None); NO `payload.jpeg_bytes` field exists; the serialised record is < 1 KiB
|
||||
|
||||
**AC-3: Cap-driven drop emits overrun record + deletes oldest tile**
|
||||
Given `tile_snapshot_cap_bytes=200 KiB`, three tiles already on disk: `t_old.jpg=100 KiB` (captured_at=t0), `t_mid.jpg=80 KiB` (t1), `t_new.jpg=100 KiB` (t2)
|
||||
When `write_snapshot(tile_id="t_overflow", jpeg_bytes=<60 KiB>, captured_at=t3)` is called
|
||||
Then `t_old.jpg` is deleted (oldest by `captured_at`); the new tile is persisted; ONE `kind="overrun"` record is enqueued with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count=1`
|
||||
|
||||
**AC-4: TileSnapshotTooLargeError on oversized JPEG**
|
||||
Given `jpeg_max_bytes=256 KiB` and an input of 300 KiB
|
||||
When `write_snapshot` is called
|
||||
Then `TileSnapshotTooLargeError` is raised before any file or record is written; the `tiles/` directory is unchanged; no FdrRecord lands on the FdrClient
|
||||
|
||||
**AC-5: TileSnapshotInvalidIdError on bad tile_id**
|
||||
Given an input `tile_id="../../../etc/passwd"` (path traversal attempt)
|
||||
When `write_snapshot` is called
|
||||
Then `TileSnapshotInvalidIdError` is raised before any file write; the `tiles/` directory is unchanged
|
||||
|
||||
**AC-6: Concurrent writes are serialised correctly**
|
||||
Given two threads each calling `write_snapshot` 100 times with distinct `tile_id`s under cap
|
||||
When both threads run concurrently
|
||||
Then all 200 sidecar files exist with byte-correct contents; 200 `mid_flight_tile_snapshot` records were enqueued (one per call); zero overruns; no partial files in `tiles/`
|
||||
|
||||
**AC-7: Existing tiles preserved on sink construction**
|
||||
Given `/tmp/fdr/abc-123/tiles/` already contains 3 tile files from a prior process (totaling 150 KiB; cap is 200 KiB)
|
||||
When the sink is constructed for the same `flight_id`
|
||||
Then the existing tiles are NOT deleted on construction; the sink's internal cap accounting includes them; a subsequent `write_snapshot` of 60 KiB triggers a drop of the oldest existing tile
|
||||
|
||||
**AC-8: Atomic write — no partial JPEGs on crash**
|
||||
Given a `write_snapshot` call that is interrupted (simulated kill) between the `atomicwrites` temp-file write and the rename
|
||||
When the test re-inspects the `tiles/` directory
|
||||
Then NO file exists at the canonical sidecar path with partial content; either the file is absent OR it is fully written and parseable
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- `write_snapshot` returns within 50 ms p99 for a 200 KiB JPEG on Tier-2 NVM (dominated by `fsync` on rename; tiles are sparse so no batching needed).
|
||||
- The cap-check sequence (lock acquire + scan tiles/ + drop if needed + lock release) p99 ≤ 5 ms when no drop is needed; p99 ≤ 50 ms when one drop is needed.
|
||||
- Producer-perceived latency must NOT exceed 100 ms p99 in any scenario — F4 mid-flight tile generation is NOT a hot path but operators do see the result.
|
||||
|
||||
**Reliability**
|
||||
- The sink's `write_snapshot` is at-most-once per call: a successful write is exactly one sidecar + exactly one FdrRecord; a failed write is zero sidecar (atomicwrites ensures this) + zero FdrRecord (the record is enqueued only after the sidecar write completes).
|
||||
- The cap-driven drop is also at-most-once per overflow event: the overrun record is enqueued exactly once even under contention (the lock covers the drop + emission sequence).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | write_snapshot 100 KiB JPEG | sidecar file exists at canonical path with byte-correct content |
|
||||
| AC-2 | Parse the enqueued record | kind=mid_flight_tile_snapshot; payload has snapshot_path + captured_at + frame_id; no jpeg_bytes; serialised < 1 KiB |
|
||||
| AC-3 | Cap=200 KiB, 3 existing tiles + 60 KiB new | Oldest tile deleted; new tile present; one overrun record enqueued |
|
||||
| AC-4 | 300 KiB JPEG with cap on size | TileSnapshotTooLargeError; no file or record written |
|
||||
| AC-5 | tile_id with path traversal characters | TileSnapshotInvalidIdError; no file or record written |
|
||||
| AC-6 | 2 threads × 100 calls each | All 200 sidecars present; 200 records enqueued; no partials |
|
||||
| AC-7 | Pre-populate tiles/ then construct sink | Existing tiles untouched; cap accounting includes them |
|
||||
| AC-8 | Kill mid-write | No partial file at canonical path; either complete or absent |
|
||||
| NFR-perf-write | Microbench write_snapshot for 200 KiB | p99 ≤ 50 ms |
|
||||
| NFR-perf-cap-check | Microbench cap-check no-drop path | p99 ≤ 5 ms |
|
||||
| NFR-perf-cap-drop | Microbench cap-check drop path | p99 ≤ 50 ms |
|
||||
| NFR-reliability-atomic | Inject failure between temp-write and rename | No half-written canonical file |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The sidecar path is RELATIVE to `flight_root/<flight_id>/` in the FdrRecord (`payload.snapshot_path = "tiles/<tile_id>.jpg"`). This makes the FDR portable: the operator can copy the entire flight directory anywhere and the records still reference the right files.
|
||||
- `tile_id` validation regex `^[a-zA-Z0-9_-]{1,128}$` is the contract; producers may use any naming scheme inside that envelope.
|
||||
- The cap is `tile_snapshot_cap_bytes`, distinct from segment `flight_cap_bytes` in AZ-293. The two caps are independent — exceeding one does NOT trigger drops in the other domain.
|
||||
- The shared FdrClient's `producer_id` for the records emitted by this sink is `"shared.fdr_client"` (the sink itself is shared infrastructure); the originating producer (C6 / C11) is reflected ONLY in the optional `frame_id` payload field, not in the outer envelope. Rationale: F4 tiles may be produced collaboratively across multiple components and the canonical attribution is the captured_at timestamp + tile_id.
|
||||
- This task does NOT introduce any new dependency: `atomicwrites` is already pinned at AZ-263 / E-BOOT.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Producer (C6 / C11) flushes tiles faster than the cap can absorb**
|
||||
- *Risk*: A pathological case where 1000 small tiles per second push the cap into constant churn.
|
||||
- *Mitigation*: F4 tile generation is rate-limited at the producer side per the C6 / C11 specs (typical 0.01–0.1 Hz). The cap is sized at 64 MiB to comfortably hold the per-flight worst case. The cap-driven overrun record is the canonical signal if a producer misbehaves; AC-3 covers the policy.
|
||||
|
||||
**Risk 2: tile_id collisions across producers**
|
||||
- *Risk*: C6 and C11 both pick `tile_id="x_42"`; the second call overwrites the first.
|
||||
- *Mitigation*: `atomicwrites` uses temp files but the rename targets the canonical name — second call OVERWRITES the first. The `payload.snapshot_path` in the second record is identical to the first; the test operator sees ONE file at the path with the second JPEG and TWO records pointing to it. Documented as a limitation: producers MUST namespace their `tile_id`s (e.g. `c6_<area>_<index>`); the sink does NOT enforce uniqueness. Code-review Phase 7 (Architecture) catches collisions in `tile_id` schemes across components.
|
||||
|
||||
**Risk 3: A failed sidecar write leaves the FdrRecord pointing at a missing file**
|
||||
- *Risk*: `atomicwrites` succeeds in the temp file but `os.rename` fails; we already enqueued the FdrRecord pointing at the canonical name.
|
||||
- *Mitigation*: The order is FIRST sidecar write (must complete) THEN FdrRecord enqueue. AC-2 implicitly covers this — if the sidecar write raises, no record is enqueued. The implementation MUST NOT enqueue the record before `atomicwrites` returns.
|
||||
|
||||
**Risk 4: `os.scandir` of `tiles/` becomes slow with thousands of tiles**
|
||||
- *Risk*: A 100 MiB cap with tiny tiles ends up with ~10k files in `tiles/`; scanning that directory on every write becomes the bottleneck.
|
||||
- *Mitigation*: The sink caches the in-memory tile list (sorted by `captured_at`) and updates it on every write; `os.scandir` runs only once on construction (AC-7). Cache invalidation on a manually-deleted tile mirrors AZ-293's stale-list refresh.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: per-flight mid-flight tile snapshot sidecar storage + pointer-record emission (architecture / E-C13 / AC-8.4 quality metadata, F4 mid-flight tile gen).
|
||||
- **Production code that must exist**: real `atomicwrites`-based sidecar writer, real FdrRecord pointer emission, real cap-policy + overrun record on overflow.
|
||||
- **Allowed external stubs**: tests MAY use `FakeFdrSink` (AZ-275) and a tmp `flight_root`; production wiring uses the real `FdrClient` from the composition root.
|
||||
- **Unacceptable substitutes**: inlining JPEG bytes into FdrRecords ("for now we don't have a sidecar path"), unbounded tile growth without cap enforcement ("the segment cap will catch it" — it won't, AZ-293 explicitly excludes the `tiles/` subdirectory), or skipping `atomicwrites` ("crash-tolerance is a nice-to-have") — operators ARE going to crash-resume mid-flight on Jetson hardware.
|
||||
@@ -0,0 +1,172 @@
|
||||
# C13 Failed-Tile Thumbnail Rate Limiter + AC-8.5 Forbidden-Kind Enforcement
|
||||
|
||||
**Task**: AZ-295_c13_thumbnail_rate_limiter
|
||||
**Name**: C13 AC-8.5 Forbidden-Kind + Thumbnail Rate Cap
|
||||
**Description**: Implement two paired record-policy gates required by AC-8.5 / C13-IT-03 / RESTRICT-UAV-4: (1) a synchronous producer-side validator that REFUSES `kind="raw_nav_frame"` (and any other AI-cam / nav-cam raw-frame kind) by raising `RawFrameWriteForbiddenError` BEFORE the record is enqueued, so the security violation is visible to the offending producer at the call site; (2) a writer-thread-side rate cap on `kind="failed_tile_thumbnail"` records (default ≤ 0.1 Hz per `description.md` § 7) that drops over-cap thumbnails with a WARN log + emits a `kind="overrun"` record carrying the dropped count, while letting in-cap thumbnails pass through to disk untouched. Together they enforce the only allowed raw-imagery-adjacent persistence path on the FDR.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-295
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines `kind="failed_tile_thumbnail"` payload (`{frame_id, tile_id, jpeg_bytes_b64}`) and the ≤ 4 KiB inline-blob invariant the cap respects.
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying `forbidden_record_kinds` (default frozen set including `raw_nav_frame`, `raw_ai_cam_frame`) and `failed_tile_thumbnail_max_hz` (default 0.1).
|
||||
|
||||
## Problem
|
||||
|
||||
Per AC-8.5 + RESTRICT-UAV-4, the FDR is the ONLY persistence path for raw-imagery-adjacent data, and the ONLY allowed raw-imagery-adjacent kind is `failed_tile_thumbnail`, capped at ≤ 0.1 Hz. Without:
|
||||
|
||||
- A synchronous validator that rejects `kind="raw_nav_frame"` (and equivalents) at the producer's call site, a careless or compromised producer could enqueue a stream of raw frames; by the time the writer thread sees them and drops them, gigabytes of raw imagery have already been serialised onto the wire format and (worst case) onto a segment file. Even an "asynchronous reject + drop" model leaks the bytes through transient memory.
|
||||
- A writer-side rate cap on `failed_tile_thumbnail`, a producer (C6 / C11) bug or thumbnail spam attack could push the inline-thumbnail throughput from the documented ≤ 0.1 Hz to many Hz, blowing past the inline-blob budget and burying real diagnostic records under thumbnail noise.
|
||||
|
||||
The two gates are intentionally asymmetric: forbidden-kind violation is a HARD security error visible to the caller (raw_nav_frame is never legitimate); over-cap thumbnails are a SOFT throughput control with WARN logging (over-eager producers are common; rate-limit and continue).
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `RecordKindPolicy` object is the single source of truth for both gates. It exposes `enforce_or_raise(record: FdrRecord) -> None` (synchronous; raises `RawFrameWriteForbiddenError` for forbidden kinds; returns silently for everything else including `failed_tile_thumbnail`) and `gate_for_writer(record: FdrRecord) -> GateDecision` (returns `ENQUEUE` or `DROP` for thumbnail rate-cap purposes).
|
||||
- Producers (C6 / C11 thumbnail emission paths; future producers) call `policy.enforce_or_raise(record)` immediately before `fdr_client.enqueue(record)`. The composition root injects the policy; producers do not construct it themselves.
|
||||
- The writer thread (AZ-291) calls `policy.gate_for_writer(record)` immediately after dequeue. On `DROP`, the writer skips the append + emits a `kind="overrun"` record with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count` aggregated across the cap window.
|
||||
- The `failed_tile_thumbnail` rate cap uses a sliding-window counter (1-second windows summed over the last 10 seconds at 0.1 Hz default) so a producer that bursts 5 thumbnails in one second still gets averaged correctly across the window — instead of a tight token-bucket that would either reject every thumbnail after the burst or let through a steady-state too-fast trickle.
|
||||
- The forbidden-kind set is config-driven (`forbidden_record_kinds`) but its DEFAULT MUST include `raw_nav_frame` and `raw_ai_cam_frame`. Removing those defaults requires a major-version Config bump and is a security-critical review item.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `RecordKindPolicy` dataclass / class with two methods: `enforce_or_raise(record)` and `gate_for_writer(record)`.
|
||||
- Forbidden-kind enforcement:
|
||||
- `enforce_or_raise` raises `RawFrameWriteForbiddenError` if `record.kind` is in the configured forbidden set. The exception's message includes the offending kind and the producer slug from the record envelope so logs identify the source.
|
||||
- The forbidden set defaults to `frozenset({"raw_nav_frame", "raw_ai_cam_frame"})` and is configurable via `forbidden_record_kinds` on the Config; runtime additions are allowed (you can ADD kinds at runtime), but the `Config` validator REJECTS any preset that REMOVES a default kind from the set (unless an explicit `unsafe_remove_default_forbidden=True` flag is set, which is a security-review-required path documented as such; the flag does NOT exist in any standard preset).
|
||||
- Failed-tile thumbnail rate cap:
|
||||
- `gate_for_writer` checks the kind. For non-thumbnail kinds, returns `ENQUEUE`.
|
||||
- For `kind="failed_tile_thumbnail"`, applies a sliding-window rate cap at `failed_tile_thumbnail_max_hz` (default 0.1 Hz). The window is (1 / max_hz) seconds; up to one record per window passes through.
|
||||
- On `DROP`, increments a running `thumbnail_dropped_count` counter and emits ONE `kind="overrun"` record per per-cap-window with `payload.dropped_count == accumulated_count_during_window` (coalesced; matches the AZ-274 overrun-coalescing semantics so post-flight tooling sees consistent overrun records regardless of whether the source is FdrClient queue overrun or thumbnail rate cap).
|
||||
- WARN log per drop window (`kind="fdr.thumbnail_rate_cap_exceeded"; producer_id; dropped_in_window`). Per-second rate cap on the WARN log itself (≤ 1 WARN/sec) so a thumbnail flood does not flood the operational log.
|
||||
- Composition-root wiring: `make_record_kind_policy(config)` factory; the composition root constructs ONE policy instance and injects it into both (a) every producer's enqueue path and (b) the `FileFdrWriter`'s post-dequeue gate.
|
||||
- `failed_tile_thumbnail_max_hz` config field (default 0.1; valid range > 0 .. 10.0); `0` is REJECTED at config validation (would silence thumbnails entirely; producers must declare intent explicitly via `disable_failed_tile_thumbnails=True` on a separate flag if they truly want to silence the kind — this requires a security-review-required preset, similar to forbidden-kind removal).
|
||||
|
||||
### Excluded
|
||||
|
||||
- Thumbnail GENERATION (orthorectification failure detection, JPEG encoding) — owned by C6 / C11 producers; this task only validates / rate-caps RECORDS already constructed.
|
||||
- Mid-flight tile snapshot SUCCESS path (sidecar storage of orthorectified tiles) — owned by AZ-294 / task #4. Failed-tile thumbnails are a DIFFERENT kind with inline (≤ 4 KiB) JPEG bytes, NOT sidecar.
|
||||
- The `kind="raw_nav_frame"` / `kind="failed_tile_thumbnail"` payload schemas — owned by AZ-272.
|
||||
- Per-segment / per-flight cap policies — owned by AZ-291 / AZ-293.
|
||||
- Producer-side rate limiting BEFORE thumbnails are constructed (e.g. C6's decision to attempt orthorectification at most every N frames) — that is per-producer concern; the C13 cap is a defense-in-depth global ceiling.
|
||||
- Cryptographic signing of records — out of scope this cycle.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: enforce_or_raise rejects raw_nav_frame**
|
||||
Given `RecordKindPolicy` constructed with default config, and an `FdrRecord(kind="raw_nav_frame", producer_id="c1_vio", payload={...})`
|
||||
When the producer calls `enforce_or_raise(record)`
|
||||
Then `RawFrameWriteForbiddenError` is raised; the message includes both `"raw_nav_frame"` and `"c1_vio"`; no record is enqueued (the producer's subsequent `fdr_client.enqueue` is never reached because the call site re-raises)
|
||||
|
||||
**AC-2: enforce_or_raise rejects raw_ai_cam_frame**
|
||||
Given the default-configured policy
|
||||
When `enforce_or_raise` is called with `kind="raw_ai_cam_frame"`
|
||||
Then `RawFrameWriteForbiddenError` is raised (same as AC-1)
|
||||
|
||||
**AC-3: enforce_or_raise passes through failed_tile_thumbnail**
|
||||
Given the default policy and `FdrRecord(kind="failed_tile_thumbnail", payload={frame_id: 1, tile_id: "x", jpeg_bytes_b64: "..."})`
|
||||
When `enforce_or_raise` is called
|
||||
Then the call returns silently; no exception is raised; the producer is free to enqueue
|
||||
|
||||
**AC-4: gate_for_writer admits in-cap thumbnails**
|
||||
Given `failed_tile_thumbnail_max_hz=0.1` (one per 10 s window) and the writer is starting fresh
|
||||
When `gate_for_writer(record)` is called once with a `failed_tile_thumbnail` record
|
||||
Then the return value is `ENQUEUE`; the record proceeds to disk
|
||||
|
||||
**AC-5: gate_for_writer drops over-cap thumbnails + emits coalesced overrun record**
|
||||
Given `failed_tile_thumbnail_max_hz=0.1` and 5 thumbnails arrive within a single 10-second window
|
||||
When the writer calls `gate_for_writer` on each
|
||||
Then the FIRST returns `ENQUEUE`; the next 4 return `DROP`; ONE `kind="overrun"` record is emitted at the end of the window with `payload.dropped_count==4` and `payload.producer_id=<originating producer>`; the WARN log fires at most once per second
|
||||
|
||||
**AC-6: Forbidden set REJECTS removal of defaults**
|
||||
Given a Config preset that attempts to set `forbidden_record_kinds = frozenset()` (empty — removing all defaults)
|
||||
When the Config is validated
|
||||
Then a `ConfigValidationError` is raised naming the missing default kinds; the policy cannot be constructed from this config
|
||||
|
||||
**AC-7: Forbidden set ALLOWS additions**
|
||||
Given a Config preset that sets `forbidden_record_kinds = frozenset({"raw_nav_frame", "raw_ai_cam_frame", "raw_thermal_frame"})`
|
||||
When the policy is constructed
|
||||
Then the policy rejects all three kinds via `enforce_or_raise`; the existing tests for the original two kinds still pass
|
||||
|
||||
**AC-8: Hz=0 is rejected at config validation**
|
||||
Given a Config preset with `failed_tile_thumbnail_max_hz=0`
|
||||
When the Config is validated
|
||||
Then a `ConfigValidationError` is raised; the policy cannot be constructed
|
||||
|
||||
**AC-9: Sliding window resets — bursts spread across windows are admitted**
|
||||
Given `failed_tile_thumbnail_max_hz=0.1` and one thumbnail at t=0, one at t=11s, one at t=22s
|
||||
When `gate_for_writer` is called for each
|
||||
Then ALL THREE return `ENQUEUE` (one per window); zero overrun records are emitted
|
||||
|
||||
**AC-10: Producer slug propagates to overrun.payload.producer_id under cap-driven drops**
|
||||
Given thumbnails arriving under cap-driven drop conditions, with the originating producer being `c6_tile_cache`
|
||||
When the overrun record is emitted
|
||||
Then `payload.producer_id == "c6_tile_cache"` (matches the producer the original thumbnails came from, NOT `"shared.fdr_client"` for the payload — the OUTER envelope's producer_id is `"shared.fdr_client"` per the schema contract)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- `enforce_or_raise` p99 ≤ 1 µs (a single set membership check; no allocation).
|
||||
- `gate_for_writer` p99 ≤ 5 µs on the in-cap path; p99 ≤ 10 µs on the cap-driven drop path (sliding-window counter update + overrun-record construction).
|
||||
- Both methods are allocation-free on the steady-state in-cap path.
|
||||
|
||||
**Reliability**
|
||||
- The forbidden-kind set is read once at policy construction and stored as an `frozenset` (immutable across the policy's lifetime). Runtime mutation via reflection is detected by code-review Phase 7 (architecture/security).
|
||||
- The sliding-window counter is per-policy-instance, not global; a single policy serves the whole flight (one composition-root construction). Resetting between flights happens via a new policy instance at takeoff.
|
||||
- The policy's WARN-log rate cap uses the same `kind="fdr.write_failure"` rate cap pattern from AZ-291 (≤ 1 WARN/sec) — implemented inside the policy, no shared rate-limit state with the writer thread.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | enforce_or_raise on `kind="raw_nav_frame"` | RawFrameWriteForbiddenError; message contains kind + producer_id |
|
||||
| AC-2 | enforce_or_raise on `kind="raw_ai_cam_frame"` | RawFrameWriteForbiddenError |
|
||||
| AC-3 | enforce_or_raise on `kind="failed_tile_thumbnail"` | Returns silently |
|
||||
| AC-4 | gate_for_writer for first thumbnail in fresh window | Returns ENQUEUE |
|
||||
| AC-5 | 5 thumbnails in one 10 s window | First ENQUEUE; next 4 DROP; one overrun record (dropped_count=4); ≤ 1 WARN |
|
||||
| AC-6 | Empty `forbidden_record_kinds` config | ConfigValidationError |
|
||||
| AC-7 | Adding `raw_thermal_frame` to forbidden set | All three kinds rejected; defaults still rejected |
|
||||
| AC-8 | `failed_tile_thumbnail_max_hz=0` | ConfigValidationError |
|
||||
| AC-9 | Thumbnails at t=0, t=11, t=22 with 10 s window | All three ENQUEUE; zero overrun records |
|
||||
| AC-10 | Drop scenario with originating producer `c6_tile_cache` | overrun record's `payload.producer_id == "c6_tile_cache"` |
|
||||
| NFR-perf-enforce | Microbench `enforce_or_raise` 10k iter | p99 ≤ 1 µs |
|
||||
| NFR-perf-gate-allow | Microbench `gate_for_writer` in-cap | p99 ≤ 5 µs |
|
||||
| NFR-perf-gate-drop | Microbench `gate_for_writer` over-cap | p99 ≤ 10 µs |
|
||||
| NFR-reliability-immutable | Attempt to mutate `policy.forbidden_kinds` after construction | TypeError (frozenset) or AttributeError (no setter) |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The forbidden-kind set is defense-in-depth, NOT the primary line of defense. Producers MUST not construct `raw_nav_frame` records in the first place (that is owned by their respective component specs); this gate catches regressions and malicious producers.
|
||||
- The sliding-window counter MUST be O(1) update per call; an O(N) implementation that scans a list of timestamps is rejected at code-review Phase 7 (architecture).
|
||||
- The cap and forbidden set apply globally across all producers within a flight, NOT per-producer. A single producer cannot consume the entire 0.1 Hz budget by exclusion of others — the budget is a global capacity for the FDR's inline thumbnail throughput. (Per-producer caps, if needed, are owned by individual component specs.)
|
||||
- This task does NOT introduce new dependencies. Stdlib `time.monotonic_ns` + a fixed-size deque (or constant counter) suffice for the sliding window.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: AC-7's "additions allowed" path is abused to add legitimate kinds (e.g. `state.tick`)**
|
||||
- *Risk*: A misconfigured deployment adds `state.tick` to the forbidden set and silently breaks the entire FDR.
|
||||
- *Mitigation*: Config validation cross-checks the forbidden set against the v1.0.0 schema's closed enum of legitimate kinds and REJECTS additions that are in the schema. The forbidden set is intended to be a SUBSET of "kinds that don't appear in v1.0.0 closed enum + raw-frame variants we explicitly want to ban". Documented in the Config validator + AC-6 tests.
|
||||
|
||||
**Risk 2: Producer-side enforce_or_raise wrapper not actually called**
|
||||
- *Risk*: A future producer forgets to call `policy.enforce_or_raise` and goes straight to `fdr_client.enqueue` — bypassing the synchronous gate.
|
||||
- *Mitigation*: A code-review Phase 2 (Spec Compliance) check requires every producer calling `fdr_client.enqueue` to also call `policy.enforce_or_raise` immediately before. The writer-side `gate_for_writer` is the defense-in-depth catch — even if a forbidden-kind record sneaks past the producer, the writer drops it and emits an `overrun` record (the security AC is "no raw frame on disk", not "no raw frame in producer memory"). Both gates exist precisely so producer-side bypasses become observable in logs.
|
||||
|
||||
**Risk 3: Sliding-window counter clock drift**
|
||||
- *Risk*: `time.monotonic_ns` is per-process; if the process is suspended (Jetson power management) the window appears to compress.
|
||||
- *Mitigation*: `monotonic_ns` does NOT advance during suspend (per CPython docs); on resume, the counter sees a single large gap. The sliding window adapts naturally — old samples drop out, and the next thumbnail is admitted. Documented; no special mitigation needed.
|
||||
|
||||
**Risk 4: WARN log rate cap interferes with debugging**
|
||||
- *Risk*: An operator investigating a thumbnail flood sees only one WARN per second and misses the burst pattern.
|
||||
- *Mitigation*: The OVERRUN RECORD emitted into the FDR carries the per-window `dropped_count`; that is the canonical record. The WARN log is operator convenience only. Documented in the policy's docstring.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: AC-8.5 forbidden-kind synchronous enforcement + failed-tile thumbnail rate cap (architecture / E-C13 / AC-8.5 / RESTRICT-UAV-4 / C13-IT-03).
|
||||
- **Production code that must exist**: real `RecordKindPolicy` with both methods, real composition-root wiring into producer paths AND the writer thread, real sliding-window counter, real overrun-record emission on drop.
|
||||
- **Allowed external stubs**: tests MAY use `FakeFdrSink` (AZ-275); production wiring uses the real shared FdrClient.
|
||||
- **Unacceptable substitutes**: writer-only enforcement without producer-side `enforce_or_raise` ("the writer will catch it" — too late, the bytes already crossed the wire format), config that allows removing `raw_nav_frame` from defaults silently ("operators know what they're doing"), token-bucket without coalescing ("we'll emit one overrun per drop") — all break C13-IT-03 + AC-NEW-3 + the fundamental AC-8.5 invariant that raw frames MUST NEVER touch durable storage.
|
||||
@@ -0,0 +1,152 @@
|
||||
# C13 FdrOpenError → Takeoff Abort Path
|
||||
|
||||
**Task**: AZ-296_c13_open_error_takeoff_abort
|
||||
**Name**: C13 Takeoff Abort on FdrOpenError
|
||||
**Description**: Wire the composition root's takeoff sequence so that `FdrOpenError` raised by `FileFdrWriter.open_flight()` (AZ-292) aborts takeoff BEFORE the C8 FC adapter is opened. This is the AC-NEW-3 every-payload-class-from-t=0 enforcement gate: if the FDR cannot persist records starting at t=0, the system MUST NOT emit external positions to the flight controller, because the audit trail proving "we made every safety-critical decision at t=0" would be missing. The abort is a HARD failure (the companion process exits with a non-zero status code so systemd / the Jetson init system surfaces it to the operator); it does NOT silently degrade.
|
||||
**Complexity**: 2 points
|
||||
**Dependencies**: AZ-291_c13_writer_thread, AZ-292_c13_flight_header_footer, AZ-263_initial_structure, AZ-266_log_module
|
||||
**Component**: composition_root + c13_fdr (epic AZ-248 / E-C13)
|
||||
**Tracker**: AZ-296
|
||||
**Epic**: AZ-248 (E-C13)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — defines the takeoff-sequence contract that this task amends with the FDR-first ordering invariant.
|
||||
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational ERROR log shape this task uses for the abort message.
|
||||
|
||||
## Problem
|
||||
|
||||
The takeoff sequence in the composition root currently has no enforced ordering between FDR open and FC adapter open. Without this task:
|
||||
|
||||
- A composition root could open C8 (FC adapter — pymavlink, MSP) BEFORE C13 (FDR), so external positions start streaming to the flight controller before `FileFdrWriter.open_flight()` has confirmed the segment file is writable. The companion would silently emit positions for which there is no audit record at t=0.
|
||||
- A misconfigured `flight_root` (read-only mount, missing parent directory, full filesystem) would surface only AFTER takeoff has begun — too late for the operator to fix the configuration on the ground.
|
||||
- C13-IT-06 ("refuse takeoff if `open_flight` fails") would fail because there is no take-off-abort path; the test would observe the FC adapter wired despite the FDR failing to open.
|
||||
|
||||
## Outcome
|
||||
|
||||
- The composition root's takeoff sequence is strictly ordered: (1) construct `FileFdrWriter`, (2) call `start()`, (3) call `open_flight(header)`, (4) ONLY IF (3) succeeded, construct + open the C8 FC adapter, (5) start every other component.
|
||||
- If step (3) raises `FdrOpenError`, the composition root catches the exception, logs an ERROR via the shared logger (`kind="composition_root.takeoff_aborted"; reason="fdr_open_error"; underlying=<str(exc)>`), tears down any partially-constructed components (the writer's `start()` is rolled back via its `stop()` so the filelock is released), and exits the process with a non-zero status code (specifically `2` — distinct from `1` which the project reserves for generic startup failures).
|
||||
- The exit message printed to stderr names the offending `flight_root` path so the operator can immediately see "the FDR root I configured is wrong" — no log-diving required.
|
||||
- The abort path is exercised end-to-end by an integration-style test that constructs a composition root with a read-only `flight_root`, runs it, and asserts (a) the FC adapter was NOT instantiated, (b) the process exits with status 2, (c) the stderr message names the path.
|
||||
- C13-IT-06 (per `_docs/02_document/components/14_c13_fdr/tests.md`) is fully satisfied by this task in combination with AZ-292.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- Modification to the composition root's takeoff sequence to enforce the strict ordering above. The composition root is `src/gps_denied_onboard/runtime_root.py` per AZ-263 / module-layout.md; the change is localised to the takeoff section.
|
||||
- A `try/except FdrOpenError` block around `open_flight(header)` that:
|
||||
1. Logs ONE ERROR record via the shared logger (`kind="composition_root.takeoff_aborted"`, `level="ERROR"`, `kv={"reason": "fdr_open_error", "underlying": str(exc), "flight_root": str(config.fdr_writer.flight_root)}`).
|
||||
2. Calls `writer.stop()` to release the filelock + close any open segment file (no-op if `start()` failed before any segment was opened).
|
||||
3. Prints a single line to stderr: `FATAL: cannot open FDR at <flight_root>: <underlying message>; aborting takeoff (exit 2)`.
|
||||
4. Calls `sys.exit(2)`.
|
||||
- The exit status is exactly `2` for FDR-open failures; the constants `EXIT_GENERIC_FAILURE=1` and `EXIT_FDR_OPEN_FAILURE=2` are documented in the composition_root_protocol contract (this task adds the new constant and the contract entry).
|
||||
- An integration-style test fixture under `tests/integration/composition_root/` that constructs a composition root with a controlled `flight_root` path that fails to open (read-only directory) and asserts the documented behaviour.
|
||||
- Update to `_docs/02_document/contracts/shared_config/composition_root_protocol.md` to document the strict takeoff ordering and the `EXIT_FDR_OPEN_FAILURE=2` constant. The contract update is in scope (this task touches the contract that other consumers read).
|
||||
- Validation that the C8 FC adapter constructor / `open()` call sites are NOT reached on the FdrOpenError path. This is verified by the integration test (`assert fc_adapter_constructor.call_count == 0`) and by a code-review Phase 2 (Spec Compliance) check that walks the takeoff sequence statically.
|
||||
|
||||
### Excluded
|
||||
|
||||
- The actual implementation of `open_flight` and `FdrOpenError` — owned by AZ-292.
|
||||
- The writer's `start()` / `stop()` lifecycle — owned by AZ-291.
|
||||
- Recovery from `FdrOpenError` (e.g. retrying with a fallback `flight_root`) — explicitly NOT in scope. AC-NEW-3 says every payload class must be present from t=0; a fallback would violate the spirit of the AC by accepting a degraded FDR. The operator must fix the config and restart.
|
||||
- Other takeoff-abort triggers (e.g. C7 inference engine load failure, C8 FC handshake failure) — those have their own composition-root abort paths owned by the respective component epics and the composition_root contract.
|
||||
- GCS alert on takeoff abort — the companion is on the ground, not yet emitting to GCS; the abort surfaces via stderr + exit code, NOT GCS STATUSTEXT (which requires the FC adapter, which we are NOT opening). Documented as a constraint.
|
||||
- Runtime FDR failure (`OSError` mid-flight) — that is owned by AZ-291's degraded-mode path with its own GCS alert.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: FdrOpenError raised → process exits with status 2**
|
||||
Given a composition root configured with `flight_root=/read-only/path` (where `open_flight()` will raise `FdrOpenError`)
|
||||
When the composition root's takeoff sequence runs
|
||||
Then the process exits with status code exactly 2; no other component (especially C8 FC adapter) is constructed; the writer's filelock is released
|
||||
|
||||
**AC-2: Stderr message names the flight_root path**
|
||||
Given AC-1's setup
|
||||
When the test captures stderr
|
||||
Then stderr contains exactly one line matching `^FATAL: cannot open FDR at /read-only/path: .*; aborting takeoff \(exit 2\)$`; no other FATAL lines are printed
|
||||
|
||||
**AC-3: ERROR log record includes underlying exception message**
|
||||
Given AC-1's setup
|
||||
When the test parses the structured log records
|
||||
Then exactly one record exists with `kind="composition_root.takeoff_aborted"`, `level="ERROR"`, `kv.reason=="fdr_open_error"`, `kv.flight_root=="/read-only/path"`, `kv.underlying` containing the underlying `FdrOpenError`'s message
|
||||
|
||||
**AC-4: C8 FC adapter is NOT constructed on the abort path**
|
||||
Given AC-1's setup AND a test double for the C8 FC adapter that records constructor invocations
|
||||
When the takeoff sequence aborts
|
||||
Then the C8 FC adapter test double's constructor was called 0 times; no MAVLink / MSP socket is opened
|
||||
|
||||
**AC-5: Successful open_flight proceeds to FC adapter**
|
||||
Given a writable `flight_root` and a normal Config
|
||||
When the takeoff sequence runs
|
||||
Then `open_flight()` returns; the C8 FC adapter IS constructed AFTER `open_flight()` returns; the process does NOT exit with status 2
|
||||
|
||||
**AC-6: writer.stop() is called on the abort path**
|
||||
Given AC-1's setup AND a writer test double that records `start` / `stop` calls
|
||||
When the takeoff aborts
|
||||
Then `writer.stop()` was called exactly once after the FdrOpenError; the filelock is released (a subsequent process can construct a new writer for the same `flight_root` without error)
|
||||
|
||||
**AC-7: Non-FdrOpenError exceptions are NOT caught by this handler**
|
||||
Given a writer that raises `RuntimeError("boom")` from `open_flight` (NOT FdrOpenError)
|
||||
When the takeoff sequence runs
|
||||
Then the `RuntimeError` propagates UP (it is not swallowed by the FdrOpenError handler); the process exits with status 1 (generic failure path) — NOT status 2
|
||||
|
||||
**AC-8: Strict ordering — FdrWriter constructed and started before FC adapter constructor is called**
|
||||
Given a composition-root unit test that records the order of constructor calls
|
||||
When the takeoff sequence runs (success path)
|
||||
Then the order is: `FileFdrWriter.__init__` → `writer.start()` → `writer.open_flight(header)` → `<C8 adapter constructor>` → `<C8 adapter open()>`; any other order fails the test
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- The takeoff abort path completes within 500 ms of the FdrOpenError being raised (writer.stop() + log + stderr write + exit). Operators must see the abort signal immediately, not after a long teardown.
|
||||
|
||||
**Reliability**
|
||||
- The abort path MUST NOT itself raise into the caller. Any exception inside the abort handler (e.g. `writer.stop()` itself raising) is swallowed with a SECOND ERROR log; the process still exits with status 2 (with `os._exit(2)` if `sys.exit(2)` is intercepted somehow).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | composition_root with read-only flight_root | exit status 2; no other component constructed; filelock released |
|
||||
| AC-2 | Capture stderr | Exactly one matching FATAL line naming the flight_root path |
|
||||
| AC-3 | Parse log records | Exactly one ERROR record with the documented kind + kv |
|
||||
| AC-4 | Mock C8 adapter; trigger abort | C8 constructor `call_count == 0` |
|
||||
| AC-5 | Writable flight_root | open_flight succeeds; C8 IS constructed after; no exit 2 |
|
||||
| AC-6 | Mock writer; trigger abort | writer.stop() called exactly once |
|
||||
| AC-7 | Writer raises RuntimeError from open_flight | RuntimeError propagates; exit status 1, not 2 |
|
||||
| AC-8 | Spy on constructor / method invocation order | Strict order: writer init → start → open_flight → C8 init → C8 open |
|
||||
| NFR-perf-abort | Time abort path from FdrOpenError to exit | ≤ 500 ms |
|
||||
| NFR-reliability-abort-resilience | writer.stop() raises during abort | Second ERROR logged; process still exits with 2 |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The takeoff abort exit code is FIXED at `2`; changing it is a breaking change to the composition_root contract and operator runbooks. The constant `EXIT_FDR_OPEN_FAILURE=2` lives in the composition root and is documented in `composition_root_protocol.md`.
|
||||
- The abort path uses `sys.exit(2)` first (so `atexit` handlers run and structured logs flush); only if `sys.exit` does not actually exit (e.g. caught somewhere up the stack — this should not happen but the abort handler is defensive) does it fall back to `os._exit(2)`.
|
||||
- The stderr message format is fixed (matches AC-2's regex). Operator runbooks grep for this exact pattern to surface FDR misconfigs in the field.
|
||||
- This task does NOT introduce new dependencies. `sys`, `os`, and the existing logger are sufficient.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: A future refactor moves the C8 FC adapter constructor BEFORE the FDR open**
|
||||
- *Risk*: An optimization that "opens the FC adapter early to warm the link" silently breaks AC-NEW-3.
|
||||
- *Mitigation*: AC-8's strict-ordering test runs in CI on every change to `runtime_root.py`. Code-review Phase 2 (Spec Compliance) explicitly checks that the FdrWriter open precedes the FC adapter constructor.
|
||||
|
||||
**Risk 2: `sys.exit(2)` interferes with pytest test runners**
|
||||
- *Risk*: The test asserts on exit status 2 but pytest catches the SystemExit and reports it as a test pass.
|
||||
- *Mitigation*: The integration test runs the composition root in a subprocess (`subprocess.run([...])`) and asserts on `proc.returncode == 2`. Documented in the test fixture; pytest's in-process exit interception is sidestepped.
|
||||
|
||||
**Risk 3: The abort handler swallows the FdrOpenError stack trace, making field debugging hard**
|
||||
- *Risk*: The operator sees `FATAL: cannot open FDR at /path: <one-line message>` but the underlying cause (e.g. ENOSPC vs. EACCES vs. ENOENT) is hidden.
|
||||
- *Mitigation*: AC-3's `kv.underlying` field carries the full `str(exc)` from the FdrOpenError; the structured log record preserves the full causal chain. The stderr line is the operator-facing summary; the log is the debug trail.
|
||||
|
||||
**Risk 4: Operators might want a "continue without FDR" override flag**
|
||||
- *Risk*: Field debugging pressure leads to a `--ignore-fdr-failure` CLI flag that violates AC-NEW-3.
|
||||
- *Mitigation*: This task EXPLICITLY excludes such an override (per the Excluded section). The contract update documents that no such override is permitted; adding one is a major-version bump on `composition_root_protocol` AND a security-review-required change. Documented as a constraint.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: AC-NEW-3 every-payload-class-from-t=0 takeoff gate (architecture / E-C13 / AC-NEW-3 / C13-IT-06 / RESTRICT-UAV-4).
|
||||
- **Production code that must exist**: real composition-root takeoff-sequence ordering, real `try/except FdrOpenError` handler, real `sys.exit(2)` (with `os._exit(2)` fallback), real `writer.stop()` rollback, real ERROR log + stderr message.
|
||||
- **Allowed external stubs**: tests MAY use a subprocess + temp-directory `flight_root`; production wiring uses the real composition root.
|
||||
- **Unacceptable substitutes**: a "warning, not abort" path ("the operator can decide"), exit code 1 ("we don't need a separate FDR-failure code"), opening the FC adapter before the FDR ("optimisation; we'll close it if FDR fails") — all break C13-IT-06 and the AC-NEW-3 invariant.
|
||||
Reference in New Issue
Block a user