Decompose Step 6 snapshot: 140 task specs + contract docs

Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 00:39:48 +03:00
parent 8171fcb29e
commit 880eabcb3f
172 changed files with 22897 additions and 35 deletions
@@ -0,0 +1,161 @@
# C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement
**Task**: AZ-301_c7_engine_gate
**Name**: C7 EngineGate
**Description**: Implement the takeoff-side `EngineGate` validator that every `InferenceRuntime` strategy invokes before deserialising a cached `.engine` file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises `EngineSchemaMismatchError` at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises `EngineHashMismatchError` / `EngineSidecarMissingError`. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator.
**Complexity**: 3 points
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module
**Component**: c7_inference (epic AZ-249 / E-C7)
**Tracker**: AZ-301
**Epic**: AZ-249 (E-C7)
### Document Dependencies
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `EngineCacheEntry` (input) and the gate's error types.
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar verification contract.
- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — filename-schema parser.
## Problem
D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an `.engine` file is about to be deserialised. Without a centralised validator:
- Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs.
- C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures.
- A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it.
This task delivers ONE validator, used by every strategy, with a single point of refusal.
## Outcome
- An `EngineGate` class at `src/gps_denied_onboard/components/c7_inference/engine_gate.py` with a single public method:
`validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> None`
that returns silently on success and raises one of `EngineSchemaMismatchError`, `EngineHashMismatchError`, `EngineSidecarMissingError` on refusal.
- `HostTuple` is a small frozen dataclass `(sm: int, jp: str, trt: str, precision: PrecisionMode)` derived from `nvidia-smi` / `pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry).
- `DeploymentManifest` is a typed wrapper over the deployed `manifest.json` (an ordered map of `engine_relative_path → sha256_hex`) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest.
- The two refusal paths are evaluated in this order:
1. Schema parse: `helpers.engine_filename_schema.parse(entry.engine_path.name)` returns the `(sm, jp, trt, precision)` quadruple; if the parse fails, raise `EngineSchemaMismatchError(reason="parse failure: ...")`.
2. Schema match: if the parsed quadruple does not match `host_tuple`, raise `EngineSchemaMismatchError(expected=host_tuple, got=parsed)`.
3. Sidecar presence: if no `.sha256` sidecar exists alongside `entry.engine_path`, raise `EngineSidecarMissingError`.
4. Sidecar trust: `helpers.sha256_sidecar.verify(entry.engine_path)` — if the sidecar's recorded hash does not match the engine's actual sha256, raise `EngineHashMismatchError(stage="sidecar")`.
5. Manifest match: if `manifest[entry.engine_path.relative_to(manifest.root)]` does not equal the verified sha256, raise `EngineHashMismatchError(stage="manifest")`.
- Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream.
- Diagnostic INFO log on success (`kind="c7.gate.pass"`, engine path, host tuple, manifest hash); ERROR log on each refusal (`kind="c7.gate.refuse"`, refusal reason, engine path).
## Scope
### Included
- `EngineGate` class with `validate(entry, host_tuple, manifest) -> None`.
- `HostTuple` dataclass and a stateless `read_host_tuple() -> HostTuple` helper that calls `nvidia-smi --query-gpu=compute_cap,driver_version,...` (via `pynvml` where possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out.
- The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step.
- Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via `EngineSchemaMismatchError(reason=...)` vs `(expected=..., got=...)`). Both are the same exception class; the kwargs differ.
- Manifest reader: a thin typed wrapper at `src/gps_denied_onboard/components/c7_inference/manifest.py` that reads the deployed `manifest.json` and exposes `__getitem__` and `root`. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location.
- INFO-on-pass and ERROR-on-refuse logs.
- Constructor-injectable `ManifestReader` for tests (the production reader reads from disk; tests inject a dict-backed fake).
### Excluded
- AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate.
- AZ-302 ThermalState publisher — unrelated.
- The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer).
- F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by `runtime_root` and E-C10.
- C12 operator tooling diagnostics for refused engines — out of scope.
- A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate).
## Acceptance Criteria
**AC-1: filename-schema parse failure refused at parse time**
Given an engine file named `bogus_name.engine` (no schema)
When `validate(entry, ...)` is called
Then `EngineSchemaMismatchError(reason="parse failure: ...")` is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call)
**AC-2: filename-schema tuple mismatch refused at parse time**
Given an engine `ultravpr__sm86_jp6.2_trt10.3_fp16.engine` and a host with `sm=87`
When `validate` is called
Then `EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...))` is raised; no sidecar / manifest checks execute
**AC-3: missing sidecar refused before manifest lookup**
Given a schema-matched engine whose `.sha256` sidecar does NOT exist on disk
When `validate` is called
Then `EngineSidecarMissingError(engine_path=...)` is raised; the manifest is NOT read
**AC-4: sidecar trust failure**
Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256
When `validate` is called
Then `EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...)` is raised; the manifest is NOT consulted
**AC-5: manifest mismatch**
Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash
When `validate` is called
Then `EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...)` is raised
**AC-6: full-success path returns silently and logs INFO**
Given an engine that passes all five steps
When `validate` is called
Then the call returns `None` silently; one `kind="c7.gate.pass"` INFO log record was emitted; the caller proceeds to deserialise
**AC-7: refusal order is deterministic for fixture targeting**
Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar
When `validate` is called
Then `EngineSchemaMismatchError` is raised (NOT `EngineSidecarMissingError`) — the schema check runs first; this property is documented and tested
**AC-8: read_host_tuple matches the running Jetson**
Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3
When `read_host_tuple()` is called
Then the returned tuple has `sm=87, jp="6.2", trt="10.3"`; on a workstation (Tier-1 Docker) the values reflect that environment instead
## Non-Functional Requirements
**Performance**
- `validate` p99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget.
- `read_host_tuple` p99 ≤ 100 ms (one `pynvml` call).
**Reliability**
- The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property.
- The gate makes NO network calls and NO writes; it is read-only.
- Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them).
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Bogus filename | `EngineSchemaMismatchError(reason=...)`; no further gate steps |
| AC-2 | Mismatched SM in filename | `EngineSchemaMismatchError(expected=..., got=...)` |
| AC-3 | Missing sidecar | `EngineSidecarMissingError`; manifest not read |
| AC-4 | Sidecar hash != file hash | `EngineHashMismatchError(stage="sidecar")` |
| AC-5 | Manifest hash != sidecar hash | `EngineHashMismatchError(stage="manifest")` |
| AC-6 | Happy path | Returns None; INFO log emitted |
| AC-7 | Both schema-fail AND sidecar-missing | Schema error wins (deterministic order) |
| AC-8 | host tuple read on Jetson Orin Nano Super | (sm=87, jp="6.2", trt="10.3") |
| NFR-perf-validate | Microbench validate × 100 with a 500 MB engine | p99 ≤ 50 ms |
| NFR-reliability-no-write | Run validate against a read-only directory | No writes attempted (sidecar stays untouched) |
## Constraints
- The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it).
- The gate is read-only. NEVER writes to the engine file, sidecar, or manifest.
- The `ManifestReader` is constructor-injectable; the production reader reads `manifest.json` from disk; tests inject a dict-backed fake.
- The `read_host_tuple` helper uses `pynvml` first; falls back to parsing `nvidia-smi` output if `pynvml` is unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raises `RuntimeError("cannot read host tuple")` and the takeoff path aborts.
- Sha256 is computed using stdlib `hashlib.sha256` with chunked reads via `helpers.sha256_sidecar`; this task does NOT introduce a new sha256 library.
- This task introduces no new third-party dependencies beyond `pynvml` (which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md).
## Risks & Mitigation
**Risk 1: Sha256 over a 500 MB engine dominates takeoff latency**
- *Risk*: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff.
- *Mitigation*: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle).
**Risk 2: Manifest reader silently treats missing entry as pass**
- *Risk*: A typo in the manifest produces `KeyError` swallowed somewhere; the gate "passes" without checking.
- *Mitigation*: The manifest reader's `__getitem__` raises `EngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...")` on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case.
**Risk 3: Refusal order changes silently across refactors**
- *Risk*: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses.
- *Mitigation*: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec.
## Runtime Completeness
- **Named capability**: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04).
- **Production code that must exist**: real `EngineGate.validate` calling real `helpers.engine_filename_schema.parse`, real `helpers.sha256_sidecar.verify`, real `ManifestReader` reading the deployed manifest.json from disk.
- **Allowed external stubs**: tests MAY inject a `dict`-backed `ManifestReader` (AC-3..AC-7); production wiring reads the on-disk manifest.
- **Unacceptable substitutes**: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).