# C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement **Task**: AZ-301_c7_engine_gate **Name**: C7 EngineGate **Description**: Implement the takeoff-side `EngineGate` validator that every `InferenceRuntime` strategy invokes before deserialising a cached `.engine` file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises `EngineSchemaMismatchError` at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises `EngineHashMismatchError` / `EngineSidecarMissingError`. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator. **Complexity**: 3 points **Dependencies**: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module **Component**: c7_inference (epic AZ-249 / E-C7) **Tracker**: AZ-301 **Epic**: AZ-249 (E-C7) ### Document Dependencies - `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `EngineCacheEntry` (input) and the gate's error types. - `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar verification contract. - `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — filename-schema parser. ## Problem D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an `.engine` file is about to be deserialised. Without a centralised validator: - Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs. - C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures. - A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it. This task delivers ONE validator, used by every strategy, with a single point of refusal. ## Outcome - An `EngineGate` class at `src/gps_denied_onboard/components/c7_inference/engine_gate.py` with a single public method: `validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> None` that returns silently on success and raises one of `EngineSchemaMismatchError`, `EngineHashMismatchError`, `EngineSidecarMissingError` on refusal. - `HostTuple` is a small frozen dataclass `(sm: int, jp: str, trt: str, precision: PrecisionMode)` derived from `nvidia-smi` / `pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry). - `DeploymentManifest` is a typed wrapper over the deployed `manifest.json` (an ordered map of `engine_relative_path → sha256_hex`) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest. - The two refusal paths are evaluated in this order: 1. Schema parse: `helpers.engine_filename_schema.parse(entry.engine_path.name)` returns the `(sm, jp, trt, precision)` quadruple; if the parse fails, raise `EngineSchemaMismatchError(reason="parse failure: ...")`. 2. Schema match: if the parsed quadruple does not match `host_tuple`, raise `EngineSchemaMismatchError(expected=host_tuple, got=parsed)`. 3. Sidecar presence: if no `.sha256` sidecar exists alongside `entry.engine_path`, raise `EngineSidecarMissingError`. 4. Sidecar trust: `helpers.sha256_sidecar.verify(entry.engine_path)` — if the sidecar's recorded hash does not match the engine's actual sha256, raise `EngineHashMismatchError(stage="sidecar")`. 5. Manifest match: if `manifest[entry.engine_path.relative_to(manifest.root)]` does not equal the verified sha256, raise `EngineHashMismatchError(stage="manifest")`. - Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream. - Diagnostic INFO log on success (`kind="c7.gate.pass"`, engine path, host tuple, manifest hash); ERROR log on each refusal (`kind="c7.gate.refuse"`, refusal reason, engine path). ## Scope ### Included - `EngineGate` class with `validate(entry, host_tuple, manifest) -> None`. - `HostTuple` dataclass and a stateless `read_host_tuple() -> HostTuple` helper that calls `nvidia-smi --query-gpu=compute_cap,driver_version,...` (via `pynvml` where possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out. - The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step. - Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via `EngineSchemaMismatchError(reason=...)` vs `(expected=..., got=...)`). Both are the same exception class; the kwargs differ. - Manifest reader: a thin typed wrapper at `src/gps_denied_onboard/components/c7_inference/manifest.py` that reads the deployed `manifest.json` and exposes `__getitem__` and `root`. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location. - INFO-on-pass and ERROR-on-refuse logs. - Constructor-injectable `ManifestReader` for tests (the production reader reads from disk; tests inject a dict-backed fake). ### Excluded - AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate. - AZ-302 ThermalState publisher — unrelated. - The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer). - F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by `runtime_root` and E-C10. - C12 operator tooling diagnostics for refused engines — out of scope. - A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate). ## Acceptance Criteria **AC-1: filename-schema parse failure refused at parse time** Given an engine file named `bogus_name.engine` (no schema) When `validate(entry, ...)` is called Then `EngineSchemaMismatchError(reason="parse failure: ...")` is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call) **AC-2: filename-schema tuple mismatch refused at parse time** Given an engine `ultravpr__sm86_jp6.2_trt10.3_fp16.engine` and a host with `sm=87` When `validate` is called Then `EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...))` is raised; no sidecar / manifest checks execute **AC-3: missing sidecar refused before manifest lookup** Given a schema-matched engine whose `.sha256` sidecar does NOT exist on disk When `validate` is called Then `EngineSidecarMissingError(engine_path=...)` is raised; the manifest is NOT read **AC-4: sidecar trust failure** Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256 When `validate` is called Then `EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...)` is raised; the manifest is NOT consulted **AC-5: manifest mismatch** Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash When `validate` is called Then `EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...)` is raised **AC-6: full-success path returns silently and logs INFO** Given an engine that passes all five steps When `validate` is called Then the call returns `None` silently; one `kind="c7.gate.pass"` INFO log record was emitted; the caller proceeds to deserialise **AC-7: refusal order is deterministic for fixture targeting** Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar When `validate` is called Then `EngineSchemaMismatchError` is raised (NOT `EngineSidecarMissingError`) — the schema check runs first; this property is documented and tested **AC-8: read_host_tuple matches the running Jetson** Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3 When `read_host_tuple()` is called Then the returned tuple has `sm=87, jp="6.2", trt="10.3"`; on a workstation (Tier-1 Docker) the values reflect that environment instead ## Non-Functional Requirements **Performance** - `validate` p99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget. - `read_host_tuple` p99 ≤ 100 ms (one `pynvml` call). **Reliability** - The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property. - The gate makes NO network calls and NO writes; it is read-only. - Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them). ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Bogus filename | `EngineSchemaMismatchError(reason=...)`; no further gate steps | | AC-2 | Mismatched SM in filename | `EngineSchemaMismatchError(expected=..., got=...)` | | AC-3 | Missing sidecar | `EngineSidecarMissingError`; manifest not read | | AC-4 | Sidecar hash != file hash | `EngineHashMismatchError(stage="sidecar")` | | AC-5 | Manifest hash != sidecar hash | `EngineHashMismatchError(stage="manifest")` | | AC-6 | Happy path | Returns None; INFO log emitted | | AC-7 | Both schema-fail AND sidecar-missing | Schema error wins (deterministic order) | | AC-8 | host tuple read on Jetson Orin Nano Super | (sm=87, jp="6.2", trt="10.3") | | NFR-perf-validate | Microbench validate × 100 with a 500 MB engine | p99 ≤ 50 ms | | NFR-reliability-no-write | Run validate against a read-only directory | No writes attempted (sidecar stays untouched) | ## Constraints - The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it). - The gate is read-only. NEVER writes to the engine file, sidecar, or manifest. - The `ManifestReader` is constructor-injectable; the production reader reads `manifest.json` from disk; tests inject a dict-backed fake. - The `read_host_tuple` helper uses `pynvml` first; falls back to parsing `nvidia-smi` output if `pynvml` is unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raises `RuntimeError("cannot read host tuple")` and the takeoff path aborts. - Sha256 is computed using stdlib `hashlib.sha256` with chunked reads via `helpers.sha256_sidecar`; this task does NOT introduce a new sha256 library. - This task introduces no new third-party dependencies beyond `pynvml` (which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md). ## Risks & Mitigation **Risk 1: Sha256 over a 500 MB engine dominates takeoff latency** - *Risk*: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff. - *Mitigation*: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle). **Risk 2: Manifest reader silently treats missing entry as pass** - *Risk*: A typo in the manifest produces `KeyError` swallowed somewhere; the gate "passes" without checking. - *Mitigation*: The manifest reader's `__getitem__` raises `EngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...")` on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case. **Risk 3: Refusal order changes silently across refactors** - *Risk*: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses. - *Mitigation*: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec. ## Runtime Completeness - **Named capability**: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04). - **Production code that must exist**: real `EngineGate.validate` calling real `helpers.engine_filename_schema.parse`, real `helpers.sha256_sidecar.verify`, real `ManifestReader` reading the deployed manifest.json from disk. - **Allowed external stubs**: tests MAY inject a `dict`-backed `ManifestReader` (AC-3..AC-7); production wiring reads the on-disk manifest. - **Unacceptable substitutes**: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy). ## Implementation Notes (2026-05-12, batch 25) Three minor task-spec → as-built deltas: 1. **`HostTuple` lives in `engine_gate.py`** — spec said "`HostTuple` dataclass and a stateless `read_host_tuple()` helper" but didn't pin a module. Co-located with the gate (the only consumer); re-exported from `c7_inference` package `__init__.py`. Future consumers can lift it out if needed. 2. **`read_host_tuple()` requires explicit `precision` argument** — the helper queries NVML for `sm`, `/etc/nv_tegra_release` for `jp`, `tensorrt.__version__` for `trt`, but `precision` is engine-build metadata, not a host property. Caller passes it. Spec implied that — the tuple "derived from `nvidia-smi`/`pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry)". 3. **AC-8 is Tier-2-only** — marked `@pytest.mark.tier2 + @pytest.mark.skipif(GPS_DENIED_TIER!=2)`. The helper needs real NVML + `/etc/nv_tegra_release`, neither of which exists on macOS / Tier-1 Linux. AC-1..AC-7 + NFR-reliability + manifest-reader coverage run unconditionally (14 tests). ### As-built file map - `src/gps_denied_onboard/components/c7_inference/engine_gate.py` — `EngineGate.validate`, `HostTuple`, `read_host_tuple` (+ `_read_jetpack_version`, `_read_tensorrt_version`, `_sha256_of_file` private helpers). - `src/gps_denied_onboard/components/c7_inference/manifest.py` — `DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`. Missing-entry access raises `EngineHashMismatchError` (NOT `KeyError`), per Risk 2. - `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `EngineGate`, `HostTuple`, `DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`. - `tests/unit/c7_inference/test_engine_gate.py` — 15 tests (14 unconditional + AC-8 tier-2 skip). ### Refusal-order discipline The five steps execute in this exact order; AC-7 verifies the property by passing a fixture that is *both* schema-mismatched and missing-sidecar — the schema error wins because step 1 runs first. Future refactors that reorder the steps regress AC-7.