Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
13 KiB
C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement
Task: AZ-301_c7_engine_gate
Name: C7 EngineGate
Description: Implement the takeoff-side EngineGate validator that every InferenceRuntime strategy invokes before deserialising a cached .engine file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises EngineSchemaMismatchError at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises EngineHashMismatchError / EngineSidecarMissingError. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator.
Complexity: 3 points
Dependencies: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module
Component: c7_inference (epic AZ-249 / E-C7)
Tracker: AZ-301
Epic: AZ-249 (E-C7)
Document Dependencies
_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md— definesEngineCacheEntry(input) and the gate's error types._docs/02_document/contracts/shared_helpers/sha256_sidecar.md— sidecar verification contract._docs/02_document/contracts/shared_helpers/engine_filename_schema.md— filename-schema parser.
Problem
D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an .engine file is about to be deserialised. Without a centralised validator:
- Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs.
- C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures.
- A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it.
This task delivers ONE validator, used by every strategy, with a single point of refusal.
Outcome
- An
EngineGateclass atsrc/gps_denied_onboard/components/c7_inference/engine_gate.pywith a single public method:validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> Nonethat returns silently on success and raises one ofEngineSchemaMismatchError,EngineHashMismatchError,EngineSidecarMissingErroron refusal. HostTupleis a small frozen dataclass(sm: int, jp: str, trt: str, precision: PrecisionMode)derived fromnvidia-smi/pynvml+ the runtime's pinned TRT version + the engine's intended precision (read from the entry).DeploymentManifestis a typed wrapper over the deployedmanifest.json(an ordered map ofengine_relative_path → sha256_hex) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest.- The two refusal paths are evaluated in this order:
- Schema parse:
helpers.engine_filename_schema.parse(entry.engine_path.name)returns the(sm, jp, trt, precision)quadruple; if the parse fails, raiseEngineSchemaMismatchError(reason="parse failure: ..."). - Schema match: if the parsed quadruple does not match
host_tuple, raiseEngineSchemaMismatchError(expected=host_tuple, got=parsed). - Sidecar presence: if no
.sha256sidecar exists alongsideentry.engine_path, raiseEngineSidecarMissingError. - Sidecar trust:
helpers.sha256_sidecar.verify(entry.engine_path)— if the sidecar's recorded hash does not match the engine's actual sha256, raiseEngineHashMismatchError(stage="sidecar"). - Manifest match: if
manifest[entry.engine_path.relative_to(manifest.root)]does not equal the verified sha256, raiseEngineHashMismatchError(stage="manifest").
- Schema parse:
- Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream.
- Diagnostic INFO log on success (
kind="c7.gate.pass", engine path, host tuple, manifest hash); ERROR log on each refusal (kind="c7.gate.refuse", refusal reason, engine path).
Scope
Included
EngineGateclass withvalidate(entry, host_tuple, manifest) -> None.HostTupledataclass and a statelessread_host_tuple() -> HostTuplehelper that callsnvidia-smi --query-gpu=compute_cap,driver_version,...(viapynvmlwhere possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out.- The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step.
- Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via
EngineSchemaMismatchError(reason=...)vs(expected=..., got=...)). Both are the same exception class; the kwargs differ. - Manifest reader: a thin typed wrapper at
src/gps_denied_onboard/components/c7_inference/manifest.pythat reads the deployedmanifest.jsonand exposes__getitem__androot. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location. - INFO-on-pass and ERROR-on-refuse logs.
- Constructor-injectable
ManifestReaderfor tests (the production reader reads from disk; tests inject a dict-backed fake).
Excluded
- AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate.
- AZ-302 ThermalState publisher — unrelated.
- The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer).
- F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by
runtime_rootand E-C10. - C12 operator tooling diagnostics for refused engines — out of scope.
- A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate).
Acceptance Criteria
AC-1: filename-schema parse failure refused at parse time
Given an engine file named bogus_name.engine (no schema)
When validate(entry, ...) is called
Then EngineSchemaMismatchError(reason="parse failure: ...") is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call)
AC-2: filename-schema tuple mismatch refused at parse time
Given an engine ultravpr__sm86_jp6.2_trt10.3_fp16.engine and a host with sm=87
When validate is called
Then EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...)) is raised; no sidecar / manifest checks execute
AC-3: missing sidecar refused before manifest lookup
Given a schema-matched engine whose .sha256 sidecar does NOT exist on disk
When validate is called
Then EngineSidecarMissingError(engine_path=...) is raised; the manifest is NOT read
AC-4: sidecar trust failure
Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256
When validate is called
Then EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...) is raised; the manifest is NOT consulted
AC-5: manifest mismatch
Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash
When validate is called
Then EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...) is raised
AC-6: full-success path returns silently and logs INFO
Given an engine that passes all five steps
When validate is called
Then the call returns None silently; one kind="c7.gate.pass" INFO log record was emitted; the caller proceeds to deserialise
AC-7: refusal order is deterministic for fixture targeting
Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar
When validate is called
Then EngineSchemaMismatchError is raised (NOT EngineSidecarMissingError) — the schema check runs first; this property is documented and tested
AC-8: read_host_tuple matches the running Jetson
Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3
When read_host_tuple() is called
Then the returned tuple has sm=87, jp="6.2", trt="10.3"; on a workstation (Tier-1 Docker) the values reflect that environment instead
Non-Functional Requirements
Performance
validatep99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget.read_host_tuplep99 ≤ 100 ms (onepynvmlcall).
Reliability
- The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property.
- The gate makes NO network calls and NO writes; it is read-only.
- Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them).
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Bogus filename | EngineSchemaMismatchError(reason=...); no further gate steps |
| AC-2 | Mismatched SM in filename | EngineSchemaMismatchError(expected=..., got=...) |
| AC-3 | Missing sidecar | EngineSidecarMissingError; manifest not read |
| AC-4 | Sidecar hash != file hash | EngineHashMismatchError(stage="sidecar") |
| AC-5 | Manifest hash != sidecar hash | EngineHashMismatchError(stage="manifest") |
| AC-6 | Happy path | Returns None; INFO log emitted |
| AC-7 | Both schema-fail AND sidecar-missing | Schema error wins (deterministic order) |
| AC-8 | host tuple read on Jetson Orin Nano Super | (sm=87, jp="6.2", trt="10.3") |
| NFR-perf-validate | Microbench validate × 100 with a 500 MB engine | p99 ≤ 50 ms |
| NFR-reliability-no-write | Run validate against a read-only directory | No writes attempted (sidecar stays untouched) |
Constraints
- The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it).
- The gate is read-only. NEVER writes to the engine file, sidecar, or manifest.
- The
ManifestReaderis constructor-injectable; the production reader readsmanifest.jsonfrom disk; tests inject a dict-backed fake. - The
read_host_tuplehelper usespynvmlfirst; falls back to parsingnvidia-smioutput ifpynvmlis unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raisesRuntimeError("cannot read host tuple")and the takeoff path aborts. - Sha256 is computed using stdlib
hashlib.sha256with chunked reads viahelpers.sha256_sidecar; this task does NOT introduce a new sha256 library. - This task introduces no new third-party dependencies beyond
pynvml(which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md).
Risks & Mitigation
Risk 1: Sha256 over a 500 MB engine dominates takeoff latency
- Risk: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff.
- Mitigation: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle).
Risk 2: Manifest reader silently treats missing entry as pass
- Risk: A typo in the manifest produces
KeyErrorswallowed somewhere; the gate "passes" without checking. - Mitigation: The manifest reader's
__getitem__raisesEngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...")on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case.
Risk 3: Refusal order changes silently across refactors
- Risk: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses.
- Mitigation: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec.
Runtime Completeness
- Named capability: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04).
- Production code that must exist: real
EngineGate.validatecalling realhelpers.engine_filename_schema.parse, realhelpers.sha256_sidecar.verify, realManifestReaderreading the deployed manifest.json from disk. - Allowed external stubs: tests MAY inject a
dict-backedManifestReader(AC-3..AC-7); production wiring reads the on-disk manifest. - Unacceptable substitutes: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).