AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:
1. filename schema parse -> EngineSchemaMismatchError(reason=...)
2. schema tuple match -> EngineSchemaMismatchError(expected,got)
3. sidecar present -> EngineSidecarMissingError
4. sidecar trust -> EngineHashMismatchError(stage=sidecar)
5. manifest match -> EngineHashMismatchError(stage=manifest)
Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).
Production code (new):
- components/c7_inference/engine_gate.py -- EngineGate, HostTuple,
read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
tensorrt.__version__; raises RuntimeError on Tier-1)
- components/c7_inference/manifest.py -- DeploymentManifest,
ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
type level: __getitem__ raises EngineHashMismatchError on
missing key, NEVER KeyError, so the gate cannot silently pass
- components/c7_inference/__init__.py -- re-exports the new
public surface
Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).
Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
1. HostTuple lives in engine_gate.py (the only consumer);
re-exported from package __init__.py.
2. read_host_tuple takes precision as a keyword argument — three
of four fields come from the host, precision is engine-build
metadata supplied by the caller.
3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
run on every CI host.
Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.
NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.
AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.
Suite: 1134 passed / 11 skipped. No regressions outside the new
files.
Co-authored-by: Cursor <cursoragent@cursor.com>
15 KiB
C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement
Task: AZ-301_c7_engine_gate
Name: C7 EngineGate
Description: Implement the takeoff-side EngineGate validator that every InferenceRuntime strategy invokes before deserialising a cached .engine file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises EngineSchemaMismatchError at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises EngineHashMismatchError / EngineSidecarMissingError. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator.
Complexity: 3 points
Dependencies: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module
Component: c7_inference (epic AZ-249 / E-C7)
Tracker: AZ-301
Epic: AZ-249 (E-C7)
Document Dependencies
_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md— definesEngineCacheEntry(input) and the gate's error types._docs/02_document/contracts/shared_helpers/sha256_sidecar.md— sidecar verification contract._docs/02_document/contracts/shared_helpers/engine_filename_schema.md— filename-schema parser.
Problem
D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an .engine file is about to be deserialised. Without a centralised validator:
- Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs.
- C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures.
- A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it.
This task delivers ONE validator, used by every strategy, with a single point of refusal.
Outcome
- An
EngineGateclass atsrc/gps_denied_onboard/components/c7_inference/engine_gate.pywith a single public method:validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> Nonethat returns silently on success and raises one ofEngineSchemaMismatchError,EngineHashMismatchError,EngineSidecarMissingErroron refusal. HostTupleis a small frozen dataclass(sm: int, jp: str, trt: str, precision: PrecisionMode)derived fromnvidia-smi/pynvml+ the runtime's pinned TRT version + the engine's intended precision (read from the entry).DeploymentManifestis a typed wrapper over the deployedmanifest.json(an ordered map ofengine_relative_path → sha256_hex) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest.- The two refusal paths are evaluated in this order:
- Schema parse:
helpers.engine_filename_schema.parse(entry.engine_path.name)returns the(sm, jp, trt, precision)quadruple; if the parse fails, raiseEngineSchemaMismatchError(reason="parse failure: ..."). - Schema match: if the parsed quadruple does not match
host_tuple, raiseEngineSchemaMismatchError(expected=host_tuple, got=parsed). - Sidecar presence: if no
.sha256sidecar exists alongsideentry.engine_path, raiseEngineSidecarMissingError. - Sidecar trust:
helpers.sha256_sidecar.verify(entry.engine_path)— if the sidecar's recorded hash does not match the engine's actual sha256, raiseEngineHashMismatchError(stage="sidecar"). - Manifest match: if
manifest[entry.engine_path.relative_to(manifest.root)]does not equal the verified sha256, raiseEngineHashMismatchError(stage="manifest").
- Schema parse:
- Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream.
- Diagnostic INFO log on success (
kind="c7.gate.pass", engine path, host tuple, manifest hash); ERROR log on each refusal (kind="c7.gate.refuse", refusal reason, engine path).
Scope
Included
EngineGateclass withvalidate(entry, host_tuple, manifest) -> None.HostTupledataclass and a statelessread_host_tuple() -> HostTuplehelper that callsnvidia-smi --query-gpu=compute_cap,driver_version,...(viapynvmlwhere possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out.- The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step.
- Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via
EngineSchemaMismatchError(reason=...)vs(expected=..., got=...)). Both are the same exception class; the kwargs differ. - Manifest reader: a thin typed wrapper at
src/gps_denied_onboard/components/c7_inference/manifest.pythat reads the deployedmanifest.jsonand exposes__getitem__androot. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location. - INFO-on-pass and ERROR-on-refuse logs.
- Constructor-injectable
ManifestReaderfor tests (the production reader reads from disk; tests inject a dict-backed fake).
Excluded
- AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate.
- AZ-302 ThermalState publisher — unrelated.
- The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer).
- F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by
runtime_rootand E-C10. - C12 operator tooling diagnostics for refused engines — out of scope.
- A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate).
Acceptance Criteria
AC-1: filename-schema parse failure refused at parse time
Given an engine file named bogus_name.engine (no schema)
When validate(entry, ...) is called
Then EngineSchemaMismatchError(reason="parse failure: ...") is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call)
AC-2: filename-schema tuple mismatch refused at parse time
Given an engine ultravpr__sm86_jp6.2_trt10.3_fp16.engine and a host with sm=87
When validate is called
Then EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...)) is raised; no sidecar / manifest checks execute
AC-3: missing sidecar refused before manifest lookup
Given a schema-matched engine whose .sha256 sidecar does NOT exist on disk
When validate is called
Then EngineSidecarMissingError(engine_path=...) is raised; the manifest is NOT read
AC-4: sidecar trust failure
Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256
When validate is called
Then EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...) is raised; the manifest is NOT consulted
AC-5: manifest mismatch
Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash
When validate is called
Then EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...) is raised
AC-6: full-success path returns silently and logs INFO
Given an engine that passes all five steps
When validate is called
Then the call returns None silently; one kind="c7.gate.pass" INFO log record was emitted; the caller proceeds to deserialise
AC-7: refusal order is deterministic for fixture targeting
Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar
When validate is called
Then EngineSchemaMismatchError is raised (NOT EngineSidecarMissingError) — the schema check runs first; this property is documented and tested
AC-8: read_host_tuple matches the running Jetson
Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3
When read_host_tuple() is called
Then the returned tuple has sm=87, jp="6.2", trt="10.3"; on a workstation (Tier-1 Docker) the values reflect that environment instead
Non-Functional Requirements
Performance
validatep99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget.read_host_tuplep99 ≤ 100 ms (onepynvmlcall).
Reliability
- The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property.
- The gate makes NO network calls and NO writes; it is read-only.
- Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them).
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Bogus filename | EngineSchemaMismatchError(reason=...); no further gate steps |
| AC-2 | Mismatched SM in filename | EngineSchemaMismatchError(expected=..., got=...) |
| AC-3 | Missing sidecar | EngineSidecarMissingError; manifest not read |
| AC-4 | Sidecar hash != file hash | EngineHashMismatchError(stage="sidecar") |
| AC-5 | Manifest hash != sidecar hash | EngineHashMismatchError(stage="manifest") |
| AC-6 | Happy path | Returns None; INFO log emitted |
| AC-7 | Both schema-fail AND sidecar-missing | Schema error wins (deterministic order) |
| AC-8 | host tuple read on Jetson Orin Nano Super | (sm=87, jp="6.2", trt="10.3") |
| NFR-perf-validate | Microbench validate × 100 with a 500 MB engine | p99 ≤ 50 ms |
| NFR-reliability-no-write | Run validate against a read-only directory | No writes attempted (sidecar stays untouched) |
Constraints
- The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it).
- The gate is read-only. NEVER writes to the engine file, sidecar, or manifest.
- The
ManifestReaderis constructor-injectable; the production reader readsmanifest.jsonfrom disk; tests inject a dict-backed fake. - The
read_host_tuplehelper usespynvmlfirst; falls back to parsingnvidia-smioutput ifpynvmlis unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raisesRuntimeError("cannot read host tuple")and the takeoff path aborts. - Sha256 is computed using stdlib
hashlib.sha256with chunked reads viahelpers.sha256_sidecar; this task does NOT introduce a new sha256 library. - This task introduces no new third-party dependencies beyond
pynvml(which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md).
Risks & Mitigation
Risk 1: Sha256 over a 500 MB engine dominates takeoff latency
- Risk: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff.
- Mitigation: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle).
Risk 2: Manifest reader silently treats missing entry as pass
- Risk: A typo in the manifest produces
KeyErrorswallowed somewhere; the gate "passes" without checking. - Mitigation: The manifest reader's
__getitem__raisesEngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...")on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case.
Risk 3: Refusal order changes silently across refactors
- Risk: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses.
- Mitigation: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec.
Runtime Completeness
- Named capability: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04).
- Production code that must exist: real
EngineGate.validatecalling realhelpers.engine_filename_schema.parse, realhelpers.sha256_sidecar.verify, realManifestReaderreading the deployed manifest.json from disk. - Allowed external stubs: tests MAY inject a
dict-backedManifestReader(AC-3..AC-7); production wiring reads the on-disk manifest. - Unacceptable substitutes: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).
Implementation Notes (2026-05-12, batch 25)
Three minor task-spec → as-built deltas:
-
HostTuplelives inengine_gate.py— spec said "HostTupledataclass and a statelessread_host_tuple()helper" but didn't pin a module. Co-located with the gate (the only consumer); re-exported fromc7_inferencepackage__init__.py. Future consumers can lift it out if needed. -
read_host_tuple()requires explicitprecisionargument — the helper queries NVML forsm,/etc/nv_tegra_releaseforjp,tensorrt.__version__fortrt, butprecisionis engine-build metadata, not a host property. Caller passes it. Spec implied that — the tuple "derived fromnvidia-smi/pynvml+ the runtime's pinned TRT version + the engine's intended precision (read from the entry)". -
AC-8 is Tier-2-only — marked
@pytest.mark.tier2 + @pytest.mark.skipif(GPS_DENIED_TIER!=2). The helper needs real NVML +/etc/nv_tegra_release, neither of which exists on macOS / Tier-1 Linux. AC-1..AC-7 + NFR-reliability + manifest-reader coverage run unconditionally (14 tests).
As-built file map
src/gps_denied_onboard/components/c7_inference/engine_gate.py—EngineGate.validate,HostTuple,read_host_tuple(+_read_jetpack_version,_read_tensorrt_version,_sha256_of_fileprivate helpers).src/gps_denied_onboard/components/c7_inference/manifest.py—DeploymentManifest,ManifestReader,ManifestReaderProtocol. Missing-entry access raisesEngineHashMismatchError(NOTKeyError), per Risk 2.src/gps_denied_onboard/components/c7_inference/__init__.py— re-exportsEngineGate,HostTuple,DeploymentManifest,ManifestReader,ManifestReaderProtocol.tests/unit/c7_inference/test_engine_gate.py— 15 tests (14 unconditional + AC-8 tier-2 skip).
Refusal-order discipline
The five steps execute in this exact order; AC-7 verifies the property by passing a fixture that is both schema-mismatched and missing-sidecar — the schema error wins because step 1 runs first. Future refactors that reorder the steps regress AC-7.