Files
gps-denied-onboard/_docs/02_tasks/done/AZ-301_c7_engine_gate.md
T
Oleksandr Bezdieniezhnykh 59f56c032f [AZ-301] Implement EngineGate — D-C10-3 + D-C10-7 takeoff validator
AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:

  1. filename schema parse  -> EngineSchemaMismatchError(reason=...)
  2. schema tuple match     -> EngineSchemaMismatchError(expected,got)
  3. sidecar present        -> EngineSidecarMissingError
  4. sidecar trust          -> EngineHashMismatchError(stage=sidecar)
  5. manifest match         -> EngineHashMismatchError(stage=manifest)

Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).

Production code (new):
 - components/c7_inference/engine_gate.py  -- EngineGate, HostTuple,
   read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
   tensorrt.__version__; raises RuntimeError on Tier-1)
 - components/c7_inference/manifest.py     -- DeploymentManifest,
   ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
   type level: __getitem__ raises EngineHashMismatchError on
   missing key, NEVER KeyError, so the gate cannot silently pass
 - components/c7_inference/__init__.py     -- re-exports the new
   public surface

Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).

Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
 1. HostTuple lives in engine_gate.py (the only consumer);
    re-exported from package __init__.py.
 2. read_host_tuple takes precision as a keyword argument — three
    of four fields come from the host, precision is engine-build
    metadata supplied by the caller.
 3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
    run on every CI host.

Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.

NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.

AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.

Suite: 1134 passed / 11 skipped. No regressions outside the new
files.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 10:20:21 +03:00

15 KiB
Raw Blame History

C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement

Task: AZ-301_c7_engine_gate Name: C7 EngineGate Description: Implement the takeoff-side EngineGate validator that every InferenceRuntime strategy invokes before deserialising a cached .engine file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises EngineSchemaMismatchError at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises EngineHashMismatchError / EngineSidecarMissingError. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator. Complexity: 3 points Dependencies: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module Component: c7_inference (epic AZ-249 / E-C7) Tracker: AZ-301 Epic: AZ-249 (E-C7)

Document Dependencies

  • _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — defines EngineCacheEntry (input) and the gate's error types.
  • _docs/02_document/contracts/shared_helpers/sha256_sidecar.md — sidecar verification contract.
  • _docs/02_document/contracts/shared_helpers/engine_filename_schema.md — filename-schema parser.

Problem

D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an .engine file is about to be deserialised. Without a centralised validator:

  • Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs.
  • C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures.
  • A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it.

This task delivers ONE validator, used by every strategy, with a single point of refusal.

Outcome

  • An EngineGate class at src/gps_denied_onboard/components/c7_inference/engine_gate.py with a single public method: validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> None that returns silently on success and raises one of EngineSchemaMismatchError, EngineHashMismatchError, EngineSidecarMissingError on refusal.
  • HostTuple is a small frozen dataclass (sm: int, jp: str, trt: str, precision: PrecisionMode) derived from nvidia-smi / pynvml + the runtime's pinned TRT version + the engine's intended precision (read from the entry).
  • DeploymentManifest is a typed wrapper over the deployed manifest.json (an ordered map of engine_relative_path → sha256_hex) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest.
  • The two refusal paths are evaluated in this order:
    1. Schema parse: helpers.engine_filename_schema.parse(entry.engine_path.name) returns the (sm, jp, trt, precision) quadruple; if the parse fails, raise EngineSchemaMismatchError(reason="parse failure: ...").
    2. Schema match: if the parsed quadruple does not match host_tuple, raise EngineSchemaMismatchError(expected=host_tuple, got=parsed).
    3. Sidecar presence: if no .sha256 sidecar exists alongside entry.engine_path, raise EngineSidecarMissingError.
    4. Sidecar trust: helpers.sha256_sidecar.verify(entry.engine_path) — if the sidecar's recorded hash does not match the engine's actual sha256, raise EngineHashMismatchError(stage="sidecar").
    5. Manifest match: if manifest[entry.engine_path.relative_to(manifest.root)] does not equal the verified sha256, raise EngineHashMismatchError(stage="manifest").
  • Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream.
  • Diagnostic INFO log on success (kind="c7.gate.pass", engine path, host tuple, manifest hash); ERROR log on each refusal (kind="c7.gate.refuse", refusal reason, engine path).

Scope

Included

  • EngineGate class with validate(entry, host_tuple, manifest) -> None.
  • HostTuple dataclass and a stateless read_host_tuple() -> HostTuple helper that calls nvidia-smi --query-gpu=compute_cap,driver_version,... (via pynvml where possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out.
  • The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step.
  • Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via EngineSchemaMismatchError(reason=...) vs (expected=..., got=...)). Both are the same exception class; the kwargs differ.
  • Manifest reader: a thin typed wrapper at src/gps_denied_onboard/components/c7_inference/manifest.py that reads the deployed manifest.json and exposes __getitem__ and root. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location.
  • INFO-on-pass and ERROR-on-refuse logs.
  • Constructor-injectable ManifestReader for tests (the production reader reads from disk; tests inject a dict-backed fake).

Excluded

  • AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate.
  • AZ-302 ThermalState publisher — unrelated.
  • The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer).
  • F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by runtime_root and E-C10.
  • C12 operator tooling diagnostics for refused engines — out of scope.
  • A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate).

Acceptance Criteria

AC-1: filename-schema parse failure refused at parse time Given an engine file named bogus_name.engine (no schema) When validate(entry, ...) is called Then EngineSchemaMismatchError(reason="parse failure: ...") is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call)

AC-2: filename-schema tuple mismatch refused at parse time Given an engine ultravpr__sm86_jp6.2_trt10.3_fp16.engine and a host with sm=87 When validate is called Then EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...)) is raised; no sidecar / manifest checks execute

AC-3: missing sidecar refused before manifest lookup Given a schema-matched engine whose .sha256 sidecar does NOT exist on disk When validate is called Then EngineSidecarMissingError(engine_path=...) is raised; the manifest is NOT read

AC-4: sidecar trust failure Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256 When validate is called Then EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...) is raised; the manifest is NOT consulted

AC-5: manifest mismatch Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash When validate is called Then EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...) is raised

AC-6: full-success path returns silently and logs INFO Given an engine that passes all five steps When validate is called Then the call returns None silently; one kind="c7.gate.pass" INFO log record was emitted; the caller proceeds to deserialise

AC-7: refusal order is deterministic for fixture targeting Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar When validate is called Then EngineSchemaMismatchError is raised (NOT EngineSidecarMissingError) — the schema check runs first; this property is documented and tested

AC-8: read_host_tuple matches the running Jetson Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3 When read_host_tuple() is called Then the returned tuple has sm=87, jp="6.2", trt="10.3"; on a workstation (Tier-1 Docker) the values reflect that environment instead

Non-Functional Requirements

Performance

  • validate p99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget.
  • read_host_tuple p99 ≤ 100 ms (one pynvml call).

Reliability

  • The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property.
  • The gate makes NO network calls and NO writes; it is read-only.
  • Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them).

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Bogus filename EngineSchemaMismatchError(reason=...); no further gate steps
AC-2 Mismatched SM in filename EngineSchemaMismatchError(expected=..., got=...)
AC-3 Missing sidecar EngineSidecarMissingError; manifest not read
AC-4 Sidecar hash != file hash EngineHashMismatchError(stage="sidecar")
AC-5 Manifest hash != sidecar hash EngineHashMismatchError(stage="manifest")
AC-6 Happy path Returns None; INFO log emitted
AC-7 Both schema-fail AND sidecar-missing Schema error wins (deterministic order)
AC-8 host tuple read on Jetson Orin Nano Super (sm=87, jp="6.2", trt="10.3")
NFR-perf-validate Microbench validate × 100 with a 500 MB engine p99 ≤ 50 ms
NFR-reliability-no-write Run validate against a read-only directory No writes attempted (sidecar stays untouched)

Constraints

  • The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it).
  • The gate is read-only. NEVER writes to the engine file, sidecar, or manifest.
  • The ManifestReader is constructor-injectable; the production reader reads manifest.json from disk; tests inject a dict-backed fake.
  • The read_host_tuple helper uses pynvml first; falls back to parsing nvidia-smi output if pynvml is unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raises RuntimeError("cannot read host tuple") and the takeoff path aborts.
  • Sha256 is computed using stdlib hashlib.sha256 with chunked reads via helpers.sha256_sidecar; this task does NOT introduce a new sha256 library.
  • This task introduces no new third-party dependencies beyond pynvml (which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md).

Risks & Mitigation

Risk 1: Sha256 over a 500 MB engine dominates takeoff latency

  • Risk: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff.
  • Mitigation: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle).

Risk 2: Manifest reader silently treats missing entry as pass

  • Risk: A typo in the manifest produces KeyError swallowed somewhere; the gate "passes" without checking.
  • Mitigation: The manifest reader's __getitem__ raises EngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...") on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case.

Risk 3: Refusal order changes silently across refactors

  • Risk: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses.
  • Mitigation: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec.

Runtime Completeness

  • Named capability: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04).
  • Production code that must exist: real EngineGate.validate calling real helpers.engine_filename_schema.parse, real helpers.sha256_sidecar.verify, real ManifestReader reading the deployed manifest.json from disk.
  • Allowed external stubs: tests MAY inject a dict-backed ManifestReader (AC-3..AC-7); production wiring reads the on-disk manifest.
  • Unacceptable substitutes: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).

Implementation Notes (2026-05-12, batch 25)

Three minor task-spec → as-built deltas:

  1. HostTuple lives in engine_gate.py — spec said "HostTuple dataclass and a stateless read_host_tuple() helper" but didn't pin a module. Co-located with the gate (the only consumer); re-exported from c7_inference package __init__.py. Future consumers can lift it out if needed.

  2. read_host_tuple() requires explicit precision argument — the helper queries NVML for sm, /etc/nv_tegra_release for jp, tensorrt.__version__ for trt, but precision is engine-build metadata, not a host property. Caller passes it. Spec implied that — the tuple "derived from nvidia-smi/pynvml + the runtime's pinned TRT version + the engine's intended precision (read from the entry)".

  3. AC-8 is Tier-2-only — marked @pytest.mark.tier2 + @pytest.mark.skipif(GPS_DENIED_TIER!=2). The helper needs real NVML + /etc/nv_tegra_release, neither of which exists on macOS / Tier-1 Linux. AC-1..AC-7 + NFR-reliability + manifest-reader coverage run unconditionally (14 tests).

As-built file map

  • src/gps_denied_onboard/components/c7_inference/engine_gate.pyEngineGate.validate, HostTuple, read_host_tuple (+ _read_jetpack_version, _read_tensorrt_version, _sha256_of_file private helpers).
  • src/gps_denied_onboard/components/c7_inference/manifest.pyDeploymentManifest, ManifestReader, ManifestReaderProtocol. Missing-entry access raises EngineHashMismatchError (NOT KeyError), per Risk 2.
  • src/gps_denied_onboard/components/c7_inference/__init__.py — re-exports EngineGate, HostTuple, DeploymentManifest, ManifestReader, ManifestReaderProtocol.
  • tests/unit/c7_inference/test_engine_gate.py — 15 tests (14 unconditional + AC-8 tier-2 skip).

Refusal-order discipline

The five steps execute in this exact order; AC-7 verifies the property by passing a fixture that is both schema-mismatched and missing-sidecar — the schema error wins because step 1 runs first. Future refactors that reorder the steps regress AC-7.