[AZ-301] Implement EngineGate — D-C10-3 + D-C10-7 takeoff validator

AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:

  1. filename schema parse  -> EngineSchemaMismatchError(reason=...)
  2. schema tuple match     -> EngineSchemaMismatchError(expected,got)
  3. sidecar present        -> EngineSidecarMissingError
  4. sidecar trust          -> EngineHashMismatchError(stage=sidecar)
  5. manifest match         -> EngineHashMismatchError(stage=manifest)

Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).

Production code (new):
 - components/c7_inference/engine_gate.py  -- EngineGate, HostTuple,
   read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
   tensorrt.__version__; raises RuntimeError on Tier-1)
 - components/c7_inference/manifest.py     -- DeploymentManifest,
   ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
   type level: __getitem__ raises EngineHashMismatchError on
   missing key, NEVER KeyError, so the gate cannot silently pass
 - components/c7_inference/__init__.py     -- re-exports the new
   public surface

Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).

Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
 1. HostTuple lives in engine_gate.py (the only consumer);
    re-exported from package __init__.py.
 2. read_host_tuple takes precision as a keyword argument — three
    of four fields come from the host, precision is engine-build
    metadata supplied by the caller.
 3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
    run on every CI host.

Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.

NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.

AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.

Suite: 1134 passed / 11 skipped. No regressions outside the new
files.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 10:20:21 +03:00
parent 65ad2168ed
commit 59f56c032f
7 changed files with 941 additions and 1 deletions
@@ -34,6 +34,10 @@ from gps_denied_onboard.components.c7_inference.architecture_registry import (
register_architecture,
)
from gps_denied_onboard.components.c7_inference.config import C7InferenceConfig
from gps_denied_onboard.components.c7_inference.engine_gate import (
EngineGate,
HostTuple,
)
from gps_denied_onboard.components.c7_inference.errors import (
CalibrationCacheError,
EngineBuildError,
@@ -47,6 +51,11 @@ from gps_denied_onboard.components.c7_inference.errors import (
TelemetryUnavailableError,
)
from gps_denied_onboard.components.c7_inference.interface import InferenceRuntime
from gps_denied_onboard.components.c7_inference.manifest import (
DeploymentManifest,
ManifestReader,
ManifestReaderProtocol,
)
from gps_denied_onboard.config.schema import register_component_block
register_component_block("c7_inference", C7InferenceConfig)
@@ -56,15 +65,20 @@ __all__ = [
"BuildConfig",
"C7InferenceConfig",
"CalibrationCacheError",
"DeploymentManifest",
"EngineBuildError",
"EngineCacheEntry",
"EngineDeserializeError",
"EngineGate",
"EngineHandle",
"EngineHashMismatchError",
"EngineSchemaMismatchError",
"EngineSidecarMissingError",
"HostTuple",
"InferenceError",
"InferenceRuntime",
"ManifestReader",
"ManifestReaderProtocol",
"OptimizationProfile",
"OutOfMemoryError",
"PrecisionMode",
@@ -0,0 +1,314 @@
"""``EngineGate`` — D-C10-3 + D-C10-7 takeoff validator (AZ-301).
Every :class:`InferenceRuntime` strategy calls
:meth:`EngineGate.validate` before :meth:`InferenceRuntime.deserialize_engine`
so the five refusal paths happen in one place:
1. **Schema parse failure** — the ``.engine`` filename does not match
the AZ-281 schema → :class:`EngineSchemaMismatchError(reason=...)`.
2. **Schema tuple mismatch** — the parsed ``(sm, jp, trt, precision)``
does not match the host → :class:`EngineSchemaMismatchError(expected=..., got=...)`.
3. **Sidecar missing** — the ``.sha256`` sidecar is absent →
:class:`EngineSidecarMissingError`.
4. **Sidecar trust** — sidecar hash != file hash →
:class:`EngineHashMismatchError(stage="sidecar")`.
5. **Manifest match** — manifest hash != sidecar hash →
:class:`EngineHashMismatchError(stage="manifest")`.
Order is part of the public contract (AC-7 verifies it); a fixture
that is *both* schema-mismatched and missing-sidecar refuses at step 1.
Pure validator — no GPU ops, no writes, no network. The
``read_host_tuple()`` helper queries ``pynvml`` (already a project dep
via :mod:`jetson-stats` / NVML telemetry); it never returns a synthetic
tuple — if NVML cannot read the GPU, the helper raises ``RuntimeError``
so the takeoff path aborts loudly.
"""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING
from gps_denied_onboard._types.inference import EngineCacheEntry, PrecisionMode
from gps_denied_onboard.components.c7_inference.errors import (
EngineHashMismatchError,
EngineSchemaMismatchError,
EngineSidecarMissingError,
)
from gps_denied_onboard.components.c7_inference.manifest import (
DeploymentManifest,
ManifestReaderProtocol,
)
from gps_denied_onboard.helpers.engine_filename_schema import (
EngineFilenameSchema,
EngineFilenameSchemaError,
)
from gps_denied_onboard.helpers.sha256_sidecar import (
SIDECAR_SUFFIX,
Sha256Sidecar,
Sha256SidecarError,
)
from gps_denied_onboard.logging import get_logger
if TYPE_CHECKING:
pass
__all__ = ["EngineGate", "HostTuple", "read_host_tuple"]
@dataclass(frozen=True)
class HostTuple:
"""Host capability tuple consulted by D-C10-7 schema-match gate.
Captures the four fields the engine filename encodes:
``sm`` (CUDA compute capability, integer e.g. 87 for Orin),
``jp`` (JetPack version, dotted "<major>.<minor>"),
``trt`` (TensorRT version, dotted), and ``precision``
(:class:`PrecisionMode` enum value).
"""
sm: int
jp: str
trt: str
precision: PrecisionMode
class EngineGate:
"""Stateless validator. Constructor injects test-side overrides only.
The injectable ``sha256_sidecar`` + ``filename_schema`` collaborators
are escape hatches for the (rare) test that needs to assert the gate
calls them in the documented order; production code uses the
module-level helper classes directly.
"""
def __init__(
self,
*,
sha256_sidecar: type[Sha256Sidecar] = Sha256Sidecar,
filename_schema: type[EngineFilenameSchema] = EngineFilenameSchema,
) -> None:
self._sidecar = sha256_sidecar
self._schema = filename_schema
self._logger = get_logger("c7_inference.gate")
def validate(
self,
entry: EngineCacheEntry,
host_tuple: HostTuple,
manifest: DeploymentManifest | ManifestReaderProtocol,
) -> None:
"""Run the five-step refusal pipeline. Returns None on full success."""
engine_path = Path(entry.engine_path)
deployment_manifest = _resolve_manifest(manifest)
# Step 1+2: filename schema parse + tuple match.
try:
parsed = self._schema.parse(engine_path.name)
except EngineFilenameSchemaError as exc:
self._log_refuse(engine_path, "schema_parse", str(exc))
raise EngineSchemaMismatchError(
f"engine filename {engine_path.name!r} does not match the "
f"AZ-281 schema: {exc}"
) from exc
if (
parsed.sm != host_tuple.sm
or parsed.jetpack != host_tuple.jp
or parsed.trt != host_tuple.trt
or parsed.precision != host_tuple.precision.value
):
reason = (
f"engine {engine_path.name!r} parsed as "
f"(sm={parsed.sm}, jp={parsed.jetpack!r}, trt={parsed.trt!r}, "
f"precision={parsed.precision!r}); host expects "
f"(sm={host_tuple.sm}, jp={host_tuple.jp!r}, "
f"trt={host_tuple.trt!r}, "
f"precision={host_tuple.precision.value!r})"
)
self._log_refuse(engine_path, "schema_tuple", reason)
raise EngineSchemaMismatchError(reason)
# Step 3: sidecar existence.
sidecar_path = Path(str(engine_path) + SIDECAR_SUFFIX)
if not sidecar_path.exists():
reason = (
f"sidecar missing for engine {engine_path!s} "
f"(expected at {sidecar_path!s})"
)
self._log_refuse(engine_path, "sidecar_missing", reason)
raise EngineSidecarMissingError(reason)
# Step 4: sidecar trust (sidecar hash == file hash).
try:
sidecar_ok = self._sidecar.verify(engine_path)
except Sha256SidecarError as exc:
reason = f"sidecar verification raised: {exc}"
self._log_refuse(engine_path, "sidecar_trust", reason)
raise EngineHashMismatchError(reason) from exc
if not sidecar_ok:
actual = _sha256_of_file(engine_path)
recorded = sidecar_path.read_text(encoding="utf-8").strip().lower()
reason = (
f"sidecar hash {recorded!r} does not match engine sha256 "
f"{actual!r} for {engine_path!s}"
)
self._log_refuse(engine_path, "sidecar_trust", reason)
raise EngineHashMismatchError(reason)
# Step 5: manifest match.
try:
relative = engine_path.relative_to(deployment_manifest.root)
except ValueError as exc:
reason = (
f"engine {engine_path!s} is not under manifest root "
f"{deployment_manifest.root!s}"
)
self._log_refuse(engine_path, "manifest_path", reason)
raise EngineHashMismatchError(reason) from exc
sidecar_hash = sidecar_path.read_text(encoding="utf-8").strip().lower()
manifest_hash = deployment_manifest[relative].lower()
if sidecar_hash != manifest_hash:
reason = (
f"manifest hash {manifest_hash!r} != sidecar hash "
f"{sidecar_hash!r} for engine path {relative.as_posix()!r}"
)
self._log_refuse(engine_path, "manifest_mismatch", reason)
raise EngineHashMismatchError(reason)
self._logger.info(
"engine gate passed for %s",
engine_path.name,
extra={
"kind": "c7.gate.pass",
"kv": {
"engine_path": str(engine_path),
"host_sm": host_tuple.sm,
"host_jp": host_tuple.jp,
"host_trt": host_tuple.trt,
"precision": host_tuple.precision.value,
"sha256": sidecar_hash,
},
},
)
def _log_refuse(
self, engine_path: Path, step: str, reason: str
) -> None:
self._logger.error(
"engine gate refused %s at step %s",
engine_path.name,
step,
extra={
"kind": "c7.gate.refuse",
"kv": {
"engine_path": str(engine_path),
"step": step,
"reason": reason,
},
},
)
def read_host_tuple(*, precision: PrecisionMode) -> HostTuple:
"""Query ``pynvml`` for ``(sm, jp, trt)``; combine with ``precision``.
Raises ``RuntimeError`` if the GPU is unreadable — never returns a
synthetic / default tuple. The caller (composition root) catches the
error and aborts takeoff per AZ-301 NFR-reliability.
JetPack and TensorRT versions are read from the Jetson environment
(``/etc/nv_tegra_release`` for L4T → JetPack mapping; ``tensorrt.__version__``
when the binding is importable). Tier-1 workstation hosts that lack
these files raise — Tier-1 is not allowed to load airborne engines
by design.
"""
import pynvml # local import — pynvml is Tier-2-only dep
try:
pynvml.nvmlInit()
except pynvml.NVMLError as exc: # type: ignore[attr-defined]
raise RuntimeError(
f"cannot read host tuple: pynvml init failed: {exc}"
) from exc
try:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle)
sm = int(major) * 10 + int(minor)
except pynvml.NVMLError as exc: # type: ignore[attr-defined]
raise RuntimeError(
f"cannot read host tuple: NVML device query failed: {exc}"
) from exc
finally:
try:
pynvml.nvmlShutdown()
except pynvml.NVMLError: # type: ignore[attr-defined]
pass
jp = _read_jetpack_version()
trt = _read_tensorrt_version()
return HostTuple(sm=sm, jp=jp, trt=trt, precision=precision)
def _resolve_manifest(
manifest: DeploymentManifest | ManifestReaderProtocol,
) -> DeploymentManifest:
if isinstance(manifest, DeploymentManifest):
return manifest
return manifest.read()
def _read_jetpack_version() -> str:
"""Parse ``/etc/nv_tegra_release`` → JetPack dotted version."""
release_path = Path("/etc/nv_tegra_release")
if not release_path.exists():
raise RuntimeError(
"cannot read host tuple: /etc/nv_tegra_release not found "
"(JetPack version requires a Jetson host)"
)
text = release_path.read_text(encoding="utf-8", errors="replace")
# L4T R36.x → JetPack 6.x; the mapping is owned by the JetPack release notes.
# The contract test fixture pins JetPack 6.2 ← L4T R36.4.
if "R36" in text:
if "REVISION: 4" in text:
return "6.2"
return "6.0"
if "R35" in text:
return "5.1"
raise RuntimeError(
f"cannot read host tuple: unrecognised L4T release in "
f"{release_path!s}: {text!r}"
)
def _read_tensorrt_version() -> str:
"""Return the dotted ``<major>.<minor>`` TRT version from the binding."""
try:
import tensorrt # type: ignore[import-not-found]
except ImportError as exc:
raise RuntimeError(
"cannot read host tuple: tensorrt python binding not installed "
"(Tier-2 Jetson only)"
) from exc
version: str = tensorrt.__version__
parts = version.split(".")
if len(parts) < 2:
raise RuntimeError(
f"cannot read host tuple: tensorrt.__version__={version!r} is "
"not a dotted version"
)
return f"{parts[0]}.{parts[1]}"
def _sha256_of_file(path: Path) -> str:
import hashlib
hasher = hashlib.sha256()
with path.open("rb") as fh:
while True:
chunk = fh.read(1 << 20)
if not chunk:
break
hasher.update(chunk)
return hasher.hexdigest()
@@ -0,0 +1,133 @@
"""Deployment manifest reader for the C7 ``EngineGate`` (AZ-301).
E-C10 produces a ``manifest.json`` during F1 cache provisioning. The
file maps every relative engine path to its canonical sha256_hex. This
module is the read-only consumer side: ``ManifestReader.from_disk()``
parses the JSON; the returned :class:`DeploymentManifest` exposes
``__getitem__`` (rejects missing keys per Risk 2) and ``root``.
The manifest's *write* shape is owned by E-C10. AZ-301 implements only
the reader sufficient for D-C10-3 takeoff gate. Field names that travel
on the wire are documented in
``_docs/02_document/contracts/shared_helpers/sha256_sidecar.md`` and
its E-C10 manifest-schema sibling (frozen at the E-C10 contract
freeze).
"""
from __future__ import annotations
import json
from collections.abc import Mapping
from dataclasses import dataclass
from pathlib import Path
from typing import Protocol, runtime_checkable
from gps_denied_onboard.components.c7_inference.errors import (
EngineHashMismatchError,
)
__all__ = [
"DeploymentManifest",
"ManifestReader",
"ManifestReaderProtocol",
]
@dataclass(frozen=True)
class DeploymentManifest:
"""Read-only typed view over a deployed ``manifest.json``.
``root`` is the on-disk directory the engine paths are relative to
(typically ``config.components['c7_inference'].engine_cache_dir``).
``entries`` maps the engine's path-relative-to-``root`` (as a
POSIX-style string) to its canonical sha256 hex digest.
Missing-entry access raises :class:`EngineHashMismatchError`
(NOT ``KeyError``) — Risk 2 in AZ-301 spec demands the gate treat
"no manifest entry" as a refusal, not a silent pass.
"""
root: Path
entries: Mapping[str, str]
def __getitem__(self, relative_path: Path | str) -> str:
key = _key_str(relative_path)
try:
return self.entries[key]
except KeyError as exc:
raise EngineHashMismatchError(
f"missing manifest entry for engine path {key!r} "
f"(manifest root={self.root!s})"
) from exc
def __contains__(self, relative_path: object) -> bool:
if not isinstance(relative_path, (Path, str)):
return False
return _key_str(relative_path) in self.entries
@runtime_checkable
class ManifestReaderProtocol(Protocol):
"""Structural shape the gate consumes. Live impl is :class:`ManifestReader`;
tests inject a dict-backed fake."""
def read(self) -> DeploymentManifest: # pragma: no cover - structural
...
class ManifestReader:
"""On-disk JSON reader for the deployment manifest.
Constructor takes the path of the ``manifest.json`` file. The
``root`` field defaults to the manifest file's parent directory
unless the JSON document carries an explicit ``"root"`` override.
"""
__slots__ = ("_path",)
def __init__(self, manifest_path: Path) -> None:
self._path = manifest_path
def read(self) -> DeploymentManifest:
try:
text = self._path.read_text(encoding="utf-8")
except OSError as exc:
raise EngineHashMismatchError(
f"cannot read manifest at {self._path!s}: "
f"{type(exc).__name__}: {exc}"
) from exc
try:
payload = json.loads(text)
except json.JSONDecodeError as exc:
raise EngineHashMismatchError(
f"manifest at {self._path!s} is not valid JSON: {exc.msg}"
) from exc
if not isinstance(payload, dict):
raise EngineHashMismatchError(
f"manifest at {self._path!s} must be a JSON object; got "
f"{type(payload).__name__}"
)
raw_entries = payload.get("entries")
if not isinstance(raw_entries, dict):
raise EngineHashMismatchError(
f"manifest at {self._path!s} missing required 'entries' "
"JSON object"
)
entries: dict[str, str] = {}
for k, v in raw_entries.items():
if not isinstance(k, str) or not isinstance(v, str):
raise EngineHashMismatchError(
f"manifest at {self._path!s} entry has non-string "
f"key/value: {k!r}: {v!r}"
)
entries[k] = v.lower()
root_value = payload.get("root")
root = Path(root_value) if isinstance(root_value, str) else self._path.parent
return DeploymentManifest(root=root, entries=entries)
def _key_str(relative_path: Path | str) -> str:
"""Normalise a key to a POSIX-style string for entry lookup."""
if isinstance(relative_path, Path):
return relative_path.as_posix()
return relative_path.replace("\\", "/")