Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-301_c7_engine_gate.md
T
Oleksandr Bezdieniezhnykh 880eabcb3f Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 00:39:48 +03:00

13 KiB
Raw Blame History

C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement

Task: AZ-301_c7_engine_gate Name: C7 EngineGate Description: Implement the takeoff-side EngineGate validator that every InferenceRuntime strategy invokes before deserialising a cached .engine file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises EngineSchemaMismatchError at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises EngineHashMismatchError / EngineSidecarMissingError. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator. Complexity: 3 points Dependencies: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module Component: c7_inference (epic AZ-249 / E-C7) Tracker: AZ-301 Epic: AZ-249 (E-C7)

Document Dependencies

  • _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — defines EngineCacheEntry (input) and the gate's error types.
  • _docs/02_document/contracts/shared_helpers/sha256_sidecar.md — sidecar verification contract.
  • _docs/02_document/contracts/shared_helpers/engine_filename_schema.md — filename-schema parser.

Problem

D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an .engine file is about to be deserialised. Without a centralised validator:

  • Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs.
  • C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures.
  • A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it.

This task delivers ONE validator, used by every strategy, with a single point of refusal.

Outcome

  • An EngineGate class at src/gps_denied_onboard/components/c7_inference/engine_gate.py with a single public method: validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> None that returns silently on success and raises one of EngineSchemaMismatchError, EngineHashMismatchError, EngineSidecarMissingError on refusal.
  • HostTuple is a small frozen dataclass (sm: int, jp: str, trt: str, precision: PrecisionMode) derived from nvidia-smi / pynvml + the runtime's pinned TRT version + the engine's intended precision (read from the entry).
  • DeploymentManifest is a typed wrapper over the deployed manifest.json (an ordered map of engine_relative_path → sha256_hex) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest.
  • The two refusal paths are evaluated in this order:
    1. Schema parse: helpers.engine_filename_schema.parse(entry.engine_path.name) returns the (sm, jp, trt, precision) quadruple; if the parse fails, raise EngineSchemaMismatchError(reason="parse failure: ...").
    2. Schema match: if the parsed quadruple does not match host_tuple, raise EngineSchemaMismatchError(expected=host_tuple, got=parsed).
    3. Sidecar presence: if no .sha256 sidecar exists alongside entry.engine_path, raise EngineSidecarMissingError.
    4. Sidecar trust: helpers.sha256_sidecar.verify(entry.engine_path) — if the sidecar's recorded hash does not match the engine's actual sha256, raise EngineHashMismatchError(stage="sidecar").
    5. Manifest match: if manifest[entry.engine_path.relative_to(manifest.root)] does not equal the verified sha256, raise EngineHashMismatchError(stage="manifest").
  • Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream.
  • Diagnostic INFO log on success (kind="c7.gate.pass", engine path, host tuple, manifest hash); ERROR log on each refusal (kind="c7.gate.refuse", refusal reason, engine path).

Scope

Included

  • EngineGate class with validate(entry, host_tuple, manifest) -> None.
  • HostTuple dataclass and a stateless read_host_tuple() -> HostTuple helper that calls nvidia-smi --query-gpu=compute_cap,driver_version,... (via pynvml where possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out.
  • The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step.
  • Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via EngineSchemaMismatchError(reason=...) vs (expected=..., got=...)). Both are the same exception class; the kwargs differ.
  • Manifest reader: a thin typed wrapper at src/gps_denied_onboard/components/c7_inference/manifest.py that reads the deployed manifest.json and exposes __getitem__ and root. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location.
  • INFO-on-pass and ERROR-on-refuse logs.
  • Constructor-injectable ManifestReader for tests (the production reader reads from disk; tests inject a dict-backed fake).

Excluded

  • AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate.
  • AZ-302 ThermalState publisher — unrelated.
  • The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer).
  • F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by runtime_root and E-C10.
  • C12 operator tooling diagnostics for refused engines — out of scope.
  • A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate).

Acceptance Criteria

AC-1: filename-schema parse failure refused at parse time Given an engine file named bogus_name.engine (no schema) When validate(entry, ...) is called Then EngineSchemaMismatchError(reason="parse failure: ...") is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call)

AC-2: filename-schema tuple mismatch refused at parse time Given an engine ultravpr__sm86_jp6.2_trt10.3_fp16.engine and a host with sm=87 When validate is called Then EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...)) is raised; no sidecar / manifest checks execute

AC-3: missing sidecar refused before manifest lookup Given a schema-matched engine whose .sha256 sidecar does NOT exist on disk When validate is called Then EngineSidecarMissingError(engine_path=...) is raised; the manifest is NOT read

AC-4: sidecar trust failure Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256 When validate is called Then EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...) is raised; the manifest is NOT consulted

AC-5: manifest mismatch Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash When validate is called Then EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...) is raised

AC-6: full-success path returns silently and logs INFO Given an engine that passes all five steps When validate is called Then the call returns None silently; one kind="c7.gate.pass" INFO log record was emitted; the caller proceeds to deserialise

AC-7: refusal order is deterministic for fixture targeting Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar When validate is called Then EngineSchemaMismatchError is raised (NOT EngineSidecarMissingError) — the schema check runs first; this property is documented and tested

AC-8: read_host_tuple matches the running Jetson Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3 When read_host_tuple() is called Then the returned tuple has sm=87, jp="6.2", trt="10.3"; on a workstation (Tier-1 Docker) the values reflect that environment instead

Non-Functional Requirements

Performance

  • validate p99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget.
  • read_host_tuple p99 ≤ 100 ms (one pynvml call).

Reliability

  • The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property.
  • The gate makes NO network calls and NO writes; it is read-only.
  • Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them).

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Bogus filename EngineSchemaMismatchError(reason=...); no further gate steps
AC-2 Mismatched SM in filename EngineSchemaMismatchError(expected=..., got=...)
AC-3 Missing sidecar EngineSidecarMissingError; manifest not read
AC-4 Sidecar hash != file hash EngineHashMismatchError(stage="sidecar")
AC-5 Manifest hash != sidecar hash EngineHashMismatchError(stage="manifest")
AC-6 Happy path Returns None; INFO log emitted
AC-7 Both schema-fail AND sidecar-missing Schema error wins (deterministic order)
AC-8 host tuple read on Jetson Orin Nano Super (sm=87, jp="6.2", trt="10.3")
NFR-perf-validate Microbench validate × 100 with a 500 MB engine p99 ≤ 50 ms
NFR-reliability-no-write Run validate against a read-only directory No writes attempted (sidecar stays untouched)

Constraints

  • The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it).
  • The gate is read-only. NEVER writes to the engine file, sidecar, or manifest.
  • The ManifestReader is constructor-injectable; the production reader reads manifest.json from disk; tests inject a dict-backed fake.
  • The read_host_tuple helper uses pynvml first; falls back to parsing nvidia-smi output if pynvml is unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raises RuntimeError("cannot read host tuple") and the takeoff path aborts.
  • Sha256 is computed using stdlib hashlib.sha256 with chunked reads via helpers.sha256_sidecar; this task does NOT introduce a new sha256 library.
  • This task introduces no new third-party dependencies beyond pynvml (which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md).

Risks & Mitigation

Risk 1: Sha256 over a 500 MB engine dominates takeoff latency

  • Risk: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff.
  • Mitigation: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle).

Risk 2: Manifest reader silently treats missing entry as pass

  • Risk: A typo in the manifest produces KeyError swallowed somewhere; the gate "passes" without checking.
  • Mitigation: The manifest reader's __getitem__ raises EngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...") on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case.

Risk 3: Refusal order changes silently across refactors

  • Risk: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses.
  • Mitigation: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec.

Runtime Completeness

  • Named capability: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04).
  • Production code that must exist: real EngineGate.validate calling real helpers.engine_filename_schema.parse, real helpers.sha256_sidecar.verify, real ManifestReader reading the deployed manifest.json from disk.
  • Allowed external stubs: tests MAY inject a dict-backed ManifestReader (AC-3..AC-7); production wiring reads the on-disk manifest.
  • Unacceptable substitutes: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).