[AZ-301] Implement EngineGate — D-C10-3 + D-C10-7 takeoff validator

AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:

  1. filename schema parse  -> EngineSchemaMismatchError(reason=...)
  2. schema tuple match     -> EngineSchemaMismatchError(expected,got)
  3. sidecar present        -> EngineSidecarMissingError
  4. sidecar trust          -> EngineHashMismatchError(stage=sidecar)
  5. manifest match         -> EngineHashMismatchError(stage=manifest)

Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).

Production code (new):
 - components/c7_inference/engine_gate.py  -- EngineGate, HostTuple,
   read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
   tensorrt.__version__; raises RuntimeError on Tier-1)
 - components/c7_inference/manifest.py     -- DeploymentManifest,
   ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
   type level: __getitem__ raises EngineHashMismatchError on
   missing key, NEVER KeyError, so the gate cannot silently pass
 - components/c7_inference/__init__.py     -- re-exports the new
   public surface

Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).

Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
 1. HostTuple lives in engine_gate.py (the only consumer);
    re-exported from package __init__.py.
 2. read_host_tuple takes precision as a keyword argument — three
    of four fields come from the host, precision is engine-build
    metadata supplied by the caller.
 3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
    run on every CI host.

Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.

NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.

AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.

Suite: 1134 passed / 11 skipped. No regressions outside the new
files.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 10:20:21 +03:00
parent 65ad2168ed
commit 59f56c032f
7 changed files with 941 additions and 1 deletions
@@ -159,3 +159,24 @@ Then the returned tuple has `sm=87, jp="6.2", trt="10.3"`; on a workstation (Tie
- **Production code that must exist**: real `EngineGate.validate` calling real `helpers.engine_filename_schema.parse`, real `helpers.sha256_sidecar.verify`, real `ManifestReader` reading the deployed manifest.json from disk.
- **Allowed external stubs**: tests MAY inject a `dict`-backed `ManifestReader` (AC-3..AC-7); production wiring reads the on-disk manifest.
- **Unacceptable substitutes**: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).
## Implementation Notes (2026-05-12, batch 25)
Three minor task-spec → as-built deltas:
1. **`HostTuple` lives in `engine_gate.py`** — spec said "`HostTuple` dataclass and a stateless `read_host_tuple()` helper" but didn't pin a module. Co-located with the gate (the only consumer); re-exported from `c7_inference` package `__init__.py`. Future consumers can lift it out if needed.
2. **`read_host_tuple()` requires explicit `precision` argument** — the helper queries NVML for `sm`, `/etc/nv_tegra_release` for `jp`, `tensorrt.__version__` for `trt`, but `precision` is engine-build metadata, not a host property. Caller passes it. Spec implied that — the tuple "derived from `nvidia-smi`/`pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry)".
3. **AC-8 is Tier-2-only** — marked `@pytest.mark.tier2 + @pytest.mark.skipif(GPS_DENIED_TIER!=2)`. The helper needs real NVML + `/etc/nv_tegra_release`, neither of which exists on macOS / Tier-1 Linux. AC-1..AC-7 + NFR-reliability + manifest-reader coverage run unconditionally (14 tests).
### As-built file map
- `src/gps_denied_onboard/components/c7_inference/engine_gate.py``EngineGate.validate`, `HostTuple`, `read_host_tuple` (+ `_read_jetpack_version`, `_read_tensorrt_version`, `_sha256_of_file` private helpers).
- `src/gps_denied_onboard/components/c7_inference/manifest.py``DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`. Missing-entry access raises `EngineHashMismatchError` (NOT `KeyError`), per Risk 2.
- `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `EngineGate`, `HostTuple`, `DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`.
- `tests/unit/c7_inference/test_engine_gate.py` — 15 tests (14 unconditional + AC-8 tier-2 skip).
### Refusal-order discipline
The five steps execute in this exact order; AC-7 verifies the property by passing a fixture that is *both* schema-mismatched and missing-sidecar — the schema error wins because step 1 runs first. Future refactors that reorder the steps regress AC-7.
@@ -0,0 +1,143 @@
# Batch 25 / Cycle 1 — Implementation Report
**Date**: 2026-05-12
**Tasks**: AZ-301 (C7 EngineGate — D-C10-3 + D-C10-7 takeoff validator)
**Story points landed**: 3
**Status**: complete (AZ-301 → In Testing)
## Scope summary
Single-task batch, continuing the post-AZ-300 1-task cadence. The user
selected `AZ-301 + AZ-302 + (optionally) AZ-304` for batch 25; AZ-302
(ThermalStatePublisher, 3pt, requires `jtop` / `pynvml` integration +
background thread) and AZ-304 (C6 Postgres schema, 2pt, requires
testcontainers + Alembic) each carry meaningful surface area; bundling
them with AZ-301 would push the batch past the project's 1-task cadence
and bloat the commit beyond practical review. Each ships as its own
batch (26 / 27).
## Files added / modified
### New
- `src/gps_denied_onboard/components/c7_inference/engine_gate.py`
`EngineGate.validate` five-step deterministic pipeline +
`HostTuple` frozen dataclass + `read_host_tuple()` Jetson helper
(lazy NVML / L4T / TRT-version reads).
- `src/gps_denied_onboard/components/c7_inference/manifest.py`
`DeploymentManifest` (Risk-2-compliant `__getitem__` that raises
`EngineHashMismatchError` on missing key) + `ManifestReader`
on-disk JSON loader + `ManifestReaderProtocol` for test fakes.
- `tests/unit/c7_inference/test_engine_gate.py` — 15 tests covering
AC-1..AC-7 + NFR-reliability-no-write + manifest-reader coverage +
refusal-log emission. AC-8 (real Jetson NVML) is Tier-2-only and
skips via `GPS_DENIED_TIER` env gate.
### Modified
- `src/gps_denied_onboard/components/c7_inference/__init__.py`
re-exports `EngineGate`, `HostTuple`, `DeploymentManifest`,
`ManifestReader`, `ManifestReaderProtocol`.
- `_docs/02_tasks/todo/AZ-301_c7_engine_gate.md` → moved to
`_docs/02_tasks/done/`; added `## Implementation Notes (2026-05-12,
batch 25)` documenting the three task-spec → as-built deltas
(HostTuple module location, explicit precision arg on
`read_host_tuple`, AC-8 Tier-2 skip).
## Design decisions
1. **HostTuple module location**: co-located with `engine_gate.py`
(the only consumer); re-exported from package `__init__.py`. Spec
left it unpinned. Risk: minor lift if future component needs it
directly — acceptable since the public surface lives in the
package.
2. **`read_host_tuple(*, precision)` keyword argument**: the helper
reads three of the four tuple fields from the host (NVML →
`sm`; `/etc/nv_tegra_release``jp`; `tensorrt.__version__``trt`).
`precision` is engine-build metadata, not a host property —
passed by the caller. Matches the spec's "derived from
nvidia-smi/pynvml + the runtime's pinned TRT version + the
engine's intended precision" clause.
3. **Risk-2 enforcement in `DeploymentManifest.__getitem__`**:
missing-key access raises `EngineHashMismatchError` directly (not
`KeyError`). Eliminates the silent-pass class of bug at the type
level — any consumer using the manifest must handle the C7-family
error, never `KeyError`.
## AC coverage
| AC | Status | Notes |
|----|--------|-------|
| AC-1 schema parse failure | covered | `test_ac1_parse_failure_refused_at_parse_time` |
| AC-2 schema tuple mismatch | covered | `test_ac2_schema_tuple_mismatch` (sm 86 vs host 87) |
| AC-3 missing sidecar | covered | `test_ac3_missing_sidecar_refused_before_manifest` |
| AC-4 sidecar trust | covered | `test_ac4_sidecar_hash_mismatches_file` |
| AC-5 manifest mismatch | covered | `test_ac5_manifest_hash_mismatches_sidecar` |
| AC-6 happy path + INFO log | covered | `test_ac6_full_success_returns_silently_and_logs_pass` |
| AC-7 schema wins over sidecar | covered | `test_ac7_schema_error_wins_over_sidecar_missing` |
| AC-8 read_host_tuple on Jetson | tier2 | `test_ac8_read_host_tuple_on_jetson``@pytest.mark.tier2` + `GPS_DENIED_TIER!=2` skip |
| NFR-perf-validate (≤ 50 ms) | tier2 | Real engine-size benchmarks belong on Jetson |
| NFR-reliability-no-write | covered | `test_nfr_reliability_no_writes` snapshots mtime + bytes + sidecar text pre/post-validate |
Additional coverage beyond ACs:
- `test_manifest_reader_round_trip` — JSON ⇄ DTO round-trip.
- `test_manifest_missing_entry_raises_hash_mismatch` — Risk-2 (the
critical "no silent pass" property).
- `test_manifest_reader_rejects_malformed_json` /
`test_manifest_reader_rejects_missing_entries_key` — bad-input
refusal at parse time.
- `test_engine_outside_manifest_root_refused``engine_path` not
under `manifest.root` raises `EngineHashMismatchError`.
- `test_refusal_emits_error_log``c7.gate.refuse` ERROR log
emitted with `step` + `reason` fields.
## Test run
```
.venv/bin/pytest tests/unit/c7_inference/ → 77 passed, 7 skipped
.venv/bin/pytest → 1134 passed, 11 skipped
```
Skips are environment-gated (CUDA for AZ-300, Tier-2 GPS_DENIED_TIER
for AZ-301 AC-8, cmake + actionlint absent on dev). No pre-existing
tests regressed.
## Self-review verdict
**Pass.** Pure validator, no GPU ops, no writes. Five refusal paths
in the documented order; AC-7 verifies the discipline. Risk-2 raised
at the type level via `DeploymentManifest.__getitem__`.
## Known gaps for the Product Implementation Completeness Gate
- **AC-8 / NFR-perf-validate not validated on dev**: needs Tier-2
Jetson. The CI matrix's `runs-on: ubuntu-22.04` cannot exercise
these — same gap pattern as AZ-300's CUDA-gated AC-3/4/5/8 and
AZ-332's tier-2 marker.
- **`read_host_tuple` failure modes are minimally tested**: NVML init
failure / unrecognised L4T release / missing tensorrt binding all
raise `RuntimeError`, but the unconditional test suite cannot
exercise the success path. Future Tier-2 integration test should
pin behaviour.
- **Manifest schema is owned by E-C10**: this task ships only the
reader. The writer (`CacheProvisioner`) is a separate task; until
it lands, integration testing of the gate uses dict-backed
manifest fixtures.
## Next batch
**Batch 26 candidates**:
- AZ-302 (ThermalStatePublisher, 3pt) — background thread + jtop /
pynvml + FDR transition records. Requires test-side fake sources
(jtop + pynvml may need to be added to pyproject extras for
Tier-1 CI; even with fakes, the import-attempt logic in the
publisher will fail without the modules — handle via lazy import).
- AZ-304 (C6 Postgres schema, 2pt) — Alembic migration +
testcontainers Postgres 16 + schema-shape fixture diff test.
- 17 tasks total ready in the queue (AZ-300 + AZ-301 removed).
Recommended batch 26 size: 1 task (continue post-AZ-300 cadence).
AZ-302's surface (8 ACs + threading + lazy-import gates) suggests it
ship alone.
+1 -1
View File
@@ -8,7 +8,7 @@ status: in_progress
sub_step:
phase: 13
name: archive-and-loop
detail: "batch 24/cycle1 complete: AZ-300 → In Testing, archived to done/. Installed [inference] extras (torch 2.11.0 + torchvision 0.26.0 + onnxruntime 1.23.2) into the dev venv. 17 ACs + NFRs covered (6 CUDA-skipped on macOS). Suite: 1120 passed / 10 skipped. Next: recompute batch 25 — candidates AZ-301 (EngineGate, 3pt) + AZ-302 (ThermalStatePublisher, 3pt) + AZ-304 (C6 Postgres schema, 2pt). 17 tasks total ready overall (AZ-300 removed; AZ-345 still gated)."
detail: "batch 25/cycle1 complete: AZ-301 → In Testing, archived to done/. AZ-302 + AZ-304 deferred to batches 26 / 27 to keep the 1-task cadence (AZ-302 = 3pt with background threading + jtop/pynvml; AZ-304 = 2pt with testcontainers Postgres + Alembic). 14 unconditional AC tests + 1 Tier-2 AC-8 skip. Suite: 1134 passed / 11 skipped. 17 tasks total ready overall (AZ-300 + AZ-301 removed)."
retry_count: 0
cycle: 1
tracker: jira