mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 15:31:13 +00:00
[AZ-301] Implement EngineGate — D-C10-3 + D-C10-7 takeoff validator
AZ-301 takeoff-side validator every InferenceRuntime strategy calls
before deserialize_engine. Five-step deterministic refusal pipeline,
in order:
1. filename schema parse -> EngineSchemaMismatchError(reason=...)
2. schema tuple match -> EngineSchemaMismatchError(expected,got)
3. sidecar present -> EngineSidecarMissingError
4. sidecar trust -> EngineHashMismatchError(stage=sidecar)
5. manifest match -> EngineHashMismatchError(stage=manifest)
Refusal order is part of the public contract (AC-7 verifies a
fixture that is BOTH schema-mismatched AND missing-sidecar refuses
at step 1).
Production code (new):
- components/c7_inference/engine_gate.py -- EngineGate, HostTuple,
read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release +
tensorrt.__version__; raises RuntimeError on Tier-1)
- components/c7_inference/manifest.py -- DeploymentManifest,
ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the
type level: __getitem__ raises EngineHashMismatchError on
missing key, NEVER KeyError, so the gate cannot silently pass
- components/c7_inference/__init__.py -- re-exports the new
public surface
Tests (new): tests/unit/c7_inference/test_engine_gate.py covers
AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal
log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs
real NVML + L4T release file + tensorrt binding).
Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes:
1. HostTuple lives in engine_gate.py (the only consumer);
re-exported from package __init__.py.
2. read_host_tuple takes precision as a keyword argument — three
of four fields come from the host, precision is engine-build
metadata supplied by the caller.
3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras
run on every CI host.
Risk-2 (manifest reader silently treats missing entry as pass):
DeploymentManifest.__getitem__ raises EngineHashMismatchError with
"missing manifest entry for {path}" — covered by
test_manifest_missing_entry_raises_hash_mismatch.
NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB
engine streaming sha256 cannot be benchmarked on Tier-1 fixtures.
AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema)
deferred to batches 26 / 27 to keep the 1-task batch cadence and
isolate their respective env / testcontainer surface areas.
Suite: 1134 passed / 11 skipped. No regressions outside the new
files.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
+21
@@ -159,3 +159,24 @@ Then the returned tuple has `sm=87, jp="6.2", trt="10.3"`; on a workstation (Tie
|
||||
- **Production code that must exist**: real `EngineGate.validate` calling real `helpers.engine_filename_schema.parse`, real `helpers.sha256_sidecar.verify`, real `ManifestReader` reading the deployed manifest.json from disk.
|
||||
- **Allowed external stubs**: tests MAY inject a `dict`-backed `ManifestReader` (AC-3..AC-7); production wiring reads the on-disk manifest.
|
||||
- **Unacceptable substitutes**: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy).
|
||||
|
||||
## Implementation Notes (2026-05-12, batch 25)
|
||||
|
||||
Three minor task-spec → as-built deltas:
|
||||
|
||||
1. **`HostTuple` lives in `engine_gate.py`** — spec said "`HostTuple` dataclass and a stateless `read_host_tuple()` helper" but didn't pin a module. Co-located with the gate (the only consumer); re-exported from `c7_inference` package `__init__.py`. Future consumers can lift it out if needed.
|
||||
|
||||
2. **`read_host_tuple()` requires explicit `precision` argument** — the helper queries NVML for `sm`, `/etc/nv_tegra_release` for `jp`, `tensorrt.__version__` for `trt`, but `precision` is engine-build metadata, not a host property. Caller passes it. Spec implied that — the tuple "derived from `nvidia-smi`/`pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry)".
|
||||
|
||||
3. **AC-8 is Tier-2-only** — marked `@pytest.mark.tier2 + @pytest.mark.skipif(GPS_DENIED_TIER!=2)`. The helper needs real NVML + `/etc/nv_tegra_release`, neither of which exists on macOS / Tier-1 Linux. AC-1..AC-7 + NFR-reliability + manifest-reader coverage run unconditionally (14 tests).
|
||||
|
||||
### As-built file map
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/engine_gate.py` — `EngineGate.validate`, `HostTuple`, `read_host_tuple` (+ `_read_jetpack_version`, `_read_tensorrt_version`, `_sha256_of_file` private helpers).
|
||||
- `src/gps_denied_onboard/components/c7_inference/manifest.py` — `DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`. Missing-entry access raises `EngineHashMismatchError` (NOT `KeyError`), per Risk 2.
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `EngineGate`, `HostTuple`, `DeploymentManifest`, `ManifestReader`, `ManifestReaderProtocol`.
|
||||
- `tests/unit/c7_inference/test_engine_gate.py` — 15 tests (14 unconditional + AC-8 tier-2 skip).
|
||||
|
||||
### Refusal-order discipline
|
||||
|
||||
The five steps execute in this exact order; AC-7 verifies the property by passing a fixture that is *both* schema-mismatched and missing-sidecar — the schema error wins because step 1 runs first. Future refactors that reorder the steps regress AC-7.
|
||||
@@ -0,0 +1,143 @@
|
||||
# Batch 25 / Cycle 1 — Implementation Report
|
||||
|
||||
**Date**: 2026-05-12
|
||||
**Tasks**: AZ-301 (C7 EngineGate — D-C10-3 + D-C10-7 takeoff validator)
|
||||
**Story points landed**: 3
|
||||
**Status**: complete (AZ-301 → In Testing)
|
||||
|
||||
## Scope summary
|
||||
|
||||
Single-task batch, continuing the post-AZ-300 1-task cadence. The user
|
||||
selected `AZ-301 + AZ-302 + (optionally) AZ-304` for batch 25; AZ-302
|
||||
(ThermalStatePublisher, 3pt, requires `jtop` / `pynvml` integration +
|
||||
background thread) and AZ-304 (C6 Postgres schema, 2pt, requires
|
||||
testcontainers + Alembic) each carry meaningful surface area; bundling
|
||||
them with AZ-301 would push the batch past the project's 1-task cadence
|
||||
and bloat the commit beyond practical review. Each ships as its own
|
||||
batch (26 / 27).
|
||||
|
||||
## Files added / modified
|
||||
|
||||
### New
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/engine_gate.py` —
|
||||
`EngineGate.validate` five-step deterministic pipeline +
|
||||
`HostTuple` frozen dataclass + `read_host_tuple()` Jetson helper
|
||||
(lazy NVML / L4T / TRT-version reads).
|
||||
- `src/gps_denied_onboard/components/c7_inference/manifest.py` —
|
||||
`DeploymentManifest` (Risk-2-compliant `__getitem__` that raises
|
||||
`EngineHashMismatchError` on missing key) + `ManifestReader`
|
||||
on-disk JSON loader + `ManifestReaderProtocol` for test fakes.
|
||||
- `tests/unit/c7_inference/test_engine_gate.py` — 15 tests covering
|
||||
AC-1..AC-7 + NFR-reliability-no-write + manifest-reader coverage +
|
||||
refusal-log emission. AC-8 (real Jetson NVML) is Tier-2-only and
|
||||
skips via `GPS_DENIED_TIER` env gate.
|
||||
|
||||
### Modified
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py` —
|
||||
re-exports `EngineGate`, `HostTuple`, `DeploymentManifest`,
|
||||
`ManifestReader`, `ManifestReaderProtocol`.
|
||||
- `_docs/02_tasks/todo/AZ-301_c7_engine_gate.md` → moved to
|
||||
`_docs/02_tasks/done/`; added `## Implementation Notes (2026-05-12,
|
||||
batch 25)` documenting the three task-spec → as-built deltas
|
||||
(HostTuple module location, explicit precision arg on
|
||||
`read_host_tuple`, AC-8 Tier-2 skip).
|
||||
|
||||
## Design decisions
|
||||
|
||||
1. **HostTuple module location**: co-located with `engine_gate.py`
|
||||
(the only consumer); re-exported from package `__init__.py`. Spec
|
||||
left it unpinned. Risk: minor lift if future component needs it
|
||||
directly — acceptable since the public surface lives in the
|
||||
package.
|
||||
2. **`read_host_tuple(*, precision)` keyword argument**: the helper
|
||||
reads three of the four tuple fields from the host (NVML →
|
||||
`sm`; `/etc/nv_tegra_release` → `jp`; `tensorrt.__version__` → `trt`).
|
||||
`precision` is engine-build metadata, not a host property —
|
||||
passed by the caller. Matches the spec's "derived from
|
||||
nvidia-smi/pynvml + the runtime's pinned TRT version + the
|
||||
engine's intended precision" clause.
|
||||
3. **Risk-2 enforcement in `DeploymentManifest.__getitem__`**:
|
||||
missing-key access raises `EngineHashMismatchError` directly (not
|
||||
`KeyError`). Eliminates the silent-pass class of bug at the type
|
||||
level — any consumer using the manifest must handle the C7-family
|
||||
error, never `KeyError`.
|
||||
|
||||
## AC coverage
|
||||
|
||||
| AC | Status | Notes |
|
||||
|----|--------|-------|
|
||||
| AC-1 schema parse failure | covered | `test_ac1_parse_failure_refused_at_parse_time` |
|
||||
| AC-2 schema tuple mismatch | covered | `test_ac2_schema_tuple_mismatch` (sm 86 vs host 87) |
|
||||
| AC-3 missing sidecar | covered | `test_ac3_missing_sidecar_refused_before_manifest` |
|
||||
| AC-4 sidecar trust | covered | `test_ac4_sidecar_hash_mismatches_file` |
|
||||
| AC-5 manifest mismatch | covered | `test_ac5_manifest_hash_mismatches_sidecar` |
|
||||
| AC-6 happy path + INFO log | covered | `test_ac6_full_success_returns_silently_and_logs_pass` |
|
||||
| AC-7 schema wins over sidecar | covered | `test_ac7_schema_error_wins_over_sidecar_missing` |
|
||||
| AC-8 read_host_tuple on Jetson | tier2 | `test_ac8_read_host_tuple_on_jetson` — `@pytest.mark.tier2` + `GPS_DENIED_TIER!=2` skip |
|
||||
| NFR-perf-validate (≤ 50 ms) | tier2 | Real engine-size benchmarks belong on Jetson |
|
||||
| NFR-reliability-no-write | covered | `test_nfr_reliability_no_writes` snapshots mtime + bytes + sidecar text pre/post-validate |
|
||||
|
||||
Additional coverage beyond ACs:
|
||||
|
||||
- `test_manifest_reader_round_trip` — JSON ⇄ DTO round-trip.
|
||||
- `test_manifest_missing_entry_raises_hash_mismatch` — Risk-2 (the
|
||||
critical "no silent pass" property).
|
||||
- `test_manifest_reader_rejects_malformed_json` /
|
||||
`test_manifest_reader_rejects_missing_entries_key` — bad-input
|
||||
refusal at parse time.
|
||||
- `test_engine_outside_manifest_root_refused` — `engine_path` not
|
||||
under `manifest.root` raises `EngineHashMismatchError`.
|
||||
- `test_refusal_emits_error_log` — `c7.gate.refuse` ERROR log
|
||||
emitted with `step` + `reason` fields.
|
||||
|
||||
## Test run
|
||||
|
||||
```
|
||||
.venv/bin/pytest tests/unit/c7_inference/ → 77 passed, 7 skipped
|
||||
.venv/bin/pytest → 1134 passed, 11 skipped
|
||||
```
|
||||
|
||||
Skips are environment-gated (CUDA for AZ-300, Tier-2 GPS_DENIED_TIER
|
||||
for AZ-301 AC-8, cmake + actionlint absent on dev). No pre-existing
|
||||
tests regressed.
|
||||
|
||||
## Self-review verdict
|
||||
|
||||
**Pass.** Pure validator, no GPU ops, no writes. Five refusal paths
|
||||
in the documented order; AC-7 verifies the discipline. Risk-2 raised
|
||||
at the type level via `DeploymentManifest.__getitem__`.
|
||||
|
||||
## Known gaps for the Product Implementation Completeness Gate
|
||||
|
||||
- **AC-8 / NFR-perf-validate not validated on dev**: needs Tier-2
|
||||
Jetson. The CI matrix's `runs-on: ubuntu-22.04` cannot exercise
|
||||
these — same gap pattern as AZ-300's CUDA-gated AC-3/4/5/8 and
|
||||
AZ-332's tier-2 marker.
|
||||
- **`read_host_tuple` failure modes are minimally tested**: NVML init
|
||||
failure / unrecognised L4T release / missing tensorrt binding all
|
||||
raise `RuntimeError`, but the unconditional test suite cannot
|
||||
exercise the success path. Future Tier-2 integration test should
|
||||
pin behaviour.
|
||||
- **Manifest schema is owned by E-C10**: this task ships only the
|
||||
reader. The writer (`CacheProvisioner`) is a separate task; until
|
||||
it lands, integration testing of the gate uses dict-backed
|
||||
manifest fixtures.
|
||||
|
||||
## Next batch
|
||||
|
||||
**Batch 26 candidates**:
|
||||
|
||||
- AZ-302 (ThermalStatePublisher, 3pt) — background thread + jtop /
|
||||
pynvml + FDR transition records. Requires test-side fake sources
|
||||
(jtop + pynvml may need to be added to pyproject extras for
|
||||
Tier-1 CI; even with fakes, the import-attempt logic in the
|
||||
publisher will fail without the modules — handle via lazy import).
|
||||
- AZ-304 (C6 Postgres schema, 2pt) — Alembic migration +
|
||||
testcontainers Postgres 16 + schema-shape fixture diff test.
|
||||
- 17 tasks total ready in the queue (AZ-300 + AZ-301 removed).
|
||||
|
||||
Recommended batch 26 size: 1 task (continue post-AZ-300 cadence).
|
||||
AZ-302's surface (8 ACs + threading + lazy-import gates) suggests it
|
||||
ship alone.
|
||||
@@ -8,7 +8,7 @@ status: in_progress
|
||||
sub_step:
|
||||
phase: 13
|
||||
name: archive-and-loop
|
||||
detail: "batch 24/cycle1 complete: AZ-300 → In Testing, archived to done/. Installed [inference] extras (torch 2.11.0 + torchvision 0.26.0 + onnxruntime 1.23.2) into the dev venv. 17 ACs + NFRs covered (6 CUDA-skipped on macOS). Suite: 1120 passed / 10 skipped. Next: recompute batch 25 — candidates AZ-301 (EngineGate, 3pt) + AZ-302 (ThermalStatePublisher, 3pt) + AZ-304 (C6 Postgres schema, 2pt). 17 tasks total ready overall (AZ-300 removed; AZ-345 still gated)."
|
||||
detail: "batch 25/cycle1 complete: AZ-301 → In Testing, archived to done/. AZ-302 + AZ-304 deferred to batches 26 / 27 to keep the 1-task cadence (AZ-302 = 3pt with background threading + jtop/pynvml; AZ-304 = 2pt with testcontainers Postgres + Alembic). 14 unconditional AC tests + 1 Tier-2 AC-8 skip. Suite: 1134 passed / 11 skipped. 17 tasks total ready overall (AZ-300 + AZ-301 removed)."
|
||||
retry_count: 0
|
||||
cycle: 1
|
||||
tracker: jira
|
||||
|
||||
Reference in New Issue
Block a user