[AZ-298] C7 TensorrtRuntime: TRT 10.3 + INT8 calib trust + GPU budget

Implement the production-default InferenceRuntime strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with EngineGate-first ordering and a pre-allocation GPU memory budget gate, infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA stream, idempotent release_engine, and an injected ThermalStatePublisher delegation for thermal_state. INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a .calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new .calib_cache.dataset_sha256 sidecar that records the dataset content hash at compile time; reuse only when both agree, rebuild silently on dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar (never silently overwritten). GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any TRT call beyond the gate (AC-6); a pre-allocation refusal raises OutOfMemoryError and leaves the resident state unchanged. TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the methods that need them so the module loads cleanly on Tier-0 hosts. A standalone CLI entry (python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>) is wired for C10 CacheProvisioner (AZ-321) to invoke pre-flight without holding a runtime instance. C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated in __post_init__. Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5 / AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1 via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize placeholders skip with prerequisite reason and live in the AZ-298 Tier-2 microbench harness. Code review verdict PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 18:31:13 +00:00 · 2026-05-12 23:11:49 +03:00
parent 54942f3052
commit 18a69022b3
9 changed files with 2307 additions and 10 deletions
@@ -0,0 +1,196 @@
+# C7 TensorrtRuntime — Engine Compile + Deserialize + Infer + GPU Memory
+
+**Task**: AZ-298_c7_tensorrt_runtime
+**Name**: C7 TensorrtRuntime
+**Description**: Implement `TensorrtRuntime`, the production-default `InferenceRuntime` strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). Owns the full TRT lifecycle: engine compilation via the Polygraphy / trtexec / IBuilderConfig hybrid (FP16 + INT8 + Mixed precision; INT8 calibration cache trust enforcement); engine deserialization at F2 takeoff load (delegating manifest content-hash + filename schema validation to AZ-301 EngineGate); per-flight resident `EngineHandle` GPU memory management; sync per-call `infer` on the F3 hot path with per-model latency budgets from C7-PT-01; release on flight end; CUDA stream ownership.
+**Complexity**: 5 points
+**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
+**Component**: c7_inference (epic AZ-249 / E-C7)
+**Tracker**: AZ-298
+**Epic**: AZ-249 (E-C7)
+
+### Document Dependencies
+
+- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
+- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — the schema parser used at deserialise time (delegated to EngineGate).
+- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — engine sidecar trust check (delegated to EngineGate).
+- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides engine cache dir, calibration dataset path, precision selection.
+
+## Problem
+
+Without a real TensorRT 10.3 strategy, the on-Jetson hot path cannot meet the C7-PT-01 per-model latency targets (UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms p95) and the AC-4.1 system E2E budget < 400 ms p95 collapses. The Protocol from AZ-297 is just types; this task is what actually runs ML on the Jetson Orin Nano Super.
+
+Concretely, without this task:
+
+- F1 pre-flight engine compilation has no production producer; C10 CacheProvisioner cannot build the engine cache.
+- F2 takeoff load has no engine deserialiser; flights cannot start.
+- F3 hot path has no `infer`; every consumer (C2 / C2.5 / C3 / C3.5) has nothing to call.
+- INT8 calibration cache trust (D-C10-6) has no production gate at compile time.
+- Per-flight GPU memory budget (NFT-LIM-01: ≤ 4 GB resident across all engines, C7-PT-02) has no enforcement point.
+
+This task delivers the canonical production runtime — every other strategy (AZ-299 ONNX-RT, AZ-300 PyTorch) is a fallback against this one's numbers.
+
+## Outcome
+
+- A `TensorrtRuntime` class at `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py` conforming to the `InferenceRuntime` Protocol from AZ-297; `current_runtime_label() == "tensorrt"`.
+- `compile_engine(model_path, build_config) -> EngineCacheEntry` produces an FP16 / INT8 / Mixed engine using the Polygraphy + trtexec + IBuilderConfig hybrid: Polygraphy for orchestration + ONNX import; trtexec for build profiling and binary outputs where appropriate; IBuilderConfig for fine-grained TRT controls (workspace size, optimization profiles, INT8 calibrator). The output `.engine` file lands at the cache path with the D-C10-7 filename schema and a sha256 sidecar from AZ-280.
+- INT8 calibration cache is trusted iff its content hash matches the calibration-dataset hash recorded at compile time. A mismatch raises `CalibrationCacheError` and forces a full rebuild — never a silent fallback.
+- `deserialize_engine(EngineCacheEntry) -> EngineHandle` calls AZ-301 `EngineGate` first (raises `EngineHashMismatchError` / `EngineSchemaMismatchError` / `EngineSidecarMissingError` on refusal); on success, builds an `IExecutionContext`, allocates GPU buffers per the optimization profile's `opt_shape`, allocates one CUDA stream, and returns a `TrtEngineHandle` (`EngineHandle` subclass) that holds the runtime-resident state.
+- `infer(handle, inputs) -> outputs` does sync GPU stream execution: H2D copy of every named input → `enqueueV3` on the engine's exec context → D2H copy of every named output → stream sync → return. No pinned-memory pool reuse beyond what TRT itself allocates (keeps the simple-baseline path viable).
+- `release_engine(handle)` frees GPU buffers, destroys the exec context, releases the CUDA stream. Called once per `EngineHandle` at flight end (or on `OutOfMemoryError`-driven cleanup).
+- Concurrent `infer` calls against different engines are serialised on a single CUDA stream per Runtime instance (matches the description.md "typically one stream because the F3 hot path is single-threaded").
+- `EngineBuildError` on compile failure (e.g., ONNX op unsupported by TRT 10.3 plugins); `EngineDeserializeError` on engine corruption; `InferenceError` on transient CUDA fault; `OutOfMemoryError` on GPU OOM with the offending engine name in the message.
+- The `current_runtime_label()` returns `"tensorrt"` exactly; the FDR-stamped runtime label is consistent across logs, FDR records, and operator post-flight inspection.
+
+## Scope
+
+### Included
+
+- `TensorrtRuntime` class implementation conforming to the AZ-297 Protocol.
+- `compile_engine`: Polygraphy network creation + ONNX parser + IBuilderConfig setup (workspace MB from `BuildConfig`, optimization profiles, FP16 / INT8 / Mixed flags) + INT8 calibrator wiring with calibration-dataset path from `BuildConfig.calibration_dataset` + trtexec invocation for build (subprocess) when `BuildConfig.use_trtexec=True` (faster on FP16; IBuilderConfig direct path is the default for INT8). Output: `.engine` file written via `helpers.sha256_sidecar.atomic_write_with_sidecar`.
+- INT8 calibration cache trust: at compile time, write a `.calib_cache` file alongside the `.engine` and a sidecar `.calib_cache.sha256`. The calibration dataset's content hash is stamped into the cache header. On reuse, the calibrator reads the existing cache iff the dataset hash matches; otherwise rebuilds and overwrites — never silently uses a stale cache.
+- `deserialize_engine`: invokes AZ-301 `EngineGate.validate(EngineCacheEntry)` first (raising the gate's documented errors); then loads the engine via `IRuntime.deserialize_cuda_engine`, builds `IExecutionContext`, allocates GPU buffers from `opt_shape`, returns `TrtEngineHandle`.
+- `infer`: sync GPU stream execution per the description.md § 5; H2D and D2H via `cudaMemcpyAsync` on the owned stream; `enqueueV3`; stream sync; per-frame DEBUG log with backbone name and elapsed milliseconds.
+- `release_engine`: destroys the exec context, frees GPU buffers, releases the stream; idempotent on a handle that was already released (returns silently).
+- `thermal_state()` is delegated to AZ-302's publisher via a constructor-injected `ThermalStatePublisher` reference; this task wires the delegation, AZ-302 owns the polling loop.
+- Per-engine GPU memory budget enforcement: at deserialize time, sum the buffer allocations against `config.inference.gpu_memory_budget_bytes` (default 4 GB per C7-PT-02). Refuse to deserialize if the budget would be exceeded — raises `OutOfMemoryError` BEFORE allocating, with a message identifying which engine pushed it over.
+- Error envelope per the AZ-297 Protocol: `EngineBuildError`, `EngineDeserializeError`, `CalibrationCacheError`, `InferenceError`, `OutOfMemoryError`.
+- Diagnostic INFO log on `compile_engine` start/end with elapsed seconds and output path; INFO log on `deserialize_engine` with engine identity + warm-up confirmation; per-frame DEBUG log on `infer` (off by default; enabled by config).
+- A standalone CLI entry point `python -m c7_inference.tensorrt_runtime compile <onnx_path> <build_config_path>` is exposed for C10 CacheProvisioner to invoke pre-flight without holding a runtime instance. The CLI is a thin wrapper around `compile_engine`; it is not a separate compile path.
+
+### Excluded
+
+- AZ-301 EngineGate validation logic (filename-schema parse, manifest content-hash check) — this task only INVOKES the gate.
+- AZ-302 ThermalState polling thread — this task only delegates to a publisher reference.
+- AZ-299 OnnxTrtEpRuntime fallback — separate task.
+- AZ-300 PytorchFp16Runtime baseline — separate task.
+- C10 CacheProvisioner orchestration of compile_engine — owned by E-C10. This task exposes the API; C10 calls it.
+- Engine warm-up beyond the deserialize-side buffer allocation (the warm-up that AC-NEW-1 measures end-to-end is owned by C10's pre-flight orchestration; the per-engine warm-up cost lives here).
+- ONNX op unsupported by TRT 10.3 — out of scope to "fix" via plugins; raises `EngineBuildError` and the operator chooses ONNX-RT fallback (AZ-299).
+- Multi-stream concurrent execution — out of scope this cycle (description.md notes the F3 hot path is single-threaded; future task if needed).
+
+## Acceptance Criteria
+
+**AC-1: compile_engine produces an FP16 .engine + sidecar at the canonical path**
+Given a valid ONNX model and `BuildConfig(precision=Fp16, ...)`
+When `compile_engine(model_path, build_config)` runs to completion
+Then a `.engine` file exists at the D-C10-7 filename schema path with a matching `.sha256` sidecar from AZ-280; the returned `EngineCacheEntry` carries that path, the sha256, and the `(SM, JP, TRT, precision)` tuple
+
+**AC-2: compile_engine produces an INT8 .engine with a calibration cache + sidecar**
+Given a valid ONNX model, a calibration dataset directory with at least 100 images, and `BuildConfig(precision=Int8, calibration_dataset=...)`
+When `compile_engine` runs
+Then a `.engine` file is produced AND a `.calib_cache` file is produced with a `.calib_cache.sha256` sidecar; the cache header records the calibration-dataset content hash; a second `compile_engine` call with the same dataset reuses the cache (verifiable: the second call is < 30 s, the first is minutes)
+
+**AC-3: stale calibration cache forces rebuild**
+Given an existing `.calib_cache` with dataset hash A, and the dataset on disk now hashes to B
+When `compile_engine` runs with the same calibration_dataset path
+Then the calibrator detects the mismatch, rebuilds the cache from scratch (cache header now records hash B), and writes a new sidecar; the prior cache is overwritten — no silent reuse, no `CalibrationCacheError`
+
+**AC-4: corrupted calibration cache raises CalibrationCacheError**
+Given an existing `.calib_cache` whose sidecar `.calib_cache.sha256` does not match the cache file's actual sha256
+When `compile_engine` runs
+Then `CalibrationCacheError` is raised; no engine is built; the cache file is NOT silently overwritten (operator must explicitly delete or replace)
+
+**AC-5: deserialize_engine invokes EngineGate before any GPU allocation**
+Given an `EngineCacheEntry` whose engine file's filename schema does not match the running Jetson tuple
+When `deserialize_engine(entry)` is called
+Then `EngineGate.validate(entry)` is invoked first, raises `EngineSchemaMismatchError`; no `IRuntime.deserialize_cuda_engine` call is made; no GPU memory is allocated (verifiable via NVML before/after delta)
+
+**AC-6: deserialize_engine refuses when GPU memory budget would be exceeded**
+Given a deserialize that would allocate 1.2 GB and a runtime instance currently holding 3.0 GB resident with budget = 4 GB
+When `deserialize_engine` is called
+Then `OutOfMemoryError` is raised with the engine name in the message; no partial allocation occurs; the prior 3.0 GB resident state is unchanged
+
+**AC-7: infer round-trips H2D + enqueueV3 + D2H on the owned stream**
+Given a deserialised `TrtEngineHandle` and a fixed input dict
+When `infer(handle, inputs)` is called and a profiling tool counts CUDA API events
+Then the call sequence is: `cudaMemcpyAsync(H2D)` × N inputs, `enqueueV3`, `cudaMemcpyAsync(D2H)` × M outputs, `cudaStreamSynchronize`; output dict has the M expected named tensors
+
+**AC-8: per-model latency p95 budget**
+Given the production-default UltraVPR / LightGlue / AdHoP / DISK FP16 engines deserialised on Tier-2 (Jetson Orin Nano Super)
+When the C7-PT-01 latency benchmark runs (scripted load matching production)
+Then per-model p95 latencies are: UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms; failure thresholds (100 / 60 / 150 / 90 ms respectively) are NOT crossed
+
+**AC-9: GPU memory budget compliance**
+Given all production-default engines resident concurrently
+When the C7-PT-02 memory benchmark runs
+Then GPU resident memory across engines ≤ 4 GB (failure threshold 5 GB) and process RAM ≤ 1.5 GB (failure threshold 2 GB)
+
+**AC-10: release_engine fully frees GPU state and is idempotent**
+Given a deserialised handle holding 1 GB GPU memory
+When `release_engine(handle)` is called once and then again
+Then after the first call NVML reports 1 GB freed; after the second call no error is raised and NVML state is unchanged
+
+## Non-Functional Requirements
+
+**Performance**
+- Per-model p95 latencies as in AC-8.
+- GPU memory ≤ 4 GB resident, RAM ≤ 1.5 GB process — AC-9 / NFT-LIM-01 / NFT-LIM-04.
+- `compile_engine` for FP16 sub-minute typical; INT8 minutes (calibration dominates) — bounded by hardware, not budget.
+- `deserialize_engine` p95 ≤ 5 s per engine on Tier-2 (engineering sanity bound; AC-NEW-1 cold-start budget is end-to-end and is asserted by C7-IT-01 in the test phase).
+
+**Compatibility**
+- TensorRT 10.3 only this cycle (per D-C7-9 JetPack 6.2 lock). No 9.x / 8.x compatibility shims.
+- Built on Polygraphy + trtexec + IBuilderConfig — three TRT-supported entry points; no custom TRT builder calls outside that surface.
+
+**Reliability**
+- All errors are caught and rewrapped into the AZ-297 family. No raw `RuntimeError` or TRT-specific exception leaks to consumers.
+- `infer` NEVER blocks on the producer side; sync stream execution is on the consumer's calling thread.
+- INT8 calibration cache trust is the lurking foot-gun (per description.md § 7); the AC-2 / AC-3 / AC-4 flow is the only protection. A new strategy or refactor MUST preserve these three.
+
+**Concurrency**
+- One CUDA stream per `TensorrtRuntime` instance. The F3 hot path is single-threaded (per description.md); concurrent `infer` calls from different threads are NOT supported this cycle (would corrupt the stream). Documented as a constraint.
+- Compile path (`compile_engine`) is allowed to run while a runtime instance is also serving `infer` calls only if the compile is in a separate process (the C10 CLI entry point); same-process concurrent compile + infer is NOT supported.
+
+## Unit Tests
+
+| AC Ref | What to Test | Required Outcome |
+|--------|-------------|-----------------|
+| AC-1 | compile FP16 with a tiny ONNX (sanity) | Engine + sidecar produced at canonical path; EngineCacheEntry shape correct |
+| AC-2 | compile INT8 with a 100-image calibration dataset; rerun | Cache + sidecar produced; rerun reuses cache (under 30 s); cache header has dataset hash |
+| AC-3 | mutate calibration dataset; rerun compile | Cache rebuilt; new dataset hash in header |
+| AC-4 | corrupt calibration sidecar; rerun compile | `CalibrationCacheError`; engine NOT built |
+| AC-5 | deserialise an engine with a mismatched filename SM | `EngineSchemaMismatchError` from gate; no GPU allocation (NVML diff = 0) |
+| AC-6 | deserialise that would push past budget | `OutOfMemoryError`; prior resident state unchanged |
+| AC-7 | infer with a profiler attached | Exact CUDA call sequence: N H2D, enqueueV3, M D2H, stream sync |
+| AC-8 | C7-PT-01 microbench against the four production engines | p95 within budgets per the table |
+| AC-9 | C7-PT-02 memory test with all engines resident | ≤ 4 GB GPU, ≤ 1.5 GB RAM |
+| AC-10 | release_engine called twice | First call frees memory; second is a no-op |
+| NFR-perf-deserialize | Microbench deserialise per engine | p95 ≤ 5 s on Tier-2 |
+| NFR-reliability-error-rewrap | Inject a TRT C++ exception via mock | Rewrapped into the AZ-297 family; original message preserved |
+
+## Constraints
+
+- TensorRT 10.3 + JetPack 6.2 lock per D-C7-9.
+- Polygraphy + trtexec + IBuilderConfig only — no custom TRT builder API outside this surface.
+- `helpers.sha256_sidecar` (AZ-280) for atomic write + sidecar pattern.
+- `helpers.engine_filename_schema` (AZ-281) for the D-C10-7 filename schema (consumed at deserialise via the gate).
+- The `EngineHandle` subclass `TrtEngineHandle` is opaque to consumers; consumers MUST NOT introspect its fields. Implementation may add diagnostic fields (e.g., `last_infer_elapsed_ms`) but they are NOT part of the AZ-297 Protocol.
+- The CLI entry point `python -m c7_inference.tensorrt_runtime compile ...` is for C10 to invoke; not part of any consumer's public API.
+- This task introduces no new third-party dependencies beyond TensorRT 10.3, Polygraphy, and the trtexec binary that ships with TRT.
+- Per-frame DEBUG logging defaults to OFF (would flood at 39 Hz aggregate per description.md); enabled only via `config.inference.per_frame_debug_log = true`.
+
+## Risks & Mitigation
+
+**Risk 1: ONNX op unsupported by TRT 10.3**
+- *Risk*: A backbone author exports an op TRT cannot lower; `compile_engine` raises `EngineBuildError` and the operator has nothing to fall back to.
+- *Mitigation*: `EngineBuildError` is surfaced to the operator pre-flight (per description.md error-handling spec); the operator switches the runtime config to `onnx_trt_ep` (AZ-299) which has wider op coverage. Documented in the operator playbook (out of scope here).
+
+**Risk 2: INT8 calibration silently uses a stale cache**
+- *Risk*: AC-3 / AC-4 fails in a corner case (dataset mutated atomically, hash check races with compile start).
+- *Mitigation*: The calibration-dataset hash is computed at the START of `compile_engine` and compared to the cache header's hash; the dataset is treated as immutable for the duration of the call. AC-3 + AC-4 cover the obvious cases; the corner case requires a separate test where the dataset is replaced mid-compile, which is operator error and out of scope.
+
+**Risk 3: GPU memory budget is exceeded under sustained use due to TRT internal scratch**
+- *Risk*: `deserialize_engine` reports buffer size based on `opt_shape`, but TRT may allocate additional internal scratch beyond what is reported.
+- *Mitigation*: AC-9 measures actual NVML resident memory, not reported buffer size; a regression here fails the test. The budget includes a safety margin (4 GB target with 5 GB hard fail) that absorbs typical TRT scratch.
+
+**Risk 4: trtexec subprocess hangs on a malformed ONNX**
+- *Risk*: trtexec can hang silently on certain malformed inputs.
+- *Mitigation*: The trtexec invocation has a config-driven timeout (default 600 s = 10 minutes); on timeout, the subprocess is killed and `EngineBuildError("trtexec timeout after 600s")` is raised. The IBuilderConfig direct path is preferred for INT8 anyway; trtexec is mainly for FP16 build profiling.
+
+## Runtime Completeness
+
+- **Named capability**: TensorRT 10.3 production runtime + Polygraphy / trtexec / IBuilderConfig hybrid + INT8 calibration cache trust + GPU memory budget enforcement (architecture / E-C7 / D-C7-9 / NFT-PERF-01 / NFT-LIM-01).
+- **Production code that must exist**: real `TensorrtRuntime` class implementing the AZ-297 Protocol; real Polygraphy + trtexec + IBuilderConfig compile path; real `IRuntime.deserialize_cuda_engine` + `IExecutionContext` + GPU buffer allocations; real `enqueueV3` + sync-stream execution; real INT8 calibrator with dataset-hash cache trust; real GPU memory budget check at deserialise.
+- **Allowed external stubs**: tests MAY substitute a `FakeCudaProfiler` to count CUDA events (AC-7); production wiring uses real TRT + real CUDA. C10 CacheProvisioner is the production caller of `compile_engine`; tests MAY drive the CLI directly.
+- **Unacceptable substitutes**: a Python-level fake "engine" that bypasses TRT (would defeat the whole point), a calibration cache "always trusted" path (would break D-C10-6), an `infer` that uses `torch.compile` instead of `enqueueV3` (that's AZ-300 PyTorch baseline territory), or an in-memory engine that skips the file-on-disk path (would break the engine cache lifecycle that C10 + F2 takeoff load depend on).