[AZ-298] C7 TensorrtRuntime: TRT 10.3 + INT8 calib trust + GPU budget

Implement the production-default InferenceRuntime strategy on JetPack
6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT
lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig
hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with
EngineGate-first ordering and a pre-allocation GPU memory budget gate,
infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA
stream, idempotent release_engine, and an injected
ThermalStatePublisher delegation for thermal_state.

INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a
.calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new
.calib_cache.dataset_sha256 sidecar that records the dataset content
hash at compile time; reuse only when both agree, rebuild silently on
dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar
(never silently overwritten).

GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any
TRT call beyond the gate (AC-6); a pre-allocation refusal raises
OutOfMemoryError and leaves the resident state unchanged.

TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the
methods that need them so the module loads cleanly on Tier-0 hosts.
A standalone CLI entry (python -m
gps_denied_onboard.components.c7_inference.tensorrt_runtime compile
<onnx> <build_config.json>) is wired for C10 CacheProvisioner
(AZ-321) to invoke pre-flight without holding a runtime instance.

C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and
trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated
in __post_init__.

Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5
/ AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1
via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize
placeholders skip with prerequisite reason and live in the AZ-298
Tier-2 microbench harness. Code review verdict
PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 23:11:49 +03:00
parent 54942f3052
commit 18a69022b3
9 changed files with 2307 additions and 10 deletions
@@ -0,0 +1,196 @@
# C7 TensorrtRuntime — Engine Compile + Deserialize + Infer + GPU Memory
**Task**: AZ-298_c7_tensorrt_runtime
**Name**: C7 TensorrtRuntime
**Description**: Implement `TensorrtRuntime`, the production-default `InferenceRuntime` strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). Owns the full TRT lifecycle: engine compilation via the Polygraphy / trtexec / IBuilderConfig hybrid (FP16 + INT8 + Mixed precision; INT8 calibration cache trust enforcement); engine deserialization at F2 takeoff load (delegating manifest content-hash + filename schema validation to AZ-301 EngineGate); per-flight resident `EngineHandle` GPU memory management; sync per-call `infer` on the F3 hot path with per-model latency budgets from C7-PT-01; release on flight end; CUDA stream ownership.
**Complexity**: 5 points
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
**Component**: c7_inference (epic AZ-249 / E-C7)
**Tracker**: AZ-298
**Epic**: AZ-249 (E-C7)
### Document Dependencies
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — the schema parser used at deserialise time (delegated to EngineGate).
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — engine sidecar trust check (delegated to EngineGate).
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides engine cache dir, calibration dataset path, precision selection.
## Problem
Without a real TensorRT 10.3 strategy, the on-Jetson hot path cannot meet the C7-PT-01 per-model latency targets (UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms p95) and the AC-4.1 system E2E budget < 400 ms p95 collapses. The Protocol from AZ-297 is just types; this task is what actually runs ML on the Jetson Orin Nano Super.
Concretely, without this task:
- F1 pre-flight engine compilation has no production producer; C10 CacheProvisioner cannot build the engine cache.
- F2 takeoff load has no engine deserialiser; flights cannot start.
- F3 hot path has no `infer`; every consumer (C2 / C2.5 / C3 / C3.5) has nothing to call.
- INT8 calibration cache trust (D-C10-6) has no production gate at compile time.
- Per-flight GPU memory budget (NFT-LIM-01: ≤ 4 GB resident across all engines, C7-PT-02) has no enforcement point.
This task delivers the canonical production runtime — every other strategy (AZ-299 ONNX-RT, AZ-300 PyTorch) is a fallback against this one's numbers.
## Outcome
- A `TensorrtRuntime` class at `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py` conforming to the `InferenceRuntime` Protocol from AZ-297; `current_runtime_label() == "tensorrt"`.
- `compile_engine(model_path, build_config) -> EngineCacheEntry` produces an FP16 / INT8 / Mixed engine using the Polygraphy + trtexec + IBuilderConfig hybrid: Polygraphy for orchestration + ONNX import; trtexec for build profiling and binary outputs where appropriate; IBuilderConfig for fine-grained TRT controls (workspace size, optimization profiles, INT8 calibrator). The output `.engine` file lands at the cache path with the D-C10-7 filename schema and a sha256 sidecar from AZ-280.
- INT8 calibration cache is trusted iff its content hash matches the calibration-dataset hash recorded at compile time. A mismatch raises `CalibrationCacheError` and forces a full rebuild — never a silent fallback.
- `deserialize_engine(EngineCacheEntry) -> EngineHandle` calls AZ-301 `EngineGate` first (raises `EngineHashMismatchError` / `EngineSchemaMismatchError` / `EngineSidecarMissingError` on refusal); on success, builds an `IExecutionContext`, allocates GPU buffers per the optimization profile's `opt_shape`, allocates one CUDA stream, and returns a `TrtEngineHandle` (`EngineHandle` subclass) that holds the runtime-resident state.
- `infer(handle, inputs) -> outputs` does sync GPU stream execution: H2D copy of every named input → `enqueueV3` on the engine's exec context → D2H copy of every named output → stream sync → return. No pinned-memory pool reuse beyond what TRT itself allocates (keeps the simple-baseline path viable).
- `release_engine(handle)` frees GPU buffers, destroys the exec context, releases the CUDA stream. Called once per `EngineHandle` at flight end (or on `OutOfMemoryError`-driven cleanup).
- Concurrent `infer` calls against different engines are serialised on a single CUDA stream per Runtime instance (matches the description.md "typically one stream because the F3 hot path is single-threaded").
- `EngineBuildError` on compile failure (e.g., ONNX op unsupported by TRT 10.3 plugins); `EngineDeserializeError` on engine corruption; `InferenceError` on transient CUDA fault; `OutOfMemoryError` on GPU OOM with the offending engine name in the message.
- The `current_runtime_label()` returns `"tensorrt"` exactly; the FDR-stamped runtime label is consistent across logs, FDR records, and operator post-flight inspection.
## Scope
### Included
- `TensorrtRuntime` class implementation conforming to the AZ-297 Protocol.
- `compile_engine`: Polygraphy network creation + ONNX parser + IBuilderConfig setup (workspace MB from `BuildConfig`, optimization profiles, FP16 / INT8 / Mixed flags) + INT8 calibrator wiring with calibration-dataset path from `BuildConfig.calibration_dataset` + trtexec invocation for build (subprocess) when `BuildConfig.use_trtexec=True` (faster on FP16; IBuilderConfig direct path is the default for INT8). Output: `.engine` file written via `helpers.sha256_sidecar.atomic_write_with_sidecar`.
- INT8 calibration cache trust: at compile time, write a `.calib_cache` file alongside the `.engine` and a sidecar `.calib_cache.sha256`. The calibration dataset's content hash is stamped into the cache header. On reuse, the calibrator reads the existing cache iff the dataset hash matches; otherwise rebuilds and overwrites — never silently uses a stale cache.
- `deserialize_engine`: invokes AZ-301 `EngineGate.validate(EngineCacheEntry)` first (raising the gate's documented errors); then loads the engine via `IRuntime.deserialize_cuda_engine`, builds `IExecutionContext`, allocates GPU buffers from `opt_shape`, returns `TrtEngineHandle`.
- `infer`: sync GPU stream execution per the description.md § 5; H2D and D2H via `cudaMemcpyAsync` on the owned stream; `enqueueV3`; stream sync; per-frame DEBUG log with backbone name and elapsed milliseconds.
- `release_engine`: destroys the exec context, frees GPU buffers, releases the stream; idempotent on a handle that was already released (returns silently).
- `thermal_state()` is delegated to AZ-302's publisher via a constructor-injected `ThermalStatePublisher` reference; this task wires the delegation, AZ-302 owns the polling loop.
- Per-engine GPU memory budget enforcement: at deserialize time, sum the buffer allocations against `config.inference.gpu_memory_budget_bytes` (default 4 GB per C7-PT-02). Refuse to deserialize if the budget would be exceeded — raises `OutOfMemoryError` BEFORE allocating, with a message identifying which engine pushed it over.
- Error envelope per the AZ-297 Protocol: `EngineBuildError`, `EngineDeserializeError`, `CalibrationCacheError`, `InferenceError`, `OutOfMemoryError`.
- Diagnostic INFO log on `compile_engine` start/end with elapsed seconds and output path; INFO log on `deserialize_engine` with engine identity + warm-up confirmation; per-frame DEBUG log on `infer` (off by default; enabled by config).
- A standalone CLI entry point `python -m c7_inference.tensorrt_runtime compile <onnx_path> <build_config_path>` is exposed for C10 CacheProvisioner to invoke pre-flight without holding a runtime instance. The CLI is a thin wrapper around `compile_engine`; it is not a separate compile path.
### Excluded
- AZ-301 EngineGate validation logic (filename-schema parse, manifest content-hash check) — this task only INVOKES the gate.
- AZ-302 ThermalState polling thread — this task only delegates to a publisher reference.
- AZ-299 OnnxTrtEpRuntime fallback — separate task.
- AZ-300 PytorchFp16Runtime baseline — separate task.
- C10 CacheProvisioner orchestration of compile_engine — owned by E-C10. This task exposes the API; C10 calls it.
- Engine warm-up beyond the deserialize-side buffer allocation (the warm-up that AC-NEW-1 measures end-to-end is owned by C10's pre-flight orchestration; the per-engine warm-up cost lives here).
- ONNX op unsupported by TRT 10.3 — out of scope to "fix" via plugins; raises `EngineBuildError` and the operator chooses ONNX-RT fallback (AZ-299).
- Multi-stream concurrent execution — out of scope this cycle (description.md notes the F3 hot path is single-threaded; future task if needed).
## Acceptance Criteria
**AC-1: compile_engine produces an FP16 .engine + sidecar at the canonical path**
Given a valid ONNX model and `BuildConfig(precision=Fp16, ...)`
When `compile_engine(model_path, build_config)` runs to completion
Then a `.engine` file exists at the D-C10-7 filename schema path with a matching `.sha256` sidecar from AZ-280; the returned `EngineCacheEntry` carries that path, the sha256, and the `(SM, JP, TRT, precision)` tuple
**AC-2: compile_engine produces an INT8 .engine with a calibration cache + sidecar**
Given a valid ONNX model, a calibration dataset directory with at least 100 images, and `BuildConfig(precision=Int8, calibration_dataset=...)`
When `compile_engine` runs
Then a `.engine` file is produced AND a `.calib_cache` file is produced with a `.calib_cache.sha256` sidecar; the cache header records the calibration-dataset content hash; a second `compile_engine` call with the same dataset reuses the cache (verifiable: the second call is < 30 s, the first is minutes)
**AC-3: stale calibration cache forces rebuild**
Given an existing `.calib_cache` with dataset hash A, and the dataset on disk now hashes to B
When `compile_engine` runs with the same calibration_dataset path
Then the calibrator detects the mismatch, rebuilds the cache from scratch (cache header now records hash B), and writes a new sidecar; the prior cache is overwritten — no silent reuse, no `CalibrationCacheError`
**AC-4: corrupted calibration cache raises CalibrationCacheError**
Given an existing `.calib_cache` whose sidecar `.calib_cache.sha256` does not match the cache file's actual sha256
When `compile_engine` runs
Then `CalibrationCacheError` is raised; no engine is built; the cache file is NOT silently overwritten (operator must explicitly delete or replace)
**AC-5: deserialize_engine invokes EngineGate before any GPU allocation**
Given an `EngineCacheEntry` whose engine file's filename schema does not match the running Jetson tuple
When `deserialize_engine(entry)` is called
Then `EngineGate.validate(entry)` is invoked first, raises `EngineSchemaMismatchError`; no `IRuntime.deserialize_cuda_engine` call is made; no GPU memory is allocated (verifiable via NVML before/after delta)
**AC-6: deserialize_engine refuses when GPU memory budget would be exceeded**
Given a deserialize that would allocate 1.2 GB and a runtime instance currently holding 3.0 GB resident with budget = 4 GB
When `deserialize_engine` is called
Then `OutOfMemoryError` is raised with the engine name in the message; no partial allocation occurs; the prior 3.0 GB resident state is unchanged
**AC-7: infer round-trips H2D + enqueueV3 + D2H on the owned stream**
Given a deserialised `TrtEngineHandle` and a fixed input dict
When `infer(handle, inputs)` is called and a profiling tool counts CUDA API events
Then the call sequence is: `cudaMemcpyAsync(H2D)` × N inputs, `enqueueV3`, `cudaMemcpyAsync(D2H)` × M outputs, `cudaStreamSynchronize`; output dict has the M expected named tensors
**AC-8: per-model latency p95 budget**
Given the production-default UltraVPR / LightGlue / AdHoP / DISK FP16 engines deserialised on Tier-2 (Jetson Orin Nano Super)
When the C7-PT-01 latency benchmark runs (scripted load matching production)
Then per-model p95 latencies are: UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms; failure thresholds (100 / 60 / 150 / 90 ms respectively) are NOT crossed
**AC-9: GPU memory budget compliance**
Given all production-default engines resident concurrently
When the C7-PT-02 memory benchmark runs
Then GPU resident memory across engines ≤ 4 GB (failure threshold 5 GB) and process RAM ≤ 1.5 GB (failure threshold 2 GB)
**AC-10: release_engine fully frees GPU state and is idempotent**
Given a deserialised handle holding 1 GB GPU memory
When `release_engine(handle)` is called once and then again
Then after the first call NVML reports 1 GB freed; after the second call no error is raised and NVML state is unchanged
## Non-Functional Requirements
**Performance**
- Per-model p95 latencies as in AC-8.
- GPU memory ≤ 4 GB resident, RAM ≤ 1.5 GB process — AC-9 / NFT-LIM-01 / NFT-LIM-04.
- `compile_engine` for FP16 sub-minute typical; INT8 minutes (calibration dominates) — bounded by hardware, not budget.
- `deserialize_engine` p95 ≤ 5 s per engine on Tier-2 (engineering sanity bound; AC-NEW-1 cold-start budget is end-to-end and is asserted by C7-IT-01 in the test phase).
**Compatibility**
- TensorRT 10.3 only this cycle (per D-C7-9 JetPack 6.2 lock). No 9.x / 8.x compatibility shims.
- Built on Polygraphy + trtexec + IBuilderConfig — three TRT-supported entry points; no custom TRT builder calls outside that surface.
**Reliability**
- All errors are caught and rewrapped into the AZ-297 family. No raw `RuntimeError` or TRT-specific exception leaks to consumers.
- `infer` NEVER blocks on the producer side; sync stream execution is on the consumer's calling thread.
- INT8 calibration cache trust is the lurking foot-gun (per description.md § 7); the AC-2 / AC-3 / AC-4 flow is the only protection. A new strategy or refactor MUST preserve these three.
**Concurrency**
- One CUDA stream per `TensorrtRuntime` instance. The F3 hot path is single-threaded (per description.md); concurrent `infer` calls from different threads are NOT supported this cycle (would corrupt the stream). Documented as a constraint.
- Compile path (`compile_engine`) is allowed to run while a runtime instance is also serving `infer` calls only if the compile is in a separate process (the C10 CLI entry point); same-process concurrent compile + infer is NOT supported.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | compile FP16 with a tiny ONNX (sanity) | Engine + sidecar produced at canonical path; EngineCacheEntry shape correct |
| AC-2 | compile INT8 with a 100-image calibration dataset; rerun | Cache + sidecar produced; rerun reuses cache (under 30 s); cache header has dataset hash |
| AC-3 | mutate calibration dataset; rerun compile | Cache rebuilt; new dataset hash in header |
| AC-4 | corrupt calibration sidecar; rerun compile | `CalibrationCacheError`; engine NOT built |
| AC-5 | deserialise an engine with a mismatched filename SM | `EngineSchemaMismatchError` from gate; no GPU allocation (NVML diff = 0) |
| AC-6 | deserialise that would push past budget | `OutOfMemoryError`; prior resident state unchanged |
| AC-7 | infer with a profiler attached | Exact CUDA call sequence: N H2D, enqueueV3, M D2H, stream sync |
| AC-8 | C7-PT-01 microbench against the four production engines | p95 within budgets per the table |
| AC-9 | C7-PT-02 memory test with all engines resident | ≤ 4 GB GPU, ≤ 1.5 GB RAM |
| AC-10 | release_engine called twice | First call frees memory; second is a no-op |
| NFR-perf-deserialize | Microbench deserialise per engine | p95 ≤ 5 s on Tier-2 |
| NFR-reliability-error-rewrap | Inject a TRT C++ exception via mock | Rewrapped into the AZ-297 family; original message preserved |
## Constraints
- TensorRT 10.3 + JetPack 6.2 lock per D-C7-9.
- Polygraphy + trtexec + IBuilderConfig only — no custom TRT builder API outside this surface.
- `helpers.sha256_sidecar` (AZ-280) for atomic write + sidecar pattern.
- `helpers.engine_filename_schema` (AZ-281) for the D-C10-7 filename schema (consumed at deserialise via the gate).
- The `EngineHandle` subclass `TrtEngineHandle` is opaque to consumers; consumers MUST NOT introspect its fields. Implementation may add diagnostic fields (e.g., `last_infer_elapsed_ms`) but they are NOT part of the AZ-297 Protocol.
- The CLI entry point `python -m c7_inference.tensorrt_runtime compile ...` is for C10 to invoke; not part of any consumer's public API.
- This task introduces no new third-party dependencies beyond TensorRT 10.3, Polygraphy, and the trtexec binary that ships with TRT.
- Per-frame DEBUG logging defaults to OFF (would flood at 39 Hz aggregate per description.md); enabled only via `config.inference.per_frame_debug_log = true`.
## Risks & Mitigation
**Risk 1: ONNX op unsupported by TRT 10.3**
- *Risk*: A backbone author exports an op TRT cannot lower; `compile_engine` raises `EngineBuildError` and the operator has nothing to fall back to.
- *Mitigation*: `EngineBuildError` is surfaced to the operator pre-flight (per description.md error-handling spec); the operator switches the runtime config to `onnx_trt_ep` (AZ-299) which has wider op coverage. Documented in the operator playbook (out of scope here).
**Risk 2: INT8 calibration silently uses a stale cache**
- *Risk*: AC-3 / AC-4 fails in a corner case (dataset mutated atomically, hash check races with compile start).
- *Mitigation*: The calibration-dataset hash is computed at the START of `compile_engine` and compared to the cache header's hash; the dataset is treated as immutable for the duration of the call. AC-3 + AC-4 cover the obvious cases; the corner case requires a separate test where the dataset is replaced mid-compile, which is operator error and out of scope.
**Risk 3: GPU memory budget is exceeded under sustained use due to TRT internal scratch**
- *Risk*: `deserialize_engine` reports buffer size based on `opt_shape`, but TRT may allocate additional internal scratch beyond what is reported.
- *Mitigation*: AC-9 measures actual NVML resident memory, not reported buffer size; a regression here fails the test. The budget includes a safety margin (4 GB target with 5 GB hard fail) that absorbs typical TRT scratch.
**Risk 4: trtexec subprocess hangs on a malformed ONNX**
- *Risk*: trtexec can hang silently on certain malformed inputs.
- *Mitigation*: The trtexec invocation has a config-driven timeout (default 600 s = 10 minutes); on timeout, the subprocess is killed and `EngineBuildError("trtexec timeout after 600s")` is raised. The IBuilderConfig direct path is preferred for INT8 anyway; trtexec is mainly for FP16 build profiling.
## Runtime Completeness
- **Named capability**: TensorRT 10.3 production runtime + Polygraphy / trtexec / IBuilderConfig hybrid + INT8 calibration cache trust + GPU memory budget enforcement (architecture / E-C7 / D-C7-9 / NFT-PERF-01 / NFT-LIM-01).
- **Production code that must exist**: real `TensorrtRuntime` class implementing the AZ-297 Protocol; real Polygraphy + trtexec + IBuilderConfig compile path; real `IRuntime.deserialize_cuda_engine` + `IExecutionContext` + GPU buffer allocations; real `enqueueV3` + sync-stream execution; real INT8 calibrator with dataset-hash cache trust; real GPU memory budget check at deserialise.
- **Allowed external stubs**: tests MAY substitute a `FakeCudaProfiler` to count CUDA events (AC-7); production wiring uses real TRT + real CUDA. C10 CacheProvisioner is the production caller of `compile_engine`; tests MAY drive the CLI directly.
- **Unacceptable substitutes**: a Python-level fake "engine" that bypasses TRT (would defeat the whole point), a calibration cache "always trusted" path (would break D-C10-6), an `infer` that uses `torch.compile` instead of `enqueueV3` (that's AZ-300 PyTorch baseline territory), or an in-memory engine that skips the file-on-disk path (would break the engine cache lifecycle that C10 + F2 takeoff load depend on).