[AZ-298] C7 TensorrtRuntime: TRT 10.3 + INT8 calib trust + GPU budget

Implement the production-default InferenceRuntime strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with EngineGate-first ordering and a pre-allocation GPU memory budget gate, infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA stream, idempotent release_engine, and an injected ThermalStatePublisher delegation for thermal_state. INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a .calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new .calib_cache.dataset_sha256 sidecar that records the dataset content hash at compile time; reuse only when both agree, rebuild silently on dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar (never silently overwritten). GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any TRT call beyond the gate (AC-6); a pre-allocation refusal raises OutOfMemoryError and leaves the resident state unchanged. TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the methods that need them so the module loads cleanly on Tier-0 hosts. A standalone CLI entry (python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>) is wired for C10 CacheProvisioner (AZ-321) to invoke pre-flight without holding a runtime instance. C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated in __post_init__. Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5 / AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1 via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize placeholders skip with prerequisite reason and live in the AZ-298 Tier-2 microbench harness. Code review verdict PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 07:31:13 +00:00 · 2026-05-12 23:11:49 +03:00
parent 54942f3052
commit 18a69022b3
9 changed files with 2307 additions and 10 deletions
@@ -0,0 +1,198 @@
+# Batch 31 / Cycle 1 — Implementation Report
+
+**Date**: 2026-05-12
+**Tasks**: AZ-298 (C7 TensorrtRuntime — production-default TensorRT 10.3 strategy + INT8 calibration cache trust + GPU memory budget enforcement)
+**Story points landed**: 5
+**Status**: complete (AZ-298 → In Testing)
+
+## Scope summary
+
+Single-task batch landing the production-default `InferenceRuntime`
+strategy for C7. `TensorrtRuntime` owns the full TensorRT 10.3 +
+JetPack 6.2 lifecycle (per D-C7-9): `compile_engine` via the
+Polygraphy + trtexec + `IBuilderConfig` hybrid (FP16 / INT8 / Mixed),
+`deserialize_engine` with EngineGate-first ordering and a
+pre-allocation GPU memory budget gate, `infer` via H2D →
+`enqueueV3` → D2H → stream sync on the owned CUDA stream,
+idempotent `release_engine`, and an injected `ThermalStatePublisher`
+delegation for `thermal_state` (AZ-302 will own the polling loop).
+
+The two foot-guns flagged in the task spec are gated explicitly:
+
+- **INT8 calibration cache trust** (D-C10-6) is enforced by a
+  `.calib_cache.sha256` file-integrity sidecar (AZ-280) plus a
+  `.calib_cache.dataset_sha256` sidecar that records the dataset
+  content hash at compile time. Reuse only when both sidecars are
+  consistent; mismatched dataset hash forces a silent rebuild
+  (AC-3); corrupt `.sha256` sidecar raises `CalibrationCacheError`
+  (AC-4 — never silently overwritten).
+- **GPU memory budget** (NFT-LIM-01, default 4 GiB) is checked
+  BEFORE any TRT call beyond the gate, using
+  `_predicted_deserialize_bytes(entry)` (engine file size +
+  `extras["opt_buffer_bytes"]` stamped at compile time, with a
+  conservative 256 MiB fallback when the field is missing). A
+  pre-allocation refusal raises `OutOfMemoryError` and leaves the
+  resident state unchanged (AC-6).
+
+TensorRT 10.3, Polygraphy, and PyCUDA are **lazy-imported** inside the
+methods that need them; the module loads cleanly on Tier-0 / macOS
+dev hosts so the package's protocol-conformance tests stay importable
+without GPU. A standalone CLI entry point
+`python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>`
+is wired for C10 `CacheProvisioner` (AZ-321) to invoke pre-flight
+without holding a runtime instance.
+
+## Files added / modified
+
+### New (production)
+
+- `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py`
+  — `TrtEngineHandle` (opaque, slots, owns engine + exec context +
+  stream + IO buffers + `allocated_bytes` + `_released` flag);
+  `_dataset_content_hash` + `_plan_calibration_cache` +
+  `_persist_calibration_cache_sidecars` (D-C10-6 trust gate;
+  AC-2 / AC-3 / AC-4); `_profile_buffer_bytes` +
+  `_predicted_deserialize_bytes` (AC-6 prediction); `TensorrtRuntime`
+  class with `compile_engine` (Polygraphy + trtexec branches),
+  `deserialize_engine` (gate → budget → load → exec ctx → IO buffers,
+  with rollback-on-error so the resident state is unchanged on
+  failure), `infer` (sync GPU stream with explicit H2D / enqueueV3 /
+  D2H / sync ordering), idempotent `release_engine`, `thermal_state`
+  delegation, `current_runtime_label() -> "tensorrt"`; `_safe_free`
+  + `_safe_del` resource helpers; argparse CLI with `compile`
+  subcommand for the C10 pre-flight entry.
+
+### New (tests)
+
+- `tests/unit/c7_inference/test_tensorrt_runtime.py` — **NEW** suite
+  of 26 active tests + 6 Tier-2-gated skips covering every AC:
+  - **AC-1** protocol conformance + label string;
+    Tier-2 placeholder skip for the real FP16 compile.
+  - **AC-2** Tier-2 placeholder skip for the INT8 compile +
+    sub-30s rebuild-from-cache timing (the Tier-1 logic equivalent
+    is in the AC-3 reuse test below).
+  - **AC-3** stale calibration cache forces rebuild
+    (`_plan_calibration_cache(...).reuse is False` when dataset
+    hash differs); matching dataset hash → reuse.
+  - **AC-4** corrupt `.sha256` sidecar / malformed dataset
+    sidecar / empty dataset all raise `CalibrationCacheError`;
+    `_persist_calibration_cache_sidecars` writes both sidecars
+    correctly after a calibrator-written cache.
+  - **AC-5** `deserialize_engine` invokes `EngineGate.validate` BEFORE
+    any TRT import — verified by monkey-patching `_load_trt` /
+    `_load_pycuda` to raise `AssertionError` on any call.
+  - **AC-6** budget helper rejects overshoot (with engine name
+    in the message); accepts within-budget allocations; full
+    `deserialize_engine` path raises `OutOfMemoryError` BEFORE
+    `_load_trt` runs and leaves `_resident_bytes` unchanged.
+  - **AC-7** `infer` orders H2D → `enqueueV3` → D2H → stream
+    sync via fake CUDA/TRT modules counting call sequence;
+    Tier-2 placeholder skip for the real CUDA-event trace.
+  - **AC-8 / AC-9 / NFR-perf-deserialize** Tier-2 placeholder
+    skips for the perf / memory benchmarks (C7-PT-01 / C7-PT-02)
+    that live in the dedicated microbench harness on Jetson.
+  - **AC-10** `release_engine` idempotent — first call frees all
+    buffers, drops resident_bytes to 0, marks handle released;
+    second call is a silent no-op; foreign handle types
+    silently ignored (defensive shim).
+  - **NFR-reliability-error-rewrap** `infer` rewraps a synthetic
+    `RuntimeError("TRT C++ exception: enqueueV3 fault")` into
+    `InferenceError` with `__cause__` preserved; foreign handle
+    type and released handle paths also rewrap to `InferenceError`;
+    missing input binding rewraps.
+  - **Thermal delegation** default-safe `ThermalState`
+    (`is_telemetry_available=False`) when no publisher is
+    injected; provider-injected publisher returns its canned
+    snapshot unmodified.
+  - **Helpers** `_predicted_deserialize_bytes` falls back to
+    256 MiB when `extras["opt_buffer_bytes"]` is absent;
+    `_profile_buffer_bytes` sums element counts × 2 bytes;
+    `_dataset_content_hash` changes with content.
+  - **CLI smoke** `_build_config_from_json` round-trips FP16
+    payloads and raises `EngineBuildError` when INT8 is
+    requested without `calibration_dataset`.
+
+### Modified (production)
+
+- `src/gps_denied_onboard/components/c7_inference/config.py` —
+  adds `gpu_memory_budget_bytes: int = 4 GiB` (NFT-LIM-01 default)
+  and `trtexec_timeout_s: int = 600` (Risk-4 mitigation, 10 min)
+  to `C7InferenceConfig`, both validated `> 0` in `__post_init__`.
+
+### Modified (tests)
+
+- `tests/unit/c7_inference/test_protocol_conformance.py` — the
+  `test_ac5_build_inference_runtime_flag_on_but_module_missing`
+  parametrization previously included `"tensorrt"`; now that
+  `tensorrt_runtime.py` exists, the factory successfully imports
+  it (the missing-module branch is exercised by `"onnx_trt_ep"`
+  and `"pytorch_fp16"` only). The TRT row will return when the
+  module-presence test gains a separate "module exists but
+  tensorrt python binding missing" case in a future task.
+
+### Modified (docs)
+
+- `_docs/02_document/module-layout.md` — `tensorrt_runtime.py`
+  row in the c7_inference per-component table now reads
+  *"(AZ-298; production-default TensorRT 10.3 strategy + INT8
+  calibration cache trust + GPU memory budget enforcement +
+  `python -m ...tensorrt_runtime compile ...` CLI)"* — replaces
+  the prior `pending` marker.
+
+## Acceptance criteria coverage
+
+| AC | Test | Status |
+|----|------|--------|
+| AC-1 FP16 engine + sidecar at canonical path | `test_ac1_protocol_conformance` (Tier-1 protocol/label) + `test_ac1_real_fp16_compile_produces_engine_and_sidecar` (Tier-2) | passing / Tier-2 skipped |
+| AC-2 INT8 cache reuse under 30 s | `test_ac3_matching_dataset_hash_reuses_cache` (Tier-1 logic) + `test_ac2_int8_compile_reuses_calibration_cache_under_30s` (Tier-2) | passing / Tier-2 skipped |
+| AC-3 Stale dataset forces rebuild | `test_ac3_stale_calibration_cache_forces_rebuild` + `test_ac3_matching_dataset_hash_reuses_cache` | passing |
+| AC-4 Corrupt calib sidecar raises | `test_ac4_corrupted_calibration_cache_raises` + `test_ac4_malformed_dataset_sidecar_raises` + `test_ac4_empty_dataset_raises` + `test_persist_calibration_cache_sidecars_writes_both` | passing |
+| AC-5 EngineGate-first before any GPU work | `test_ac5_gate_refusal_precedes_trt_import` | passing |
+| AC-6 Budget pre-alloc refusal | `test_ac6_budget_helper_refuses_overshoot` + `test_ac6_budget_helper_accepts_within` + `test_ac6_deserialize_budget_raises_before_trt_load` | passing |
+| AC-7 H2D → enqueueV3 → D2H → sync ordering | `test_infer_orders_h2d_enqueue_d2h_sync` (Tier-1 via fakes) + `test_ac7_real_infer_records_cuda_event_sequence` (Tier-2) | passing / Tier-2 skipped |
+| AC-8 Per-model p95 latency | `test_ac8_per_model_p95_latency_within_budget` (Tier-2 microbench) | Tier-2 skipped |
+| AC-9 4 GiB GPU + 1.5 GiB RAM budget | `test_ac9_concurrent_engine_resident_memory_within_budget` (Tier-2 microbench) | Tier-2 skipped |
+| AC-10 `release_engine` idempotent | `test_ac10_release_is_idempotent` + `test_release_engine_ignores_foreign_handle_type` | passing |
+| NFR-perf-deserialize p95 ≤ 5 s | `test_nfr_perf_deserialize_p95_under_5s` (Tier-2 microbench) | Tier-2 skipped |
+| NFR-reliability error rewrap | `test_infer_rewraps_third_party_exception` + `test_infer_rejects_foreign_handle` + `test_infer_rejects_released_handle` + `test_infer_missing_input_binding_rewraps` | passing |
+
+## AC Test Coverage: 10 of 10 covered (+ 2 NFRs)
+## Code Review Verdict: PASS_WITH_WARNINGS (1 Medium auto-fixed)
+## Auto-Fix Attempts: 1 (hoisted `self._load_trt()` out of the per-output
+binding loop in `infer()` — saw it during review; mechanical fix,
+re-ran tests after.)
+## Stuck Agents: None
+
+## Findings (self-review)
+
+| # | Severity | Category | Location | Note | Resolution |
+|---|----------|----------|----------|------|------------|
+| 1 | Medium | Performance | `tensorrt_runtime.py::TensorrtRuntime.infer` | `self._load_trt()` was called inside the per-output for-loop. The lazy import is module-cached so the cost is small, but the attribute lookup + the try/except added overhead on the hot path. Hoisted above the loop. | **FIXED** in this batch. |
+| 2 | Low | Maintainability | `tensorrt_runtime.py::_predicted_deserialize_bytes` | Falls back to a flat 256 MiB IO-buffer estimate when `extras["opt_buffer_bytes"]` is absent (engine produced by an older compile path). Conservative for the budget gate but loose — could underestimate for very large profiles. Accepted because `compile_engine` always stamps the field; the fallback only protects against externally-produced engines. | Open (Low) — accepted as documented. |
+| 3 | Low | Test-quality | `test_infer_orders_h2d_enqueue_d2h_sync` | Uses a fake CUDA module that captures `memcpy_htod_async` / `memcpy_dtoh_async` calls plus a fake exec context counting `execute_async_v3`. The ordering assertion is implicit from the linear control flow inside `infer` (H2D loop → exec → D2H loop → sync); a real CUDA event trace lives in the AZ-298 Tier-2 microbench harness. | Open (Low) — Tier-2 placeholder is `test_ac7_real_infer_records_cuda_event_sequence`. |
+| 4 | Low | Architecture | `tensorrt_runtime.py::infer` | Reads `handle._input_buffers` / `_output_buffers` etc. directly through the slot names. Per Invariant I-4 those fields are private to `TensorrtRuntime`, so the access is intra-class and the slot pattern is just a memory-layout optimisation — but it makes the test code look like it's introspecting a black-box handle. Accepted because the test stays inside the c7_inference component boundary. | Open (Low) — accepted as documented. |
+| 5 | Low | Scope | `tensorrt_runtime.py::_safe_del` | The `del resource` line cannot actually free anything since it only drops a local reference inside the helper; the real teardown happens when the caller drops its own reference. The helper is mostly a defensive "best-effort, log-warn-on-exception" wrapper around the C++-shim destructors. Kept as a single explicit place to swallow + log unusual teardown errors. | Open (Low) — accepted as documented. |
+
+## Tracker
+
+- AZ-298 transitioned to **In Progress** at session start; will move
+  to **In Testing** post-commit per `protocols.md`.
+
+## Test suite
+
+- `tests/unit/c7_inference/test_tensorrt_runtime.py` — 26 passing
+  + 6 Tier-2 skips on macOS dev (no TensorRT binding).
+- `tests/unit/c7_inference/` (full c7 suite) — 116 passing, 13
+  skipped (CUDA / TensorRT unavailable on Tier-1 / macOS).
+- Combined unit suite excluding pending components (c1, c2, c2.5, c3,
+  c3.5, c4, c5, c8, c10, c11, c12) and the c6 collection blocker on
+  this host (missing `psycopg_pool` is a known dev-machine env issue,
+  pre-existing) — 506 passing, 10 environment-skipped, 1 warning
+  (pre-existing `pynvml` FutureWarning unrelated to AZ-298).
+
+## Next batch
+
+Cycle 1 advances per the greenfield queue — autodev re-detects the
+next AZ ticket in the Step 7 batch loop and continues. AZ-299 (C7
+OnnxTrtEpRuntime fallback) is the next AZ-249/E-C7 item ahead in
+the dependency graph.
@@ -0,0 +1,62 @@
+# Code Review Report — Batch 31 / Cycle 1
+
+**Batch**: 31
+**Tasks**: AZ-298 (C7 TensorrtRuntime)
+**Date**: 2026-05-12
+**Verdict**: PASS_WITH_WARNINGS
+
+## Findings
+
+| # | Severity | Category | File:Line | Title |
+|---|----------|----------|-----------|-------|
+| 1 | Medium | Performance | `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py::infer` | `_load_trt()` called inside per-output loop |
+| 2 | Low | Maintainability | `tensorrt_runtime.py::_predicted_deserialize_bytes` | 256 MiB flat fallback when `extras["opt_buffer_bytes"]` is absent |
+| 3 | Low | Test-quality | `tests/unit/c7_inference/test_tensorrt_runtime.py::test_infer_orders_h2d_enqueue_d2h_sync` | Ordering verified via fake-module call counts, not a real CUDA event trace |
+| 4 | Low | Architecture | `tensorrt_runtime.py::infer` | Access to `TrtEngineHandle._input_buffers` / `_output_buffers` via slot names |
+| 5 | Low | Scope | `tensorrt_runtime.py::_safe_del` | `del resource` only drops a local reference; helper is mostly defensive log-warn |
+
+### Finding Details
+
+**F1: `_load_trt()` called inside per-output loop** (Medium / Performance)
+- Location: `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py` — `TensorrtRuntime.infer`
+- Description: The original implementation called `self._load_trt()` inside the per-output binding for-loop. The lazy import is module-cached so subsequent calls are cheap, but the attribute lookup + the try/except inside a hot path adds avoidable overhead.
+- Suggestion: Hoist `trt = self._load_trt()` above the loops (alongside `cuda, _ = self._load_pycuda()`).
+- Task: AZ-298
+- Resolution: **AUTO-FIXED** in this batch.
+
+**F2: 256 MiB flat fallback in `_predicted_deserialize_bytes`** (Low / Maintainability)
+- Location: `tensorrt_runtime.py::_predicted_deserialize_bytes`
+- Description: When `EngineCacheEntry.extras["opt_buffer_bytes"]` is missing (engine produced by an older compile path), the budget gate uses a flat 256 MiB upper-bound. This is conservative for typical engines but can underestimate for engines with very large profiles.
+- Suggestion: `compile_engine` already stamps the field. Tighten the fallback only if an externally-produced engine appears in the cache; today the path is dormant.
+- Task: AZ-298
+- Resolution: Open (Low) — accepted as documented.
+
+**F3: Fake-module call-count ordering** (Low / Test-quality)
+- Location: `tests/unit/c7_inference/test_tensorrt_runtime.py::test_infer_orders_h2d_enqueue_d2h_sync`
+- Description: Verifies H2D → enqueueV3 → D2H → sync via fake CUDA/TRT modules counting calls and asserting on a single linear flow. Does not capture a real CUDA event trace.
+- Suggestion: The Tier-2 placeholder `test_ac7_real_infer_records_cuda_event_sequence` exists for the real event trace on Jetson; no change needed here.
+- Task: AZ-298
+- Resolution: Open (Low) — accepted as documented.
+
+**F4: Slot-name access in `infer`** (Low / Architecture)
+- Location: `tensorrt_runtime.py::TensorrtRuntime.infer`
+- Description: `infer` reads `handle._input_buffers`, `handle._output_buffers`, `handle._exec_context`, etc. via the slot names declared on `TrtEngineHandle`. Per Invariant I-4 those fields are private to `TensorrtRuntime`, so the access is intra-class and the test code stays inside the c7_inference component boundary.
+- Suggestion: None — the alternative (a getter method per field) would slow the hot path without contract gain.
+- Task: AZ-298
+- Resolution: Open (Low) — accepted as documented.
+
+**F5: `_safe_del` is mostly defensive** (Low / Scope)
+- Location: `tensorrt_runtime.py::_safe_del`
+- Description: The helper calls `del resource` which only drops a local reference inside the helper scope; the real teardown happens when the caller drops its own reference. The helper exists as a single explicit place to swallow + WARN-log unusual teardown errors.
+- Suggestion: Acceptable. The PyCUDA / TRT C++ shims hook destructors that fire when the last Python reference is released — `_safe_del` documents that contract in one place.
+- Task: AZ-298
+- Resolution: Open (Low) — accepted as documented.
+
+## Verdict Logic
+
+- 0 Critical
+- 0 High
+- 1 Medium (auto-fixed in this batch)
+- 4 Low
+
+→ **PASS_WITH_WARNINGS**: only Medium / Low findings; Medium was auto-fixed.