Implement the production-default InferenceRuntime strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with EngineGate-first ordering and a pre-allocation GPU memory budget gate, infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA stream, idempotent release_engine, and an injected ThermalStatePublisher delegation for thermal_state. INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a .calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new .calib_cache.dataset_sha256 sidecar that records the dataset content hash at compile time; reuse only when both agree, rebuild silently on dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar (never silently overwritten). GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any TRT call beyond the gate (AC-6); a pre-allocation refusal raises OutOfMemoryError and leaves the resident state unchanged. TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the methods that need them so the module loads cleanly on Tier-0 hosts. A standalone CLI entry (python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>) is wired for C10 CacheProvisioner (AZ-321) to invoke pre-flight without holding a runtime instance. C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated in __post_init__. Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5 / AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1 via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize placeholders skip with prerequisite reason and live in the AZ-298 Tier-2 microbench harness. Code review verdict PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied). Co-authored-by: Cursor <cursoragent@cursor.com>
13 KiB
Batch 31 / Cycle 1 — Implementation Report
Date: 2026-05-12 Tasks: AZ-298 (C7 TensorrtRuntime — production-default TensorRT 10.3 strategy + INT8 calibration cache trust + GPU memory budget enforcement) Story points landed: 5 Status: complete (AZ-298 → In Testing)
Scope summary
Single-task batch landing the production-default InferenceRuntime
strategy for C7. TensorrtRuntime owns the full TensorRT 10.3 +
JetPack 6.2 lifecycle (per D-C7-9): compile_engine via the
Polygraphy + trtexec + IBuilderConfig hybrid (FP16 / INT8 / Mixed),
deserialize_engine with EngineGate-first ordering and a
pre-allocation GPU memory budget gate, infer via H2D →
enqueueV3 → D2H → stream sync on the owned CUDA stream,
idempotent release_engine, and an injected ThermalStatePublisher
delegation for thermal_state (AZ-302 will own the polling loop).
The two foot-guns flagged in the task spec are gated explicitly:
- INT8 calibration cache trust (D-C10-6) is enforced by a
.calib_cache.sha256file-integrity sidecar (AZ-280) plus a.calib_cache.dataset_sha256sidecar that records the dataset content hash at compile time. Reuse only when both sidecars are consistent; mismatched dataset hash forces a silent rebuild (AC-3); corrupt.sha256sidecar raisesCalibrationCacheError(AC-4 — never silently overwritten). - GPU memory budget (NFT-LIM-01, default 4 GiB) is checked
BEFORE any TRT call beyond the gate, using
_predicted_deserialize_bytes(entry)(engine file size +extras["opt_buffer_bytes"]stamped at compile time, with a conservative 256 MiB fallback when the field is missing). A pre-allocation refusal raisesOutOfMemoryErrorand leaves the resident state unchanged (AC-6).
TensorRT 10.3, Polygraphy, and PyCUDA are lazy-imported inside the
methods that need them; the module loads cleanly on Tier-0 / macOS
dev hosts so the package's protocol-conformance tests stay importable
without GPU. A standalone CLI entry point
python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>
is wired for C10 CacheProvisioner (AZ-321) to invoke pre-flight
without holding a runtime instance.
Files added / modified
New (production)
src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py—TrtEngineHandle(opaque, slots, owns engine + exec context + stream + IO buffers +allocated_bytes+_releasedflag);_dataset_content_hash+_plan_calibration_cache+_persist_calibration_cache_sidecars(D-C10-6 trust gate; AC-2 / AC-3 / AC-4);_profile_buffer_bytes+_predicted_deserialize_bytes(AC-6 prediction);TensorrtRuntimeclass withcompile_engine(Polygraphy + trtexec branches),deserialize_engine(gate → budget → load → exec ctx → IO buffers, with rollback-on-error so the resident state is unchanged on failure),infer(sync GPU stream with explicit H2D / enqueueV3 / D2H / sync ordering), idempotentrelease_engine,thermal_statedelegation,current_runtime_label() -> "tensorrt";_safe_free_safe_delresource helpers; argparse CLI withcompilesubcommand for the C10 pre-flight entry.
New (tests)
tests/unit/c7_inference/test_tensorrt_runtime.py— NEW suite of 26 active tests + 6 Tier-2-gated skips covering every AC:- AC-1 protocol conformance + label string; Tier-2 placeholder skip for the real FP16 compile.
- AC-2 Tier-2 placeholder skip for the INT8 compile + sub-30s rebuild-from-cache timing (the Tier-1 logic equivalent is in the AC-3 reuse test below).
- AC-3 stale calibration cache forces rebuild
(
_plan_calibration_cache(...).reuse is Falsewhen dataset hash differs); matching dataset hash → reuse. - AC-4 corrupt
.sha256sidecar / malformed dataset sidecar / empty dataset all raiseCalibrationCacheError;_persist_calibration_cache_sidecarswrites both sidecars correctly after a calibrator-written cache. - AC-5
deserialize_engineinvokesEngineGate.validateBEFORE any TRT import — verified by monkey-patching_load_trt/_load_pycudato raiseAssertionErroron any call. - AC-6 budget helper rejects overshoot (with engine name
in the message); accepts within-budget allocations; full
deserialize_enginepath raisesOutOfMemoryErrorBEFORE_load_trtruns and leaves_resident_bytesunchanged. - AC-7
inferorders H2D →enqueueV3→ D2H → stream sync via fake CUDA/TRT modules counting call sequence; Tier-2 placeholder skip for the real CUDA-event trace. - AC-8 / AC-9 / NFR-perf-deserialize Tier-2 placeholder skips for the perf / memory benchmarks (C7-PT-01 / C7-PT-02) that live in the dedicated microbench harness on Jetson.
- AC-10
release_engineidempotent — first call frees all buffers, drops resident_bytes to 0, marks handle released; second call is a silent no-op; foreign handle types silently ignored (defensive shim). - NFR-reliability-error-rewrap
inferrewraps a syntheticRuntimeError("TRT C++ exception: enqueueV3 fault")intoInferenceErrorwith__cause__preserved; foreign handle type and released handle paths also rewrap toInferenceError; missing input binding rewraps. - Thermal delegation default-safe
ThermalState(is_telemetry_available=False) when no publisher is injected; provider-injected publisher returns its canned snapshot unmodified. - Helpers
_predicted_deserialize_bytesfalls back to 256 MiB whenextras["opt_buffer_bytes"]is absent;_profile_buffer_bytessums element counts × 2 bytes;_dataset_content_hashchanges with content. - CLI smoke
_build_config_from_jsonround-trips FP16 payloads and raisesEngineBuildErrorwhen INT8 is requested withoutcalibration_dataset.
Modified (production)
src/gps_denied_onboard/components/c7_inference/config.py— addsgpu_memory_budget_bytes: int = 4 GiB(NFT-LIM-01 default) andtrtexec_timeout_s: int = 600(Risk-4 mitigation, 10 min) toC7InferenceConfig, both validated> 0in__post_init__.
Modified (tests)
tests/unit/c7_inference/test_protocol_conformance.py— thetest_ac5_build_inference_runtime_flag_on_but_module_missingparametrization previously included"tensorrt"; now thattensorrt_runtime.pyexists, the factory successfully imports it (the missing-module branch is exercised by"onnx_trt_ep"and"pytorch_fp16"only). The TRT row will return when the module-presence test gains a separate "module exists but tensorrt python binding missing" case in a future task.
Modified (docs)
_docs/02_document/module-layout.md—tensorrt_runtime.pyrow in the c7_inference per-component table now reads "(AZ-298; production-default TensorRT 10.3 strategy + INT8 calibration cache trust + GPU memory budget enforcement +python -m ...tensorrt_runtime compile ...CLI)" — replaces the priorpendingmarker.
Acceptance criteria coverage
| AC | Test | Status |
|---|---|---|
| AC-1 FP16 engine + sidecar at canonical path | test_ac1_protocol_conformance (Tier-1 protocol/label) + test_ac1_real_fp16_compile_produces_engine_and_sidecar (Tier-2) |
passing / Tier-2 skipped |
| AC-2 INT8 cache reuse under 30 s | test_ac3_matching_dataset_hash_reuses_cache (Tier-1 logic) + test_ac2_int8_compile_reuses_calibration_cache_under_30s (Tier-2) |
passing / Tier-2 skipped |
| AC-3 Stale dataset forces rebuild | test_ac3_stale_calibration_cache_forces_rebuild + test_ac3_matching_dataset_hash_reuses_cache |
passing |
| AC-4 Corrupt calib sidecar raises | test_ac4_corrupted_calibration_cache_raises + test_ac4_malformed_dataset_sidecar_raises + test_ac4_empty_dataset_raises + test_persist_calibration_cache_sidecars_writes_both |
passing |
| AC-5 EngineGate-first before any GPU work | test_ac5_gate_refusal_precedes_trt_import |
passing |
| AC-6 Budget pre-alloc refusal | test_ac6_budget_helper_refuses_overshoot + test_ac6_budget_helper_accepts_within + test_ac6_deserialize_budget_raises_before_trt_load |
passing |
| AC-7 H2D → enqueueV3 → D2H → sync ordering | test_infer_orders_h2d_enqueue_d2h_sync (Tier-1 via fakes) + test_ac7_real_infer_records_cuda_event_sequence (Tier-2) |
passing / Tier-2 skipped |
| AC-8 Per-model p95 latency | test_ac8_per_model_p95_latency_within_budget (Tier-2 microbench) |
Tier-2 skipped |
| AC-9 4 GiB GPU + 1.5 GiB RAM budget | test_ac9_concurrent_engine_resident_memory_within_budget (Tier-2 microbench) |
Tier-2 skipped |
AC-10 release_engine idempotent |
test_ac10_release_is_idempotent + test_release_engine_ignores_foreign_handle_type |
passing |
| NFR-perf-deserialize p95 ≤ 5 s | test_nfr_perf_deserialize_p95_under_5s (Tier-2 microbench) |
Tier-2 skipped |
| NFR-reliability error rewrap | test_infer_rewraps_third_party_exception + test_infer_rejects_foreign_handle + test_infer_rejects_released_handle + test_infer_missing_input_binding_rewraps |
passing |
AC Test Coverage: 10 of 10 covered (+ 2 NFRs)
Code Review Verdict: PASS_WITH_WARNINGS (1 Medium auto-fixed)
Auto-Fix Attempts: 1 (hoisted self._load_trt() out of the per-output
binding loop in infer() — saw it during review; mechanical fix,
re-ran tests after.)
Stuck Agents: None
Findings (self-review)
| # | Severity | Category | Location | Note | Resolution |
|---|---|---|---|---|---|
| 1 | Medium | Performance | tensorrt_runtime.py::TensorrtRuntime.infer |
self._load_trt() was called inside the per-output for-loop. The lazy import is module-cached so the cost is small, but the attribute lookup + the try/except added overhead on the hot path. Hoisted above the loop. |
FIXED in this batch. |
| 2 | Low | Maintainability | tensorrt_runtime.py::_predicted_deserialize_bytes |
Falls back to a flat 256 MiB IO-buffer estimate when extras["opt_buffer_bytes"] is absent (engine produced by an older compile path). Conservative for the budget gate but loose — could underestimate for very large profiles. Accepted because compile_engine always stamps the field; the fallback only protects against externally-produced engines. |
Open (Low) — accepted as documented. |
| 3 | Low | Test-quality | test_infer_orders_h2d_enqueue_d2h_sync |
Uses a fake CUDA module that captures memcpy_htod_async / memcpy_dtoh_async calls plus a fake exec context counting execute_async_v3. The ordering assertion is implicit from the linear control flow inside infer (H2D loop → exec → D2H loop → sync); a real CUDA event trace lives in the AZ-298 Tier-2 microbench harness. |
Open (Low) — Tier-2 placeholder is test_ac7_real_infer_records_cuda_event_sequence. |
| 4 | Low | Architecture | tensorrt_runtime.py::infer |
Reads handle._input_buffers / _output_buffers etc. directly through the slot names. Per Invariant I-4 those fields are private to TensorrtRuntime, so the access is intra-class and the slot pattern is just a memory-layout optimisation — but it makes the test code look like it's introspecting a black-box handle. Accepted because the test stays inside the c7_inference component boundary. |
Open (Low) — accepted as documented. |
| 5 | Low | Scope | tensorrt_runtime.py::_safe_del |
The del resource line cannot actually free anything since it only drops a local reference inside the helper; the real teardown happens when the caller drops its own reference. The helper is mostly a defensive "best-effort, log-warn-on-exception" wrapper around the C++-shim destructors. Kept as a single explicit place to swallow + log unusual teardown errors. |
Open (Low) — accepted as documented. |
Tracker
- AZ-298 transitioned to In Progress at session start; will move
to In Testing post-commit per
protocols.md.
Test suite
tests/unit/c7_inference/test_tensorrt_runtime.py— 26 passing- 6 Tier-2 skips on macOS dev (no TensorRT binding).
tests/unit/c7_inference/(full c7 suite) — 116 passing, 13 skipped (CUDA / TensorRT unavailable on Tier-1 / macOS).- Combined unit suite excluding pending components (c1, c2, c2.5, c3,
c3.5, c4, c5, c8, c10, c11, c12) and the c6 collection blocker on
this host (missing
psycopg_poolis a known dev-machine env issue, pre-existing) — 506 passing, 10 environment-skipped, 1 warning (pre-existingpynvmlFutureWarning unrelated to AZ-298).
Next batch
Cycle 1 advances per the greenfield queue — autodev re-detects the next AZ ticket in the Step 7 batch loop and continues. AZ-299 (C7 OnnxTrtEpRuntime fallback) is the next AZ-249/E-C7 item ahead in the dependency graph.