# C7 — On-Jetson Inference Runtime ## 1. High-Level Overview **Purpose**: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4. **Architectural Pattern**: Strategy — `InferenceRuntime` interface with three concrete implementations: `TensorrtRuntime` (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), `OnnxTrtEpRuntime` (fallback), `PytorchFp16Runtime` (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by `BUILD_*` flags (ADR-002), composition-root wired (ADR-009). **Cycle-1 operational reality**: C7 is **infrastructure shared across consumers** — it does NOT have its own slot in the `_STRATEGY_REGISTRY` populated by `register_airborne_strategies()` (AZ-591). Instead the airborne binary builds the `InferenceRuntime` once via `runtime_root/airborne_bootstrap.py::_build_c7_inference` → `inference_factory.build_inference_runtime`, and seeds the single instance into `pre_constructed["c7_inference"]` (AZ-621 / Phase C). The same instance is reused as the engine source for the shared `LightGlueRuntime` load (AZ-622 / Phase D, `_build_c3_lightglue_runtime`), so the bootstrap never double-builds the runtime; downstream wrappers (c2_vpr / c3_matcher / c3_5_adhop, per `AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS`) then receive the identity-shared runtime via `compose_root`'s constructor injection. Airborne-buildable runtimes are gated by `C7_AIRBORNE_BUILD_FLAGS = (("tensorrt", "BUILD_TENSORRT_RUNTIME"), ("pytorch_fp16", "BUILD_PYTORCH_FP16_RUNTIME"))` — `tensorrt` is the production-default, `pytorch_fp16` is the Tier-0 / workstation fallback (and is the conservative `C7InferenceConfig.runtime` default so unconfigured test environments resolve to the Tier-0 baseline). `onnx_trt_ep` is **deliberately omitted** from the airborne flag matrix even though `inference_factory._RUNTIME_TO_BUILD_FLAG` recognises it — see § 7 Tier-2 follow-up. When no airborne runtime is buildable (both `BUILD_TENSORRT_RUNTIME` and `BUILD_PYTORCH_FP16_RUNTIME` OFF, or the configured runtime's flag is OFF) and any configured consumer still requires `c7_inference`, `_build_c7_inference` surfaces the upstream `RuntimeNotAvailableError` as an `AirborneBootstrapError` (AC-621.2) naming the missing key, BOTH airborne `BUILD_*` flags + their runtimes, and the consuming component slug(s) — narrowed to the configured consumers when available. AZ-687 replay-mode guard: when `config.mode == "replay"` and the minimal replay `Config` omits the `c7_inference` block, `build_pre_constructed` skips both `c7_inference` AND the cascading `c3_lightglue_runtime` seed (the LightGlue runtime depends on the inference runtime); the c2_vpr / c3_matcher / c2_5_rerank / c3_5_adhop wrappers that would have consumed the runtime are likewise absent from the replay `Config` and therefore never look at the skipped slot. **Upstream dependencies**: - C10 CacheProvisioner → during F1 (after C11 `TileDownloader` has populated C6) triggers engine compilation when no cached engine matches the `(SM, JP, TRT, precision)` tuple. - F2 takeoff load → triggers `deserialize_cached_engine` for every model used by C1/C2/C2.5/C3/C3.5. - jetson-stats / NVML → thermal-throttle telemetry source. **Downstream consumers**: - C2 VPR (backbone forward pass). - C2.5 ReRanker (LightGlue forward pass). - C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes). - C3.5 AdHoP (conditional refinement backbone). - C1 (only the strategies that have a CUDA path; KltRansac is CPU-only). - C4 (consumes `ThermalState` for the D-CROSS-LATENCY-1 covariance-mode decision). ## 2. Internal Interfaces ### Interface: `InferenceRuntime` | Method | Input | Output | Async | Error Types | |--------|-------|--------|-------|-------------| | `compile_engine` | `model_path: Path, build_config: BuildConfig` | `EngineCacheEntry` | No (offline) | `EngineBuildError`, `CalibrationCacheError` | | `deserialize_engine` | `EngineCacheEntry` | `EngineHandle` | No | `EngineDeserializeError` | | `infer` | `EngineHandle, inputs: dict[str, Tensor]` | `dict[str, Tensor]` | No (sync GPU stream) | `InferenceError`, `OutOfMemoryError` | | `release_engine` | `EngineHandle` | `None` | No | — | | `thermal_state` | `()` | `ThermalState` | No | `TelemetryUnavailableError` | | `current_runtime_label` | `()` | `string` | No | — | **Input/Output DTOs**: ``` BuildConfig: precision: enum {fp16, int8, mixed} workspace_mb: int calibration_dataset: Path (required for int8) optimization_profiles: list[(input_name, min_shape, opt_shape, max_shape)] EngineCacheEntry: see data_model.md EngineHandle: opaque GPU-resident handle ThermalState: cpu_temp_c: float gpu_temp_c: float thermal_throttle_active: bool measured_clock_mhz: int measured_at: monotonic_ns ``` ## 3. External API Specification Not applicable. ## 4. Data Access Patterns ### Queries | Query | Frequency | Hot Path | Index Needed | |-------|-----------|----------|--------------| | `infer` for VPR backbone | 3 Hz | Yes | n/a | | `infer` for LightGlue (×10 in C2.5, ×3 in C3) | 3 Hz × 13 = 39 Hz | Yes | n/a | | `infer` for AdHoP (conditional) | <1 Hz typical | Yes (when invoked) | n/a | | `thermal_state` poll | 1 Hz from C4 | No (sampled, not per-frame) | n/a | ### Caching Strategy | Data | Cache Type | TTL | Invalidation | |------|-----------|-----|-------------| | Compiled `.engine` files | filesystem keyed by `(SM, JP, TRT, precision)` (D-C10-7) | bounded by JetPack/TRT version stability | manifest content-hash gate at takeoff (D-C10-3) | | INT8 calibration cache | filesystem alongside `.engine` (D-C10-6) | bounded by calibration dataset version | rebuild when calibration dataset hash changes | | Resident engine handles | GPU memory | flight lifetime | F8 reboot recovery re-deserialises | ### Storage Estimates | Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate | |-----------------|---------------------|----------|------------|-------------| | `.engine` files | one per (model × precision × backbone) | 50 MB – 500 MB / engine | up to ~1.5 GB across all backbones for a deployment binary | bounded by AC-8.3 carve-out | | INT8 calibration caches | one per engine | 1–10 MB | <50 MB | as above | ### Data Management **Seed data**: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation. **Rollback**: D-C10-7 self-describing filename schema (`__sm_jp_trt_.engine`) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current `(SM, JP, TRT)` tuple. ## 5. Implementation Details **Algorithmic Complexity**: per-model forward pass cost is the design driver. Engine builds are `O(complexity_of_optimizer_search)` — minutes for INT8 with calibration; sub-minute for FP16. **State Management**: - Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded). - Owns the resident engine handles for the duration of a flight. - Owns the polling loop for thermal-throttle telemetry (1 Hz background thread). **Key Dependencies**: | Library | Version | Purpose | |---------|---------|---------| | TensorRT (C++ + Python) | 10.3 (JetPack 6.2 pin) per D-C7-9 | Primary engine compile + deserialize + infer | | Polygraphy | matches TensorRT | Engine build orchestration | | trtexec | bundled with TensorRT | Alternate engine build path | | ONNX Runtime + TRT EP | per project pin | Fallback runtime | | PyTorch | per simple-baseline pin | FP16 baseline (mandatory) | | jetson-stats / pynvml | latest | Thermal-throttle telemetry source | **Error Handling Strategy**: - `EngineBuildError`: surface to C10/operator pre-flight; takeoff blocked. **Never silently fall back** between runtimes — if the configured runtime can't build, the operator must explicitly switch. - `EngineDeserializeError` at takeoff: refuse takeoff with explicit `(SM, JP, TRT, precision)` mismatch detail. - `InferenceError` mid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path. - `OutOfMemoryError`: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation. - `TelemetryUnavailableError`: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN. - `CalibrationCacheError`: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild. ## 6. Extensions and Helpers | Helper | Purpose | Used By | |--------|---------|---------| | `EngineFilenameSchema` | self-describing filename per D-C10-7 | C7, C10 | | `Sha256Sidecar` | atomic write + content-hash sidecar pattern | C6, C7, C10 | ## 7. Caveats & Edge Cases **Known limitations**: - TensorRT engines are NOT portable across `(SM, JP, TRT, precision)` tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts. - INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01. **Potential race conditions**: - The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's `infer` calls. Use a lock-free atomic snapshot for `thermal_state`. **Performance bottlenecks**: - Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth. **Cycle-1 Tier-2 follow-up dependencies**: - `OnnxTrtEpRuntime` — the module + class are implemented and the lower-level `inference_factory._RUNTIME_TO_BUILD_FLAG` maps `"onnx_trt_ep" → "BUILD_ONNX_TRT_EP_RUNTIME"`, but the **airborne** `C7_AIRBORNE_BUILD_FLAGS` tuple in `runtime_root/airborne_bootstrap.py` deliberately omits it (research-only per the AZ-621 task spec). Setting `config.components['c7_inference'].runtime = "onnx_trt_ep"` on an airborne binary raises `AirborneBootstrapError` from `_build_c7_inference` whose message lists ONLY the two airborne flag options (tensorrt / pytorch_fp16) — operators see a clean recovery path instead of a research-build escape hatch. Tier-2 follow-up: extend `C7_AIRBORNE_BUILD_FLAGS` (and gate it on `BUILD_ONNX_TRT_EP_RUNTIME=ON`) only if a future deployment scenario justifies the ORT-TRT-EP path on a flight binary; until then the runtime is exercised via unit-test composition and ad-hoc workstation runs only. ## 8. Dependency Graph **Must be implemented after**: nothing internal — C7 is foundational. **Can be implemented in parallel with**: C6, C13. **Blocks**: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes `ThermalState`), C10, F1, F2, F3, F6, F8. ## 9. Logging Strategy | Log Level | When | Example | |-----------|------|---------| | ERROR | `EngineBuildError`, `EngineDeserializeError`, `OutOfMemoryError`, `CalibrationCacheError` | `C7 OOM during infer; backbone=ultravpr; frame=12345` | | WARN | thermal-throttle entered/exited; telemetry unavailable | `C7 thermal throttle active; gpu_temp=83C; clock=750mhz` | | INFO | Strategy ready; engine deserialised; backbone resident | `C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16]` | | DEBUG | per-frame infer timing per backbone | `C7 infer backbone=ultravpr frame=12345 took=37ms` | **Log format**: structured JSON. **Log storage**: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).