# C7 — On-Jetson Inference Runtime

## 1. High-Level Overview

**Purpose**: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.

**Architectural Pattern**: Strategy — `InferenceRuntime` interface with three concrete implementations: `TensorrtRuntime` (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), `OnnxTrtEpRuntime` (fallback), `PytorchFp16Runtime` (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by `BUILD_*` flags (ADR-002), composition-root wired (ADR-009).

**Cycle-1 operational reality**: C7 is **infrastructure shared across consumers** — it does NOT have its own slot in the `_STRATEGY_REGISTRY` populated by `register_airborne_strategies()` (AZ-591). Instead the airborne binary builds the `InferenceRuntime` once via `runtime_root/airborne_bootstrap.py::_build_c7_inference` → `inference_factory.build_inference_runtime`, and seeds the single instance into `pre_constructed["c7_inference"]` (AZ-621 / Phase C). The same instance is reused as the engine source for the shared `LightGlueRuntime` load (AZ-622 / Phase D, `_build_c3_lightglue_runtime`), so the bootstrap never double-builds the runtime; downstream wrappers (c2_vpr / c3_matcher / c3_5_adhop, per `AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS`) then receive the identity-shared runtime via `compose_root`'s constructor injection. Airborne-buildable runtimes are gated by `C7_AIRBORNE_BUILD_FLAGS = (("tensorrt", "BUILD_TENSORRT_RUNTIME"), ("pytorch_fp16", "BUILD_PYTORCH_FP16_RUNTIME"))` — `tensorrt` is the production-default, `pytorch_fp16` is the Tier-0 / workstation fallback (and is the conservative `C7InferenceConfig.runtime` default so unconfigured test environments resolve to the Tier-0 baseline). `onnx_trt_ep` is **deliberately omitted** from the airborne flag matrix even though `inference_factory._RUNTIME_TO_BUILD_FLAG` recognises it — see § 7 Tier-2 follow-up. When no airborne runtime is buildable (both `BUILD_TENSORRT_RUNTIME` and `BUILD_PYTORCH_FP16_RUNTIME` OFF, or the configured runtime's flag is OFF) and any configured consumer still requires `c7_inference`, `_build_c7_inference` surfaces the upstream `RuntimeNotAvailableError` as an `AirborneBootstrapError` (AC-621.2) naming the missing key, BOTH airborne `BUILD_*` flags + their runtimes, and the consuming component slug(s) — narrowed to the configured consumers when available. AZ-687 replay-mode guard: when `config.mode == "replay"` and the minimal replay `Config` omits the `c7_inference` block, `build_pre_constructed` skips both `c7_inference` AND the cascading `c3_lightglue_runtime` seed (the LightGlue runtime depends on the inference runtime); the c2_vpr / c3_matcher / c2_5_rerank / c3_5_adhop wrappers that would have consumed the runtime are likewise absent from the replay `Config` and therefore never look at the skipped slot.

**Upstream dependencies**:
- C10 CacheProvisioner → during F1 (after C11 `TileDownloader` has populated C6) triggers engine compilation when no cached engine matches the `(SM, JP, TRT, precision)` tuple.
- F2 takeoff load → triggers `deserialize_cached_engine` for every model used by C1/C2/C2.5/C3/C3.5.
- jetson-stats / NVML → thermal-throttle telemetry source.

**Downstream consumers**:
- C2 VPR (backbone forward pass).
- C2.5 ReRanker (LightGlue forward pass).
- C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
- C3.5 AdHoP (conditional refinement backbone).
- C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
- C4 (consumes `ThermalState` for the D-CROSS-LATENCY-1 covariance-mode decision).

## 2. Internal Interfaces

### Interface: `InferenceRuntime`

| Method | Input | Output | Async | Error Types |
|--------|-------|--------|-------|-------------|
| `compile_engine` | `model_path: Path, build_config: BuildConfig` | `EngineCacheEntry` | No (offline) | `EngineBuildError`, `CalibrationCacheError` |
| `deserialize_engine` | `EngineCacheEntry` | `EngineHandle` | No | `EngineDeserializeError` |
| `infer` | `EngineHandle, inputs: dict[str, Tensor]` | `dict[str, Tensor]` | No (sync GPU stream) | `InferenceError`, `OutOfMemoryError` |
| `release_engine` | `EngineHandle` | `None` | No | — |
| `thermal_state` | `()` | `ThermalState` | No | `TelemetryUnavailableError` |
| `current_runtime_label` | `()` | `string` | No | — |

**Input/Output DTOs**:
```
BuildConfig:
  precision:                   enum {fp16, int8, mixed}
  workspace_mb:                int
  calibration_dataset:         Path (required for int8)
  optimization_profiles:       list[(input_name, min_shape, opt_shape, max_shape)]

EngineCacheEntry:              see data_model.md
EngineHandle:                  opaque GPU-resident handle

ThermalState:
  cpu_temp_c:                  float
  gpu_temp_c:                  float
  thermal_throttle_active:     bool
  measured_clock_mhz:          int
  measured_at:                 monotonic_ns
```

## 3. External API Specification

Not applicable.

## 4. Data Access Patterns

### Queries

| Query | Frequency | Hot Path | Index Needed |
|-------|-----------|----------|--------------|
| `infer` for VPR backbone | 3 Hz | Yes | n/a |
| `infer` for LightGlue (×10 in C2.5, ×3 in C3) | 3 Hz × 13 = 39 Hz | Yes | n/a |
| `infer` for AdHoP (conditional) | <1 Hz typical | Yes (when invoked) | n/a |
| `thermal_state` poll | 1 Hz from C4 | No (sampled, not per-frame) | n/a |

### Caching Strategy

| Data | Cache Type | TTL | Invalidation |
|------|-----------|-----|-------------|
| Compiled `.engine` files | filesystem keyed by `(SM, JP, TRT, precision)` (D-C10-7) | bounded by JetPack/TRT version stability | manifest content-hash gate at takeoff (D-C10-3) |
| INT8 calibration cache | filesystem alongside `.engine` (D-C10-6) | bounded by calibration dataset version | rebuild when calibration dataset hash changes |
| Resident engine handles | GPU memory | flight lifetime | F8 reboot recovery re-deserialises |

### Storage Estimates

| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|-----------------|---------------------|----------|------------|-------------|
| `.engine` files | one per (model × precision × backbone) | 50 MB – 500 MB / engine | up to ~1.5 GB across all backbones for a deployment binary | bounded by AC-8.3 carve-out |
| INT8 calibration caches | one per engine | 1–10 MB | <50 MB | as above |

### Data Management

**Seed data**: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.

**Rollback**: D-C10-7 self-describing filename schema (`<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine`) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current `(SM, JP, TRT)` tuple.

## 5. Implementation Details

**Algorithmic Complexity**: per-model forward pass cost is the design driver. Engine builds are `O(complexity_of_optimizer_search)` — minutes for INT8 with calibration; sub-minute for FP16.

**State Management**:
- Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
- Owns the resident engine handles for the duration of a flight.
- Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).

**Key Dependencies**:

| Library | Version | Purpose |
|---------|---------|---------|
| TensorRT (C++ + Python) | 10.3 (JetPack 6.2 pin) per D-C7-9 | Primary engine compile + deserialize + infer |
| Polygraphy | matches TensorRT | Engine build orchestration |
| trtexec | bundled with TensorRT | Alternate engine build path |
| ONNX Runtime + TRT EP | per project pin | Fallback runtime |
| PyTorch | per simple-baseline pin | FP16 baseline (mandatory) |
| jetson-stats / pynvml | latest | Thermal-throttle telemetry source |

**Error Handling Strategy**:
- `EngineBuildError`: surface to C10/operator pre-flight; takeoff blocked. **Never silently fall back** between runtimes — if the configured runtime can't build, the operator must explicitly switch.
- `EngineDeserializeError` at takeoff: refuse takeoff with explicit `(SM, JP, TRT, precision)` mismatch detail.
- `InferenceError` mid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.
- `OutOfMemoryError`: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.
- `TelemetryUnavailableError`: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.
- `CalibrationCacheError`: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.

## 6. Extensions and Helpers

| Helper | Purpose | Used By |
|--------|---------|---------|
| `EngineFilenameSchema` | self-describing filename per D-C10-7 | C7, C10 |
| `Sha256Sidecar` | atomic write + content-hash sidecar pattern | C6, C7, C10 |

## 7. Caveats & Edge Cases

**Known limitations**:
- TensorRT engines are NOT portable across `(SM, JP, TRT, precision)` tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts.
- INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.

**Potential race conditions**:
- The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's `infer` calls. Use a lock-free atomic snapshot for `thermal_state`.

**Performance bottlenecks**:
- Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.

**Cycle-1 Tier-2 follow-up dependencies**:
- `OnnxTrtEpRuntime` — the module + class are implemented and the lower-level `inference_factory._RUNTIME_TO_BUILD_FLAG` maps `"onnx_trt_ep" → "BUILD_ONNX_TRT_EP_RUNTIME"`, but the **airborne** `C7_AIRBORNE_BUILD_FLAGS` tuple in `runtime_root/airborne_bootstrap.py` deliberately omits it (research-only per the AZ-621 task spec). Setting `config.components['c7_inference'].runtime = "onnx_trt_ep"` on an airborne binary raises `AirborneBootstrapError` from `_build_c7_inference` whose message lists ONLY the two airborne flag options (tensorrt / pytorch_fp16) — operators see a clean recovery path instead of a research-build escape hatch. Tier-2 follow-up: extend `C7_AIRBORNE_BUILD_FLAGS` (and gate it on `BUILD_ONNX_TRT_EP_RUNTIME=ON`) only if a future deployment scenario justifies the ORT-TRT-EP path on a flight binary; until then the runtime is exercised via unit-test composition and ad-hoc workstation runs only.

## 8. Dependency Graph

**Must be implemented after**: nothing internal — C7 is foundational.

**Can be implemented in parallel with**: C6, C13.

**Blocks**: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes `ThermalState`), C10, F1, F2, F3, F6, F8.

## 9. Logging Strategy

| Log Level | When | Example |
|-----------|------|---------|
| ERROR | `EngineBuildError`, `EngineDeserializeError`, `OutOfMemoryError`, `CalibrationCacheError` | `C7 OOM during infer; backbone=ultravpr; frame=12345` |
| WARN | thermal-throttle entered/exited; telemetry unavailable | `C7 thermal throttle active; gpu_temp=83C; clock=750mhz` |
| INFO | Strategy ready; engine deserialised; backbone resident | `C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16]` |
| DEBUG | per-frame infer timing per backbone | `C7 infer backbone=ultravpr frame=12345 took=37ms` |

**Log format**: structured JSON.
**Log storage**: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).