Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
8.6 KiB
C7 — On-Jetson Inference Runtime
1. High-Level Overview
Purpose: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.
Architectural Pattern: Strategy — InferenceRuntime interface with three concrete implementations: TensorrtRuntime (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), OnnxTrtEpRuntime (fallback), PytorchFp16Runtime (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by BUILD_* flags (ADR-002), composition-root wired (ADR-009).
Upstream dependencies:
- C10 CacheProvisioner → during F1 (after C11
TileDownloaderhas populated C6) triggers engine compilation when no cached engine matches the(SM, JP, TRT, precision)tuple. - F2 takeoff load → triggers
deserialize_cached_enginefor every model used by C1/C2/C2.5/C3/C3.5. - jetson-stats / NVML → thermal-throttle telemetry source.
Downstream consumers:
- C2 VPR (backbone forward pass).
- C2.5 ReRanker (LightGlue forward pass).
- C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
- C3.5 AdHoP (conditional refinement backbone).
- C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
- C4 (consumes
ThermalStatefor the D-CROSS-LATENCY-1 covariance-mode decision).
2. Internal Interfaces
Interface: InferenceRuntime
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
compile_engine |
model_path: Path, build_config: BuildConfig |
EngineCacheEntry |
No (offline) | EngineBuildError, CalibrationCacheError |
deserialize_engine |
EngineCacheEntry |
EngineHandle |
No | EngineDeserializeError |
infer |
EngineHandle, inputs: dict[str, Tensor] |
dict[str, Tensor] |
No (sync GPU stream) | InferenceError, OutOfMemoryError |
release_engine |
EngineHandle |
None |
No | — |
thermal_state |
() |
ThermalState |
No | TelemetryUnavailableError |
current_runtime_label |
() |
string |
No | — |
Input/Output DTOs:
BuildConfig:
precision: enum {fp16, int8, mixed}
workspace_mb: int
calibration_dataset: Path (required for int8)
optimization_profiles: list[(input_name, min_shape, opt_shape, max_shape)]
EngineCacheEntry: see data_model.md
EngineHandle: opaque GPU-resident handle
ThermalState:
cpu_temp_c: float
gpu_temp_c: float
thermal_throttle_active: bool
measured_clock_mhz: int
measured_at: monotonic_ns
3. External API Specification
Not applicable.
4. Data Access Patterns
Queries
| Query | Frequency | Hot Path | Index Needed |
|---|---|---|---|
infer for VPR backbone |
3 Hz | Yes | n/a |
infer for LightGlue (×10 in C2.5, ×3 in C3) |
3 Hz × 13 = 39 Hz | Yes | n/a |
infer for AdHoP (conditional) |
<1 Hz typical | Yes (when invoked) | n/a |
thermal_state poll |
1 Hz from C4 | No (sampled, not per-frame) | n/a |
Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|---|---|---|---|
Compiled .engine files |
filesystem keyed by (SM, JP, TRT, precision) (D-C10-7) |
bounded by JetPack/TRT version stability | manifest content-hash gate at takeoff (D-C10-3) |
| INT8 calibration cache | filesystem alongside .engine (D-C10-6) |
bounded by calibration dataset version | rebuild when calibration dataset hash changes |
| Resident engine handles | GPU memory | flight lifetime | F8 reboot recovery re-deserialises |
Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|---|---|---|---|---|
.engine files |
one per (model × precision × backbone) | 50 MB – 500 MB / engine | up to ~1.5 GB across all backbones for a deployment binary | bounded by AC-8.3 carve-out |
| INT8 calibration caches | one per engine | 1–10 MB | <50 MB | as above |
Data Management
Seed data: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.
Rollback: D-C10-7 self-describing filename schema (<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current (SM, JP, TRT) tuple.
5. Implementation Details
Algorithmic Complexity: per-model forward pass cost is the design driver. Engine builds are O(complexity_of_optimizer_search) — minutes for INT8 with calibration; sub-minute for FP16.
State Management:
- Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
- Owns the resident engine handles for the duration of a flight.
- Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).
Key Dependencies:
| Library | Version | Purpose |
|---|---|---|
| TensorRT (C++ + Python) | 10.3 (JetPack 6.2 pin) per D-C7-9 | Primary engine compile + deserialize + infer |
| Polygraphy | matches TensorRT | Engine build orchestration |
| trtexec | bundled with TensorRT | Alternate engine build path |
| ONNX Runtime + TRT EP | per project pin | Fallback runtime |
| PyTorch | per simple-baseline pin | FP16 baseline (mandatory) |
| jetson-stats / pynvml | latest | Thermal-throttle telemetry source |
Error Handling Strategy:
EngineBuildError: surface to C10/operator pre-flight; takeoff blocked. Never silently fall back between runtimes — if the configured runtime can't build, the operator must explicitly switch.EngineDeserializeErrorat takeoff: refuse takeoff with explicit(SM, JP, TRT, precision)mismatch detail.InferenceErrormid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.OutOfMemoryError: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.TelemetryUnavailableError: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.CalibrationCacheError: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.
6. Extensions and Helpers
| Helper | Purpose | Used By |
|---|---|---|
EngineFilenameSchema |
self-describing filename per D-C10-7 | C7, C10 |
Sha256Sidecar |
atomic write + content-hash sidecar pattern | C6, C7, C10 |
7. Caveats & Edge Cases
Known limitations:
- TensorRT engines are NOT portable across
(SM, JP, TRT, precision)tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts. - INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.
Potential race conditions:
- The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's
infercalls. Use a lock-free atomic snapshot forthermal_state.
Performance bottlenecks:
- Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.
8. Dependency Graph
Must be implemented after: nothing internal — C7 is foundational.
Can be implemented in parallel with: C6, C13.
Blocks: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes ThermalState), C10, F1, F2, F3, F6, F8.
9. Logging Strategy
| Log Level | When | Example |
|---|---|---|
| ERROR | EngineBuildError, EngineDeserializeError, OutOfMemoryError, CalibrationCacheError |
C7 OOM during infer; backbone=ultravpr; frame=12345 |
| WARN | thermal-throttle entered/exited; telemetry unavailable | C7 thermal throttle active; gpu_temp=83C; clock=750mhz |
| INFO | Strategy ready; engine deserialised; backbone resident | C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16] |
| DEBUG | per-frame infer timing per backbone | C7 infer backbone=ultravpr frame=12345 took=37ms |
Log format: structured JSON. Log storage: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).