Files
gps-denied-onboard/_docs/02_document/components/09_c7_inference/description.md
T
Oleksandr Bezdieniezhnykh 64542d32fc Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
2026-05-10 00:21:34 +03:00

8.6 KiB
Raw Blame History

C7 — On-Jetson Inference Runtime

1. High-Level Overview

Purpose: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.

Architectural Pattern: Strategy — InferenceRuntime interface with three concrete implementations: TensorrtRuntime (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), OnnxTrtEpRuntime (fallback), PytorchFp16Runtime (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by BUILD_* flags (ADR-002), composition-root wired (ADR-009).

Upstream dependencies:

  • C10 CacheProvisioner → during F1 (after C11 TileDownloader has populated C6) triggers engine compilation when no cached engine matches the (SM, JP, TRT, precision) tuple.
  • F2 takeoff load → triggers deserialize_cached_engine for every model used by C1/C2/C2.5/C3/C3.5.
  • jetson-stats / NVML → thermal-throttle telemetry source.

Downstream consumers:

  • C2 VPR (backbone forward pass).
  • C2.5 ReRanker (LightGlue forward pass).
  • C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
  • C3.5 AdHoP (conditional refinement backbone).
  • C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
  • C4 (consumes ThermalState for the D-CROSS-LATENCY-1 covariance-mode decision).

2. Internal Interfaces

Interface: InferenceRuntime

Method Input Output Async Error Types
compile_engine model_path: Path, build_config: BuildConfig EngineCacheEntry No (offline) EngineBuildError, CalibrationCacheError
deserialize_engine EngineCacheEntry EngineHandle No EngineDeserializeError
infer EngineHandle, inputs: dict[str, Tensor] dict[str, Tensor] No (sync GPU stream) InferenceError, OutOfMemoryError
release_engine EngineHandle None No
thermal_state () ThermalState No TelemetryUnavailableError
current_runtime_label () string No

Input/Output DTOs:

BuildConfig:
  precision:                   enum {fp16, int8, mixed}
  workspace_mb:                int
  calibration_dataset:         Path (required for int8)
  optimization_profiles:       list[(input_name, min_shape, opt_shape, max_shape)]

EngineCacheEntry:              see data_model.md
EngineHandle:                  opaque GPU-resident handle

ThermalState:
  cpu_temp_c:                  float
  gpu_temp_c:                  float
  thermal_throttle_active:     bool
  measured_clock_mhz:          int
  measured_at:                 monotonic_ns

3. External API Specification

Not applicable.

4. Data Access Patterns

Queries

Query Frequency Hot Path Index Needed
infer for VPR backbone 3 Hz Yes n/a
infer for LightGlue (×10 in C2.5, ×3 in C3) 3 Hz × 13 = 39 Hz Yes n/a
infer for AdHoP (conditional) <1 Hz typical Yes (when invoked) n/a
thermal_state poll 1 Hz from C4 No (sampled, not per-frame) n/a

Caching Strategy

Data Cache Type TTL Invalidation
Compiled .engine files filesystem keyed by (SM, JP, TRT, precision) (D-C10-7) bounded by JetPack/TRT version stability manifest content-hash gate at takeoff (D-C10-3)
INT8 calibration cache filesystem alongside .engine (D-C10-6) bounded by calibration dataset version rebuild when calibration dataset hash changes
Resident engine handles GPU memory flight lifetime F8 reboot recovery re-deserialises

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
.engine files one per (model × precision × backbone) 50 MB 500 MB / engine up to ~1.5 GB across all backbones for a deployment binary bounded by AC-8.3 carve-out
INT8 calibration caches one per engine 110 MB <50 MB as above

Data Management

Seed data: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.

Rollback: D-C10-7 self-describing filename schema (<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current (SM, JP, TRT) tuple.

5. Implementation Details

Algorithmic Complexity: per-model forward pass cost is the design driver. Engine builds are O(complexity_of_optimizer_search) — minutes for INT8 with calibration; sub-minute for FP16.

State Management:

  • Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
  • Owns the resident engine handles for the duration of a flight.
  • Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).

Key Dependencies:

Library Version Purpose
TensorRT (C++ + Python) 10.3 (JetPack 6.2 pin) per D-C7-9 Primary engine compile + deserialize + infer
Polygraphy matches TensorRT Engine build orchestration
trtexec bundled with TensorRT Alternate engine build path
ONNX Runtime + TRT EP per project pin Fallback runtime
PyTorch per simple-baseline pin FP16 baseline (mandatory)
jetson-stats / pynvml latest Thermal-throttle telemetry source

Error Handling Strategy:

  • EngineBuildError: surface to C10/operator pre-flight; takeoff blocked. Never silently fall back between runtimes — if the configured runtime can't build, the operator must explicitly switch.
  • EngineDeserializeError at takeoff: refuse takeoff with explicit (SM, JP, TRT, precision) mismatch detail.
  • InferenceError mid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.
  • OutOfMemoryError: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.
  • TelemetryUnavailableError: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.
  • CalibrationCacheError: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.

6. Extensions and Helpers

Helper Purpose Used By
EngineFilenameSchema self-describing filename per D-C10-7 C7, C10
Sha256Sidecar atomic write + content-hash sidecar pattern C6, C7, C10

7. Caveats & Edge Cases

Known limitations:

  • TensorRT engines are NOT portable across (SM, JP, TRT, precision) tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts.
  • INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.

Potential race conditions:

  • The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's infer calls. Use a lock-free atomic snapshot for thermal_state.

Performance bottlenecks:

  • Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.

8. Dependency Graph

Must be implemented after: nothing internal — C7 is foundational.

Can be implemented in parallel with: C6, C13.

Blocks: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes ThermalState), C10, F1, F2, F3, F6, F8.

9. Logging Strategy

Log Level When Example
ERROR EngineBuildError, EngineDeserializeError, OutOfMemoryError, CalibrationCacheError C7 OOM during infer; backbone=ultravpr; frame=12345
WARN thermal-throttle entered/exited; telemetry unavailable C7 thermal throttle active; gpu_temp=83C; clock=750mhz
INFO Strategy ready; engine deserialised; backbone resident C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16]
DEBUG per-frame infer timing per backbone C7 infer backbone=ultravpr frame=12345 took=37ms

Log format: structured JSON. Log storage: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).