Batch 3 of the cycle-1 component-doc sync. For each of C6
(tile_cache), C7 (inference), C8 (fc_adapter):
- Append "Cycle-1 operational reality" paragraph to § 1
documenting the actual cycle-1 wiring path:
- C6: infrastructure seeded via build_pre_constructed's
c6_descriptor_index (BUILD_FAISS_INDEX-gated) and
c6_tile_store slots; no _STRATEGY_REGISTRY slot;
AZ-687 replay-mode guard skips both seeds when the
minimal replay Config omits the c6_tile_cache block.
- C7: single InferenceRuntime built once via
_build_c7_inference, identity-shared as the engine
source for c3_lightglue_runtime (AZ-622 phase D);
C7_AIRBORNE_BUILD_FLAGS lists tensorrt (production-
default) + pytorch_fp16 (Tier-0 fallback);
onnx_trt_ep deliberately omitted from airborne flags;
AZ-687 replay-mode guard cascades to c3_lightglue_runtime.
- C8: composed via a SEPARATE registry path
(runtime_root/fc_factory.py) with its own _FC_REGISTRY
+ _GCS_REGISTRY; per-binary bootstrap modules register
concrete strategies under BUILD_FC_* / BUILD_GCS_*
flags; bind_outbound_emit_thread enforces the
single-writer outbound invariant (AC-6).
- Add "Cycle-1 Tier-2 follow-up dependencies" subsection
in § 7 of C7 only: onnx_trt_ep is implemented and the
inference_factory recognises BUILD_ONNX_TRT_EP_RUNTIME,
but airborne config selecting it raises a clean
AirborneBootstrapError pointing only at the two airborne
options. C6 and C8 have no parked Tier-2 strategies for
cycle-1.
None of c6/c7/c8 import cv2 directly, so no OpenCV pin
row is added to § 5 (D-CROSS-CVE-1 leftover stays as it
is; the relaxed pin is recorded against c2.5/c3/c3.5/c4/c5
where the imports actually live).
Also refresh the D-CROSS-CVE-1 leftover replay timestamp
(condition still upstream-gated: gtsam wheels remain
numpy<2) and bump the autodev state's sub_step.detail to
record "batch 3/~5 done (c6/c7/c8); 4 components + 8
helpers + tests/ remain".
Co-authored-by: Cursor <cursoragent@cursor.com>
12 KiB
C7 — On-Jetson Inference Runtime
1. High-Level Overview
Purpose: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.
Architectural Pattern: Strategy — InferenceRuntime interface with three concrete implementations: TensorrtRuntime (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), OnnxTrtEpRuntime (fallback), PytorchFp16Runtime (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by BUILD_* flags (ADR-002), composition-root wired (ADR-009).
Cycle-1 operational reality: C7 is infrastructure shared across consumers — it does NOT have its own slot in the _STRATEGY_REGISTRY populated by register_airborne_strategies() (AZ-591). Instead the airborne binary builds the InferenceRuntime once via runtime_root/airborne_bootstrap.py::_build_c7_inference → inference_factory.build_inference_runtime, and seeds the single instance into pre_constructed["c7_inference"] (AZ-621 / Phase C). The same instance is reused as the engine source for the shared LightGlueRuntime load (AZ-622 / Phase D, _build_c3_lightglue_runtime), so the bootstrap never double-builds the runtime; downstream wrappers (c2_vpr / c3_matcher / c3_5_adhop, per AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS) then receive the identity-shared runtime via compose_root's constructor injection. Airborne-buildable runtimes are gated by C7_AIRBORNE_BUILD_FLAGS = (("tensorrt", "BUILD_TENSORRT_RUNTIME"), ("pytorch_fp16", "BUILD_PYTORCH_FP16_RUNTIME")) — tensorrt is the production-default, pytorch_fp16 is the Tier-0 / workstation fallback (and is the conservative C7InferenceConfig.runtime default so unconfigured test environments resolve to the Tier-0 baseline). onnx_trt_ep is deliberately omitted from the airborne flag matrix even though inference_factory._RUNTIME_TO_BUILD_FLAG recognises it — see § 7 Tier-2 follow-up. When no airborne runtime is buildable (both BUILD_TENSORRT_RUNTIME and BUILD_PYTORCH_FP16_RUNTIME OFF, or the configured runtime's flag is OFF) and any configured consumer still requires c7_inference, _build_c7_inference surfaces the upstream RuntimeNotAvailableError as an AirborneBootstrapError (AC-621.2) naming the missing key, BOTH airborne BUILD_* flags + their runtimes, and the consuming component slug(s) — narrowed to the configured consumers when available. AZ-687 replay-mode guard: when config.mode == "replay" and the minimal replay Config omits the c7_inference block, build_pre_constructed skips both c7_inference AND the cascading c3_lightglue_runtime seed (the LightGlue runtime depends on the inference runtime); the c2_vpr / c3_matcher / c2_5_rerank / c3_5_adhop wrappers that would have consumed the runtime are likewise absent from the replay Config and therefore never look at the skipped slot.
Upstream dependencies:
- C10 CacheProvisioner → during F1 (after C11
TileDownloaderhas populated C6) triggers engine compilation when no cached engine matches the(SM, JP, TRT, precision)tuple. - F2 takeoff load → triggers
deserialize_cached_enginefor every model used by C1/C2/C2.5/C3/C3.5. - jetson-stats / NVML → thermal-throttle telemetry source.
Downstream consumers:
- C2 VPR (backbone forward pass).
- C2.5 ReRanker (LightGlue forward pass).
- C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
- C3.5 AdHoP (conditional refinement backbone).
- C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
- C4 (consumes
ThermalStatefor the D-CROSS-LATENCY-1 covariance-mode decision).
2. Internal Interfaces
Interface: InferenceRuntime
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
compile_engine |
model_path: Path, build_config: BuildConfig |
EngineCacheEntry |
No (offline) | EngineBuildError, CalibrationCacheError |
deserialize_engine |
EngineCacheEntry |
EngineHandle |
No | EngineDeserializeError |
infer |
EngineHandle, inputs: dict[str, Tensor] |
dict[str, Tensor] |
No (sync GPU stream) | InferenceError, OutOfMemoryError |
release_engine |
EngineHandle |
None |
No | — |
thermal_state |
() |
ThermalState |
No | TelemetryUnavailableError |
current_runtime_label |
() |
string |
No | — |
Input/Output DTOs:
BuildConfig:
precision: enum {fp16, int8, mixed}
workspace_mb: int
calibration_dataset: Path (required for int8)
optimization_profiles: list[(input_name, min_shape, opt_shape, max_shape)]
EngineCacheEntry: see data_model.md
EngineHandle: opaque GPU-resident handle
ThermalState:
cpu_temp_c: float
gpu_temp_c: float
thermal_throttle_active: bool
measured_clock_mhz: int
measured_at: monotonic_ns
3. External API Specification
Not applicable.
4. Data Access Patterns
Queries
| Query | Frequency | Hot Path | Index Needed |
|---|---|---|---|
infer for VPR backbone |
3 Hz | Yes | n/a |
infer for LightGlue (×10 in C2.5, ×3 in C3) |
3 Hz × 13 = 39 Hz | Yes | n/a |
infer for AdHoP (conditional) |
<1 Hz typical | Yes (when invoked) | n/a |
thermal_state poll |
1 Hz from C4 | No (sampled, not per-frame) | n/a |
Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|---|---|---|---|
Compiled .engine files |
filesystem keyed by (SM, JP, TRT, precision) (D-C10-7) |
bounded by JetPack/TRT version stability | manifest content-hash gate at takeoff (D-C10-3) |
| INT8 calibration cache | filesystem alongside .engine (D-C10-6) |
bounded by calibration dataset version | rebuild when calibration dataset hash changes |
| Resident engine handles | GPU memory | flight lifetime | F8 reboot recovery re-deserialises |
Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|---|---|---|---|---|
.engine files |
one per (model × precision × backbone) | 50 MB – 500 MB / engine | up to ~1.5 GB across all backbones for a deployment binary | bounded by AC-8.3 carve-out |
| INT8 calibration caches | one per engine | 1–10 MB | <50 MB | as above |
Data Management
Seed data: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.
Rollback: D-C10-7 self-describing filename schema (<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current (SM, JP, TRT) tuple.
5. Implementation Details
Algorithmic Complexity: per-model forward pass cost is the design driver. Engine builds are O(complexity_of_optimizer_search) — minutes for INT8 with calibration; sub-minute for FP16.
State Management:
- Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
- Owns the resident engine handles for the duration of a flight.
- Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).
Key Dependencies:
| Library | Version | Purpose |
|---|---|---|
| TensorRT (C++ + Python) | 10.3 (JetPack 6.2 pin) per D-C7-9 | Primary engine compile + deserialize + infer |
| Polygraphy | matches TensorRT | Engine build orchestration |
| trtexec | bundled with TensorRT | Alternate engine build path |
| ONNX Runtime + TRT EP | per project pin | Fallback runtime |
| PyTorch | per simple-baseline pin | FP16 baseline (mandatory) |
| jetson-stats / pynvml | latest | Thermal-throttle telemetry source |
Error Handling Strategy:
EngineBuildError: surface to C10/operator pre-flight; takeoff blocked. Never silently fall back between runtimes — if the configured runtime can't build, the operator must explicitly switch.EngineDeserializeErrorat takeoff: refuse takeoff with explicit(SM, JP, TRT, precision)mismatch detail.InferenceErrormid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.OutOfMemoryError: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.TelemetryUnavailableError: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.CalibrationCacheError: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.
6. Extensions and Helpers
| Helper | Purpose | Used By |
|---|---|---|
EngineFilenameSchema |
self-describing filename per D-C10-7 | C7, C10 |
Sha256Sidecar |
atomic write + content-hash sidecar pattern | C6, C7, C10 |
7. Caveats & Edge Cases
Known limitations:
- TensorRT engines are NOT portable across
(SM, JP, TRT, precision)tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts. - INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.
Potential race conditions:
- The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's
infercalls. Use a lock-free atomic snapshot forthermal_state.
Performance bottlenecks:
- Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.
Cycle-1 Tier-2 follow-up dependencies:
OnnxTrtEpRuntime— the module + class are implemented and the lower-levelinference_factory._RUNTIME_TO_BUILD_FLAGmaps"onnx_trt_ep" → "BUILD_ONNX_TRT_EP_RUNTIME", but the airborneC7_AIRBORNE_BUILD_FLAGStuple inruntime_root/airborne_bootstrap.pydeliberately omits it (research-only per the AZ-621 task spec). Settingconfig.components['c7_inference'].runtime = "onnx_trt_ep"on an airborne binary raisesAirborneBootstrapErrorfrom_build_c7_inferencewhose message lists ONLY the two airborne flag options (tensorrt / pytorch_fp16) — operators see a clean recovery path instead of a research-build escape hatch. Tier-2 follow-up: extendC7_AIRBORNE_BUILD_FLAGS(and gate it onBUILD_ONNX_TRT_EP_RUNTIME=ON) only if a future deployment scenario justifies the ORT-TRT-EP path on a flight binary; until then the runtime is exercised via unit-test composition and ad-hoc workstation runs only.
8. Dependency Graph
Must be implemented after: nothing internal — C7 is foundational.
Can be implemented in parallel with: C6, C13.
Blocks: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes ThermalState), C10, F1, F2, F3, F6, F8.
9. Logging Strategy
| Log Level | When | Example |
|---|---|---|
| ERROR | EngineBuildError, EngineDeserializeError, OutOfMemoryError, CalibrationCacheError |
C7 OOM during infer; backbone=ultravpr; frame=12345 |
| WARN | thermal-throttle entered/exited; telemetry unavailable | C7 thermal throttle active; gpu_temp=83C; clock=750mhz |
| INFO | Strategy ready; engine deserialised; backbone resident | C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16] |
| DEBUG | per-frame infer timing per backbone | C7 infer backbone=ultravpr frame=12345 took=37ms |
Log format: structured JSON. Log storage: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).