Files
Oleksandr Bezdieniezhnykh 76f460c88a [autodev] Step 13 partial: c6/c7/c8 cycle-1 doc sync
Batch 3 of the cycle-1 component-doc sync. For each of C6
(tile_cache), C7 (inference), C8 (fc_adapter):

- Append "Cycle-1 operational reality" paragraph to § 1
  documenting the actual cycle-1 wiring path:
  - C6: infrastructure seeded via build_pre_constructed's
    c6_descriptor_index (BUILD_FAISS_INDEX-gated) and
    c6_tile_store slots; no _STRATEGY_REGISTRY slot;
    AZ-687 replay-mode guard skips both seeds when the
    minimal replay Config omits the c6_tile_cache block.
  - C7: single InferenceRuntime built once via
    _build_c7_inference, identity-shared as the engine
    source for c3_lightglue_runtime (AZ-622 phase D);
    C7_AIRBORNE_BUILD_FLAGS lists tensorrt (production-
    default) + pytorch_fp16 (Tier-0 fallback);
    onnx_trt_ep deliberately omitted from airborne flags;
    AZ-687 replay-mode guard cascades to c3_lightglue_runtime.
  - C8: composed via a SEPARATE registry path
    (runtime_root/fc_factory.py) with its own _FC_REGISTRY
    + _GCS_REGISTRY; per-binary bootstrap modules register
    concrete strategies under BUILD_FC_* / BUILD_GCS_*
    flags; bind_outbound_emit_thread enforces the
    single-writer outbound invariant (AC-6).

- Add "Cycle-1 Tier-2 follow-up dependencies" subsection
  in § 7 of C7 only: onnx_trt_ep is implemented and the
  inference_factory recognises BUILD_ONNX_TRT_EP_RUNTIME,
  but airborne config selecting it raises a clean
  AirborneBootstrapError pointing only at the two airborne
  options. C6 and C8 have no parked Tier-2 strategies for
  cycle-1.

None of c6/c7/c8 import cv2 directly, so no OpenCV pin
row is added to § 5 (D-CROSS-CVE-1 leftover stays as it
is; the relaxed pin is recorded against c2.5/c3/c3.5/c4/c5
where the imports actually live).

Also refresh the D-CROSS-CVE-1 leftover replay timestamp
(condition still upstream-gated: gtsam wheels remain
numpy<2) and bump the autodev state's sub_step.detail to
record "batch 3/~5 done (c6/c7/c8); 4 components + 8
helpers + tests/ remain".

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 17:17:33 +03:00

12 KiB
Raw Permalink Blame History

C7 — On-Jetson Inference Runtime

1. High-Level Overview

Purpose: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.

Architectural Pattern: Strategy — InferenceRuntime interface with three concrete implementations: TensorrtRuntime (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), OnnxTrtEpRuntime (fallback), PytorchFp16Runtime (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by BUILD_* flags (ADR-002), composition-root wired (ADR-009).

Cycle-1 operational reality: C7 is infrastructure shared across consumers — it does NOT have its own slot in the _STRATEGY_REGISTRY populated by register_airborne_strategies() (AZ-591). Instead the airborne binary builds the InferenceRuntime once via runtime_root/airborne_bootstrap.py::_build_c7_inferenceinference_factory.build_inference_runtime, and seeds the single instance into pre_constructed["c7_inference"] (AZ-621 / Phase C). The same instance is reused as the engine source for the shared LightGlueRuntime load (AZ-622 / Phase D, _build_c3_lightglue_runtime), so the bootstrap never double-builds the runtime; downstream wrappers (c2_vpr / c3_matcher / c3_5_adhop, per AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS) then receive the identity-shared runtime via compose_root's constructor injection. Airborne-buildable runtimes are gated by C7_AIRBORNE_BUILD_FLAGS = (("tensorrt", "BUILD_TENSORRT_RUNTIME"), ("pytorch_fp16", "BUILD_PYTORCH_FP16_RUNTIME"))tensorrt is the production-default, pytorch_fp16 is the Tier-0 / workstation fallback (and is the conservative C7InferenceConfig.runtime default so unconfigured test environments resolve to the Tier-0 baseline). onnx_trt_ep is deliberately omitted from the airborne flag matrix even though inference_factory._RUNTIME_TO_BUILD_FLAG recognises it — see § 7 Tier-2 follow-up. When no airborne runtime is buildable (both BUILD_TENSORRT_RUNTIME and BUILD_PYTORCH_FP16_RUNTIME OFF, or the configured runtime's flag is OFF) and any configured consumer still requires c7_inference, _build_c7_inference surfaces the upstream RuntimeNotAvailableError as an AirborneBootstrapError (AC-621.2) naming the missing key, BOTH airborne BUILD_* flags + their runtimes, and the consuming component slug(s) — narrowed to the configured consumers when available. AZ-687 replay-mode guard: when config.mode == "replay" and the minimal replay Config omits the c7_inference block, build_pre_constructed skips both c7_inference AND the cascading c3_lightglue_runtime seed (the LightGlue runtime depends on the inference runtime); the c2_vpr / c3_matcher / c2_5_rerank / c3_5_adhop wrappers that would have consumed the runtime are likewise absent from the replay Config and therefore never look at the skipped slot.

Upstream dependencies:

  • C10 CacheProvisioner → during F1 (after C11 TileDownloader has populated C6) triggers engine compilation when no cached engine matches the (SM, JP, TRT, precision) tuple.
  • F2 takeoff load → triggers deserialize_cached_engine for every model used by C1/C2/C2.5/C3/C3.5.
  • jetson-stats / NVML → thermal-throttle telemetry source.

Downstream consumers:

  • C2 VPR (backbone forward pass).
  • C2.5 ReRanker (LightGlue forward pass).
  • C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
  • C3.5 AdHoP (conditional refinement backbone).
  • C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
  • C4 (consumes ThermalState for the D-CROSS-LATENCY-1 covariance-mode decision).

2. Internal Interfaces

Interface: InferenceRuntime

Method Input Output Async Error Types
compile_engine model_path: Path, build_config: BuildConfig EngineCacheEntry No (offline) EngineBuildError, CalibrationCacheError
deserialize_engine EngineCacheEntry EngineHandle No EngineDeserializeError
infer EngineHandle, inputs: dict[str, Tensor] dict[str, Tensor] No (sync GPU stream) InferenceError, OutOfMemoryError
release_engine EngineHandle None No
thermal_state () ThermalState No TelemetryUnavailableError
current_runtime_label () string No

Input/Output DTOs:

BuildConfig:
  precision:                   enum {fp16, int8, mixed}
  workspace_mb:                int
  calibration_dataset:         Path (required for int8)
  optimization_profiles:       list[(input_name, min_shape, opt_shape, max_shape)]

EngineCacheEntry:              see data_model.md
EngineHandle:                  opaque GPU-resident handle

ThermalState:
  cpu_temp_c:                  float
  gpu_temp_c:                  float
  thermal_throttle_active:     bool
  measured_clock_mhz:          int
  measured_at:                 monotonic_ns

3. External API Specification

Not applicable.

4. Data Access Patterns

Queries

Query Frequency Hot Path Index Needed
infer for VPR backbone 3 Hz Yes n/a
infer for LightGlue (×10 in C2.5, ×3 in C3) 3 Hz × 13 = 39 Hz Yes n/a
infer for AdHoP (conditional) <1 Hz typical Yes (when invoked) n/a
thermal_state poll 1 Hz from C4 No (sampled, not per-frame) n/a

Caching Strategy

Data Cache Type TTL Invalidation
Compiled .engine files filesystem keyed by (SM, JP, TRT, precision) (D-C10-7) bounded by JetPack/TRT version stability manifest content-hash gate at takeoff (D-C10-3)
INT8 calibration cache filesystem alongside .engine (D-C10-6) bounded by calibration dataset version rebuild when calibration dataset hash changes
Resident engine handles GPU memory flight lifetime F8 reboot recovery re-deserialises

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
.engine files one per (model × precision × backbone) 50 MB 500 MB / engine up to ~1.5 GB across all backbones for a deployment binary bounded by AC-8.3 carve-out
INT8 calibration caches one per engine 110 MB <50 MB as above

Data Management

Seed data: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.

Rollback: D-C10-7 self-describing filename schema (<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current (SM, JP, TRT) tuple.

5. Implementation Details

Algorithmic Complexity: per-model forward pass cost is the design driver. Engine builds are O(complexity_of_optimizer_search) — minutes for INT8 with calibration; sub-minute for FP16.

State Management:

  • Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
  • Owns the resident engine handles for the duration of a flight.
  • Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).

Key Dependencies:

Library Version Purpose
TensorRT (C++ + Python) 10.3 (JetPack 6.2 pin) per D-C7-9 Primary engine compile + deserialize + infer
Polygraphy matches TensorRT Engine build orchestration
trtexec bundled with TensorRT Alternate engine build path
ONNX Runtime + TRT EP per project pin Fallback runtime
PyTorch per simple-baseline pin FP16 baseline (mandatory)
jetson-stats / pynvml latest Thermal-throttle telemetry source

Error Handling Strategy:

  • EngineBuildError: surface to C10/operator pre-flight; takeoff blocked. Never silently fall back between runtimes — if the configured runtime can't build, the operator must explicitly switch.
  • EngineDeserializeError at takeoff: refuse takeoff with explicit (SM, JP, TRT, precision) mismatch detail.
  • InferenceError mid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.
  • OutOfMemoryError: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.
  • TelemetryUnavailableError: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.
  • CalibrationCacheError: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.

6. Extensions and Helpers

Helper Purpose Used By
EngineFilenameSchema self-describing filename per D-C10-7 C7, C10
Sha256Sidecar atomic write + content-hash sidecar pattern C6, C7, C10

7. Caveats & Edge Cases

Known limitations:

  • TensorRT engines are NOT portable across (SM, JP, TRT, precision) tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts.
  • INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.

Potential race conditions:

  • The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's infer calls. Use a lock-free atomic snapshot for thermal_state.

Performance bottlenecks:

  • Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.

Cycle-1 Tier-2 follow-up dependencies:

  • OnnxTrtEpRuntime — the module + class are implemented and the lower-level inference_factory._RUNTIME_TO_BUILD_FLAG maps "onnx_trt_ep" → "BUILD_ONNX_TRT_EP_RUNTIME", but the airborne C7_AIRBORNE_BUILD_FLAGS tuple in runtime_root/airborne_bootstrap.py deliberately omits it (research-only per the AZ-621 task spec). Setting config.components['c7_inference'].runtime = "onnx_trt_ep" on an airborne binary raises AirborneBootstrapError from _build_c7_inference whose message lists ONLY the two airborne flag options (tensorrt / pytorch_fp16) — operators see a clean recovery path instead of a research-build escape hatch. Tier-2 follow-up: extend C7_AIRBORNE_BUILD_FLAGS (and gate it on BUILD_ONNX_TRT_EP_RUNTIME=ON) only if a future deployment scenario justifies the ORT-TRT-EP path on a flight binary; until then the runtime is exercised via unit-test composition and ad-hoc workstation runs only.

8. Dependency Graph

Must be implemented after: nothing internal — C7 is foundational.

Can be implemented in parallel with: C6, C13.

Blocks: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes ThermalState), C10, F1, F2, F3, F6, F8.

9. Logging Strategy

Log Level When Example
ERROR EngineBuildError, EngineDeserializeError, OutOfMemoryError, CalibrationCacheError C7 OOM during infer; backbone=ultravpr; frame=12345
WARN thermal-throttle entered/exited; telemetry unavailable C7 thermal throttle active; gpu_temp=83C; clock=750mhz
INFO Strategy ready; engine deserialised; backbone resident C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16]
DEBUG per-frame infer timing per backbone C7 infer backbone=ultravpr frame=12345 took=37ms

Log format: structured JSON. Log storage: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).