Update autodev state, architecture documentation, and glossary terms

Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-10 00:21:34 +03:00
parent 723f574b14
commit 64542d32fc
52 changed files with 8789 additions and 88 deletions
@@ -0,0 +1,155 @@
# C7 — On-Jetson Inference Runtime
## 1. High-Level Overview
**Purpose**: provide a single inference-runtime abstraction that all GPU-using components (C1 selectively, C2, C2.5, C3, C3.5) consume. Owns engine compilation (Polygraphy / trtexec / IBuilderConfig hybrid), engine deserialization at takeoff load, GPU memory management, INT8 calibration cache trust, and the thermal-throttle telemetry feed that drives the D-CROSS-LATENCY-1 hybrid in C4.
**Architectural Pattern**: Strategy — `InferenceRuntime` interface with three concrete implementations: `TensorrtRuntime` (production-default per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock), `OnnxTrtEpRuntime` (fallback), `PytorchFp16Runtime` (mandatory simple-baseline). Selection at startup by config (ADR-001), build-time gating by `BUILD_*` flags (ADR-002), composition-root wired (ADR-009).
**Upstream dependencies**:
- C10 CacheProvisioner → during F1 (after C11 `TileDownloader` has populated C6) triggers engine compilation when no cached engine matches the `(SM, JP, TRT, precision)` tuple.
- F2 takeoff load → triggers `deserialize_cached_engine` for every model used by C1/C2/C2.5/C3/C3.5.
- jetson-stats / NVML → thermal-throttle telemetry source.
**Downstream consumers**:
- C2 VPR (backbone forward pass).
- C2.5 ReRanker (LightGlue forward pass).
- C3 CrossDomainMatcher (DISK / LightGlue / ALIKED / XFeat forward passes).
- C3.5 AdHoP (conditional refinement backbone).
- C1 (only the strategies that have a CUDA path; KltRansac is CPU-only).
- C4 (consumes `ThermalState` for the D-CROSS-LATENCY-1 covariance-mode decision).
## 2. Internal Interfaces
### Interface: `InferenceRuntime`
| Method | Input | Output | Async | Error Types |
|--------|-------|--------|-------|-------------|
| `compile_engine` | `model_path: Path, build_config: BuildConfig` | `EngineCacheEntry` | No (offline) | `EngineBuildError`, `CalibrationCacheError` |
| `deserialize_engine` | `EngineCacheEntry` | `EngineHandle` | No | `EngineDeserializeError` |
| `infer` | `EngineHandle, inputs: dict[str, Tensor]` | `dict[str, Tensor]` | No (sync GPU stream) | `InferenceError`, `OutOfMemoryError` |
| `release_engine` | `EngineHandle` | `None` | No | — |
| `thermal_state` | `()` | `ThermalState` | No | `TelemetryUnavailableError` |
| `current_runtime_label` | `()` | `string` | No | — |
**Input/Output DTOs**:
```
BuildConfig:
precision: enum {fp16, int8, mixed}
workspace_mb: int
calibration_dataset: Path (required for int8)
optimization_profiles: list[(input_name, min_shape, opt_shape, max_shape)]
EngineCacheEntry: see data_model.md
EngineHandle: opaque GPU-resident handle
ThermalState:
cpu_temp_c: float
gpu_temp_c: float
thermal_throttle_active: bool
measured_clock_mhz: int
measured_at: monotonic_ns
```
## 3. External API Specification
Not applicable.
## 4. Data Access Patterns
### Queries
| Query | Frequency | Hot Path | Index Needed |
|-------|-----------|----------|--------------|
| `infer` for VPR backbone | 3 Hz | Yes | n/a |
| `infer` for LightGlue (×10 in C2.5, ×3 in C3) | 3 Hz × 13 = 39 Hz | Yes | n/a |
| `infer` for AdHoP (conditional) | <1 Hz typical | Yes (when invoked) | n/a |
| `thermal_state` poll | 1 Hz from C4 | No (sampled, not per-frame) | n/a |
### Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|------|-----------|-----|-------------|
| Compiled `.engine` files | filesystem keyed by `(SM, JP, TRT, precision)` (D-C10-7) | bounded by JetPack/TRT version stability | manifest content-hash gate at takeoff (D-C10-3) |
| INT8 calibration cache | filesystem alongside `.engine` (D-C10-6) | bounded by calibration dataset version | rebuild when calibration dataset hash changes |
| Resident engine handles | GPU memory | flight lifetime | F8 reboot recovery re-deserialises |
### Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|-----------------|---------------------|----------|------------|-------------|
| `.engine` files | one per (model × precision × backbone) | 50 MB 500 MB / engine | up to ~1.5 GB across all backbones for a deployment binary | bounded by AC-8.3 carve-out |
| INT8 calibration caches | one per engine | 110 MB | <50 MB | as above |
### Data Management
**Seed data**: pre-flight F1 provisioning compiles engines (or reuses cached). No mid-flight compilation.
**Rollback**: D-C10-7 self-describing filename schema (`<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine`) makes stale engines visually obvious; F2 takeoff load refuses to deserialize an engine whose metadata doesn't match the host's current `(SM, JP, TRT)` tuple.
## 5. Implementation Details
**Algorithmic Complexity**: per-model forward pass cost is the design driver. Engine builds are `O(complexity_of_optimizer_search)` — minutes for INT8 with calibration; sub-minute for FP16.
**State Management**:
- Owns the CUDA stream(s) for the runtime; one stream per concurrent consumer (typically one stream because the F3 hot path is single-threaded).
- Owns the resident engine handles for the duration of a flight.
- Owns the polling loop for thermal-throttle telemetry (1 Hz background thread).
**Key Dependencies**:
| Library | Version | Purpose |
|---------|---------|---------|
| TensorRT (C++ + Python) | 10.3 (JetPack 6.2 pin) per D-C7-9 | Primary engine compile + deserialize + infer |
| Polygraphy | matches TensorRT | Engine build orchestration |
| trtexec | bundled with TensorRT | Alternate engine build path |
| ONNX Runtime + TRT EP | per project pin | Fallback runtime |
| PyTorch | per simple-baseline pin | FP16 baseline (mandatory) |
| jetson-stats / pynvml | latest | Thermal-throttle telemetry source |
**Error Handling Strategy**:
- `EngineBuildError`: surface to C10/operator pre-flight; takeoff blocked. **Never silently fall back** between runtimes — if the configured runtime can't build, the operator must explicitly switch.
- `EngineDeserializeError` at takeoff: refuse takeoff with explicit `(SM, JP, TRT, precision)` mismatch detail.
- `InferenceError` mid-flight (rare; e.g., transient CUDA fault): emit no result for that frame; the consumer (C2/C3) reports its own degraded path.
- `OutOfMemoryError`: same as above; surface to C13 FDR and C12 operator-tooling for post-flight investigation.
- `TelemetryUnavailableError`: jetson-stats hung or unavailable. Default to "thermal_throttle_active = false" (D-CROSS-LATENCY-1 stays on the steady-state path); log WARN.
- `CalibrationCacheError`: per D-C10-6, calibration cache trust is critical; if the cache hash mismatches, refuse to use it and force a rebuild.
## 6. Extensions and Helpers
| Helper | Purpose | Used By |
|--------|---------|---------|
| `EngineFilenameSchema` | self-describing filename per D-C10-7 | C7, C10 |
| `Sha256Sidecar` | atomic write + content-hash sidecar pattern | C6, C7, C10 |
## 7. Caveats & Edge Cases
**Known limitations**:
- TensorRT engines are NOT portable across `(SM, JP, TRT, precision)` tuples; Tier-1 (workstation Docker) cannot reuse Tier-2 (Jetson) engines. CI emits both tiers' engines as artifacts.
- INT8 calibration cache trust is the lurking foot-gun; D-C10-6 manifest-hash gate is the only protection. Any deviation breaks NFT-PERF-01 / NFT-LIM-01.
**Potential race conditions**:
- The thermal-throttle polling thread MUST be reentrant-safe with the F3 hot path's `infer` calls. Use a lock-free atomic snapshot for `thermal_state`.
**Performance bottlenecks**:
- Per-frame inference cost is the F3 hot path's largest contributor. NFT-PERF-01 partition is the source of truth.
## 8. Dependency Graph
**Must be implemented after**: nothing internal — C7 is foundational.
**Can be implemented in parallel with**: C6, C13.
**Blocks**: C1 (CUDA strategies), C2, C2.5, C3, C3.5, C4 (consumes `ThermalState`), C10, F1, F2, F3, F6, F8.
## 9. Logging Strategy
| Log Level | When | Example |
|-----------|------|---------|
| ERROR | `EngineBuildError`, `EngineDeserializeError`, `OutOfMemoryError`, `CalibrationCacheError` | `C7 OOM during infer; backbone=ultravpr; frame=12345` |
| WARN | thermal-throttle entered/exited; telemetry unavailable | `C7 thermal throttle active; gpu_temp=83C; clock=750mhz` |
| INFO | Strategy ready; engine deserialised; backbone resident | `C7 ready: runtime=tensorrt, engines=[ultravpr@fp16, lightglue@fp16, disk@fp16]` |
| DEBUG | per-frame infer timing per backbone | `C7 infer backbone=ultravpr frame=12345 took=37ms` |
**Log format**: structured JSON.
**Log storage**: stdout / journald / FDR via C13 (ERROR + WARN always; thermal-state transitions always to FDR).
@@ -0,0 +1,170 @@
# Test Specification — C7 On-Jetson Inference Runtime
Component-scoped. Suite-level coverage in `_docs/02_document/tests/*.md`.
## Acceptance Criteria Traceability
| AC ID | Acceptance Criterion (one-line) | Test IDs | Coverage |
|-------|---------------------------------|----------|----------|
| AC-4.1 | E2E latency <400 ms p95 | NFT-PERF-01 (Tier-2), **C7-PT-01** | Covered |
| AC-4.2 | Memory <8 GB on Jetson | NFT-LIM-01, **C7-PT-02** | Covered |
| AC-NEW-1 | Cold-start TTFF <30 s p95 | NFT-PERF-03, **C7-IT-01** | Covered |
| AC-NEW-5 | Operating envelope; thermal telemetry feed | NFT-LIM-04, **C7-IT-02** | Covered (workstation portion) |
| D-C10-3 | Manifest content-hash takeoff gate | (gate is C10-owned, but the engine deserialise call is C7) | **C7-IT-03** | Covered |
| D-C10-7 | Engine filename schema (SM/JP/TRT/precision) | Helper-doc cited; **C7-IT-04** | Covered |
---
## Component-Internal Tests
### C7-IT-01: cold-start engine load + warm-up budget
**Summary**: from a cold (zero-resident-engines) Jetson process, every required engine deserialises and warms up in under the AC-NEW-1 30 s p95 budget.
**Traces to**: AC-NEW-1
**Description**: kill the companion process; restart; measure wall-clock from process start to "all engines warm" event in the FDR record stream. Repeat 10 times; assert p95 ≤ 30 s.
**Input data**: pre-built engine cache for the Derkachi fixture profile.
**Expected result**: p95 ≤ 30 s; no engine fails to warm.
**Max execution time**: 6 min (10 × ~30 s + overhead).
---
### C7-IT-02: thermal telemetry feeds C4's hybrid
**Summary**: `ThermalState` from `jetson-stats` is published at ≥1 Hz and is observable to C4; under simulated throttle, `throttle == true` is reported within 1 s of the throttle event.
**Traces to**: AC-NEW-5 (workstation-baseline portion; chamber portion deferred per traceability matrix)
**Description**: simulate a thermal-throttle event by spoofing the `jetson-stats` sysfs reading; assert (a) `ThermalState` updates carry `throttle == true` within 1 s, (b) C4's `current_covariance_mode` flips to JACOBIAN within 1 frame after that.
**Input data**: scripted sysfs spoof.
**Expected result**: 1 s telemetry latency; 1-frame C4 reaction.
**Max execution time**: 30 s.
---
### C7-IT-03: D-C10-3 takeoff gate refuses mismatched engine
**Summary**: when the manifest's content-hash for an engine does not match the on-disk engine's hash, C7 refuses to deserialise and the F2 takeoff aborts.
**Traces to**: D-C10-3
**Description**: corrupt one byte of a deployed engine after the manifest has been signed; trigger F2 takeoff load; assert (a) C7 raises `EngineHashMismatchError`, (b) the airborne process refuses to open the FC adapter, (c) the failure is logged at ERROR.
**Input data**: a deployed engine + its corrupted twin.
**Expected result**: takeoff aborts; ERROR logged.
**Max execution time**: 30 s.
---
### C7-IT-04: SM / JetPack / TRT / precision filename schema enforcement
**Summary**: an engine file whose `<sm>/<jp>/<trt>/<precision>` quadruple in the filename does not match the running Jetson's actual quadruple is refused at deserialise time.
**Traces to**: D-C10-7
**Description**: copy a valid engine file but rename it with a mismatched SM (e.g., `sm86` instead of `sm87`); call `load_engine`; assert `EngineSchemaMismatchError` and no GPU memory allocated.
**Input data**: a valid engine + a renamed copy.
**Expected result**: engine refused at filename-parse time.
**Max execution time**: 5 s.
---
### C7-IT-05: ONNX-RT fallback when TRT engine unavailable
**Summary**: if the primary TRT engine is missing or unloadable, C7 falls back to ONNX-RT + TRT-EP and continues without dropping the request.
**Traces to**: defensive (engine-rule simple-baseline path)
**Description**: rename the TRT engine for one model away (so deserialise fails); call `infer`; assert the call succeeds via ONNX-RT path with a degraded-latency warning logged.
**Input data**: TRT engine + ONNX model side-by-side.
**Expected result**: successful inference; degraded-latency warning.
**Max execution time**: 30 s.
---
## Performance Tests
### C7-PT-01: per-call inference latency p95 by model
**Traces to**: AC-4.1
**Load scenario**: scripted call rate matching production — UltraVPR @ 3 Hz, LightGlue @ 9 Hz (3 cands × 3 Hz), AdHoP conditional (~25%).
**Expected results**:
| Model | Mode | p95 latency target | Failure threshold |
|-------|------|--------------------|-------------------|
| UltraVPR | TRT FP16 | ≤ 60 ms | 100 ms |
| LightGlue | TRT FP16 | ≤ 30 ms | 60 ms |
| AdHoP | TRT FP16 | ≤ 90 ms | 150 ms |
| DISK | TRT FP16 | ≤ 50 ms | 90 ms |
---
### C7-PT-02: aggregate GPU memory budget
**Traces to**: AC-4.2
**Load scenario**: all production-default engines resident concurrently.
**Expected results**:
| Metric | Target | Failure Threshold |
|--------|--------|-------------------|
| GPU resident memory (all engines) | ≤ 4 GB | 5 GB |
| System RAM (process resident) | ≤ 1.5 GB | 2 GB |
(remaining 8 GB shared LPDDR5 budget partition belongs to OS + ROS-equivalents + scratch; tracked at the system level by NFT-LIM-01.)
---
## Security Tests
### C7-ST-01: engine deserialise refuses files with no SHA-256 sidecar
**Summary**: per Helper `Sha256Sidecar`, every engine has a sidecar `.sha256` file; deserialising an engine without one is refused.
**Traces to**: D-C10-3 (defensive)
**Test procedure**:
1. Delete the sidecar for one valid engine.
2. Call `load_engine` on it.
3. Assert refusal with `EngineSidecarMissingError`.
**Pass criteria**: refusal + no GPU memory allocated.
**Fail criteria**: load succeeds.
---
## Acceptance Tests
C7 has no operator-facing behaviour; covered transitively via NFT-PERF-01 / NFT-PERF-03.
---
## Test Data Management
| Data Set | Source | Size |
|----------|--------|------|
| Pre-built engine cache for Derkachi profile | C10 build artifact | ~600 MB |
| Spoofed `jetson-stats` sysfs harness | scripted | <1 MB |
| Corrupted-engine fixture | scripted | varies |
**Setup**: C10 must have built engines for SM 87 / JP 6.2 / TRT 10.3 / FP16 once before C7 tests can run on Tier-2.
**Teardown**: read-only.
**Data isolation**: per-test temp dirs.