# C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback **Task**: AZ-299_c7_onnxrt_fallback **Name**: C7 OnnxTrtEpRuntime **Description**: Implement `OnnxTrtEpRuntime`, the fallback `InferenceRuntime` strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup. **Complexity**: 3 points **Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module **Component**: c7_inference (epic AZ-249 / E-C7) **Tracker**: AZ-299 **Epic**: AZ-249 (E-C7) ### Document Dependencies - `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297. - `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — used at deserialise time when reusing a cached `.engine` via ORT TRT EP. - `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar trust check via the gate. - `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides ORT cache dir + ONNX model paths. ## Problem Two scenarios force a fallback off `TensorrtRuntime`: 1. The cached `.engine` for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request. 2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails). Without `OnnxTrtEpRuntime`: - C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify. - The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated. - The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect. ## Outcome - An `OnnxTrtEpRuntime` class at `src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "onnx_trt_ep"`. - `compile_engine` is a no-op for ORT — it returns an `EngineCacheEntry` whose `engine_path` is the underlying `.onnx` file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently. - `deserialize_engine(EngineCacheEntry) -> EngineHandle`: if the entry's `engine_path` is a `.engine` file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a `.onnx`, skip the gate and load the ONNX directly. In either case, build an `InferenceSession` with the TRT EP provider list `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]` (provider fallback chain) and warm up by running one zero-input through the session. - `infer(handle, inputs) -> outputs` calls `session.run(output_names, inputs)`; the call is sync; ORT manages its own CUDA stream internally. - `release_engine(handle)` calls `session.end_profiling()` and drops the session reference; ORT releases EP resources on garbage collection. - A degraded-latency WARN log (`kind="c7.fallback_to_onnx_trt_ep"`) is emitted ONCE on first `infer` call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used. - `thermal_state()` is delegated to the same `ThermalStatePublisher` reference used by `TensorrtRuntime` (AZ-302 owns the publisher). - ORT version pin matches the project's `requirements.txt` / equivalent; this task does NOT introduce a new ORT version. ## Scope ### Included - `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol. - `compile_engine`: returns an `EngineCacheEntry` pointing at the source `.onnx` file (no separate engine binary). The `.onnx`'s sha256 is computed via AZ-280 and stamped into the entry; the `(SM, JP, TRT, precision)` tuple is set to the host's running tuple (since ORT will lazy-compile per host). - `deserialize_engine`: provider list construction (`["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`), provider options (`trt_engine_cache_enable=True`, `trt_engine_cache_path=config.inference.ort_trt_cache_dir`, `trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4`), session creation, single-shot warm-up. Returns `OnnxTrtEpEngineHandle` (`EngineHandle` subclass) wrapping the session. - Engine-cache reuse path: if `EngineCacheEntry.engine_path.suffix == ".engine"`, invoke `EngineGate.validate(entry)` first; the engine binary is then placed at the `trt_engine_cache_path` so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config). - ONNX direct path: if `engine_path.suffix == ".onnx"`, skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session. - `infer`: `session.run(output_names, {name: ndarray for name, ndarray in inputs.items()})`; output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config. - `release_engine`: idempotent session drop. - Constructor flag `is_fallback: bool = False` set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first `infer`, if `is_fallback`, emit the `kind="c7.fallback_to_onnx_trt_ep"` WARN log + an FDR record (via `gcs_alert` in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback). - Error envelope: `EngineBuildError` (ONNX validation failure), `EngineDeserializeError` (session creation failure), `InferenceError` (mid-flight ORT runtime exception), `OutOfMemoryError` (mid-session OOM), the gate's errors when applicable. ### Excluded - AZ-298 TensorrtRuntime (production-default) — separate task. - AZ-300 PytorchFp16Runtime — separate task. - AZ-301 EngineGate validation logic — this task INVOKES the gate. - AZ-302 ThermalState polling — this task delegates. - ORT version upgrade or new ORT dependency — pinned to the project's existing version. - Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle. - Multi-session pooling — one session per `EngineHandle` this cycle. - Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling. ## Acceptance Criteria **AC-1: Protocol conformance** Given `runtime_checkable(InferenceRuntime)` When `isinstance(OnnxTrtEpRuntime(...), InferenceRuntime)` is evaluated Then result is `True`; `current_runtime_label() == "onnx_trt_ep"` **AC-2: deserialize from `.onnx` skips the gate** Given an `EngineCacheEntry(engine_path=)` When `deserialize_engine(entry)` is called Then `EngineGate.validate` is NOT called; an ORT `InferenceSession` is created with the TRT EP at the head of the provider list; a single warm-up `session.run` succeeds **AC-3: deserialize from `.engine` invokes the gate** Given an `EngineCacheEntry(engine_path=)` whose filename schema mismatches the host When `deserialize_engine(entry)` is called Then `EngineGate.validate` is invoked first and raises `EngineSchemaMismatchError`; no `InferenceSession` is created **AC-4: infer round-trips through ORT and returns named outputs** Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict When `infer(handle, inputs)` is called Then `session.run` is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip) **AC-5: fallback WARN log fires once on first infer** Given a runtime constructed with `is_fallback=True` When `infer` is called for the first time Then exactly one `kind="c7.fallback_to_onnx_trt_ep"` WARN log is emitted AND one `gcs_alert(...)` callback is invoked; subsequent `infer` calls do NOT emit the log again **AC-6: provider fallback chain respects ORT order** Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available When `deserialize_engine` is called Then session creation succeeds with `CUDAExecutionProvider` as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` STILL returns `"onnx_trt_ep"` (the runtime label is the strategy, not the EP) **AC-7: release_engine drops the session and is idempotent** Given a deserialised handle When `release_engine(handle)` is called once and then again Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently **AC-8: workspace budget is respected** Given `config.inference.gpu_memory_budget_bytes = 4 GB` When `deserialize_engine` is called and ORT's TRT EP attempts to allocate workspace Then the workspace size is capped at `gpu_memory_budget_bytes // 4 = 1 GB` per the provider option; an attempt to exceed this raises `OutOfMemoryError` ## Non-Functional Requirements **Performance** - Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms). - Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95. **Compatibility** - ORT version pinned to the project's `requirements.txt` (this task does NOT change it). - TRT EP requires the same TRT 10.3 / CUDA stack as `TensorrtRuntime`; if absent, `CUDAExecutionProvider` carries the inference (degraded but functional) per AC-6. **Reliability** - Errors rewrapped into the AZ-297 family. - ORT-internal exceptions (e.g., `onnxruntime.OrtInvalidArgument`) are caught and rewrapped as `InferenceError`. - Session lifetime is bound to the `EngineHandle`; the runtime never holds a session beyond an explicit `release_engine` or process exit. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Protocol conformance + label | `isinstance` True; label string match | | AC-2 | Deserialise from `.onnx` | Gate NOT called; session created with TRT EP head | | AC-3 | Deserialise from `.engine` with mismatched schema | Gate raises; no session | | AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance | | AC-5 | First infer with `is_fallback=True` | Exactly one WARN log; one gcs_alert; second infer silent | | AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still `"onnx_trt_ep"` | | AC-7 | release_engine called twice | First drops; second is no-op | | AC-8 | Workspace cap | Provider option set to budget // 4 | | NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s | | NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to `InferenceError` | ## Constraints - ORT version pinned at the project default; no upgrade in this task. - Provider list order is fixed: `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive. - ORT EP's `trt_engine_cache_path` is a config field; defaults to a per-flight subdirectory of `config.inference.engine_cache_dir`. - The `OnnxTrtEpEngineHandle` is opaque to consumers. - This task introduces no new third-party dependencies beyond what ORT already requires. ## Risks & Mitigation **Risk 1: ORT TRT EP cache poisons across precisions** - *Risk*: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph. - *Mitigation*: The EP cache directory is per-flight (under `engine_cache_dir`) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit. **Risk 2: Provider fallback to CPU EP is silent** - *Risk*: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency. - *Mitigation*: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (`kind="c7.cpu_fallback"`). The composition root MAY install a hard refusal hook that raises `EngineDeserializeError` on CPU fallback (operator-configurable; default is "warn but allow"). **Risk 3: ORT version drift breaks TRT EP option keys** - *Risk*: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses. - *Mitigation*: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge. ## Runtime Completeness - **Named capability**: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05). - **Production code that must exist**: real `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol; real ORT `InferenceSession` creation with the TRT-EP-led provider list; real ORT `session.run` on the F3 hot path; real fallback WARN + GCS alert wiring. - **Allowed external stubs**: tests MAY substitute a recording wrapper around `session.run` to verify call sequence (AC-4); production wiring uses real ORT. - **Unacceptable substitutes**: a stub that just defers to `TensorrtRuntime` (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).