# C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback

**Task**: AZ-299_c7_onnxrt_fallback
**Name**: C7 OnnxTrtEpRuntime
**Description**: Implement `OnnxTrtEpRuntime`, the fallback `InferenceRuntime` strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup.
**Complexity**: 3 points
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
**Component**: c7_inference (epic AZ-249 / E-C7)
**Tracker**: AZ-299
**Epic**: AZ-249 (E-C7)

### Document Dependencies

- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — used at deserialise time when reusing a cached `.engine` via ORT TRT EP.
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar trust check via the gate.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides ORT cache dir + ONNX model paths.

## Problem

Two scenarios force a fallback off `TensorrtRuntime`:

1. The cached `.engine` for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request.
2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).

Without `OnnxTrtEpRuntime`:

- C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
- The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
- The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.

## Outcome

- An `OnnxTrtEpRuntime` class at `src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "onnx_trt_ep"`.
- `compile_engine` is a no-op for ORT — it returns an `EngineCacheEntry` whose `engine_path` is the underlying `.onnx` file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.
- `deserialize_engine(EngineCacheEntry) -> EngineHandle`: if the entry's `engine_path` is a `.engine` file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a `.onnx`, skip the gate and load the ONNX directly. In either case, build an `InferenceSession` with the TRT EP provider list `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]` (provider fallback chain) and warm up by running one zero-input through the session.
- `infer(handle, inputs) -> outputs` calls `session.run(output_names, inputs)`; the call is sync; ORT manages its own CUDA stream internally.
- `release_engine(handle)` calls `session.end_profiling()` and drops the session reference; ORT releases EP resources on garbage collection.
- A degraded-latency WARN log (`kind="c7.fallback_to_onnx_trt_ep"`) is emitted ONCE on first `infer` call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used.
- `thermal_state()` is delegated to the same `ThermalStatePublisher` reference used by `TensorrtRuntime` (AZ-302 owns the publisher).
- ORT version pin matches the project's `requirements.txt` / equivalent; this task does NOT introduce a new ORT version.

## Scope

### Included

- `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol.
- `compile_engine`: returns an `EngineCacheEntry` pointing at the source `.onnx` file (no separate engine binary). The `.onnx`'s sha256 is computed via AZ-280 and stamped into the entry; the `(SM, JP, TRT, precision)` tuple is set to the host's running tuple (since ORT will lazy-compile per host).
- `deserialize_engine`: provider list construction (`["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`), provider options (`trt_engine_cache_enable=True`, `trt_engine_cache_path=config.inference.ort_trt_cache_dir`, `trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4`), session creation, single-shot warm-up. Returns `OnnxTrtEpEngineHandle` (`EngineHandle` subclass) wrapping the session.
- Engine-cache reuse path: if `EngineCacheEntry.engine_path.suffix == ".engine"`, invoke `EngineGate.validate(entry)` first; the engine binary is then placed at the `trt_engine_cache_path` so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config).
- ONNX direct path: if `engine_path.suffix == ".onnx"`, skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session.
- `infer`: `session.run(output_names, {name: ndarray for name, ndarray in inputs.items()})`; output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.
- `release_engine`: idempotent session drop.
- Constructor flag `is_fallback: bool = False` set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first `infer`, if `is_fallback`, emit the `kind="c7.fallback_to_onnx_trt_ep"` WARN log + an FDR record (via `gcs_alert` in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback).
- Error envelope: `EngineBuildError` (ONNX validation failure), `EngineDeserializeError` (session creation failure), `InferenceError` (mid-flight ORT runtime exception), `OutOfMemoryError` (mid-session OOM), the gate's errors when applicable.

### Excluded

- AZ-298 TensorrtRuntime (production-default) — separate task.
- AZ-300 PytorchFp16Runtime — separate task.
- AZ-301 EngineGate validation logic — this task INVOKES the gate.
- AZ-302 ThermalState polling — this task delegates.
- ORT version upgrade or new ORT dependency — pinned to the project's existing version.
- Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
- Multi-session pooling — one session per `EngineHandle` this cycle.
- Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.

## Acceptance Criteria

**AC-1: Protocol conformance**
Given `runtime_checkable(InferenceRuntime)`
When `isinstance(OnnxTrtEpRuntime(...), InferenceRuntime)` is evaluated
Then result is `True`; `current_runtime_label() == "onnx_trt_ep"`

**AC-2: deserialize from `.onnx` skips the gate**
Given an `EngineCacheEntry(engine_path=<onnx_path>)`
When `deserialize_engine(entry)` is called
Then `EngineGate.validate` is NOT called; an ORT `InferenceSession` is created with the TRT EP at the head of the provider list; a single warm-up `session.run` succeeds

**AC-3: deserialize from `.engine` invokes the gate**
Given an `EngineCacheEntry(engine_path=<engine_path>)` whose filename schema mismatches the host
When `deserialize_engine(entry)` is called
Then `EngineGate.validate` is invoked first and raises `EngineSchemaMismatchError`; no `InferenceSession` is created

**AC-4: infer round-trips through ORT and returns named outputs**
Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict
When `infer(handle, inputs)` is called
Then `session.run` is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)

**AC-5: fallback WARN log fires once on first infer**
Given a runtime constructed with `is_fallback=True`
When `infer` is called for the first time
Then exactly one `kind="c7.fallback_to_onnx_trt_ep"` WARN log is emitted AND one `gcs_alert(...)` callback is invoked; subsequent `infer` calls do NOT emit the log again

**AC-6: provider fallback chain respects ORT order**
Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available
When `deserialize_engine` is called
Then session creation succeeds with `CUDAExecutionProvider` as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` STILL returns `"onnx_trt_ep"` (the runtime label is the strategy, not the EP)

**AC-7: release_engine drops the session and is idempotent**
Given a deserialised handle
When `release_engine(handle)` is called once and then again
Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently

**AC-8: workspace budget is respected**
Given `config.inference.gpu_memory_budget_bytes = 4 GB`
When `deserialize_engine` is called and ORT's TRT EP attempts to allocate workspace
Then the workspace size is capped at `gpu_memory_budget_bytes // 4 = 1 GB` per the provider option; an attempt to exceed this raises `OutOfMemoryError`

## Non-Functional Requirements

**Performance**
- Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
- Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.

**Compatibility**
- ORT version pinned to the project's `requirements.txt` (this task does NOT change it).
- TRT EP requires the same TRT 10.3 / CUDA stack as `TensorrtRuntime`; if absent, `CUDAExecutionProvider` carries the inference (degraded but functional) per AC-6.

**Reliability**
- Errors rewrapped into the AZ-297 family.
- ORT-internal exceptions (e.g., `onnxruntime.OrtInvalidArgument`) are caught and rewrapped as `InferenceError`.
- Session lifetime is bound to the `EngineHandle`; the runtime never holds a session beyond an explicit `release_engine` or process exit.

## Unit Tests

| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Protocol conformance + label | `isinstance` True; label string match |
| AC-2 | Deserialise from `.onnx` | Gate NOT called; session created with TRT EP head |
| AC-3 | Deserialise from `.engine` with mismatched schema | Gate raises; no session |
| AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance |
| AC-5 | First infer with `is_fallback=True` | Exactly one WARN log; one gcs_alert; second infer silent |
| AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still `"onnx_trt_ep"` |
| AC-7 | release_engine called twice | First drops; second is no-op |
| AC-8 | Workspace cap | Provider option set to budget // 4 |
| NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s |
| NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to `InferenceError` |

## Constraints

- ORT version pinned at the project default; no upgrade in this task.
- Provider list order is fixed: `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive.
- ORT EP's `trt_engine_cache_path` is a config field; defaults to a per-flight subdirectory of `config.inference.engine_cache_dir`.
- The `OnnxTrtEpEngineHandle` is opaque to consumers.
- This task introduces no new third-party dependencies beyond what ORT already requires.

## Risks & Mitigation

**Risk 1: ORT TRT EP cache poisons across precisions**
- *Risk*: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
- *Mitigation*: The EP cache directory is per-flight (under `engine_cache_dir`) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.

**Risk 2: Provider fallback to CPU EP is silent**
- *Risk*: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
- *Mitigation*: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (`kind="c7.cpu_fallback"`). The composition root MAY install a hard refusal hook that raises `EngineDeserializeError` on CPU fallback (operator-configurable; default is "warn but allow").

**Risk 3: ORT version drift breaks TRT EP option keys**
- *Risk*: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
- *Mitigation*: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.

## Runtime Completeness

- **Named capability**: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
- **Production code that must exist**: real `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol; real ORT `InferenceSession` creation with the TRT-EP-led provider list; real ORT `session.run` on the F3 hot path; real fallback WARN + GCS alert wiring.
- **Allowed external stubs**: tests MAY substitute a recording wrapper around `session.run` to verify call sequence (AC-4); production wiring uses real ORT.
- **Unacceptable substitutes**: a stub that just defers to `TensorrtRuntime` (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).