# Batch 32 / Cycle 1 — Implementation Report **Date**: 2026-05-12 **Tasks**: AZ-299 (C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP fallback strategy + per-flight ORT TRT subgraph cache + one-shot fallback alert + CPU-fallback gate) **Story points landed**: 3 **Status**: complete (AZ-299 → In Testing) ## Scope summary Single-task batch landing the fallback `InferenceRuntime` strategy for C7. `OnnxTrtEpRuntime` owns the ONNX Runtime + TensorRT EP path that satisfies C7-IT-05: when the TRT-direct strategy (AZ-298) cannot deserialise the cached engine for a given model, or when the operator explicitly selects ORT, the system stays in the air at degraded latency rather than dropping the request. The runtime conforms to the same AZ-297 Protocol as `TensorrtRuntime` and `PytorchFp16Runtime`, so the composition root can wire it as either the primary strategy or as the fallback target. The fallback semantics required by AC-5 and Risk-2 are captured by two new FDR record kinds (extending AZ-272): - `c7.fallback_to_onnx_trt_ep` — fired once per session when a runtime constructed with `is_fallback=True` serves its first `infer`. Carries `{model_name, reason, active_provider}`. - `c7.cpu_fallback` — fired at deserialise time when ORT's provider fallback chain settled on `CPUExecutionProvider` (TRT EP refused AND CUDA EP refused or unavailable). Carries `{model_name, requested_providers, active_provider}`. The composition root can install a hard-refusal hook by setting `config.inference.ort_disallow_cpu_fallback = True`; default remains "warn but allow" so a misconfigured Jetson serves results (slowly) rather than hard-failing the flight. ORT, NumPy, and `pycuda` (used only for `release_engine` cleanup hints) are **lazy-imported** inside the methods that need them; the module loads cleanly on Tier-0 / macOS dev hosts so the package's protocol- conformance tests stay importable without GPU. ORT version is pinned at the project default; this task does NOT introduce any new third- party dependency. The TRT EP cache directory comes from `config.inference.ort_trt_cache_dir` (new field, defaults to `/var/lib/gps-denied/engines/ort_trt_cache`) and is intentionally a sibling of the TRT-direct `engine_cache_dir` so C12 operator tooling can clean both on flight end via a single sweep. ## Files added / modified ### New (production) - `src/gps_denied_onboard/components/c7_inference/onnx_trt_ep_runtime.py` — `OnnxTrtEpEngineHandle` (opaque, slots, owns the ORT session + cached output names + `model_name` + `_released` flag); local `_iso_ts_now` helper for FDR timestamps (kept component-local rather than reaching across layering — see Findings #1); `_ort_dtype_to_numpy` (single point that maps the ORT type strings back to NumPy dtypes, isolating the version-fragile mapping for Risk-3); `_build_provider_args` (single place that constructs the TRT EP option dict — pins the option-key names for Risk-3 unit test); `_stage_engine_for_ort` (symlink-or-copy a cached `.engine` into the ORT TRT EP cache directory at the path ORT expects); `OnnxTrtEpRuntime` class with `compile_engine` (no-op returning an `EngineCacheEntry` pointing at the source `.onnx`), `deserialize_engine` (gate-first when the entry is a `.engine`, skip-gate when `.onnx`; provider list `[TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider]`; staging the cached engine for the EP; warm-up `session.run` after construction; one-shot `c7.cpu_fallback` alert when the active provider is CPU; honours `ort_disallow_cpu_fallback` by raising `EngineDeserializeError` before any work happens on the CPU path), `infer` (sync `session.run` with named inputs / outputs; first call on `is_fallback=True` runtimes fires exactly one `c7.fallback_to_onnx_trt_ep` WARN log + `gcs_alert` callback; ORT-internal exceptions rewrapped to `InferenceError` with `__cause__` preserved), idempotent `release_engine` (drops the session reference and marks the handle released; second call is a silent no-op), `thermal_state` delegation to the injected `ThermalStatePublisher`, `current_runtime_label() -> "onnx_trt_ep"`. ### New (tests) - `tests/unit/c7_inference/test_onnx_trt_ep_runtime.py` — **NEW** suite covering every AC + the two risks: - **AC-1** protocol conformance + label string match. - **AC-2** deserialise from `.onnx` does NOT call `EngineGate.validate`; session is built with the TRT EP at the head of the provider list; warm-up `session.run` runs exactly once. - **AC-3** deserialise from `.engine` whose filename schema mismatches the host: `EngineGate.validate` raises before any ORT session creation — verified by monkey-patching `_load_ort` to raise `AssertionError` on any call. - **AC-4** `infer` round-trips through the fake ORT session with named inputs / outputs; the returned dict matches the Protocol shape. (The "numerical comparison against TRT-direct within FP16 tolerance" half of AC-4 lives in the Tier-2 microbench harness — placeholder skip in the same file.) - **AC-5** first `infer` with `is_fallback=True` emits exactly one `c7.fallback_to_onnx_trt_ep` WARN log AND invokes the `gcs_alert` callback once; second `infer` is silent on both channels; `is_fallback=False` never emits. - **AC-6** forcing TRT EP to refuse (the fake ORT reports only `CUDAExecutionProvider` and `CPUExecutionProvider` as successfully loaded) creates the session with CUDA EP as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` still returns `"onnx_trt_ep"`. - **AC-7** `release_engine` called twice — first drops the session reference and marks released; second is a silent no-op; foreign handle types silently ignored (defensive shim consistent with `TensorrtRuntime`). - **AC-8** `_build_provider_args` sets `trt_max_workspace_size` to `gpu_memory_budget_bytes // 4`; the provider option dict contains exactly the keys `{trt_engine_cache_enable, trt_engine_cache_path, trt_max_workspace_size, trt_fp16_enable}` (Risk-3 pin). - **Risk-2** CPU fallback emits exactly one `c7.cpu_fallback` FDR record at deserialise time; with `ort_disallow_cpu_fallback=True` the runtime instead raises `EngineDeserializeError` before any session work. - **NFR-reliability** ORT-internal `RuntimeError` raised inside `session.run` is rewrapped as `InferenceError` with `__cause__` preserved; foreign handle types and released handles rewrap. - **Tier-2 placeholders**: numerical FP16 comparison against TRT-direct (AC-4 tail), session-creation perf NFR (≤ 30 s p95 first / ≤ 5 s p95 with EP cache hot), and real-EP CPU-fallback under TRT-version-mismatch — all marked `@pytest.mark.tier2` and skipped on Tier-1 / macOS dev. ### Modified (production) - `src/gps_denied_onboard/components/c7_inference/config.py` — adds `ort_trt_cache_dir: str = "/var/lib/gps-denied/engines/ort_trt_cache"` (validated non-empty in `__post_init__`) and `ort_disallow_cpu_fallback: bool = False` to `C7InferenceConfig`. The CPU-fallback gate intentionally defaults to "warn but allow" to honour the architecture's "keep flying" principle; the operator opts INTO hard-refusal when latency budgets matter more than service continuity. - `src/gps_denied_onboard/fdr_client/records.py` — adds two new `FdrRecord` kinds (`c7.fallback_to_onnx_trt_ep` and `c7.cpu_fallback`) with their required field sets, following the existing pattern for `c6.write_failed` / `c6.freshness.*`. ### Modified (tests) - `tests/unit/c7_inference/test_protocol_conformance.py` — the `test_ac5_build_inference_runtime_flag_on_but_module_missing` parametrization previously excluded only `{"pytorch_fp16", "tensorrt"}`; now that `onnx_trt_ep_runtime.py` exists the set is `{"pytorch_fp16", "tensorrt", "onnx_trt_ep"}`. The test body and parametrize structure are kept intact so the factory's missing- module branch stays under test for any future strategy whose `BUILD_*` flag is wired in `inference_factory._RUNTIME_TO_MODULE` ahead of its module landing. - `tests/unit/test_az272_fdr_record_schema.py` — extends the per-kind fixture builder with deterministic payloads for `c7.fallback_to_onnx_trt_ep` and `c7.cpu_fallback` so the AZ-272 roundtrip / schema-version / unknown-kind tests cover the new kinds the same way they cover the C6 kinds. ### Modified (docs) - `_docs/02_document/module-layout.md` — the `onnx_trt_runtime.py (ONNX Runtime + TensorRT EP, pending)` row in the c7_inference per-component table now reads `onnx_trt_ep_runtime.py (AZ-299; ONNX Runtime + TensorRT EP fallback strategy + per-flight ORT TRT subgraph cache + one-shot fallback WARN/FDR/GCS alert + CPU-fallback gate)`. The filename shift from `onnx_trt_runtime.py` (task spec body) to `onnx_trt_ep_runtime.py` (shipped) follows `inference_factory._RUNTIME_TO_MODULE` which is the authoritative factory wiring — the task spec's "Outcome" body had a typo that contradicted its own "label" wording (`"onnx_trt_ep"`). The factory wins. ## Acceptance criteria coverage | AC | Test | Status | |----|------|--------| | AC-1 Protocol conformance + label | `test_ac1_protocol_conformance` | passing | | AC-2 Deserialise from `.onnx` skips the gate | `test_ac2_deserialize_from_onnx_skips_gate` | passing | | AC-3 Deserialise from `.engine` invokes the gate | `test_ac3_deserialize_from_engine_invokes_gate_and_skips_session_on_refusal` | passing | | AC-4 `infer` round-trips through ORT (named outputs) | `test_ac4_infer_round_trips_named_outputs` (Tier-1) + Tier-2 numerical FP16 comparison placeholder | passing / Tier-2 skipped | | AC-5 Fallback WARN log fires once on first infer | `test_ac5_first_infer_with_is_fallback_emits_warn_and_alert_once` + `test_ac5_not_fallback_never_emits` | passing | | AC-6 Provider fallback chain respects ORT order | `test_ac6_trt_ep_refused_falls_through_to_cuda_ep` | passing | | AC-7 `release_engine` idempotent | `test_ac7_release_is_idempotent` + `test_release_engine_ignores_foreign_handle_type` | passing | | AC-8 Workspace budget respected | `test_ac8_provider_options_pin_keys_and_budget_quarter` | passing | | Risk-2 CPU fallback signalled | `test_risk2_cpu_fallback_emits_fdr_kind` + `test_risk2_cpu_fallback_with_disallow_raises` | passing | | Risk-3 TRT EP option-key pin | `test_ac8_provider_options_pin_keys_and_budget_quarter` (shared) | passing | | NFR-perf-session-create p95 ≤ 30 s / ≤ 5 s cache hot | `test_nfr_perf_session_create_first_under_30s_cache_hot_under_5s` (Tier-2 microbench) | Tier-2 skipped | | NFR-reliability-error-rewrap | `test_nfr_reliability_infer_rewraps_runtime_error` + `test_infer_rejects_foreign_handle` + `test_infer_rejects_released_handle` | passing | ## AC Test Coverage: 8 of 8 covered (+ 2 risks + 2 NFRs) ## Code Review Verdict: PASS_WITH_WARNINGS (2 Low accepted; see Findings) ## Auto-Fix Attempts: 0 ## Stuck Agents: None ## Findings (self-review) | # | Severity | Category | Location | Note | Resolution | |---|----------|----------|----------|------|------------| | 1 | Low | Maintainability | `onnx_trt_ep_runtime.py::_iso_ts_now` | Duplicated from the equivalent helper in `tensorrt_runtime.py` / `fdr_client`. Consolidating into a shared helper would either inflate `fdr_client/records.py` (which is the lowest-layer module the c7 strategies depend on) or carve out a new shared utility module just for one one-liner. Kept component-local; a later hygiene pass can extract the helper alongside the existing shared `_types/` move when more components grow ISO-timestamp call sites. | Open (Low) — accepted; the c7 layering rule wins. | | 2 | Low | Test-quality | `test_ac4_infer_round_trips_named_outputs` | Uses a `_FakeOrtSession` whose `run(...)` returns canned arrays in the declared output order. The named-output mapping assertion is verified at the Protocol layer; the *numerical* FP16 comparison against TRT-direct lives in the Tier-2 microbench harness. | Open (Low) — Tier-2 placeholder owns the numerical half. | | 3 | Low | Architecture | `onnx_trt_ep_runtime.py::_stage_engine_for_ort` | Attempts symlink first, falls back to copy on `OSError` (e.g., crossing a filesystem boundary, or running on a host that disallows symlinks for the running user). The copy path leaves a stale binary in the EP cache directory if the staging fails partway; C12's per-flight cache cleanup handles this — a torn copy on disk is no worse than a stale subgraph. | Open (Low) — accepted as documented; C12 owns cleanup. | | 4 | Low | Test-coverage | AC-3 schema-mismatch path | The test patches `EngineGate.validate` to raise `EngineSchemaMismatchError`; the real gate's filename-schema parser is exercised by AZ-281 / AZ-301 tests. Wiring this runtime to a real (live) gate would duplicate that coverage at the wrong layer. | Open (Low) — accepted; AZ-301 owns the parser. | ## Tracker - AZ-299 transitioned to **In Progress** at session start; will move to **In Testing** post-commit per `protocols.md`. ## Test suite - `tests/unit/c7_inference/test_onnx_trt_ep_runtime.py` — all active tests passing, Tier-2 placeholders skipped on macOS dev (no ORT/CUDA binding). - `tests/unit/c7_inference/` (full c7 suite) — 139 passing, 17 skipped (CUDA / TensorRT / ORT unavailable on Tier-1 / macOS). - `tests/unit/test_az272_fdr_record_schema.py` — 34 passing (the two new C7 kinds now covered by every roundtrip / schema-version test). - Combined unit suite excluding pending components (c1, c2, c2.5, c3, c3.5, c4, c5, c8, c10, c11, c12) and the c6 collection blocker on this host (missing `psycopg_pool` is a known dev- machine env issue, pre-existing) — 529 passing, 19 environment- skipped, 1 warning (pre-existing `pynvml` FutureWarning unrelated to AZ-299). ## Next batch Cycle 1 advances per the greenfield queue — autodev re-detects the next AZ ticket in the Step 7 batch loop. With C7's three concrete strategies now landed (AZ-298 / AZ-299 / AZ-300), the remaining C7 work is `AZ-301 c7_engine_gate` (already in `done/`) + `AZ-302 c7_thermal_publisher` (already in `done/`); the next ticket in dependency order is the first item in the queue that doesn't depend on a pending earlier task — autodev will compute that during the next sub-step.