[AZ-299] C7 OnnxTrtEpRuntime: ORT + TRT EP fallback strategy

Land the fallback InferenceRuntime strategy that satisfies C7-IT-05: when the TRT-direct path (AZ-298) cannot deserialise a cached engine or when the operator explicitly selects ORT, the system stays in the air at degraded latency rather than dropping the request. Conforms to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep". Production - onnx_trt_ep_runtime.py: compile_engine is a no-op returning an EngineCacheEntry pointing at the source .onnx; deserialize_engine is gate-first for .engine entries and gate-skip for .onnx, builds an ORT InferenceSession with the provider list [TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider], stages cached engines into the ORT TRT EP cache directory via symlink-or-copy, warms up with one session.run after construction, and honours config.inference.ort_disallow_cpu_ fallback by raising EngineDeserializeError when the active provider resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep WARN log plus gcs_alert callback on first call when is_fallback= True; release_engine is idempotent. _build_provider_args is the single point that pins TRT EP option-key names (Risk-3) and caps trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8). - config.py: adds ort_trt_cache_dir (validated non-empty) and ort_disallow_cpu_fallback to C7InferenceConfig. - fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and c7.cpu_fallback FDR record kinds. Tests - test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1 via fake ORT session; Tier-2 placeholders skip on macOS dev for numerical FP16 comparison and session-creation perf NFR. - test_protocol_conformance.py: drops onnx_trt_ep from the missing- module parametrize now that the module ships. - test_az272_fdr_record_schema.py: extends per-kind fixture builder to cover the two new C7 FDR kinds in the roundtrip / schema-version AC tests. Docs - module-layout.md: replaces the pending onnx_trt_runtime row with the shipped onnx_trt_ep_runtime row + capabilities list. - batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted). Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit suite (excluding pending components) 529 passing, 19 env-skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 16:01:14 +00:00 · 2026-05-12 23:55:50 +03:00
parent 18a69022b3
commit 0ad3278b12
11 changed files with 1849 additions and 8 deletions
@@ -1,167 +0,0 @@
-# C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback
-
-**Task**: AZ-299_c7_onnxrt_fallback
-**Name**: C7 OnnxTrtEpRuntime
-**Description**: Implement `OnnxTrtEpRuntime`, the fallback `InferenceRuntime` strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup.
-**Complexity**: 3 points
-**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
-**Component**: c7_inference (epic AZ-249 / E-C7)
-**Tracker**: AZ-299
-**Epic**: AZ-249 (E-C7)
-
-### Document Dependencies
-
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — used at deserialise time when reusing a cached `.engine` via ORT TRT EP.
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar trust check via the gate.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides ORT cache dir + ONNX model paths.
-
-## Problem
-
-Two scenarios force a fallback off `TensorrtRuntime`:
-
-1. The cached `.engine` for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request.
-2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).
-
-Without `OnnxTrtEpRuntime`:
-
- C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
- The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
- The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.
-
-## Outcome
-
- An `OnnxTrtEpRuntime` class at `src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "onnx_trt_ep"`.
- `compile_engine` is a no-op for ORT — it returns an `EngineCacheEntry` whose `engine_path` is the underlying `.onnx` file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.
- `deserialize_engine(EngineCacheEntry) -> EngineHandle`: if the entry's `engine_path` is a `.engine` file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a `.onnx`, skip the gate and load the ONNX directly. In either case, build an `InferenceSession` with the TRT EP provider list `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]` (provider fallback chain) and warm up by running one zero-input through the session.
- `infer(handle, inputs) -> outputs` calls `session.run(output_names, inputs)`; the call is sync; ORT manages its own CUDA stream internally.
- `release_engine(handle)` calls `session.end_profiling()` and drops the session reference; ORT releases EP resources on garbage collection.
- A degraded-latency WARN log (`kind="c7.fallback_to_onnx_trt_ep"`) is emitted ONCE on first `infer` call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used.
- `thermal_state()` is delegated to the same `ThermalStatePublisher` reference used by `TensorrtRuntime` (AZ-302 owns the publisher).
- ORT version pin matches the project's `requirements.txt` / equivalent; this task does NOT introduce a new ORT version.
-
-## Scope
-
-### Included
-
- `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol.
- `compile_engine`: returns an `EngineCacheEntry` pointing at the source `.onnx` file (no separate engine binary). The `.onnx`'s sha256 is computed via AZ-280 and stamped into the entry; the `(SM, JP, TRT, precision)` tuple is set to the host's running tuple (since ORT will lazy-compile per host).
- `deserialize_engine`: provider list construction (`["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`), provider options (`trt_engine_cache_enable=True`, `trt_engine_cache_path=config.inference.ort_trt_cache_dir`, `trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4`), session creation, single-shot warm-up. Returns `OnnxTrtEpEngineHandle` (`EngineHandle` subclass) wrapping the session.
- Engine-cache reuse path: if `EngineCacheEntry.engine_path.suffix == ".engine"`, invoke `EngineGate.validate(entry)` first; the engine binary is then placed at the `trt_engine_cache_path` so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config).
- ONNX direct path: if `engine_path.suffix == ".onnx"`, skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session.
- `infer`: `session.run(output_names, {name: ndarray for name, ndarray in inputs.items()})`; output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.
- `release_engine`: idempotent session drop.
- Constructor flag `is_fallback: bool = False` set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first `infer`, if `is_fallback`, emit the `kind="c7.fallback_to_onnx_trt_ep"` WARN log + an FDR record (via `gcs_alert` in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback).
- Error envelope: `EngineBuildError` (ONNX validation failure), `EngineDeserializeError` (session creation failure), `InferenceError` (mid-flight ORT runtime exception), `OutOfMemoryError` (mid-session OOM), the gate's errors when applicable.
-
-### Excluded
-
- AZ-298 TensorrtRuntime (production-default) — separate task.
- AZ-300 PytorchFp16Runtime — separate task.
- AZ-301 EngineGate validation logic — this task INVOKES the gate.
- AZ-302 ThermalState polling — this task delegates.
- ORT version upgrade or new ORT dependency — pinned to the project's existing version.
- Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
- Multi-session pooling — one session per `EngineHandle` this cycle.
- Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.
-
-## Acceptance Criteria
-
-**AC-1: Protocol conformance**
-Given `runtime_checkable(InferenceRuntime)`
-When `isinstance(OnnxTrtEpRuntime(...), InferenceRuntime)` is evaluated
-Then result is `True`; `current_runtime_label() == "onnx_trt_ep"`
-
-**AC-2: deserialize from `.onnx` skips the gate**
-Given an `EngineCacheEntry(engine_path=<onnx_path>)`
-When `deserialize_engine(entry)` is called
-Then `EngineGate.validate` is NOT called; an ORT `InferenceSession` is created with the TRT EP at the head of the provider list; a single warm-up `session.run` succeeds
-
-**AC-3: deserialize from `.engine` invokes the gate**
-Given an `EngineCacheEntry(engine_path=<engine_path>)` whose filename schema mismatches the host
-When `deserialize_engine(entry)` is called
-Then `EngineGate.validate` is invoked first and raises `EngineSchemaMismatchError`; no `InferenceSession` is created
-
-**AC-4: infer round-trips through ORT and returns named outputs**
-Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict
-When `infer(handle, inputs)` is called
-Then `session.run` is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)
-
-**AC-5: fallback WARN log fires once on first infer**
-Given a runtime constructed with `is_fallback=True`
-When `infer` is called for the first time
-Then exactly one `kind="c7.fallback_to_onnx_trt_ep"` WARN log is emitted AND one `gcs_alert(...)` callback is invoked; subsequent `infer` calls do NOT emit the log again
-
-**AC-6: provider fallback chain respects ORT order**
-Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available
-When `deserialize_engine` is called
-Then session creation succeeds with `CUDAExecutionProvider` as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` STILL returns `"onnx_trt_ep"` (the runtime label is the strategy, not the EP)
-
-**AC-7: release_engine drops the session and is idempotent**
-Given a deserialised handle
-When `release_engine(handle)` is called once and then again
-Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently
-
-**AC-8: workspace budget is respected**
-Given `config.inference.gpu_memory_budget_bytes = 4 GB`
-When `deserialize_engine` is called and ORT's TRT EP attempts to allocate workspace
-Then the workspace size is capped at `gpu_memory_budget_bytes // 4 = 1 GB` per the provider option; an attempt to exceed this raises `OutOfMemoryError`
-
-## Non-Functional Requirements
-
-**Performance**
- Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
- Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.
-
-**Compatibility**
- ORT version pinned to the project's `requirements.txt` (this task does NOT change it).
- TRT EP requires the same TRT 10.3 / CUDA stack as `TensorrtRuntime`; if absent, `CUDAExecutionProvider` carries the inference (degraded but functional) per AC-6.
-
-**Reliability**
- Errors rewrapped into the AZ-297 family.
- ORT-internal exceptions (e.g., `onnxruntime.OrtInvalidArgument`) are caught and rewrapped as `InferenceError`.
- Session lifetime is bound to the `EngineHandle`; the runtime never holds a session beyond an explicit `release_engine` or process exit.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|-------------|-----------------|
-| AC-1 | Protocol conformance + label | `isinstance` True; label string match |
-| AC-2 | Deserialise from `.onnx` | Gate NOT called; session created with TRT EP head |
-| AC-3 | Deserialise from `.engine` with mismatched schema | Gate raises; no session |
-| AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance |
-| AC-5 | First infer with `is_fallback=True` | Exactly one WARN log; one gcs_alert; second infer silent |
-| AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still `"onnx_trt_ep"` |
-| AC-7 | release_engine called twice | First drops; second is no-op |
-| AC-8 | Workspace cap | Provider option set to budget // 4 |
-| NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s |
-| NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to `InferenceError` |
-
-## Constraints
-
- ORT version pinned at the project default; no upgrade in this task.
- Provider list order is fixed: `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive.
- ORT EP's `trt_engine_cache_path` is a config field; defaults to a per-flight subdirectory of `config.inference.engine_cache_dir`.
- The `OnnxTrtEpEngineHandle` is opaque to consumers.
- This task introduces no new third-party dependencies beyond what ORT already requires.
-
-## Risks & Mitigation
-
-**Risk 1: ORT TRT EP cache poisons across precisions**
- *Risk*: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
- *Mitigation*: The EP cache directory is per-flight (under `engine_cache_dir`) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.
-
-**Risk 2: Provider fallback to CPU EP is silent**
- *Risk*: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
- *Mitigation*: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (`kind="c7.cpu_fallback"`). The composition root MAY install a hard refusal hook that raises `EngineDeserializeError` on CPU fallback (operator-configurable; default is "warn but allow").
-
-**Risk 3: ORT version drift breaks TRT EP option keys**
- *Risk*: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
- *Mitigation*: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.
-
-## Runtime Completeness
-
- **Named capability**: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
- **Production code that must exist**: real `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol; real ORT `InferenceSession` creation with the TRT-EP-led provider list; real ORT `session.run` on the F3 hot path; real fallback WARN + GCS alert wiring.
- **Allowed external stubs**: tests MAY substitute a recording wrapper around `session.run` to verify call sequence (AC-4); production wiring uses real ORT.
- **Unacceptable substitutes**: a stub that just defers to `TensorrtRuntime` (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).