Land the fallback InferenceRuntime strategy that satisfies C7-IT-05: when the TRT-direct path (AZ-298) cannot deserialise a cached engine or when the operator explicitly selects ORT, the system stays in the air at degraded latency rather than dropping the request. Conforms to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep". Production - onnx_trt_ep_runtime.py: compile_engine is a no-op returning an EngineCacheEntry pointing at the source .onnx; deserialize_engine is gate-first for .engine entries and gate-skip for .onnx, builds an ORT InferenceSession with the provider list [TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider], stages cached engines into the ORT TRT EP cache directory via symlink-or-copy, warms up with one session.run after construction, and honours config.inference.ort_disallow_cpu_ fallback by raising EngineDeserializeError when the active provider resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep WARN log plus gcs_alert callback on first call when is_fallback= True; release_engine is idempotent. _build_provider_args is the single point that pins TRT EP option-key names (Risk-3) and caps trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8). - config.py: adds ort_trt_cache_dir (validated non-empty) and ort_disallow_cpu_fallback to C7InferenceConfig. - fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and c7.cpu_fallback FDR record kinds. Tests - test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1 via fake ORT session; Tier-2 placeholders skip on macOS dev for numerical FP16 comparison and session-creation perf NFR. - test_protocol_conformance.py: drops onnx_trt_ep from the missing- module parametrize now that the module ships. - test_az272_fdr_record_schema.py: extends per-kind fixture builder to cover the two new C7 FDR kinds in the roundtrip / schema-version AC tests. Docs - module-layout.md: replaces the pending onnx_trt_runtime row with the shipped onnx_trt_ep_runtime row + capabilities list. - batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted). Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit suite (excluding pending components) 529 passing, 19 env-skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
14 KiB
C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback
Task: AZ-299_c7_onnxrt_fallback
Name: C7 OnnxTrtEpRuntime
Description: Implement OnnxTrtEpRuntime, the fallback InferenceRuntime strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup.
Complexity: 3 points
Dependencies: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
Component: c7_inference (epic AZ-249 / E-C7)
Tracker: AZ-299
Epic: AZ-249 (E-C7)
Document Dependencies
_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md— the Protocol this task implements; produced by AZ-297._docs/02_document/contracts/shared_helpers/engine_filename_schema.md— used at deserialise time when reusing a cached.enginevia ORT TRT EP._docs/02_document/contracts/shared_helpers/sha256_sidecar.md— sidecar trust check via the gate._docs/02_document/contracts/shared_config/composition_root_protocol.md— Config object that provides ORT cache dir + ONNX model paths.
Problem
Two scenarios force a fallback off TensorrtRuntime:
- The cached
.enginefor a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request. - The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).
Without OnnxTrtEpRuntime:
- C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
- The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
- The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.
Outcome
- An
OnnxTrtEpRuntimeclass atsrc/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.pyconforming to the AZ-297 Protocol;current_runtime_label() == "onnx_trt_ep". compile_engineis a no-op for ORT — it returns anEngineCacheEntrywhoseengine_pathis the underlying.onnxfile path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.deserialize_engine(EngineCacheEntry) -> EngineHandle: if the entry'sengine_pathis a.enginefile, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a.onnx, skip the gate and load the ONNX directly. In either case, build anInferenceSessionwith the TRT EP provider list["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"](provider fallback chain) and warm up by running one zero-input through the session.infer(handle, inputs) -> outputscallssession.run(output_names, inputs); the call is sync; ORT manages its own CUDA stream internally.release_engine(handle)callssession.end_profiling()and drops the session reference; ORT releases EP resources on garbage collection.- A degraded-latency WARN log (
kind="c7.fallback_to_onnx_trt_ep") is emitted ONCE on firstinfercall when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used. thermal_state()is delegated to the sameThermalStatePublisherreference used byTensorrtRuntime(AZ-302 owns the publisher).- ORT version pin matches the project's
requirements.txt/ equivalent; this task does NOT introduce a new ORT version.
Scope
Included
OnnxTrtEpRuntimeclass implementing the AZ-297 Protocol.compile_engine: returns anEngineCacheEntrypointing at the source.onnxfile (no separate engine binary). The.onnx's sha256 is computed via AZ-280 and stamped into the entry; the(SM, JP, TRT, precision)tuple is set to the host's running tuple (since ORT will lazy-compile per host).deserialize_engine: provider list construction (["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]), provider options (trt_engine_cache_enable=True,trt_engine_cache_path=config.inference.ort_trt_cache_dir,trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4), session creation, single-shot warm-up. ReturnsOnnxTrtEpEngineHandle(EngineHandlesubclass) wrapping the session.- Engine-cache reuse path: if
EngineCacheEntry.engine_path.suffix == ".engine", invokeEngineGate.validate(entry)first; the engine binary is then placed at thetrt_engine_cache_pathso ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config). - ONNX direct path: if
engine_path.suffix == ".onnx", skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session. infer:session.run(output_names, {name: ndarray for name, ndarray in inputs.items()}); output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.release_engine: idempotent session drop.- Constructor flag
is_fallback: bool = Falseset by the composition root when this strategy is wired as the fallback (vs. selected directly). On firstinfer, ifis_fallback, emit thekind="c7.fallback_to_onnx_trt_ep"WARN log + an FDR record (viagcs_alertin the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback). - Error envelope:
EngineBuildError(ONNX validation failure),EngineDeserializeError(session creation failure),InferenceError(mid-flight ORT runtime exception),OutOfMemoryError(mid-session OOM), the gate's errors when applicable.
Excluded
- AZ-298 TensorrtRuntime (production-default) — separate task.
- AZ-300 PytorchFp16Runtime — separate task.
- AZ-301 EngineGate validation logic — this task INVOKES the gate.
- AZ-302 ThermalState polling — this task delegates.
- ORT version upgrade or new ORT dependency — pinned to the project's existing version.
- Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
- Multi-session pooling — one session per
EngineHandlethis cycle. - Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.
Acceptance Criteria
AC-1: Protocol conformance
Given runtime_checkable(InferenceRuntime)
When isinstance(OnnxTrtEpRuntime(...), InferenceRuntime) is evaluated
Then result is True; current_runtime_label() == "onnx_trt_ep"
AC-2: deserialize from .onnx skips the gate
Given an EngineCacheEntry(engine_path=<onnx_path>)
When deserialize_engine(entry) is called
Then EngineGate.validate is NOT called; an ORT InferenceSession is created with the TRT EP at the head of the provider list; a single warm-up session.run succeeds
AC-3: deserialize from .engine invokes the gate
Given an EngineCacheEntry(engine_path=<engine_path>) whose filename schema mismatches the host
When deserialize_engine(entry) is called
Then EngineGate.validate is invoked first and raises EngineSchemaMismatchError; no InferenceSession is created
AC-4: infer round-trips through ORT and returns named outputs
Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict
When infer(handle, inputs) is called
Then session.run is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)
AC-5: fallback WARN log fires once on first infer
Given a runtime constructed with is_fallback=True
When infer is called for the first time
Then exactly one kind="c7.fallback_to_onnx_trt_ep" WARN log is emitted AND one gcs_alert(...) callback is invoked; subsequent infer calls do NOT emit the log again
AC-6: provider fallback chain respects ORT order
Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available
When deserialize_engine is called
Then session creation succeeds with CUDAExecutionProvider as the active provider; an INFO log records the actual provider in use; current_runtime_label() STILL returns "onnx_trt_ep" (the runtime label is the strategy, not the EP)
AC-7: release_engine drops the session and is idempotent
Given a deserialised handle
When release_engine(handle) is called once and then again
Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently
AC-8: workspace budget is respected
Given config.inference.gpu_memory_budget_bytes = 4 GB
When deserialize_engine is called and ORT's TRT EP attempts to allocate workspace
Then the workspace size is capped at gpu_memory_budget_bytes // 4 = 1 GB per the provider option; an attempt to exceed this raises OutOfMemoryError
Non-Functional Requirements
Performance
- Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
- Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.
Compatibility
- ORT version pinned to the project's
requirements.txt(this task does NOT change it). - TRT EP requires the same TRT 10.3 / CUDA stack as
TensorrtRuntime; if absent,CUDAExecutionProvidercarries the inference (degraded but functional) per AC-6.
Reliability
- Errors rewrapped into the AZ-297 family.
- ORT-internal exceptions (e.g.,
onnxruntime.OrtInvalidArgument) are caught and rewrapped asInferenceError. - Session lifetime is bound to the
EngineHandle; the runtime never holds a session beyond an explicitrelease_engineor process exit.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Protocol conformance + label | isinstance True; label string match |
| AC-2 | Deserialise from .onnx |
Gate NOT called; session created with TRT EP head |
| AC-3 | Deserialise from .engine with mismatched schema |
Gate raises; no session |
| AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance |
| AC-5 | First infer with is_fallback=True |
Exactly one WARN log; one gcs_alert; second infer silent |
| AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still "onnx_trt_ep" |
| AC-7 | release_engine called twice | First drops; second is no-op |
| AC-8 | Workspace cap | Provider option set to budget // 4 |
| NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s |
| NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to InferenceError |
Constraints
- ORT version pinned at the project default; no upgrade in this task.
- Provider list order is fixed:
["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive. - ORT EP's
trt_engine_cache_pathis a config field; defaults to a per-flight subdirectory ofconfig.inference.engine_cache_dir. - The
OnnxTrtEpEngineHandleis opaque to consumers. - This task introduces no new third-party dependencies beyond what ORT already requires.
Risks & Mitigation
Risk 1: ORT TRT EP cache poisons across precisions
- Risk: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
- Mitigation: The EP cache directory is per-flight (under
engine_cache_dir) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.
Risk 2: Provider fallback to CPU EP is silent
- Risk: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
- Mitigation: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (
kind="c7.cpu_fallback"). The composition root MAY install a hard refusal hook that raisesEngineDeserializeErroron CPU fallback (operator-configurable; default is "warn but allow").
Risk 3: ORT version drift breaks TRT EP option keys
- Risk: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
- Mitigation: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.
Runtime Completeness
- Named capability: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
- Production code that must exist: real
OnnxTrtEpRuntimeclass implementing the AZ-297 Protocol; real ORTInferenceSessioncreation with the TRT-EP-led provider list; real ORTsession.runon the F3 hot path; real fallback WARN + GCS alert wiring. - Allowed external stubs: tests MAY substitute a recording wrapper around
session.runto verify call sequence (AC-4); production wiring uses real ORT. - Unacceptable substitutes: a stub that just defers to
TensorrtRuntime(would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).