Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-299_c7_onnxrt_fallback.md
T
Oleksandr Bezdieniezhnykh 880eabcb3f Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 00:39:48 +03:00

14 KiB
Raw Blame History

C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback

Task: AZ-299_c7_onnxrt_fallback Name: C7 OnnxTrtEpRuntime Description: Implement OnnxTrtEpRuntime, the fallback InferenceRuntime strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup. Complexity: 3 points Dependencies: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module Component: c7_inference (epic AZ-249 / E-C7) Tracker: AZ-299 Epic: AZ-249 (E-C7)

Document Dependencies

  • _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — the Protocol this task implements; produced by AZ-297.
  • _docs/02_document/contracts/shared_helpers/engine_filename_schema.md — used at deserialise time when reusing a cached .engine via ORT TRT EP.
  • _docs/02_document/contracts/shared_helpers/sha256_sidecar.md — sidecar trust check via the gate.
  • _docs/02_document/contracts/shared_config/composition_root_protocol.md — Config object that provides ORT cache dir + ONNX model paths.

Problem

Two scenarios force a fallback off TensorrtRuntime:

  1. The cached .engine for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request.
  2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).

Without OnnxTrtEpRuntime:

  • C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
  • The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
  • The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.

Outcome

  • An OnnxTrtEpRuntime class at src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py conforming to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep".
  • compile_engine is a no-op for ORT — it returns an EngineCacheEntry whose engine_path is the underlying .onnx file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.
  • deserialize_engine(EngineCacheEntry) -> EngineHandle: if the entry's engine_path is a .engine file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a .onnx, skip the gate and load the ONNX directly. In either case, build an InferenceSession with the TRT EP provider list ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"] (provider fallback chain) and warm up by running one zero-input through the session.
  • infer(handle, inputs) -> outputs calls session.run(output_names, inputs); the call is sync; ORT manages its own CUDA stream internally.
  • release_engine(handle) calls session.end_profiling() and drops the session reference; ORT releases EP resources on garbage collection.
  • A degraded-latency WARN log (kind="c7.fallback_to_onnx_trt_ep") is emitted ONCE on first infer call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used.
  • thermal_state() is delegated to the same ThermalStatePublisher reference used by TensorrtRuntime (AZ-302 owns the publisher).
  • ORT version pin matches the project's requirements.txt / equivalent; this task does NOT introduce a new ORT version.

Scope

Included

  • OnnxTrtEpRuntime class implementing the AZ-297 Protocol.
  • compile_engine: returns an EngineCacheEntry pointing at the source .onnx file (no separate engine binary). The .onnx's sha256 is computed via AZ-280 and stamped into the entry; the (SM, JP, TRT, precision) tuple is set to the host's running tuple (since ORT will lazy-compile per host).
  • deserialize_engine: provider list construction (["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]), provider options (trt_engine_cache_enable=True, trt_engine_cache_path=config.inference.ort_trt_cache_dir, trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4), session creation, single-shot warm-up. Returns OnnxTrtEpEngineHandle (EngineHandle subclass) wrapping the session.
  • Engine-cache reuse path: if EngineCacheEntry.engine_path.suffix == ".engine", invoke EngineGate.validate(entry) first; the engine binary is then placed at the trt_engine_cache_path so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config).
  • ONNX direct path: if engine_path.suffix == ".onnx", skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session.
  • infer: session.run(output_names, {name: ndarray for name, ndarray in inputs.items()}); output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.
  • release_engine: idempotent session drop.
  • Constructor flag is_fallback: bool = False set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first infer, if is_fallback, emit the kind="c7.fallback_to_onnx_trt_ep" WARN log + an FDR record (via gcs_alert in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback).
  • Error envelope: EngineBuildError (ONNX validation failure), EngineDeserializeError (session creation failure), InferenceError (mid-flight ORT runtime exception), OutOfMemoryError (mid-session OOM), the gate's errors when applicable.

Excluded

  • AZ-298 TensorrtRuntime (production-default) — separate task.
  • AZ-300 PytorchFp16Runtime — separate task.
  • AZ-301 EngineGate validation logic — this task INVOKES the gate.
  • AZ-302 ThermalState polling — this task delegates.
  • ORT version upgrade or new ORT dependency — pinned to the project's existing version.
  • Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
  • Multi-session pooling — one session per EngineHandle this cycle.
  • Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.

Acceptance Criteria

AC-1: Protocol conformance Given runtime_checkable(InferenceRuntime) When isinstance(OnnxTrtEpRuntime(...), InferenceRuntime) is evaluated Then result is True; current_runtime_label() == "onnx_trt_ep"

AC-2: deserialize from .onnx skips the gate Given an EngineCacheEntry(engine_path=<onnx_path>) When deserialize_engine(entry) is called Then EngineGate.validate is NOT called; an ORT InferenceSession is created with the TRT EP at the head of the provider list; a single warm-up session.run succeeds

AC-3: deserialize from .engine invokes the gate Given an EngineCacheEntry(engine_path=<engine_path>) whose filename schema mismatches the host When deserialize_engine(entry) is called Then EngineGate.validate is invoked first and raises EngineSchemaMismatchError; no InferenceSession is created

AC-4: infer round-trips through ORT and returns named outputs Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict When infer(handle, inputs) is called Then session.run is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)

AC-5: fallback WARN log fires once on first infer Given a runtime constructed with is_fallback=True When infer is called for the first time Then exactly one kind="c7.fallback_to_onnx_trt_ep" WARN log is emitted AND one gcs_alert(...) callback is invoked; subsequent infer calls do NOT emit the log again

AC-6: provider fallback chain respects ORT order Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available When deserialize_engine is called Then session creation succeeds with CUDAExecutionProvider as the active provider; an INFO log records the actual provider in use; current_runtime_label() STILL returns "onnx_trt_ep" (the runtime label is the strategy, not the EP)

AC-7: release_engine drops the session and is idempotent Given a deserialised handle When release_engine(handle) is called once and then again Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently

AC-8: workspace budget is respected Given config.inference.gpu_memory_budget_bytes = 4 GB When deserialize_engine is called and ORT's TRT EP attempts to allocate workspace Then the workspace size is capped at gpu_memory_budget_bytes // 4 = 1 GB per the provider option; an attempt to exceed this raises OutOfMemoryError

Non-Functional Requirements

Performance

  • Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
  • Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.

Compatibility

  • ORT version pinned to the project's requirements.txt (this task does NOT change it).
  • TRT EP requires the same TRT 10.3 / CUDA stack as TensorrtRuntime; if absent, CUDAExecutionProvider carries the inference (degraded but functional) per AC-6.

Reliability

  • Errors rewrapped into the AZ-297 family.
  • ORT-internal exceptions (e.g., onnxruntime.OrtInvalidArgument) are caught and rewrapped as InferenceError.
  • Session lifetime is bound to the EngineHandle; the runtime never holds a session beyond an explicit release_engine or process exit.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Protocol conformance + label isinstance True; label string match
AC-2 Deserialise from .onnx Gate NOT called; session created with TRT EP head
AC-3 Deserialise from .engine with mismatched schema Gate raises; no session
AC-4 Numerical comparison against TRT-direct (UltraVPR sample input) Outputs match within FP16 tolerance
AC-5 First infer with is_fallback=True Exactly one WARN log; one gcs_alert; second infer silent
AC-6 Force TRT EP refusal, allow CUDA EP Session creates with CUDA EP; label still "onnx_trt_ep"
AC-7 release_engine called twice First drops; second is no-op
AC-8 Workspace cap Provider option set to budget // 4
NFR-perf-session-create Microbench session creation × 5 First p95 ≤ 30 s; subsequent ≤ 5 s
NFR-reliability-error-rewrap Inject ORT internal error Rewrapped to InferenceError

Constraints

  • ORT version pinned at the project default; no upgrade in this task.
  • Provider list order is fixed: ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive.
  • ORT EP's trt_engine_cache_path is a config field; defaults to a per-flight subdirectory of config.inference.engine_cache_dir.
  • The OnnxTrtEpEngineHandle is opaque to consumers.
  • This task introduces no new third-party dependencies beyond what ORT already requires.

Risks & Mitigation

Risk 1: ORT TRT EP cache poisons across precisions

  • Risk: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
  • Mitigation: The EP cache directory is per-flight (under engine_cache_dir) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.

Risk 2: Provider fallback to CPU EP is silent

  • Risk: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
  • Mitigation: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (kind="c7.cpu_fallback"). The composition root MAY install a hard refusal hook that raises EngineDeserializeError on CPU fallback (operator-configurable; default is "warn but allow").

Risk 3: ORT version drift breaks TRT EP option keys

  • Risk: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
  • Mitigation: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.

Runtime Completeness

  • Named capability: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
  • Production code that must exist: real OnnxTrtEpRuntime class implementing the AZ-297 Protocol; real ORT InferenceSession creation with the TRT-EP-led provider list; real ORT session.run on the F3 hot path; real fallback WARN + GCS alert wiring.
  • Allowed external stubs: tests MAY substitute a recording wrapper around session.run to verify call sequence (AC-4); production wiring uses real ORT.
  • Unacceptable substitutes: a stub that just defers to TensorrtRuntime (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).