mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 23:11:13 +00:00

Files

T

Oleksandr Bezdieniezhnykh 0ad3278b12 [AZ-299] C7 OnnxTrtEpRuntime: ORT + TRT EP fallback strategy

Land the fallback InferenceRuntime strategy that satisfies C7-IT-05:
when the TRT-direct path (AZ-298) cannot deserialise a cached engine
or when the operator explicitly selects ORT, the system stays in the
air at degraded latency rather than dropping the request. Conforms to
the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep".

Production
- onnx_trt_ep_runtime.py: compile_engine is a no-op returning an
  EngineCacheEntry pointing at the source .onnx; deserialize_engine
  is gate-first for .engine entries and gate-skip for .onnx, builds
  an ORT InferenceSession with the provider list
  [TensorrtExecutionProvider, CUDAExecutionProvider,
  CPUExecutionProvider], stages cached engines into the ORT TRT EP
  cache directory via symlink-or-copy, warms up with one session.run
  after construction, and honours config.inference.ort_disallow_cpu_
  fallback by raising EngineDeserializeError when the active provider
  resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep
  WARN log plus gcs_alert callback on first call when is_fallback=
  True; release_engine is idempotent. _build_provider_args is the
  single point that pins TRT EP option-key names (Risk-3) and caps
  trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8).
- config.py: adds ort_trt_cache_dir (validated non-empty) and
  ort_disallow_cpu_fallback to C7InferenceConfig.
- fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and
  c7.cpu_fallback FDR record kinds.

Tests
- test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback
  alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1
  via fake ORT session; Tier-2 placeholders skip on macOS dev for
  numerical FP16 comparison and session-creation perf NFR.
- test_protocol_conformance.py: drops onnx_trt_ep from the missing-
  module parametrize now that the module ships.
- test_az272_fdr_record_schema.py: extends per-kind fixture builder
  to cover the two new C7 FDR kinds in the roundtrip / schema-version
  AC tests.

Docs
- module-layout.md: replaces the pending onnx_trt_runtime row with
  the shipped onnx_trt_ep_runtime row + capabilities list.
- batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch
  + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted).

Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit
suite (excluding pending components) 529 passing, 19 env-skipped.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-12 23:55:50 +03:00

14 KiB

Raw Blame History

C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback

Task: AZ-299_c7_onnxrt_fallback Name: C7 OnnxTrtEpRuntime Description: Implement OnnxTrtEpRuntime, the fallback InferenceRuntime strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup. Complexity: 3 points Dependencies: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module Component: c7_inference (epic AZ-249 / E-C7) Tracker: AZ-299 Epic: AZ-249 (E-C7)

Document Dependencies

_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — the Protocol this task implements; produced by AZ-297.
_docs/02_document/contracts/shared_helpers/engine_filename_schema.md — used at deserialise time when reusing a cached .engine via ORT TRT EP.
_docs/02_document/contracts/shared_helpers/sha256_sidecar.md — sidecar trust check via the gate.
_docs/02_document/contracts/shared_config/composition_root_protocol.md — Config object that provides ORT cache dir + ONNX model paths.

Problem

Two scenarios force a fallback off TensorrtRuntime:

The cached .engine for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request.
The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).

Without OnnxTrtEpRuntime:

C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.

Outcome

An OnnxTrtEpRuntime class at src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py conforming to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep".
compile_engine is a no-op for ORT — it returns an EngineCacheEntry whose engine_path is the underlying .onnx file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.
deserialize_engine(EngineCacheEntry) -> EngineHandle: if the entry's engine_path is a .engine file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a .onnx, skip the gate and load the ONNX directly. In either case, build an InferenceSession with the TRT EP provider list ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"] (provider fallback chain) and warm up by running one zero-input through the session.
infer(handle, inputs) -> outputs calls session.run(output_names, inputs); the call is sync; ORT manages its own CUDA stream internally.
release_engine(handle) calls session.end_profiling() and drops the session reference; ORT releases EP resources on garbage collection.
A degraded-latency WARN log (kind="c7.fallback_to_onnx_trt_ep") is emitted ONCE on first infer call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used.
thermal_state() is delegated to the same ThermalStatePublisher reference used by TensorrtRuntime (AZ-302 owns the publisher).
ORT version pin matches the project's requirements.txt / equivalent; this task does NOT introduce a new ORT version.

Scope

Included

OnnxTrtEpRuntime class implementing the AZ-297 Protocol.
compile_engine: returns an EngineCacheEntry pointing at the source .onnx file (no separate engine binary). The .onnx's sha256 is computed via AZ-280 and stamped into the entry; the (SM, JP, TRT, precision) tuple is set to the host's running tuple (since ORT will lazy-compile per host).
deserialize_engine: provider list construction (["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]), provider options (trt_engine_cache_enable=True, trt_engine_cache_path=config.inference.ort_trt_cache_dir, trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4), session creation, single-shot warm-up. Returns OnnxTrtEpEngineHandle (EngineHandle subclass) wrapping the session.
Engine-cache reuse path: if EngineCacheEntry.engine_path.suffix == ".engine", invoke EngineGate.validate(entry) first; the engine binary is then placed at the trt_engine_cache_path so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config).
ONNX direct path: if engine_path.suffix == ".onnx", skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session.
infer: session.run(output_names, {name: ndarray for name, ndarray in inputs.items()}); output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.
release_engine: idempotent session drop.
Constructor flag is_fallback: bool = False set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first infer, if is_fallback, emit the kind="c7.fallback_to_onnx_trt_ep" WARN log + an FDR record (via gcs_alert in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback).
Error envelope: EngineBuildError (ONNX validation failure), EngineDeserializeError (session creation failure), InferenceError (mid-flight ORT runtime exception), OutOfMemoryError (mid-session OOM), the gate's errors when applicable.

Excluded

AZ-298 TensorrtRuntime (production-default) — separate task.
AZ-300 PytorchFp16Runtime — separate task.
AZ-301 EngineGate validation logic — this task INVOKES the gate.
AZ-302 ThermalState polling — this task delegates.
ORT version upgrade or new ORT dependency — pinned to the project's existing version.
Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
Multi-session pooling — one session per EngineHandle this cycle.
Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.

Acceptance Criteria

AC-1: Protocol conformance Given runtime_checkable(InferenceRuntime) When isinstance(OnnxTrtEpRuntime(...), InferenceRuntime) is evaluated Then result is True; current_runtime_label() == "onnx_trt_ep"

AC-2: deserialize from .onnx skips the gate Given an EngineCacheEntry(engine_path=<onnx_path>) When deserialize_engine(entry) is called Then EngineGate.validate is NOT called; an ORT InferenceSession is created with the TRT EP at the head of the provider list; a single warm-up session.run succeeds

AC-3: deserialize from .engine invokes the gate Given an EngineCacheEntry(engine_path=<engine_path>) whose filename schema mismatches the host When deserialize_engine(entry) is called Then EngineGate.validate is invoked first and raises EngineSchemaMismatchError; no InferenceSession is created

AC-4: infer round-trips through ORT and returns named outputs Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict When infer(handle, inputs) is called Then session.run is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)

AC-5: fallback WARN log fires once on first infer Given a runtime constructed with is_fallback=True When infer is called for the first time Then exactly one kind="c7.fallback_to_onnx_trt_ep" WARN log is emitted AND one gcs_alert(...) callback is invoked; subsequent infer calls do NOT emit the log again

AC-6: provider fallback chain respects ORT order Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available When deserialize_engine is called Then session creation succeeds with CUDAExecutionProvider as the active provider; an INFO log records the actual provider in use; current_runtime_label() STILL returns "onnx_trt_ep" (the runtime label is the strategy, not the EP)

AC-7: release_engine drops the session and is idempotent Given a deserialised handle When release_engine(handle) is called once and then again Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently

AC-8: workspace budget is respected Given config.inference.gpu_memory_budget_bytes = 4 GB When deserialize_engine is called and ORT's TRT EP attempts to allocate workspace Then the workspace size is capped at gpu_memory_budget_bytes // 4 = 1 GB per the provider option; an attempt to exceed this raises OutOfMemoryError

Non-Functional Requirements

Performance

Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.

Compatibility

ORT version pinned to the project's requirements.txt (this task does NOT change it).
TRT EP requires the same TRT 10.3 / CUDA stack as TensorrtRuntime; if absent, CUDAExecutionProvider carries the inference (degraded but functional) per AC-6.

Reliability

Errors rewrapped into the AZ-297 family.
ORT-internal exceptions (e.g., onnxruntime.OrtInvalidArgument) are caught and rewrapped as InferenceError.
Session lifetime is bound to the EngineHandle; the runtime never holds a session beyond an explicit release_engine or process exit.

Unit Tests

AC Ref	What to Test	Required Outcome
AC-1	Protocol conformance + label	`isinstance` True; label string match
AC-2	Deserialise from `.onnx`	Gate NOT called; session created with TRT EP head
AC-3	Deserialise from `.engine` with mismatched schema	Gate raises; no session
AC-4	Numerical comparison against TRT-direct (UltraVPR sample input)	Outputs match within FP16 tolerance
AC-5	First infer with `is_fallback=True`	Exactly one WARN log; one gcs_alert; second infer silent
AC-6	Force TRT EP refusal, allow CUDA EP	Session creates with CUDA EP; label still `"onnx_trt_ep"`
AC-7	release_engine called twice	First drops; second is no-op
AC-8	Workspace cap	Provider option set to budget // 4
NFR-perf-session-create	Microbench session creation × 5	First p95 ≤ 30 s; subsequent ≤ 5 s
NFR-reliability-error-rewrap	Inject ORT internal error	Rewrapped to `InferenceError`

Constraints

ORT version pinned at the project default; no upgrade in this task.
Provider list order is fixed: ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive.
ORT EP's trt_engine_cache_path is a config field; defaults to a per-flight subdirectory of config.inference.engine_cache_dir.
The OnnxTrtEpEngineHandle is opaque to consumers.
This task introduces no new third-party dependencies beyond what ORT already requires.

Risks & Mitigation

Risk 1: ORT TRT EP cache poisons across precisions

Risk: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
Mitigation: The EP cache directory is per-flight (under engine_cache_dir) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.

Risk 2: Provider fallback to CPU EP is silent

Risk: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
Mitigation: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (kind="c7.cpu_fallback"). The composition root MAY install a hard refusal hook that raises EngineDeserializeError on CPU fallback (operator-configurable; default is "warn but allow").

Risk 3: ORT version drift breaks TRT EP option keys

Risk: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
Mitigation: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.

Runtime Completeness

Named capability: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
Production code that must exist: real OnnxTrtEpRuntime class implementing the AZ-297 Protocol; real ORT InferenceSession creation with the TRT-EP-led provider list; real ORT session.run on the F3 hot path; real fallback WARN + GCS alert wiring.
Allowed external stubs: tests MAY substitute a recording wrapper around session.run to verify call sequence (AC-4); production wiring uses real ORT.
Unacceptable substitutes: a stub that just defers to TensorrtRuntime (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).

14 KiB Raw Blame History Unescape Escape