Land the fallback InferenceRuntime strategy that satisfies C7-IT-05: when the TRT-direct path (AZ-298) cannot deserialise a cached engine or when the operator explicitly selects ORT, the system stays in the air at degraded latency rather than dropping the request. Conforms to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep". Production - onnx_trt_ep_runtime.py: compile_engine is a no-op returning an EngineCacheEntry pointing at the source .onnx; deserialize_engine is gate-first for .engine entries and gate-skip for .onnx, builds an ORT InferenceSession with the provider list [TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider], stages cached engines into the ORT TRT EP cache directory via symlink-or-copy, warms up with one session.run after construction, and honours config.inference.ort_disallow_cpu_ fallback by raising EngineDeserializeError when the active provider resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep WARN log plus gcs_alert callback on first call when is_fallback= True; release_engine is idempotent. _build_provider_args is the single point that pins TRT EP option-key names (Risk-3) and caps trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8). - config.py: adds ort_trt_cache_dir (validated non-empty) and ort_disallow_cpu_fallback to C7InferenceConfig. - fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and c7.cpu_fallback FDR record kinds. Tests - test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1 via fake ORT session; Tier-2 placeholders skip on macOS dev for numerical FP16 comparison and session-creation perf NFR. - test_protocol_conformance.py: drops onnx_trt_ep from the missing- module parametrize now that the module ships. - test_az272_fdr_record_schema.py: extends per-kind fixture builder to cover the two new C7 FDR kinds in the roundtrip / schema-version AC tests. Docs - module-layout.md: replaces the pending onnx_trt_runtime row with the shipped onnx_trt_ep_runtime row + capabilities list. - batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted). Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit suite (excluding pending components) 529 passing, 19 env-skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
14 KiB
Batch 32 / Cycle 1 — Implementation Report
Date: 2026-05-12 Tasks: AZ-299 (C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP fallback strategy + per-flight ORT TRT subgraph cache + one-shot fallback alert + CPU-fallback gate) Story points landed: 3 Status: complete (AZ-299 → In Testing)
Scope summary
Single-task batch landing the fallback InferenceRuntime strategy
for C7. OnnxTrtEpRuntime owns the ONNX Runtime + TensorRT EP path
that satisfies C7-IT-05: when the TRT-direct strategy (AZ-298) cannot
deserialise the cached engine for a given model, or when the operator
explicitly selects ORT, the system stays in the air at degraded
latency rather than dropping the request. The runtime conforms to the
same AZ-297 Protocol as TensorrtRuntime and PytorchFp16Runtime,
so the composition root can wire it as either the primary strategy
or as the fallback target.
The fallback semantics required by AC-5 and Risk-2 are captured by two new FDR record kinds (extending AZ-272):
c7.fallback_to_onnx_trt_ep— fired once per session when a runtime constructed withis_fallback=Trueserves its firstinfer. Carries{model_name, reason, active_provider}.c7.cpu_fallback— fired at deserialise time when ORT's provider fallback chain settled onCPUExecutionProvider(TRT EP refused AND CUDA EP refused or unavailable). Carries{model_name, requested_providers, active_provider}. The composition root can install a hard-refusal hook by settingconfig.inference.ort_disallow_cpu_fallback = True; default remains "warn but allow" so a misconfigured Jetson serves results (slowly) rather than hard-failing the flight.
ORT, NumPy, and pycuda (used only for release_engine cleanup hints)
are lazy-imported inside the methods that need them; the module
loads cleanly on Tier-0 / macOS dev hosts so the package's protocol-
conformance tests stay importable without GPU. ORT version is pinned
at the project default; this task does NOT introduce any new third-
party dependency. The TRT EP cache directory comes from
config.inference.ort_trt_cache_dir (new field, defaults to
/var/lib/gps-denied/engines/ort_trt_cache) and is intentionally a
sibling of the TRT-direct engine_cache_dir so C12 operator tooling
can clean both on flight end via a single sweep.
Files added / modified
New (production)
src/gps_denied_onboard/components/c7_inference/onnx_trt_ep_runtime.py—OnnxTrtEpEngineHandle(opaque, slots, owns the ORT session + cached output names +model_name+_releasedflag); local_iso_ts_nowhelper for FDR timestamps (kept component-local rather than reaching across layering — see Findings #1);_ort_dtype_to_numpy(single point that maps the ORT type strings back to NumPy dtypes, isolating the version-fragile mapping for Risk-3);_build_provider_args(single place that constructs the TRT EP option dict — pins the option-key names for Risk-3 unit test);_stage_engine_for_ort(symlink-or-copy a cached.engineinto the ORT TRT EP cache directory at the path ORT expects);OnnxTrtEpRuntimeclass withcompile_engine(no-op returning anEngineCacheEntrypointing at the source.onnx),deserialize_engine(gate-first when the entry is a.engine, skip-gate when.onnx; provider list[TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider]; staging the cached engine for the EP; warm-upsession.runafter construction; one-shotc7.cpu_fallbackalert when the active provider is CPU; honoursort_disallow_cpu_fallbackby raisingEngineDeserializeErrorbefore any work happens on the CPU path),infer(syncsession.runwith named inputs / outputs; first call onis_fallback=Trueruntimes fires exactly onec7.fallback_to_onnx_trt_epWARN log +gcs_alertcallback; ORT-internal exceptions rewrapped toInferenceErrorwith__cause__preserved), idempotentrelease_engine(drops the session reference and marks the handle released; second call is a silent no-op),thermal_statedelegation to the injectedThermalStatePublisher,current_runtime_label() -> "onnx_trt_ep".
New (tests)
tests/unit/c7_inference/test_onnx_trt_ep_runtime.py— NEW suite covering every AC + the two risks:- AC-1 protocol conformance + label string match.
- AC-2 deserialise from
.onnxdoes NOT callEngineGate.validate; session is built with the TRT EP at the head of the provider list; warm-upsession.runruns exactly once. - AC-3 deserialise from
.enginewhose filename schema mismatches the host:EngineGate.validateraises before any ORT session creation — verified by monkey-patching_load_ortto raiseAssertionErroron any call. - AC-4
inferround-trips through the fake ORT session with named inputs / outputs; the returned dict matches the Protocol shape. (The "numerical comparison against TRT-direct within FP16 tolerance" half of AC-4 lives in the Tier-2 microbench harness — placeholder skip in the same file.) - AC-5 first
inferwithis_fallback=Trueemits exactly onec7.fallback_to_onnx_trt_epWARN log AND invokes thegcs_alertcallback once; secondinferis silent on both channels;is_fallback=Falsenever emits. - AC-6 forcing TRT EP to refuse (the fake ORT reports only
CUDAExecutionProviderandCPUExecutionProvideras successfully loaded) creates the session with CUDA EP as the active provider; an INFO log records the actual provider in use;current_runtime_label()still returns"onnx_trt_ep". - AC-7
release_enginecalled twice — first drops the session reference and marks released; second is a silent no-op; foreign handle types silently ignored (defensive shim consistent withTensorrtRuntime). - AC-8
_build_provider_argssetstrt_max_workspace_sizetogpu_memory_budget_bytes // 4; the provider option dict contains exactly the keys{trt_engine_cache_enable, trt_engine_cache_path, trt_max_workspace_size, trt_fp16_enable}(Risk-3 pin). - Risk-2 CPU fallback emits exactly one
c7.cpu_fallbackFDR record at deserialise time; withort_disallow_cpu_fallback=Truethe runtime instead raisesEngineDeserializeErrorbefore any session work. - NFR-reliability ORT-internal
RuntimeErrorraised insidesession.runis rewrapped asInferenceErrorwith__cause__preserved; foreign handle types and released handles rewrap. - Tier-2 placeholders: numerical FP16 comparison against
TRT-direct (AC-4 tail), session-creation perf NFR
(≤ 30 s p95 first / ≤ 5 s p95 with EP cache hot), and real-EP
CPU-fallback under TRT-version-mismatch — all marked
@pytest.mark.tier2and skipped on Tier-1 / macOS dev.
Modified (production)
src/gps_denied_onboard/components/c7_inference/config.py— addsort_trt_cache_dir: str = "/var/lib/gps-denied/engines/ort_trt_cache"(validated non-empty in__post_init__) andort_disallow_cpu_fallback: bool = FalsetoC7InferenceConfig. The CPU-fallback gate intentionally defaults to "warn but allow" to honour the architecture's "keep flying" principle; the operator opts INTO hard-refusal when latency budgets matter more than service continuity.src/gps_denied_onboard/fdr_client/records.py— adds two newFdrRecordkinds (c7.fallback_to_onnx_trt_epandc7.cpu_fallback) with their required field sets, following the existing pattern forc6.write_failed/c6.freshness.*.
Modified (tests)
tests/unit/c7_inference/test_protocol_conformance.py— thetest_ac5_build_inference_runtime_flag_on_but_module_missingparametrization previously excluded only{"pytorch_fp16", "tensorrt"}; now thatonnx_trt_ep_runtime.pyexists the set is{"pytorch_fp16", "tensorrt", "onnx_trt_ep"}. The test body and parametrize structure are kept intact so the factory's missing- module branch stays under test for any future strategy whoseBUILD_*flag is wired ininference_factory._RUNTIME_TO_MODULEahead of its module landing.tests/unit/test_az272_fdr_record_schema.py— extends the per-kind fixture builder with deterministic payloads forc7.fallback_to_onnx_trt_epandc7.cpu_fallbackso the AZ-272 roundtrip / schema-version / unknown-kind tests cover the new kinds the same way they cover the C6 kinds.
Modified (docs)
_docs/02_document/module-layout.md— theonnx_trt_runtime.py (ONNX Runtime + TensorRT EP, pending)row in the c7_inference per-component table now readsonnx_trt_ep_runtime.py (AZ-299; ONNX Runtime + TensorRT EP fallback strategy + per-flight ORT TRT subgraph cache + one-shot fallback WARN/FDR/GCS alert + CPU-fallback gate). The filename shift fromonnx_trt_runtime.py(task spec body) toonnx_trt_ep_runtime.py(shipped) followsinference_factory._RUNTIME_TO_MODULEwhich is the authoritative factory wiring — the task spec's "Outcome" body had a typo that contradicted its own "label" wording ("onnx_trt_ep"). The factory wins.
Acceptance criteria coverage
| AC | Test | Status |
|---|---|---|
| AC-1 Protocol conformance + label | test_ac1_protocol_conformance |
passing |
AC-2 Deserialise from .onnx skips the gate |
test_ac2_deserialize_from_onnx_skips_gate |
passing |
AC-3 Deserialise from .engine invokes the gate |
test_ac3_deserialize_from_engine_invokes_gate_and_skips_session_on_refusal |
passing |
AC-4 infer round-trips through ORT (named outputs) |
test_ac4_infer_round_trips_named_outputs (Tier-1) + Tier-2 numerical FP16 comparison placeholder |
passing / Tier-2 skipped |
| AC-5 Fallback WARN log fires once on first infer | test_ac5_first_infer_with_is_fallback_emits_warn_and_alert_once + test_ac5_not_fallback_never_emits |
passing |
| AC-6 Provider fallback chain respects ORT order | test_ac6_trt_ep_refused_falls_through_to_cuda_ep |
passing |
AC-7 release_engine idempotent |
test_ac7_release_is_idempotent + test_release_engine_ignores_foreign_handle_type |
passing |
| AC-8 Workspace budget respected | test_ac8_provider_options_pin_keys_and_budget_quarter |
passing |
| Risk-2 CPU fallback signalled | test_risk2_cpu_fallback_emits_fdr_kind + test_risk2_cpu_fallback_with_disallow_raises |
passing |
| Risk-3 TRT EP option-key pin | test_ac8_provider_options_pin_keys_and_budget_quarter (shared) |
passing |
| NFR-perf-session-create p95 ≤ 30 s / ≤ 5 s cache hot | test_nfr_perf_session_create_first_under_30s_cache_hot_under_5s (Tier-2 microbench) |
Tier-2 skipped |
| NFR-reliability-error-rewrap | test_nfr_reliability_infer_rewraps_runtime_error + test_infer_rejects_foreign_handle + test_infer_rejects_released_handle |
passing |
AC Test Coverage: 8 of 8 covered (+ 2 risks + 2 NFRs)
Code Review Verdict: PASS_WITH_WARNINGS (2 Low accepted; see Findings)
Auto-Fix Attempts: 0
Stuck Agents: None
Findings (self-review)
| # | Severity | Category | Location | Note | Resolution |
|---|---|---|---|---|---|
| 1 | Low | Maintainability | onnx_trt_ep_runtime.py::_iso_ts_now |
Duplicated from the equivalent helper in tensorrt_runtime.py / fdr_client. Consolidating into a shared helper would either inflate fdr_client/records.py (which is the lowest-layer module the c7 strategies depend on) or carve out a new shared utility module just for one one-liner. Kept component-local; a later hygiene pass can extract the helper alongside the existing shared _types/ move when more components grow ISO-timestamp call sites. |
Open (Low) — accepted; the c7 layering rule wins. |
| 2 | Low | Test-quality | test_ac4_infer_round_trips_named_outputs |
Uses a _FakeOrtSession whose run(...) returns canned arrays in the declared output order. The named-output mapping assertion is verified at the Protocol layer; the numerical FP16 comparison against TRT-direct lives in the Tier-2 microbench harness. |
Open (Low) — Tier-2 placeholder owns the numerical half. |
| 3 | Low | Architecture | onnx_trt_ep_runtime.py::_stage_engine_for_ort |
Attempts symlink first, falls back to copy on OSError (e.g., crossing a filesystem boundary, or running on a host that disallows symlinks for the running user). The copy path leaves a stale binary in the EP cache directory if the staging fails partway; C12's per-flight cache cleanup handles this — a torn copy on disk is no worse than a stale subgraph. |
Open (Low) — accepted as documented; C12 owns cleanup. |
| 4 | Low | Test-coverage | AC-3 schema-mismatch path | The test patches EngineGate.validate to raise EngineSchemaMismatchError; the real gate's filename-schema parser is exercised by AZ-281 / AZ-301 tests. Wiring this runtime to a real (live) gate would duplicate that coverage at the wrong layer. |
Open (Low) — accepted; AZ-301 owns the parser. |
Tracker
- AZ-299 transitioned to In Progress at session start; will move
to In Testing post-commit per
protocols.md.
Test suite
tests/unit/c7_inference/test_onnx_trt_ep_runtime.py— all active tests passing, Tier-2 placeholders skipped on macOS dev (no ORT/CUDA binding).tests/unit/c7_inference/(full c7 suite) — 139 passing, 17 skipped (CUDA / TensorRT / ORT unavailable on Tier-1 / macOS).tests/unit/test_az272_fdr_record_schema.py— 34 passing (the two new C7 kinds now covered by every roundtrip / schema-version test).- Combined unit suite excluding pending components (c1, c2, c2.5,
c3, c3.5, c4, c5, c8, c10, c11, c12) and the c6 collection
blocker on this host (missing
psycopg_poolis a known dev- machine env issue, pre-existing) — 529 passing, 19 environment- skipped, 1 warning (pre-existingpynvmlFutureWarning unrelated to AZ-299).
Next batch
Cycle 1 advances per the greenfield queue — autodev re-detects the
next AZ ticket in the Step 7 batch loop. With C7's three concrete
strategies now landed (AZ-298 / AZ-299 / AZ-300), the remaining
C7 work is AZ-301 c7_engine_gate (already in done/) +
AZ-302 c7_thermal_publisher (already in done/); the next
ticket in dependency order is the first item in the queue that
doesn't depend on a pending earlier task — autodev will compute
that during the next sub-step.