[AZ-299] C7 OnnxTrtEpRuntime: ORT + TRT EP fallback strategy

Land the fallback InferenceRuntime strategy that satisfies C7-IT-05:
when the TRT-direct path (AZ-298) cannot deserialise a cached engine
or when the operator explicitly selects ORT, the system stays in the
air at degraded latency rather than dropping the request. Conforms to
the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep".

Production
- onnx_trt_ep_runtime.py: compile_engine is a no-op returning an
  EngineCacheEntry pointing at the source .onnx; deserialize_engine
  is gate-first for .engine entries and gate-skip for .onnx, builds
  an ORT InferenceSession with the provider list
  [TensorrtExecutionProvider, CUDAExecutionProvider,
  CPUExecutionProvider], stages cached engines into the ORT TRT EP
  cache directory via symlink-or-copy, warms up with one session.run
  after construction, and honours config.inference.ort_disallow_cpu_
  fallback by raising EngineDeserializeError when the active provider
  resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep
  WARN log plus gcs_alert callback on first call when is_fallback=
  True; release_engine is idempotent. _build_provider_args is the
  single point that pins TRT EP option-key names (Risk-3) and caps
  trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8).
- config.py: adds ort_trt_cache_dir (validated non-empty) and
  ort_disallow_cpu_fallback to C7InferenceConfig.
- fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and
  c7.cpu_fallback FDR record kinds.

Tests
- test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback
  alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1
  via fake ORT session; Tier-2 placeholders skip on macOS dev for
  numerical FP16 comparison and session-creation perf NFR.
- test_protocol_conformance.py: drops onnx_trt_ep from the missing-
  module parametrize now that the module ships.
- test_az272_fdr_record_schema.py: extends per-kind fixture builder
  to cover the two new C7 FDR kinds in the roundtrip / schema-version
  AC tests.

Docs
- module-layout.md: replaces the pending onnx_trt_runtime row with
  the shipped onnx_trt_ep_runtime row + capabilities list.
- batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch
  + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted).

Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit
suite (excluding pending components) 529 passing, 19 env-skipped.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 23:55:50 +03:00
parent 18a69022b3
commit 0ad3278b12
11 changed files with 1849 additions and 8 deletions
@@ -158,6 +158,29 @@ KNOWN_PAYLOAD_KEYS: Final[dict[str, frozenset[str]]] = {
"c6.eviction_batch": frozenset(
{"trigger_tile_id", "freed_bytes", "evicted_count", "evicted_tile_ids"}
),
# AZ-299 / E-C7: emitted exactly once per OnnxTrtEpRuntime instance, on
# the FIRST ``infer`` call when the runtime was wired as a fallback
# (composition root sets ``is_fallback=True``). Subsequent infers are
# silent on this kind. ``model_name`` is the engine_path stem;
# ``reason`` is a short tag explaining why the runtime is acting as
# fallback (e.g., ``"composition_root_explicit"``,
# ``"trt_runtime_deserialize_failed"``); ``active_provider`` is the
# first entry of ``InferenceSession.get_providers()`` so the operator
# sees which EP is actually serving the request.
"c7.fallback_to_onnx_trt_ep": frozenset(
{"model_name", "reason", "active_provider"}
),
# AZ-299 / E-C7 Risk-2: emitted by OnnxTrtEpRuntime on
# ``deserialize_engine`` when ORT's active provider after session
# creation is ``CPUExecutionProvider`` (i.e., both TRT and CUDA EPs
# refused). ``model_name`` is the engine_path stem;
# ``requested_providers`` is the provider-list ORT was asked to use
# (always ``["TensorrtExecutionProvider", "CUDAExecutionProvider",
# "CPUExecutionProvider"]`` this cycle); ``active_provider`` is
# ``"CPUExecutionProvider"`` (the record only fires for that branch).
"c7.cpu_fallback": frozenset(
{"model_name", "requested_providers", "active_provider"}
),
}
KNOWN_KINDS: Final[frozenset[str]] = frozenset(KNOWN_PAYLOAD_KEYS.keys())