Decompose Step 6 snapshot: 140 task specs + contract docs

Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 20:21:13 +00:00 · 2026-05-11 00:39:48 +03:00
parent 8171fcb29e
commit 880eabcb3f
172 changed files with 22897 additions and 35 deletions
@@ -0,0 +1,167 @@
+# C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback
+
+**Task**: AZ-299_c7_onnxrt_fallback
+**Name**: C7 OnnxTrtEpRuntime
+**Description**: Implement `OnnxTrtEpRuntime`, the fallback `InferenceRuntime` strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup.
+**Complexity**: 3 points
+**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
+**Component**: c7_inference (epic AZ-249 / E-C7)
+**Tracker**: AZ-299
+**Epic**: AZ-249 (E-C7)
+
+### Document Dependencies
+
+- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
+- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — used at deserialise time when reusing a cached `.engine` via ORT TRT EP.
+- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar trust check via the gate.
+- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides ORT cache dir + ONNX model paths.
+
+## Problem
+
+Two scenarios force a fallback off `TensorrtRuntime`:
+
+1. The cached `.engine` for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request.
+2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails).
+
+Without `OnnxTrtEpRuntime`:
+
+- C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify.
+- The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated.
+- The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect.
+
+## Outcome
+
+- An `OnnxTrtEpRuntime` class at `src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "onnx_trt_ep"`.
+- `compile_engine` is a no-op for ORT — it returns an `EngineCacheEntry` whose `engine_path` is the underlying `.onnx` file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently.
+- `deserialize_engine(EngineCacheEntry) -> EngineHandle`: if the entry's `engine_path` is a `.engine` file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a `.onnx`, skip the gate and load the ONNX directly. In either case, build an `InferenceSession` with the TRT EP provider list `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]` (provider fallback chain) and warm up by running one zero-input through the session.
+- `infer(handle, inputs) -> outputs` calls `session.run(output_names, inputs)`; the call is sync; ORT manages its own CUDA stream internally.
+- `release_engine(handle)` calls `session.end_profiling()` and drops the session reference; ORT releases EP resources on garbage collection.
+- A degraded-latency WARN log (`kind="c7.fallback_to_onnx_trt_ep"`) is emitted ONCE on first `infer` call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used.
+- `thermal_state()` is delegated to the same `ThermalStatePublisher` reference used by `TensorrtRuntime` (AZ-302 owns the publisher).
+- ORT version pin matches the project's `requirements.txt` / equivalent; this task does NOT introduce a new ORT version.
+
+## Scope
+
+### Included
+
+- `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol.
+- `compile_engine`: returns an `EngineCacheEntry` pointing at the source `.onnx` file (no separate engine binary). The `.onnx`'s sha256 is computed via AZ-280 and stamped into the entry; the `(SM, JP, TRT, precision)` tuple is set to the host's running tuple (since ORT will lazy-compile per host).
+- `deserialize_engine`: provider list construction (`["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`), provider options (`trt_engine_cache_enable=True`, `trt_engine_cache_path=config.inference.ort_trt_cache_dir`, `trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4`), session creation, single-shot warm-up. Returns `OnnxTrtEpEngineHandle` (`EngineHandle` subclass) wrapping the session.
+- Engine-cache reuse path: if `EngineCacheEntry.engine_path.suffix == ".engine"`, invoke `EngineGate.validate(entry)` first; the engine binary is then placed at the `trt_engine_cache_path` so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config).
+- ONNX direct path: if `engine_path.suffix == ".onnx"`, skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session.
+- `infer`: `session.run(output_names, {name: ndarray for name, ndarray in inputs.items()})`; output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config.
+- `release_engine`: idempotent session drop.
+- Constructor flag `is_fallback: bool = False` set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first `infer`, if `is_fallback`, emit the `kind="c7.fallback_to_onnx_trt_ep"` WARN log + an FDR record (via `gcs_alert` in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback).
+- Error envelope: `EngineBuildError` (ONNX validation failure), `EngineDeserializeError` (session creation failure), `InferenceError` (mid-flight ORT runtime exception), `OutOfMemoryError` (mid-session OOM), the gate's errors when applicable.
+
+### Excluded
+
+- AZ-298 TensorrtRuntime (production-default) — separate task.
+- AZ-300 PytorchFp16Runtime — separate task.
+- AZ-301 EngineGate validation logic — this task INVOKES the gate.
+- AZ-302 ThermalState polling — this task delegates.
+- ORT version upgrade or new ORT dependency — pinned to the project's existing version.
+- Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle.
+- Multi-session pooling — one session per `EngineHandle` this cycle.
+- Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling.
+
+## Acceptance Criteria
+
+**AC-1: Protocol conformance**
+Given `runtime_checkable(InferenceRuntime)`
+When `isinstance(OnnxTrtEpRuntime(...), InferenceRuntime)` is evaluated
+Then result is `True`; `current_runtime_label() == "onnx_trt_ep"`
+
+**AC-2: deserialize from `.onnx` skips the gate**
+Given an `EngineCacheEntry(engine_path=<onnx_path>)`
+When `deserialize_engine(entry)` is called
+Then `EngineGate.validate` is NOT called; an ORT `InferenceSession` is created with the TRT EP at the head of the provider list; a single warm-up `session.run` succeeds
+
+**AC-3: deserialize from `.engine` invokes the gate**
+Given an `EngineCacheEntry(engine_path=<engine_path>)` whose filename schema mismatches the host
+When `deserialize_engine(entry)` is called
+Then `EngineGate.validate` is invoked first and raises `EngineSchemaMismatchError`; no `InferenceSession` is created
+
+**AC-4: infer round-trips through ORT and returns named outputs**
+Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict
+When `infer(handle, inputs)` is called
+Then `session.run` is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip)
+
+**AC-5: fallback WARN log fires once on first infer**
+Given a runtime constructed with `is_fallback=True`
+When `infer` is called for the first time
+Then exactly one `kind="c7.fallback_to_onnx_trt_ep"` WARN log is emitted AND one `gcs_alert(...)` callback is invoked; subsequent `infer` calls do NOT emit the log again
+
+**AC-6: provider fallback chain respects ORT order**
+Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available
+When `deserialize_engine` is called
+Then session creation succeeds with `CUDAExecutionProvider` as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` STILL returns `"onnx_trt_ep"` (the runtime label is the strategy, not the EP)
+
+**AC-7: release_engine drops the session and is idempotent**
+Given a deserialised handle
+When `release_engine(handle)` is called once and then again
+Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently
+
+**AC-8: workspace budget is respected**
+Given `config.inference.gpu_memory_budget_bytes = 4 GB`
+When `deserialize_engine` is called and ORT's TRT EP attempts to allocate workspace
+Then the workspace size is capped at `gpu_memory_budget_bytes // 4 = 1 GB` per the provider option; an attempt to exceed this raises `OutOfMemoryError`
+
+## Non-Functional Requirements
+
+**Performance**
+- Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms).
+- Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95.
+
+**Compatibility**
+- ORT version pinned to the project's `requirements.txt` (this task does NOT change it).
+- TRT EP requires the same TRT 10.3 / CUDA stack as `TensorrtRuntime`; if absent, `CUDAExecutionProvider` carries the inference (degraded but functional) per AC-6.
+
+**Reliability**
+- Errors rewrapped into the AZ-297 family.
+- ORT-internal exceptions (e.g., `onnxruntime.OrtInvalidArgument`) are caught and rewrapped as `InferenceError`.
+- Session lifetime is bound to the `EngineHandle`; the runtime never holds a session beyond an explicit `release_engine` or process exit.
+
+## Unit Tests
+
+| AC Ref | What to Test | Required Outcome |
+|--------|-------------|-----------------|
+| AC-1 | Protocol conformance + label | `isinstance` True; label string match |
+| AC-2 | Deserialise from `.onnx` | Gate NOT called; session created with TRT EP head |
+| AC-3 | Deserialise from `.engine` with mismatched schema | Gate raises; no session |
+| AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance |
+| AC-5 | First infer with `is_fallback=True` | Exactly one WARN log; one gcs_alert; second infer silent |
+| AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still `"onnx_trt_ep"` |
+| AC-7 | release_engine called twice | First drops; second is no-op |
+| AC-8 | Workspace cap | Provider option set to budget // 4 |
+| NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s |
+| NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to `InferenceError` |
+
+## Constraints
+
+- ORT version pinned at the project default; no upgrade in this task.
+- Provider list order is fixed: `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive.
+- ORT EP's `trt_engine_cache_path` is a config field; defaults to a per-flight subdirectory of `config.inference.engine_cache_dir`.
+- The `OnnxTrtEpEngineHandle` is opaque to consumers.
+- This task introduces no new third-party dependencies beyond what ORT already requires.
+
+## Risks & Mitigation
+
+**Risk 1: ORT TRT EP cache poisons across precisions**
+- *Risk*: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph.
+- *Mitigation*: The EP cache directory is per-flight (under `engine_cache_dir`) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit.
+
+**Risk 2: Provider fallback to CPU EP is silent**
+- *Risk*: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency.
+- *Mitigation*: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (`kind="c7.cpu_fallback"`). The composition root MAY install a hard refusal hook that raises `EngineDeserializeError` on CPU fallback (operator-configurable; default is "warn but allow").
+
+**Risk 3: ORT version drift breaks TRT EP option keys**
+- *Risk*: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses.
+- *Mitigation*: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge.
+
+## Runtime Completeness
+
+- **Named capability**: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05).
+- **Production code that must exist**: real `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol; real ORT `InferenceSession` creation with the TRT-EP-led provider list; real ORT `session.run` on the F3 hot path; real fallback WARN + GCS alert wiring.
+- **Allowed external stubs**: tests MAY substitute a recording wrapper around `session.run` to verify call sequence (AC-4); production wiring uses real ORT.
+- **Unacceptable substitutes**: a stub that just defers to `TensorrtRuntime` (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency).