[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy

AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch). Implements the AZ-297 Protocol; current_runtime_label returns "pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298 TRT, AZ-299 ORT) is measured against, and the only viable runtime for Tier-1 workstation Docker where TRT is non-trivial to install. Production code (new): - components/c7_inference/pytorch_fp16_runtime.py — runtime + PytorchEngineHandle + output-shape adapter - components/c7_inference/architecture_registry.py — torch-free register_architecture / default_registry / ArchitectureFactory (Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone code) - components/c7_inference/__init__.py — re-exports the registry mechanism. Still does NOT import the concrete strategy module (Invariant I-5) - components/c7_inference/config.py — adds per_frame_debug_log bool field (gates the DEBUG per-frame latency log) Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry guards run unconditionally (17 tests); AC-3/4/5/8 + NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and skip on Tier-1 CI / macOS dev. Tests (modified): - test_protocol_conformance.py — narrowed test_ac5_build_inference_runtime_flag_on_but_module_missing parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT still covered until AZ-298 / AZ-299 ship. CI: .github/workflows/ci.yml lint + unit jobs now install '-e .[dev,inference]' because mypy + pytest need torch + torchvision + onnxruntime on the runner. Three task-spec -> as-built deltas documented in _docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes: 1. Constructor conforms to AZ-297 factory shape (config positional; thermal_publisher + registry + clock keyword-only optionals). AZ-302 will update the factory to thread thermal_publisher. 2. Architecture registry uses extras["model_name"] as lookup key (avoids touching the frozen BuildConfig / EngineCacheEntry DTOs). 3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg registry has no per-backbone input-shape metadata. Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint environment gates). No regressions in non-c7_inference areas. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 10:51:13 +00:00 · 2026-05-12 10:13:21 +03:00
parent fce80290bc
commit 65ad2168ed
10 changed files with 1079 additions and 9 deletions
@@ -0,0 +1,135 @@
+# Batch 24 / Cycle 1 — Implementation Report
+
+**Date**: 2026-05-12
+**Tasks**: AZ-300 (C7 PytorchFp16Runtime — mandatory simple-baseline)
+**Story points landed**: 2
+**Status**: complete (AZ-300 → In Testing)
+
+## Scope summary
+
+Single-task batch by design — narrowed from the initial post-AZ-332
+plan (`{AZ-300, AZ-301, AZ-302}`) to keep the post-OKVIS2 turn at a
+reviewable size. AZ-301 (EngineGate, 3pt) and AZ-302
+(ThermalStatePublisher, 3pt) move to batch 25.
+
+## Files added / modified
+
+### New
+
+- `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py`
+  — `PytorchFp16Runtime` + `PytorchEngineHandle` + `_to_numpy_dict`
+  output-shape adapter.
+- `src/gps_denied_onboard/components/c7_inference/architecture_registry.py`
+  — torch-free `register_architecture` / `default_registry` /
+  `ArchitectureFactory`. Risk-1 mitigation (no L2→L3 back-edge from C7
+  into per-backbone code).
+- `tests/unit/c7_inference/test_pytorch_fp16_runtime.py` — 17 tests
+  covering AC-1..AC-8 + NFRs; CPU-runnable subset green on macOS.
+
+### Modified
+
+- `src/gps_denied_onboard/components/c7_inference/__init__.py`
+  — re-exports `ArchitectureFactory`, `default_registry`,
+  `register_architecture`. Still does NOT import the concrete strategy
+  module (Invariant I-5 / Risk-2).
+- `src/gps_denied_onboard/components/c7_inference/config.py`
+  — added `per_frame_debug_log: bool = False` to `C7InferenceConfig`
+  (gates the DEBUG per-frame latency log per spec § Scope).
+- `tests/unit/c7_inference/test_protocol_conformance.py`
+  — narrowed `test_ac5_build_inference_runtime_flag_on_but_module_missing`
+  parametrisation to exclude `pytorch_fp16` (now-built); TRT / ORT
+  remain covered (AZ-298 / AZ-299 still pending).
+- `_docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md` → moved to
+  `_docs/02_tasks/done/`; added an `## Implementation Notes (2026-05-12,
+  batch 24)` section documenting the three task-spec → as-built deltas.
+
+## Design decisions (resolved spec contradictions)
+
+1. **Constructor shape** — `__init__(config: Config, *, thermal_publisher=None,
+   architecture_registry=None, clock=None)`. AZ-297 factory passes
+   `config` only; thermal-publisher injection waits for AZ-302 to update
+   the factory. Same pattern as AZ-332 vs. AZ-331 (user-approved option A
+   from the prior batch).
+2. **Architecture registry key** — `EngineCacheEntry.extras["model_name"]`,
+   populated from the checkpoint's file stem inside `compile_engine`.
+   Avoids touching the frozen `BuildConfig` / `EngineCacheEntry` DTOs.
+3. **Warm-up forward** — deferred to AZ-300 tier-2 follow-up. The
+   registry has no input-shape metadata; a real warm-up needs
+   per-backbone shape info owned by each backbone's composition wiring.
+
+## AC coverage
+
+| AC | Status | Notes |
+|----|--------|-------|
+| AC-1 protocol conformance | covered | `test_ac1_protocol_conformance` |
+| AC-2 compile_engine no-op | covered | `test_ac2_compile_engine_is_noop` |
+| AC-3 deserialize half-cast/GPU/eval | covered (CUDA-skip on Tier-1) | `test_ac3_deserialize_loads_half_casts_gpu_moves_eval` |
+| AC-4 infer numerical FP32 reference | covered (CUDA-skip on Tier-1) | `test_ac4_infer_numerical_close_to_fp32`; atol=5e-3, rtol=5e-3 for FP16 tiny linear |
+| AC-5 release frees GPU memory | covered (CUDA-skip on Tier-1) | `test_ac5_release_frees_gpu_memory` + I-7 idempotent assertion |
+| AC-6 missing checkpoint | covered | `test_ac6_missing_checkpoint_raises` |
+| AC-7 mismatched state_dict | covered | `test_ac7_incompatible_state_dict_raises_with_cause` (validates `__cause__` chain) |
+| AC-8 CUDA OOM rewrap | covered (CUDA-skip on Tier-1) | `test_ac8_cuda_oom_during_infer_rewrapped` (synthetic OOM via stub model) |
+| NFR-perf-deserialize | tier2 | Jetson-only validation |
+| NFR-reliability-eval-mode | covered (CUDA-skip on Tier-1) | `test_nfr_reliability_eval_mode_unconditional` |
+
+Additional coverage beyond ACs:
+
+- `test_thermal_state_default_safe_when_no_publisher` — Invariant I-6
+  fallback when AZ-302 publisher absent.
+- `test_thermal_state_delegates_to_publisher` — duck-typed `.read()`
+  delegation, forward-compat with AZ-302.
+- `test_deserialize_missing_architecture_registration` — registry
+  lookup miss path.
+- `test_infer_rejects_foreign_handle` / `test_infer_rejects_released_handle`
+  — handle-lifecycle guards (consumers MUST pass back the same
+  runtime's handle).
+- `test_register_architecture_rejects_collision` /
+  `test_register_architecture_same_factory_is_idempotent` — composition-time
+  registry safety.
+
+## Test run
+
+```
+.venv/bin/pytest tests/unit/c7_inference/ → 63 passed, 6 skipped
+.venv/bin/pytest                          → 1120 passed, 10 skipped
+```
+
+The 6 c7_inference skips are CUDA-gated. The 10 full-suite skips are
+all environment-gated (CUDA + Tier-2 + cmake/actionlint not on PATH).
+No pre-existing tests regressed.
+
+## Self-review verdict
+
+**Pass.** Followed AZ-297 contract (Protocol surface + factory shape +
+error envelope + Invariant I-1/2/4/5/6/7/8). The single
+test-protocol-conformance edit is narrowly scoped (parametrisation
+filter, not behaviour change). No churn outside `c7_inference`.
+
+## Known gaps for the Product Implementation Completeness Gate
+
+- **Warm-up forward**: deferred to AZ-300 tier-2 (Jetson). Real first
+  `infer` call does the implicit warm-up; AC-3 still passes because it
+  only checks dtype/device/training-mode, not warm-up artifacts.
+- **Thermal publisher wiring**: returns default-safe state until AZ-302
+  ships. Invariant I-6 holds; consumers see
+  `is_telemetry_available=False` and `thermal_throttle_active=False`.
+- **CUDA-gated NFR-perf**: Tier-1 CI cannot validate p95 ≤ 10 s on
+  deserialize; Tier-2 Jetson CI is the gate.
+- **Architecture registry population**: this task ships the *mechanism*;
+  per-backbone modules (E-C2 / E-C2.5 / E-C3 / E-C3.5) own actually
+  *populating* the registry from their composition wiring. Tracked by
+  those component epics.
+
+## Next batch
+
+**Batch 25 candidates** (18 tasks total ready in the queue):
+
+- AZ-301 (C7 EngineGate, 3pt) — no `torch` dependency; uses C7 error
+  types only.
+- AZ-302 (C7 ThermalStatePublisher, 3pt) — `jtop` / `pynvml`
+  deps (Tier-2 only; Tier-1 tests stub the source).
+- AZ-304 (C6 Postgres schema, 2pt) — no native deps; pure SQL +
+  alembic migration if pattern allows.
+
+Recommended batch 25 size: 2–3 tasks (AZ-301 + AZ-302, plus AZ-304 if
+turn budget allows).