[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy

AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch). Implements the AZ-297 Protocol; current_runtime_label returns "pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298 TRT, AZ-299 ORT) is measured against, and the only viable runtime for Tier-1 workstation Docker where TRT is non-trivial to install. Production code (new): - components/c7_inference/pytorch_fp16_runtime.py — runtime + PytorchEngineHandle + output-shape adapter - components/c7_inference/architecture_registry.py — torch-free register_architecture / default_registry / ArchitectureFactory (Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone code) - components/c7_inference/__init__.py — re-exports the registry mechanism. Still does NOT import the concrete strategy module (Invariant I-5) - components/c7_inference/config.py — adds per_frame_debug_log bool field (gates the DEBUG per-frame latency log) Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry guards run unconditionally (17 tests); AC-3/4/5/8 + NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and skip on Tier-1 CI / macOS dev. Tests (modified): - test_protocol_conformance.py — narrowed test_ac5_build_inference_runtime_flag_on_but_module_missing parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT still covered until AZ-298 / AZ-299 ship. CI: .github/workflows/ci.yml lint + unit jobs now install '-e .[dev,inference]' because mypy + pytest need torch + torchvision + onnxruntime on the runner. Three task-spec -> as-built deltas documented in _docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes: 1. Constructor conforms to AZ-297 factory shape (config positional; thermal_publisher + registry + clock keyword-only optionals). AZ-302 will update the factory to thread thermal_publisher. 2. Architecture registry uses extras["model_name"] as lookup key (avoids touching the frozen BuildConfig / EngineCacheEntry DTOs). 3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg registry has no per-backbone input-shape metadata. Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint environment gates). No regressions in non-c7_inference areas. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 16:41:13 +00:00 · 2026-05-12 10:13:21 +03:00
parent fce80290bc
commit 65ad2168ed
10 changed files with 1079 additions and 9 deletions
@@ -159,3 +159,26 @@ Then `OutOfMemoryError` is raised (rewrapped from `torch.cuda.OutOfMemoryError`)
 - **Production code that must exist**: real `PytorchFp16Runtime` class implementing the AZ-297 Protocol; real `torch.load` + `.half().cuda().eval()` + sync forward; real release path.
 - **Allowed external stubs**: tests MAY substitute a tiny `nn.Linear` checkpoint as the "model"; production wiring uses the actual backbones registered by the composition root.
 - **Unacceptable substitutes**: a CPU-only mode (would defeat the GPU-first invariant the AZ-297 Protocol implies via `EngineHandle`); `torch.compile` (would silently change the simple-baseline contract); autocast (would change the "FP16 only" guarantee that downstream comparisons rely on).
+
+## Implementation Notes (2026-05-12, batch 24)
+
+Three task-spec → as-built deltas, surfaced as DECISION-style rationale so AZ-301 / AZ-302 don't repeat the analysis:
+
+1. **Constructor signature** — spec says "Constructor accepts a `ThermalStatePublisher` reference". The AZ-297 factory (`runtime_root/inference_factory.py`) calls `strategy_cls(config)` positionally. Same pattern as AZ-332 vs. AZ-331. Adopted: `__init__(self, config: Config, *, thermal_publisher=None, architecture_registry=None, clock=None)`. All kwargs default. AZ-302 will update the factory to thread `thermal_publisher`; until then, `thermal_state()` returns the default-safe `ThermalState` (Invariant I-6). No change to the factory in this task.
+
+2. **Architecture registry** — spec calls out "a single dict registered at composition time" but doesn't pick a field on `BuildConfig` / `EngineCacheEntry`. The DTO has neither a `model_name` nor a `model_arch` field, so we read `EngineCacheEntry.extras["model_name"]` (the documented `dict[str, str]` extension point on the DTO) and populate it from the checkpoint's file stem inside `compile_engine`. Registry lives in `c7_inference.architecture_registry` (torch-free — composition root may register before any GPU init) and is re-exported as `c7_inference.register_architecture` / `c7_inference.default_registry`.
+
+3. **Warm-up forward** — spec mentions "Single warm-up forward with zero-shaped input to allocate buffers". The registry only carries a zero-arg factory; no input-shape metadata. A real warm-up needs per-backbone shape info which is owned by each backbone module's composition wiring, not by C7. Deferred to AZ-300 tier-2 follow-up (Jetson). First real `infer` call does the implicit warm-up; no functional impact on AC-3 / AC-4 / AC-5.
+
+### CUDA test-skip policy
+
+AC-3, AC-4, AC-5, AC-8, NFR-perf-deserialize, NFR-reliability-eval-mode require an actual CUDA device. On macOS / Tier-1 CI (no GPU) they decorate with `@pytest.mark.skipif(not torch.cuda.is_available(), ...)`. The Tier-2 Jetson CI runs the full sweep. AC-1, AC-2, AC-6, AC-7 trip *before* any `.cuda()` call (factory construction, file existence, `load_state_dict(strict=True)` rejection), so they run unconditionally and currently pass on macOS arm64 + PyTorch 2.11 CPU build.
+
+### As-built file map
+
+- `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py` — `PytorchFp16Runtime`, `PytorchEngineHandle`, `_to_numpy_dict` helper.
+- `src/gps_denied_onboard/components/c7_inference/architecture_registry.py` — `register_architecture`, `default_registry`, `ArchitectureFactory` type alias.
+- `src/gps_denied_onboard/components/c7_inference/config.py` — added `per_frame_debug_log: bool = False` field (gates the DEBUG per-frame latency log).
+- `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `ArchitectureFactory`, `default_registry`, `register_architecture`. Still does NOT import `pytorch_fp16_runtime` (Invariant I-5).
+- `tests/unit/c7_inference/test_pytorch_fp16_runtime.py` — 17 tests, 6 CUDA-skipped on macOS.
+- `tests/unit/c7_inference/test_protocol_conformance.py` — narrowed `test_ac5_build_inference_runtime_flag_on_but_module_missing` parametrisation to exclude `pytorch_fp16` (now-built); TRT / ORT still covered.