[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy

AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch). Implements the AZ-297 Protocol; current_runtime_label returns "pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298 TRT, AZ-299 ORT) is measured against, and the only viable runtime for Tier-1 workstation Docker where TRT is non-trivial to install. Production code (new): - components/c7_inference/pytorch_fp16_runtime.py — runtime + PytorchEngineHandle + output-shape adapter - components/c7_inference/architecture_registry.py — torch-free register_architecture / default_registry / ArchitectureFactory (Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone code) - components/c7_inference/__init__.py — re-exports the registry mechanism. Still does NOT import the concrete strategy module (Invariant I-5) - components/c7_inference/config.py — adds per_frame_debug_log bool field (gates the DEBUG per-frame latency log) Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry guards run unconditionally (17 tests); AC-3/4/5/8 + NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and skip on Tier-1 CI / macOS dev. Tests (modified): - test_protocol_conformance.py — narrowed test_ac5_build_inference_runtime_flag_on_but_module_missing parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT still covered until AZ-298 / AZ-299 ship. CI: .github/workflows/ci.yml lint + unit jobs now install '-e .[dev,inference]' because mypy + pytest need torch + torchvision + onnxruntime on the runner. Three task-spec -> as-built deltas documented in _docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes: 1. Constructor conforms to AZ-297 factory shape (config positional; thermal_publisher + registry + clock keyword-only optionals). AZ-302 will update the factory to thread thermal_publisher. 2. Architecture registry uses extras["model_name"] as lookup key (avoids touching the frozen BuildConfig / EngineCacheEntry DTOs). 3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg registry has no per-backbone input-shape metadata. Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint environment gates). No regressions in non-c7_inference areas. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 19:11:14 +00:00 · 2026-05-12 10:13:21 +03:00
parent fce80290bc
commit 65ad2168ed
10 changed files with 1079 additions and 9 deletions
@@ -1,161 +0,0 @@
-# C7 PytorchFp16Runtime — Mandatory Simple Baseline
-
-**Task**: AZ-300_c7_pytorch_baseline
-**Name**: C7 PytorchFp16Runtime
-**Description**: Implement `PytorchFp16Runtime`, the mandatory simple-baseline `InferenceRuntime` strategy. Loads each backbone's canonical PyTorch checkpoint, calls `.half().cuda()`, and conforms to the AZ-297 Protocol — no engine compile, no engine deserialize, no calibration cache. Used as the numerical reference every fancier strategy is measured against (engine simplicity rule), and as the only viable runtime for Tier-1 workstation Docker (where TRT installation is non-trivial).
-**Complexity**: 2 points
-**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
-**Component**: c7_inference (epic AZ-249 / E-C7)
-**Tracker**: AZ-300
-**Epic**: AZ-249 (E-C7)
-
-### Document Dependencies
-
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries the PyTorch checkpoint paths and the runtime selection.
-
-## Problem
-
-A "simple baseline" is mandatory because:
-
- Without a numerical reference, the FP16 / INT8 outputs from `TensorrtRuntime` (AZ-298) and `OnnxTrtEpRuntime` (AZ-299) cannot be sanity-checked. Every strategy must produce results that agree with the PyTorch reference within a documented tolerance.
- Tier-1 workstation Docker runs research / debugging / training-vs-deployed comparison workloads where TRT is not installed. Without `PytorchFp16Runtime`, these workflows have no executable path through the C7 component.
- The ENG-RULE (engine simplicity) demands every complex strategy can be ablated to a simple one; PyTorch is that simple one.
-
-Without this task, the AZ-297 Protocol has only fancy implementations and no ground truth.
-
-## Outcome
-
- A `PytorchFp16Runtime` class at `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "pytorch_fp16"`.
- `compile_engine` is a no-op — returns an `EngineCacheEntry` with `engine_path` set to the source PyTorch checkpoint path (`.pt` / `.pth`). The `(SM, JP, TRT, precision)` tuple is set to `(None, None, None, "fp16")` since PyTorch is hardware-portable across SM levels.
- `deserialize_engine(EngineCacheEntry) -> EngineHandle` loads the checkpoint with `torch.load(map_location="cuda")`, calls `.half().cuda().eval()`, returns a `PytorchEngineHandle` wrapping the model.
- `infer(handle, inputs) -> outputs` does sync GPU forward pass with `torch.no_grad() + torch.inference_mode()`; converts input numpy arrays to `torch.Tensor.half().cuda()`, runs the forward, converts outputs back to numpy. No torch.compile, no scripting, no tracing — straight eager FP16.
- `release_engine(handle)` deletes the model reference and calls `torch.cuda.empty_cache()` to free GPU memory.
- `thermal_state()` delegates to the constructor-injected `ThermalStatePublisher` (AZ-302).
- `BUILD_PYTORCH_RUNTIME=ON` is the default Tier-1 setting per ADR-002; airborne (Tier-2 default Jetson) is OFF; operator (`BUILD_C10_PROVISIONING=ON`) is OFF; replay (Tier-3) is OFF — but airborne can still load this strategy IF an operator explicitly switches the config.
-
-## Scope
-
-### Included
-
- `PytorchFp16Runtime` class implementation conforming to the AZ-297 Protocol.
- `compile_engine`: no-op; returns `EngineCacheEntry` whose `engine_path` is the checkpoint path; sha256 computed via AZ-280 (the helper, but invoked transitively — this task does not directly depend on AZ-280's API; the entry is built using stdlib hashing and the same algorithm).
- `deserialize_engine`: `torch.load(map_location="cuda") → .half().cuda().eval() → wrap → return`. Single warm-up forward with zero-shaped input to allocate buffers.
- `infer`: input dict → `{name: torch.from_numpy(arr).half().cuda() for name, arr in inputs.items()}` → forward pass under `torch.no_grad() + torch.inference_mode()` → output dict via `.cpu().numpy()`. Synchronous (the sync barrier is implicit in the `.cpu()` transfer).
- `release_engine`: drop the model reference, call `torch.cuda.empty_cache()`.
- Diagnostic INFO log on `deserialize_engine` with checkpoint path + parameter count + estimated GPU footprint (`sum(p.numel() * p.element_size() for p in model.parameters())`).
- Per-frame DEBUG log on `infer` (off by default, gated by config).
- Error envelope: `EngineDeserializeError` (checkpoint missing or incompatible state dict), `InferenceError` (forward-pass exception), `OutOfMemoryError` (CUDA OOM during forward).
- Constructor accepts a `ThermalStatePublisher` reference for the `thermal_state()` delegation.
-
-### Excluded
-
- AZ-298 TensorrtRuntime — separate task.
- AZ-299 OnnxTrtEpRuntime — separate task.
- AZ-301 EngineGate (no engine binaries to validate; PyTorch is checkpoint-based, not engine-based).
- AZ-302 ThermalState polling — delegated.
- `torch.compile` / `torch.jit.trace` / `torch.jit.script` — explicitly out of scope; this is the SIMPLE baseline.
- Mixed-precision autocast — explicitly FP16 only; no `torch.cuda.amp.autocast`.
- Multi-GPU support — single Jetson GPU only.
- Engine cache for PyTorch — there is no engine cache; the checkpoint IS the artifact.
-
-## Acceptance Criteria
-
-**AC-1: Protocol conformance**
-Given `runtime_checkable(InferenceRuntime)`
-When `isinstance(PytorchFp16Runtime(...), InferenceRuntime)` is evaluated
-Then result is `True`; `current_runtime_label() == "pytorch_fp16"`
-
-**AC-2: compile_engine is a no-op**
-Given a checkpoint path on disk
-When `compile_engine(path, build_config)` is called
-Then no `.engine` file is produced; the returned `EngineCacheEntry` has `engine_path == path` and the `(SM, JP, TRT)` tuple components are `None`; the call returns within 100 ms
-
-**AC-3: deserialize loads, half-casts, GPU-moves, eval-mode**
-Given a valid checkpoint
-When `deserialize_engine(entry)` is called
-Then the loaded model has `model.training == False`; every parameter has `dtype == torch.float16` and `device.type == "cuda"`; one warm-up forward has succeeded; the returned handle is a `PytorchEngineHandle`
-
-**AC-4: infer produces numpy output dict matching the Protocol**
-Given a deserialised handle for a UltraVPR-shaped model and a fixed input numpy dict
-When `infer(handle, inputs)` is called
-Then the returned value is a `dict[str, np.ndarray]`; every output is FP16-cast or FP32-cast per the model's actual output dtypes (no silent type coercion); the numerical output is within a documented tolerance of the FP32 reference (when running the same model in FP32 mode for the test's reference path)
-
-**AC-5: release frees GPU memory**
-Given a deserialised handle holding K MB GPU memory
-When `release_engine(handle)` is called
-Then NVML reports K MB freed within 1 s (the freed memory may not return to OS immediately, but `torch.cuda.memory_allocated()` decreases to zero for that handle's allocations)
-
-**AC-6: missing checkpoint raises EngineDeserializeError**
-Given a non-existent checkpoint path
-When `deserialize_engine(entry)` is called
-Then `EngineDeserializeError` is raised with the path in the message; no GPU memory is allocated
-
-**AC-7: incompatible state dict raises EngineDeserializeError**
-Given a checkpoint whose state-dict keys do not match the architecture the runtime expects
-When `deserialize_engine` is called
-Then `EngineDeserializeError` is raised; the original `RuntimeError` from `load_state_dict(strict=True)` is preserved as `__cause__`
-
-**AC-8: CUDA OOM during infer surfaces as OutOfMemoryError**
-Given a deserialised model and an input tensor large enough to OOM
-When `infer(handle, inputs)` is called
-Then `OutOfMemoryError` is raised (rewrapped from `torch.cuda.OutOfMemoryError`); the model is NOT silently moved to CPU
-
-## Non-Functional Requirements
-
-**Performance**
- Per-call latency budget is the simple-baseline reference; no specific p95 target. PyTorch FP16 typically runs 3–5× slower than TRT FP16 on Jetson; that is acceptable because this strategy is not the production-default airborne choice.
- `deserialize_engine` p95 ≤ 10 s on Tier-2 (checkpoint load + half-cast + GPU move + warm-up).
-
-**Compatibility**
- PyTorch version pinned to the project default; this task does NOT change it.
- Torch checkpoint format only — `.pt` / `.pth` files saved via `torch.save`.
-
-**Reliability**
- Errors rewrapped into the AZ-297 family.
- `eval()` mode is set unconditionally; this is the simple baseline, not a training runtime. Even if a checkpoint accidentally ships in training mode, the runtime forces `eval()`.
- `torch.no_grad()` + `torch.inference_mode()` are applied inside `infer`; the forward never accumulates gradients.
-
-## Unit Tests
-
-| AC Ref | What to Test | Required Outcome |
-|--------|-------------|-----------------|
-| AC-1 | Protocol conformance + label | `isinstance` True; label match |
-| AC-2 | compile_engine returns quickly with checkpoint path | No `.engine` produced; entry shape correct; ≤ 100 ms |
-| AC-3 | deserialize a small test model | Eval mode True; FP16 dtype on GPU; warm-up succeeded |
-| AC-4 | infer numerical comparison vs FP32 reference | Output within tolerance |
-| AC-5 | release after deserialise | NVML / `torch.cuda.memory_allocated` shows freed |
-| AC-6 | deserialise non-existent path | `EngineDeserializeError`; no GPU alloc |
-| AC-7 | deserialise mismatched state dict | `EngineDeserializeError`; `__cause__` preserved |
-| AC-8 | infer with deliberately oversized input | `OutOfMemoryError`; no CPU fallback |
-| NFR-perf-deserialize | Microbench deserialise × 5 | p95 ≤ 10 s on Tier-2 |
-| NFR-reliability-eval-mode | After deserialise, check `model.training` | False unconditionally |
-
-## Constraints
-
- PyTorch version pinned at project default.
- Eager FP16 only — no `torch.compile`, no JIT, no autocast. The "simple baseline" is in the name.
- The model architecture loader is per-backbone; this task wires in a registry mapping `model_name` (from `BuildConfig`) to the architecture class. The registry is populated by per-backbone modules (UltraVPR, LightGlue, etc.) — that registry is owned by E-C2 / E-C2.5 / E-C3 / E-C3.5 component code and outside this task's scope; this task only provides the mechanism (a single dict registered at composition time).
- The `PytorchEngineHandle` is opaque.
- This task introduces no new third-party dependencies beyond what PyTorch already requires.
-
-## Risks & Mitigation
-
-**Risk 1: Model architecture registry leaks across components**
- *Risk*: To load a `UltraVPR` checkpoint, this runtime needs the `UltraVPR` class. Importing it from `c2_vpr.ultra_vpr` would create a back-edge from C7 (Layer 2) to C2 (Layer 3) violating module-layout layering.
- *Mitigation*: The composition root registers each backbone class into `PytorchFp16Runtime`'s registry at startup (dependency injection). The runtime never imports component code directly. The injection is wired in `runtime_root` (Layer 5), which is allowed to depend on every layer.
-
-**Risk 2: Checkpoint deserialization is a security risk**
- *Risk*: `torch.load` can execute arbitrary code via pickle when loading untrusted checkpoints.
- *Mitigation*: Checkpoints are cosigned via the deployment manifest's signature (per `_docs/02_document/risk_mitigations.md`). This task uses `torch.load(weights_only=True)` (PyTorch 2.x default) which restricts pickle to known-safe types. A non-weights-only checkpoint raises `EngineDeserializeError`.
-
-**Risk 3: FP16 numerical mismatch with FP32 reference outside tolerance**
- *Risk*: Some model architectures lose accuracy when half-cast (FP16) without autocast.
- *Mitigation*: AC-4 documents the tolerance per model (recorded in the implementation report). If a backbone exceeds tolerance, the runtime is unfit for that backbone and the operator switches to TRT (which uses calibration to recover accuracy). The accuracy-vs-runtime trade-off is a per-backbone property documented during integration testing — this task accepts the result, does not work around it.
-
-## Runtime Completeness
-
- **Named capability**: PyTorch FP16 simple-baseline runtime (architecture / E-C7 / ENG-RULE).
- **Production code that must exist**: real `PytorchFp16Runtime` class implementing the AZ-297 Protocol; real `torch.load` + `.half().cuda().eval()` + sync forward; real release path.
- **Allowed external stubs**: tests MAY substitute a tiny `nn.Linear` checkpoint as the "model"; production wiring uses the actual backbones registered by the composition root.
- **Unacceptable substitutes**: a CPU-only mode (would defeat the GPU-first invariant the AZ-297 Protocol implies via `EngineHandle`); `torch.compile` (would silently change the simple-baseline contract); autocast (would change the "FP16 only" guarantee that downstream comparisons rely on).