mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 19:11:14 +00:00
[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy
AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch).
Implements the AZ-297 Protocol; current_runtime_label returns
"pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298
TRT, AZ-299 ORT) is measured against, and the only viable runtime for
Tier-1 workstation Docker where TRT is non-trivial to install.
Production code (new):
- components/c7_inference/pytorch_fp16_runtime.py — runtime +
PytorchEngineHandle + output-shape adapter
- components/c7_inference/architecture_registry.py — torch-free
register_architecture / default_registry / ArchitectureFactory
(Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone
code)
- components/c7_inference/__init__.py — re-exports the registry
mechanism. Still does NOT import the concrete strategy module
(Invariant I-5)
- components/c7_inference/config.py — adds per_frame_debug_log bool
field (gates the DEBUG per-frame latency log)
Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py
covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry
guards run unconditionally (17 tests); AC-3/4/5/8 +
NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and
skip on Tier-1 CI / macOS dev.
Tests (modified):
- test_protocol_conformance.py — narrowed
test_ac5_build_inference_runtime_flag_on_but_module_missing
parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT
still covered until AZ-298 / AZ-299 ship.
CI: .github/workflows/ci.yml lint + unit jobs now install
'-e .[dev,inference]' because mypy + pytest need torch + torchvision +
onnxruntime on the runner.
Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes:
1. Constructor conforms to AZ-297 factory shape (config positional;
thermal_publisher + registry + clock keyword-only optionals).
AZ-302 will update the factory to thread thermal_publisher.
2. Architecture registry uses extras["model_name"] as lookup key
(avoids touching the frozen BuildConfig / EngineCacheEntry DTOs).
3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg
registry has no per-backbone input-shape metadata.
Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint
environment gates). No regressions in non-c7_inference areas.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,161 +0,0 @@
|
||||
# C7 PytorchFp16Runtime — Mandatory Simple Baseline
|
||||
|
||||
**Task**: AZ-300_c7_pytorch_baseline
|
||||
**Name**: C7 PytorchFp16Runtime
|
||||
**Description**: Implement `PytorchFp16Runtime`, the mandatory simple-baseline `InferenceRuntime` strategy. Loads each backbone's canonical PyTorch checkpoint, calls `.half().cuda()`, and conforms to the AZ-297 Protocol — no engine compile, no engine deserialize, no calibration cache. Used as the numerical reference every fancier strategy is measured against (engine simplicity rule), and as the only viable runtime for Tier-1 workstation Docker (where TRT installation is non-trivial).
|
||||
**Complexity**: 2 points
|
||||
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: c7_inference (epic AZ-249 / E-C7)
|
||||
**Tracker**: AZ-300
|
||||
**Epic**: AZ-249 (E-C7)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297.
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries the PyTorch checkpoint paths and the runtime selection.
|
||||
|
||||
## Problem
|
||||
|
||||
A "simple baseline" is mandatory because:
|
||||
|
||||
- Without a numerical reference, the FP16 / INT8 outputs from `TensorrtRuntime` (AZ-298) and `OnnxTrtEpRuntime` (AZ-299) cannot be sanity-checked. Every strategy must produce results that agree with the PyTorch reference within a documented tolerance.
|
||||
- Tier-1 workstation Docker runs research / debugging / training-vs-deployed comparison workloads where TRT is not installed. Without `PytorchFp16Runtime`, these workflows have no executable path through the C7 component.
|
||||
- The ENG-RULE (engine simplicity) demands every complex strategy can be ablated to a simple one; PyTorch is that simple one.
|
||||
|
||||
Without this task, the AZ-297 Protocol has only fancy implementations and no ground truth.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `PytorchFp16Runtime` class at `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "pytorch_fp16"`.
|
||||
- `compile_engine` is a no-op — returns an `EngineCacheEntry` with `engine_path` set to the source PyTorch checkpoint path (`.pt` / `.pth`). The `(SM, JP, TRT, precision)` tuple is set to `(None, None, None, "fp16")` since PyTorch is hardware-portable across SM levels.
|
||||
- `deserialize_engine(EngineCacheEntry) -> EngineHandle` loads the checkpoint with `torch.load(map_location="cuda")`, calls `.half().cuda().eval()`, returns a `PytorchEngineHandle` wrapping the model.
|
||||
- `infer(handle, inputs) -> outputs` does sync GPU forward pass with `torch.no_grad() + torch.inference_mode()`; converts input numpy arrays to `torch.Tensor.half().cuda()`, runs the forward, converts outputs back to numpy. No torch.compile, no scripting, no tracing — straight eager FP16.
|
||||
- `release_engine(handle)` deletes the model reference and calls `torch.cuda.empty_cache()` to free GPU memory.
|
||||
- `thermal_state()` delegates to the constructor-injected `ThermalStatePublisher` (AZ-302).
|
||||
- `BUILD_PYTORCH_RUNTIME=ON` is the default Tier-1 setting per ADR-002; airborne (Tier-2 default Jetson) is OFF; operator (`BUILD_C10_PROVISIONING=ON`) is OFF; replay (Tier-3) is OFF — but airborne can still load this strategy IF an operator explicitly switches the config.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `PytorchFp16Runtime` class implementation conforming to the AZ-297 Protocol.
|
||||
- `compile_engine`: no-op; returns `EngineCacheEntry` whose `engine_path` is the checkpoint path; sha256 computed via AZ-280 (the helper, but invoked transitively — this task does not directly depend on AZ-280's API; the entry is built using stdlib hashing and the same algorithm).
|
||||
- `deserialize_engine`: `torch.load(map_location="cuda") → .half().cuda().eval() → wrap → return`. Single warm-up forward with zero-shaped input to allocate buffers.
|
||||
- `infer`: input dict → `{name: torch.from_numpy(arr).half().cuda() for name, arr in inputs.items()}` → forward pass under `torch.no_grad() + torch.inference_mode()` → output dict via `.cpu().numpy()`. Synchronous (the sync barrier is implicit in the `.cpu()` transfer).
|
||||
- `release_engine`: drop the model reference, call `torch.cuda.empty_cache()`.
|
||||
- Diagnostic INFO log on `deserialize_engine` with checkpoint path + parameter count + estimated GPU footprint (`sum(p.numel() * p.element_size() for p in model.parameters())`).
|
||||
- Per-frame DEBUG log on `infer` (off by default, gated by config).
|
||||
- Error envelope: `EngineDeserializeError` (checkpoint missing or incompatible state dict), `InferenceError` (forward-pass exception), `OutOfMemoryError` (CUDA OOM during forward).
|
||||
- Constructor accepts a `ThermalStatePublisher` reference for the `thermal_state()` delegation.
|
||||
|
||||
### Excluded
|
||||
|
||||
- AZ-298 TensorrtRuntime — separate task.
|
||||
- AZ-299 OnnxTrtEpRuntime — separate task.
|
||||
- AZ-301 EngineGate (no engine binaries to validate; PyTorch is checkpoint-based, not engine-based).
|
||||
- AZ-302 ThermalState polling — delegated.
|
||||
- `torch.compile` / `torch.jit.trace` / `torch.jit.script` — explicitly out of scope; this is the SIMPLE baseline.
|
||||
- Mixed-precision autocast — explicitly FP16 only; no `torch.cuda.amp.autocast`.
|
||||
- Multi-GPU support — single Jetson GPU only.
|
||||
- Engine cache for PyTorch — there is no engine cache; the checkpoint IS the artifact.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Protocol conformance**
|
||||
Given `runtime_checkable(InferenceRuntime)`
|
||||
When `isinstance(PytorchFp16Runtime(...), InferenceRuntime)` is evaluated
|
||||
Then result is `True`; `current_runtime_label() == "pytorch_fp16"`
|
||||
|
||||
**AC-2: compile_engine is a no-op**
|
||||
Given a checkpoint path on disk
|
||||
When `compile_engine(path, build_config)` is called
|
||||
Then no `.engine` file is produced; the returned `EngineCacheEntry` has `engine_path == path` and the `(SM, JP, TRT)` tuple components are `None`; the call returns within 100 ms
|
||||
|
||||
**AC-3: deserialize loads, half-casts, GPU-moves, eval-mode**
|
||||
Given a valid checkpoint
|
||||
When `deserialize_engine(entry)` is called
|
||||
Then the loaded model has `model.training == False`; every parameter has `dtype == torch.float16` and `device.type == "cuda"`; one warm-up forward has succeeded; the returned handle is a `PytorchEngineHandle`
|
||||
|
||||
**AC-4: infer produces numpy output dict matching the Protocol**
|
||||
Given a deserialised handle for a UltraVPR-shaped model and a fixed input numpy dict
|
||||
When `infer(handle, inputs)` is called
|
||||
Then the returned value is a `dict[str, np.ndarray]`; every output is FP16-cast or FP32-cast per the model's actual output dtypes (no silent type coercion); the numerical output is within a documented tolerance of the FP32 reference (when running the same model in FP32 mode for the test's reference path)
|
||||
|
||||
**AC-5: release frees GPU memory**
|
||||
Given a deserialised handle holding K MB GPU memory
|
||||
When `release_engine(handle)` is called
|
||||
Then NVML reports K MB freed within 1 s (the freed memory may not return to OS immediately, but `torch.cuda.memory_allocated()` decreases to zero for that handle's allocations)
|
||||
|
||||
**AC-6: missing checkpoint raises EngineDeserializeError**
|
||||
Given a non-existent checkpoint path
|
||||
When `deserialize_engine(entry)` is called
|
||||
Then `EngineDeserializeError` is raised with the path in the message; no GPU memory is allocated
|
||||
|
||||
**AC-7: incompatible state dict raises EngineDeserializeError**
|
||||
Given a checkpoint whose state-dict keys do not match the architecture the runtime expects
|
||||
When `deserialize_engine` is called
|
||||
Then `EngineDeserializeError` is raised; the original `RuntimeError` from `load_state_dict(strict=True)` is preserved as `__cause__`
|
||||
|
||||
**AC-8: CUDA OOM during infer surfaces as OutOfMemoryError**
|
||||
Given a deserialised model and an input tensor large enough to OOM
|
||||
When `infer(handle, inputs)` is called
|
||||
Then `OutOfMemoryError` is raised (rewrapped from `torch.cuda.OutOfMemoryError`); the model is NOT silently moved to CPU
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Per-call latency budget is the simple-baseline reference; no specific p95 target. PyTorch FP16 typically runs 3–5× slower than TRT FP16 on Jetson; that is acceptable because this strategy is not the production-default airborne choice.
|
||||
- `deserialize_engine` p95 ≤ 10 s on Tier-2 (checkpoint load + half-cast + GPU move + warm-up).
|
||||
|
||||
**Compatibility**
|
||||
- PyTorch version pinned to the project default; this task does NOT change it.
|
||||
- Torch checkpoint format only — `.pt` / `.pth` files saved via `torch.save`.
|
||||
|
||||
**Reliability**
|
||||
- Errors rewrapped into the AZ-297 family.
|
||||
- `eval()` mode is set unconditionally; this is the simple baseline, not a training runtime. Even if a checkpoint accidentally ships in training mode, the runtime forces `eval()`.
|
||||
- `torch.no_grad()` + `torch.inference_mode()` are applied inside `infer`; the forward never accumulates gradients.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Protocol conformance + label | `isinstance` True; label match |
|
||||
| AC-2 | compile_engine returns quickly with checkpoint path | No `.engine` produced; entry shape correct; ≤ 100 ms |
|
||||
| AC-3 | deserialize a small test model | Eval mode True; FP16 dtype on GPU; warm-up succeeded |
|
||||
| AC-4 | infer numerical comparison vs FP32 reference | Output within tolerance |
|
||||
| AC-5 | release after deserialise | NVML / `torch.cuda.memory_allocated` shows freed |
|
||||
| AC-6 | deserialise non-existent path | `EngineDeserializeError`; no GPU alloc |
|
||||
| AC-7 | deserialise mismatched state dict | `EngineDeserializeError`; `__cause__` preserved |
|
||||
| AC-8 | infer with deliberately oversized input | `OutOfMemoryError`; no CPU fallback |
|
||||
| NFR-perf-deserialize | Microbench deserialise × 5 | p95 ≤ 10 s on Tier-2 |
|
||||
| NFR-reliability-eval-mode | After deserialise, check `model.training` | False unconditionally |
|
||||
|
||||
## Constraints
|
||||
|
||||
- PyTorch version pinned at project default.
|
||||
- Eager FP16 only — no `torch.compile`, no JIT, no autocast. The "simple baseline" is in the name.
|
||||
- The model architecture loader is per-backbone; this task wires in a registry mapping `model_name` (from `BuildConfig`) to the architecture class. The registry is populated by per-backbone modules (UltraVPR, LightGlue, etc.) — that registry is owned by E-C2 / E-C2.5 / E-C3 / E-C3.5 component code and outside this task's scope; this task only provides the mechanism (a single dict registered at composition time).
|
||||
- The `PytorchEngineHandle` is opaque.
|
||||
- This task introduces no new third-party dependencies beyond what PyTorch already requires.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Model architecture registry leaks across components**
|
||||
- *Risk*: To load a `UltraVPR` checkpoint, this runtime needs the `UltraVPR` class. Importing it from `c2_vpr.ultra_vpr` would create a back-edge from C7 (Layer 2) to C2 (Layer 3) violating module-layout layering.
|
||||
- *Mitigation*: The composition root registers each backbone class into `PytorchFp16Runtime`'s registry at startup (dependency injection). The runtime never imports component code directly. The injection is wired in `runtime_root` (Layer 5), which is allowed to depend on every layer.
|
||||
|
||||
**Risk 2: Checkpoint deserialization is a security risk**
|
||||
- *Risk*: `torch.load` can execute arbitrary code via pickle when loading untrusted checkpoints.
|
||||
- *Mitigation*: Checkpoints are cosigned via the deployment manifest's signature (per `_docs/02_document/risk_mitigations.md`). This task uses `torch.load(weights_only=True)` (PyTorch 2.x default) which restricts pickle to known-safe types. A non-weights-only checkpoint raises `EngineDeserializeError`.
|
||||
|
||||
**Risk 3: FP16 numerical mismatch with FP32 reference outside tolerance**
|
||||
- *Risk*: Some model architectures lose accuracy when half-cast (FP16) without autocast.
|
||||
- *Mitigation*: AC-4 documents the tolerance per model (recorded in the implementation report). If a backbone exceeds tolerance, the runtime is unfit for that backbone and the operator switches to TRT (which uses calibration to recover accuracy). The accuracy-vs-runtime trade-off is a per-backbone property documented during integration testing — this task accepts the result, does not work around it.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: PyTorch FP16 simple-baseline runtime (architecture / E-C7 / ENG-RULE).
|
||||
- **Production code that must exist**: real `PytorchFp16Runtime` class implementing the AZ-297 Protocol; real `torch.load` + `.half().cuda().eval()` + sync forward; real release path.
|
||||
- **Allowed external stubs**: tests MAY substitute a tiny `nn.Linear` checkpoint as the "model"; production wiring uses the actual backbones registered by the composition root.
|
||||
- **Unacceptable substitutes**: a CPU-only mode (would defeat the GPU-first invariant the AZ-297 Protocol implies via `EngineHandle`); `torch.compile` (would silently change the simple-baseline contract); autocast (would change the "FP16 only" guarantee that downstream comparisons rely on).
|
||||
Reference in New Issue
Block a user