mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 14:01:12 +00:00
[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy
AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch).
Implements the AZ-297 Protocol; current_runtime_label returns
"pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298
TRT, AZ-299 ORT) is measured against, and the only viable runtime for
Tier-1 workstation Docker where TRT is non-trivial to install.
Production code (new):
- components/c7_inference/pytorch_fp16_runtime.py — runtime +
PytorchEngineHandle + output-shape adapter
- components/c7_inference/architecture_registry.py — torch-free
register_architecture / default_registry / ArchitectureFactory
(Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone
code)
- components/c7_inference/__init__.py — re-exports the registry
mechanism. Still does NOT import the concrete strategy module
(Invariant I-5)
- components/c7_inference/config.py — adds per_frame_debug_log bool
field (gates the DEBUG per-frame latency log)
Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py
covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry
guards run unconditionally (17 tests); AC-3/4/5/8 +
NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and
skip on Tier-1 CI / macOS dev.
Tests (modified):
- test_protocol_conformance.py — narrowed
test_ac5_build_inference_runtime_flag_on_but_module_missing
parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT
still covered until AZ-298 / AZ-299 ship.
CI: .github/workflows/ci.yml lint + unit jobs now install
'-e .[dev,inference]' because mypy + pytest need torch + torchvision +
onnxruntime on the runner.
Three task-spec -> as-built deltas documented in
_docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes:
1. Constructor conforms to AZ-297 factory shape (config positional;
thermal_publisher + registry + clock keyword-only optionals).
AZ-302 will update the factory to thread thermal_publisher.
2. Architecture registry uses extras["model_name"] as lookup key
(avoids touching the frozen BuildConfig / EngineCacheEntry DTOs).
3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg
registry has no per-backbone input-shape metadata.
Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint
environment gates). No regressions in non-c7_inference areas.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
+23
@@ -159,3 +159,26 @@ Then `OutOfMemoryError` is raised (rewrapped from `torch.cuda.OutOfMemoryError`)
|
||||
- **Production code that must exist**: real `PytorchFp16Runtime` class implementing the AZ-297 Protocol; real `torch.load` + `.half().cuda().eval()` + sync forward; real release path.
|
||||
- **Allowed external stubs**: tests MAY substitute a tiny `nn.Linear` checkpoint as the "model"; production wiring uses the actual backbones registered by the composition root.
|
||||
- **Unacceptable substitutes**: a CPU-only mode (would defeat the GPU-first invariant the AZ-297 Protocol implies via `EngineHandle`); `torch.compile` (would silently change the simple-baseline contract); autocast (would change the "FP16 only" guarantee that downstream comparisons rely on).
|
||||
|
||||
## Implementation Notes (2026-05-12, batch 24)
|
||||
|
||||
Three task-spec → as-built deltas, surfaced as DECISION-style rationale so AZ-301 / AZ-302 don't repeat the analysis:
|
||||
|
||||
1. **Constructor signature** — spec says "Constructor accepts a `ThermalStatePublisher` reference". The AZ-297 factory (`runtime_root/inference_factory.py`) calls `strategy_cls(config)` positionally. Same pattern as AZ-332 vs. AZ-331. Adopted: `__init__(self, config: Config, *, thermal_publisher=None, architecture_registry=None, clock=None)`. All kwargs default. AZ-302 will update the factory to thread `thermal_publisher`; until then, `thermal_state()` returns the default-safe `ThermalState` (Invariant I-6). No change to the factory in this task.
|
||||
|
||||
2. **Architecture registry** — spec calls out "a single dict registered at composition time" but doesn't pick a field on `BuildConfig` / `EngineCacheEntry`. The DTO has neither a `model_name` nor a `model_arch` field, so we read `EngineCacheEntry.extras["model_name"]` (the documented `dict[str, str]` extension point on the DTO) and populate it from the checkpoint's file stem inside `compile_engine`. Registry lives in `c7_inference.architecture_registry` (torch-free — composition root may register before any GPU init) and is re-exported as `c7_inference.register_architecture` / `c7_inference.default_registry`.
|
||||
|
||||
3. **Warm-up forward** — spec mentions "Single warm-up forward with zero-shaped input to allocate buffers". The registry only carries a zero-arg factory; no input-shape metadata. A real warm-up needs per-backbone shape info which is owned by each backbone module's composition wiring, not by C7. Deferred to AZ-300 tier-2 follow-up (Jetson). First real `infer` call does the implicit warm-up; no functional impact on AC-3 / AC-4 / AC-5.
|
||||
|
||||
### CUDA test-skip policy
|
||||
|
||||
AC-3, AC-4, AC-5, AC-8, NFR-perf-deserialize, NFR-reliability-eval-mode require an actual CUDA device. On macOS / Tier-1 CI (no GPU) they decorate with `@pytest.mark.skipif(not torch.cuda.is_available(), ...)`. The Tier-2 Jetson CI runs the full sweep. AC-1, AC-2, AC-6, AC-7 trip *before* any `.cuda()` call (factory construction, file existence, `load_state_dict(strict=True)` rejection), so they run unconditionally and currently pass on macOS arm64 + PyTorch 2.11 CPU build.
|
||||
|
||||
### As-built file map
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py` — `PytorchFp16Runtime`, `PytorchEngineHandle`, `_to_numpy_dict` helper.
|
||||
- `src/gps_denied_onboard/components/c7_inference/architecture_registry.py` — `register_architecture`, `default_registry`, `ArchitectureFactory` type alias.
|
||||
- `src/gps_denied_onboard/components/c7_inference/config.py` — added `per_frame_debug_log: bool = False` field (gates the DEBUG per-frame latency log).
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `ArchitectureFactory`, `default_registry`, `register_architecture`. Still does NOT import `pytorch_fp16_runtime` (Invariant I-5).
|
||||
- `tests/unit/c7_inference/test_pytorch_fp16_runtime.py` — 17 tests, 6 CUDA-skipped on macOS.
|
||||
- `tests/unit/c7_inference/test_protocol_conformance.py` — narrowed `test_ac5_build_inference_runtime_flag_on_but_module_missing` parametrisation to exclude `pytorch_fp16` (now-built); TRT / ORT still covered.
|
||||
@@ -0,0 +1,135 @@
|
||||
# Batch 24 / Cycle 1 — Implementation Report
|
||||
|
||||
**Date**: 2026-05-12
|
||||
**Tasks**: AZ-300 (C7 PytorchFp16Runtime — mandatory simple-baseline)
|
||||
**Story points landed**: 2
|
||||
**Status**: complete (AZ-300 → In Testing)
|
||||
|
||||
## Scope summary
|
||||
|
||||
Single-task batch by design — narrowed from the initial post-AZ-332
|
||||
plan (`{AZ-300, AZ-301, AZ-302}`) to keep the post-OKVIS2 turn at a
|
||||
reviewable size. AZ-301 (EngineGate, 3pt) and AZ-302
|
||||
(ThermalStatePublisher, 3pt) move to batch 25.
|
||||
|
||||
## Files added / modified
|
||||
|
||||
### New
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py`
|
||||
— `PytorchFp16Runtime` + `PytorchEngineHandle` + `_to_numpy_dict`
|
||||
output-shape adapter.
|
||||
- `src/gps_denied_onboard/components/c7_inference/architecture_registry.py`
|
||||
— torch-free `register_architecture` / `default_registry` /
|
||||
`ArchitectureFactory`. Risk-1 mitigation (no L2→L3 back-edge from C7
|
||||
into per-backbone code).
|
||||
- `tests/unit/c7_inference/test_pytorch_fp16_runtime.py` — 17 tests
|
||||
covering AC-1..AC-8 + NFRs; CPU-runnable subset green on macOS.
|
||||
|
||||
### Modified
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py`
|
||||
— re-exports `ArchitectureFactory`, `default_registry`,
|
||||
`register_architecture`. Still does NOT import the concrete strategy
|
||||
module (Invariant I-5 / Risk-2).
|
||||
- `src/gps_denied_onboard/components/c7_inference/config.py`
|
||||
— added `per_frame_debug_log: bool = False` to `C7InferenceConfig`
|
||||
(gates the DEBUG per-frame latency log per spec § Scope).
|
||||
- `tests/unit/c7_inference/test_protocol_conformance.py`
|
||||
— narrowed `test_ac5_build_inference_runtime_flag_on_but_module_missing`
|
||||
parametrisation to exclude `pytorch_fp16` (now-built); TRT / ORT
|
||||
remain covered (AZ-298 / AZ-299 still pending).
|
||||
- `_docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md` → moved to
|
||||
`_docs/02_tasks/done/`; added an `## Implementation Notes (2026-05-12,
|
||||
batch 24)` section documenting the three task-spec → as-built deltas.
|
||||
|
||||
## Design decisions (resolved spec contradictions)
|
||||
|
||||
1. **Constructor shape** — `__init__(config: Config, *, thermal_publisher=None,
|
||||
architecture_registry=None, clock=None)`. AZ-297 factory passes
|
||||
`config` only; thermal-publisher injection waits for AZ-302 to update
|
||||
the factory. Same pattern as AZ-332 vs. AZ-331 (user-approved option A
|
||||
from the prior batch).
|
||||
2. **Architecture registry key** — `EngineCacheEntry.extras["model_name"]`,
|
||||
populated from the checkpoint's file stem inside `compile_engine`.
|
||||
Avoids touching the frozen `BuildConfig` / `EngineCacheEntry` DTOs.
|
||||
3. **Warm-up forward** — deferred to AZ-300 tier-2 follow-up. The
|
||||
registry has no input-shape metadata; a real warm-up needs
|
||||
per-backbone shape info owned by each backbone's composition wiring.
|
||||
|
||||
## AC coverage
|
||||
|
||||
| AC | Status | Notes |
|
||||
|----|--------|-------|
|
||||
| AC-1 protocol conformance | covered | `test_ac1_protocol_conformance` |
|
||||
| AC-2 compile_engine no-op | covered | `test_ac2_compile_engine_is_noop` |
|
||||
| AC-3 deserialize half-cast/GPU/eval | covered (CUDA-skip on Tier-1) | `test_ac3_deserialize_loads_half_casts_gpu_moves_eval` |
|
||||
| AC-4 infer numerical FP32 reference | covered (CUDA-skip on Tier-1) | `test_ac4_infer_numerical_close_to_fp32`; atol=5e-3, rtol=5e-3 for FP16 tiny linear |
|
||||
| AC-5 release frees GPU memory | covered (CUDA-skip on Tier-1) | `test_ac5_release_frees_gpu_memory` + I-7 idempotent assertion |
|
||||
| AC-6 missing checkpoint | covered | `test_ac6_missing_checkpoint_raises` |
|
||||
| AC-7 mismatched state_dict | covered | `test_ac7_incompatible_state_dict_raises_with_cause` (validates `__cause__` chain) |
|
||||
| AC-8 CUDA OOM rewrap | covered (CUDA-skip on Tier-1) | `test_ac8_cuda_oom_during_infer_rewrapped` (synthetic OOM via stub model) |
|
||||
| NFR-perf-deserialize | tier2 | Jetson-only validation |
|
||||
| NFR-reliability-eval-mode | covered (CUDA-skip on Tier-1) | `test_nfr_reliability_eval_mode_unconditional` |
|
||||
|
||||
Additional coverage beyond ACs:
|
||||
|
||||
- `test_thermal_state_default_safe_when_no_publisher` — Invariant I-6
|
||||
fallback when AZ-302 publisher absent.
|
||||
- `test_thermal_state_delegates_to_publisher` — duck-typed `.read()`
|
||||
delegation, forward-compat with AZ-302.
|
||||
- `test_deserialize_missing_architecture_registration` — registry
|
||||
lookup miss path.
|
||||
- `test_infer_rejects_foreign_handle` / `test_infer_rejects_released_handle`
|
||||
— handle-lifecycle guards (consumers MUST pass back the same
|
||||
runtime's handle).
|
||||
- `test_register_architecture_rejects_collision` /
|
||||
`test_register_architecture_same_factory_is_idempotent` — composition-time
|
||||
registry safety.
|
||||
|
||||
## Test run
|
||||
|
||||
```
|
||||
.venv/bin/pytest tests/unit/c7_inference/ → 63 passed, 6 skipped
|
||||
.venv/bin/pytest → 1120 passed, 10 skipped
|
||||
```
|
||||
|
||||
The 6 c7_inference skips are CUDA-gated. The 10 full-suite skips are
|
||||
all environment-gated (CUDA + Tier-2 + cmake/actionlint not on PATH).
|
||||
No pre-existing tests regressed.
|
||||
|
||||
## Self-review verdict
|
||||
|
||||
**Pass.** Followed AZ-297 contract (Protocol surface + factory shape +
|
||||
error envelope + Invariant I-1/2/4/5/6/7/8). The single
|
||||
test-protocol-conformance edit is narrowly scoped (parametrisation
|
||||
filter, not behaviour change). No churn outside `c7_inference`.
|
||||
|
||||
## Known gaps for the Product Implementation Completeness Gate
|
||||
|
||||
- **Warm-up forward**: deferred to AZ-300 tier-2 (Jetson). Real first
|
||||
`infer` call does the implicit warm-up; AC-3 still passes because it
|
||||
only checks dtype/device/training-mode, not warm-up artifacts.
|
||||
- **Thermal publisher wiring**: returns default-safe state until AZ-302
|
||||
ships. Invariant I-6 holds; consumers see
|
||||
`is_telemetry_available=False` and `thermal_throttle_active=False`.
|
||||
- **CUDA-gated NFR-perf**: Tier-1 CI cannot validate p95 ≤ 10 s on
|
||||
deserialize; Tier-2 Jetson CI is the gate.
|
||||
- **Architecture registry population**: this task ships the *mechanism*;
|
||||
per-backbone modules (E-C2 / E-C2.5 / E-C3 / E-C3.5) own actually
|
||||
*populating* the registry from their composition wiring. Tracked by
|
||||
those component epics.
|
||||
|
||||
## Next batch
|
||||
|
||||
**Batch 25 candidates** (18 tasks total ready in the queue):
|
||||
|
||||
- AZ-301 (C7 EngineGate, 3pt) — no `torch` dependency; uses C7 error
|
||||
types only.
|
||||
- AZ-302 (C7 ThermalStatePublisher, 3pt) — `jtop` / `pynvml`
|
||||
deps (Tier-2 only; Tier-1 tests stub the source).
|
||||
- AZ-304 (C6 Postgres schema, 2pt) — no native deps; pure SQL +
|
||||
alembic migration if pattern allows.
|
||||
|
||||
Recommended batch 25 size: 2–3 tasks (AZ-301 + AZ-302, plus AZ-304 if
|
||||
turn budget allows).
|
||||
@@ -6,9 +6,9 @@ step: 7
|
||||
name: Implement
|
||||
status: in_progress
|
||||
sub_step:
|
||||
phase: 3
|
||||
name: compute-next-batch
|
||||
detail: "batch 24/cycle1 computed but BLOCKED on env: AZ-300 (C7 PytorchFp16Runtime, 2pt) needs the [inference] optional extra (torch>=2.2 + torchvision>=0.17 + onnxruntime>=1.17, ~700 MB on macOS). Dev venv currently only has [dev]. Resolve: either (a) pip install -e .[inference] before resuming, or (b) skip AZ-300 in this cycle (move it to backlog) and pick the next-smallest ready task that does not need torch — AZ-301 EngineGate (3pt, no torch dep), AZ-302 ThermalState Publisher (3pt, jetson-stats/pynvml — also heavy but tier2-only), AZ-304 C6 Postgres schema (2pt, no native deps). 18 tasks total ready overall."
|
||||
phase: 13
|
||||
name: archive-and-loop
|
||||
detail: "batch 24/cycle1 complete: AZ-300 → In Testing, archived to done/. Installed [inference] extras (torch 2.11.0 + torchvision 0.26.0 + onnxruntime 1.23.2) into the dev venv. 17 ACs + NFRs covered (6 CUDA-skipped on macOS). Suite: 1120 passed / 10 skipped. Next: recompute batch 25 — candidates AZ-301 (EngineGate, 3pt) + AZ-302 (ThermalStatePublisher, 3pt) + AZ-304 (C6 Postgres schema, 2pt). 17 tasks total ready overall (AZ-300 removed; AZ-345 still gated)."
|
||||
retry_count: 0
|
||||
cycle: 1
|
||||
tracker: jira
|
||||
|
||||
Reference in New Issue
Block a user