mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 23:01:13 +00:00
[AZ-321] C10 EngineCompiler: hardware-tied TRT compile + cache reuse
Land the C10 per-model engine compile + cache-reuse orchestrator. `EngineCompiler.compile_engines_for_corpus(request)` walks the corpus, computes the canonical engine filename via AZ-281 `EngineFilenameSchema.build`, and either reuses the cached binary (cache hit, AZ-280 `Sha256Sidecar.verify` returns True) or delegates to the AZ-297 `compile_engine` on the injected runtime (cache miss; the runtime owns the write path). Returns one `EngineCompileResult` per backbone carrying the canonical `EngineCacheEntry`, outcome (BUILT / REUSED), and `compile_duration_s` (None on reuse). Hardware-tied reuse (D-C10-6 / D-C10-7) falls out of the filename schema — a host change rebuilds at the new path and leaves the old files untouched (AC-4). Design corrections vs. the task spec body: - The spec proposed a c10-local `EngineCacheEntry` carrying outcome and duration; that name is already taken by the AZ-297 canonical DTO. The wrapper is renamed `EngineCompileResult`; the canonical shape wins. - The spec called `InferenceRuntime.host_info()`, which is not in the AZ-297 Protocol. `HostCapabilities` is threaded through `EngineCompileRequest` instead so the composition root owns host probing and the compiler stays decoupled. - The c10 layer cannot import `components.c7_inference` (arch rule `test_az270_compose_root.test_ac6`). `engine_compiler.py` defines `CompileEngineCallable` — a structural Protocol cut of `InferenceRuntime` exposing only `compile_engine` — and catches broad `Exception` (re-raising preserves the original type; `error_class` is recorded in the ERROR log payload). Production - engine_compiler.py: `CompileOutcome` enum, `BackboneSpec`, `EngineCompileRequest`, `EngineCompileResult`, `EngineCompileSummary` DTOs; `CompileEngineCallable` Protocol; `EngineCompiler` with the single public method. - config.py: `BackboneConfig` + `C10ProvisioningConfig` (`workspace_mb` default 4 GiB to match C7 NFT-LIM-01); validate positive shape dims and duplicate model_name detection in `__post_init__`. - runtime_root/c10_factory.py: `build_engine_compiler(config)` wires the existing `build_inference_runtime` factory through; `build_backbone_specs(config)` materialises the `BackboneSpec` tuple from the config block. - components/c10_provisioning/__init__.py: re-exports the AZ-321 surface and registers the new config block. Tests - test_engine_compiler.py: covers AC-1..AC-10 + missing-sidecar sibling case for AC-5. Tier-1 via fake runtime that writes through the REAL `Sha256Sidecar.write_atomic_and_sidecar`. Tier-2 placeholders for the cache-hit p99 NFR (200 MB engine sweep) and kill-during-compile atomic-write NFR. Docs - module-layout.md: c10_provisioning Per-Component Mapping lists the new internal modules (engine_compiler.py, config.py), the composition-root c10_factory.py, the AZ-321 public re-export surface, and the registered config block. - batch_33_cycle1_report.md + reviews/batch_33_review.md: PASS_WITH_WARNINGS (4 Low findings accepted). Tests run: c10_provisioning 13 passing + 2 Tier-2 skips; combined unit suite (excluding pending components) 543 passing, 21 env-skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,218 @@
|
||||
# Batch 33 / Cycle 1 — Implementation Report
|
||||
|
||||
**Date**: 2026-05-12
|
||||
**Tasks**: AZ-321 (C10 EngineCompiler — per-model TRT compile +
|
||||
hardware-tied cache reuse + AZ-281 filename schema + AZ-280 sidecar
|
||||
gate)
|
||||
**Story points landed**: 5
|
||||
**Status**: complete (AZ-321 → In Testing)
|
||||
|
||||
## Scope summary
|
||||
|
||||
Single-task batch landing the C10 per-model engine compile + cache-
|
||||
reuse orchestrator. `EngineCompiler.compile_engines_for_corpus(req)`
|
||||
walks the corpus, computes the canonical engine filename via AZ-281
|
||||
`EngineFilenameSchema.build(...)`, and either reuses the cached
|
||||
binary (cache hit; AZ-280 `Sha256Sidecar.verify` returns True) or
|
||||
delegates to the AZ-297 `compile_engine` method on the injected
|
||||
runtime (cache miss; the runtime owns the write path and the sidecar
|
||||
emission). The orchestrator returns one `EngineCompileResult` per
|
||||
backbone carrying the canonical `EngineCacheEntry`, the
|
||||
`CompileOutcome.{BUILT,REUSED}` label, and the `compile_duration_s`
|
||||
(None on reuse). Hardware-tied cache reuse (D-C10-6 / D-C10-7) falls
|
||||
out naturally from the filename schema: an engine compiled on
|
||||
`(sm=87, jp=6.2, trt=10.3, fp16)` lives at a different path than one
|
||||
compiled on `(sm=89, jp=6.3, trt=10.5, fp16)`, so a hardware change
|
||||
produces cache misses for the new device and leaves the old files
|
||||
untouched (AC-4).
|
||||
|
||||
Two design corrections vs. the task spec body:
|
||||
|
||||
- **`EngineCacheEntry` shape** — the task spec proposed a c10-local
|
||||
`EngineCacheEntry` with `outcome` and `compile_duration_s` fields.
|
||||
That clashes with the canonical AZ-297
|
||||
`_types.inference.EngineCacheEntry` already re-exported from
|
||||
`components.c10_provisioning`. The canonical shape wins; the AZ-321
|
||||
wrapper is renamed `EngineCompileResult` and carries
|
||||
`{entry, outcome, compile_duration_s}` cleanly.
|
||||
- **`InferenceRuntime.host_info()`** — the task spec calls a
|
||||
hypothetical `host_info()` method on the runtime to retrieve
|
||||
`(sm, jp, trt)`. The AZ-297 Protocol does NOT expose host info.
|
||||
Rather than expand the frozen Protocol mid-cycle, we accept a
|
||||
`HostCapabilities` field on `EngineCompileRequest` so the
|
||||
composition root threads the host from its own probe (Tier-2
|
||||
device introspection or Tier-1 test fixture). The compiler stays
|
||||
decoupled from any runtime-side introspection surface.
|
||||
|
||||
The C10 layer is also forbidden by `test_az270_compose_root.test_ac6`
|
||||
from importing from `components.c7_inference` directly — that rule
|
||||
applies across all `components/*/*.py` files regardless of what the
|
||||
prose-level `module-layout.md` declares the "Imports from" list to be.
|
||||
The lint test wins. To respect it, `engine_compiler.py` defines
|
||||
`CompileEngineCallable` — a structural Protocol cut of
|
||||
`InferenceRuntime` exposing only the single `compile_engine` method
|
||||
the compiler actually uses — and catches the broader `Exception`
|
||||
class (the AZ-297 C7 error family stays the runtime's contract; the
|
||||
compiler dispatches on `type(exc).__name__` in its ERROR log payload
|
||||
and re-raises so the original exception type propagates to the
|
||||
caller intact).
|
||||
|
||||
## Files added / modified
|
||||
|
||||
### New (production)
|
||||
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/engine_compiler.py`
|
||||
— `CompileOutcome` enum (`BUILT` / `REUSED`), `BackboneSpec` DTO,
|
||||
`EngineCompileRequest` DTO, `EngineCompileResult` DTO,
|
||||
`EngineCompileSummary` DTO, `CompileEngineCallable` structural
|
||||
Protocol, and the `EngineCompiler` class with the single public
|
||||
`compile_engines_for_corpus` method. Helpers:
|
||||
`_build_config_for_backbone` (synthesises one
|
||||
`OptimizationProfile` with `min == opt == max ==
|
||||
expected_input_shape` from the backbone spec; richer dynamic-shape
|
||||
ranges are out of scope for AZ-321), `_summarise` (aggregate counts
|
||||
for the `c10.engine.compile.summary` log).
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/config.py` —
|
||||
`BackboneConfig` DTO (`model_name`, `onnx_path`,
|
||||
`expected_input_shape`, `input_name` with `"input"` default) +
|
||||
`C10ProvisioningConfig` (`backbones` tuple, `workspace_mb` default
|
||||
4096 to match C7 NFT-LIM-01). Both validate in `__post_init__`
|
||||
(non-empty strings, positive shape dims, duplicate model_name
|
||||
detection).
|
||||
- `src/gps_denied_onboard/runtime_root/c10_factory.py` —
|
||||
`build_engine_compiler(config)` wires the existing
|
||||
`build_inference_runtime` factory through to a new `EngineCompiler`
|
||||
instance with a c10-scoped structured logger; `build_backbone_specs
|
||||
(config)` materialises the `BackboneSpec` tuple from
|
||||
`config.components['c10_provisioning'].backbones`.
|
||||
|
||||
### Modified (production)
|
||||
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/__init__.py` —
|
||||
re-exports the AZ-321 public surface (`EngineCompiler`,
|
||||
`BackboneSpec`, `EngineCompileRequest`, `EngineCompileResult`,
|
||||
`CompileOutcome`, `EngineCompileSummary`, `CompileEngineCallable`,
|
||||
`BackboneConfig`, `C10ProvisioningConfig`) and registers the new
|
||||
config block via `register_component_block("c10_provisioning",
|
||||
C10ProvisioningConfig)`. `CacheProvisioner` / `Manifest` /
|
||||
`EngineCacheEntry` re-exports unchanged.
|
||||
|
||||
### New (tests)
|
||||
|
||||
- `tests/unit/c10_provisioning/test_engine_compiler.py` — **NEW**
|
||||
Tier-1 suite covering every AC + the 2 Tier-2 NFR placeholders:
|
||||
- **AC-1** cold cache + 3 backbones → all `BUILT`; 3 `.engine` +
|
||||
3 `.sha256` files on disk; 3 `c10.engine.cache.miss` WARN logs;
|
||||
1 `c10.engine.compile.summary` INFO log with `engines_built=3`.
|
||||
- **AC-2** warm cache + identical request → all `REUSED`;
|
||||
`compile_duration_s is None` for every result; ZERO calls to the
|
||||
fake runtime; 3 `c10.engine.cache.hit` INFO logs.
|
||||
- **AC-3** mixed (1 hit + 2 miss) — DINOv2 reused, LightGlue +
|
||||
ALIKED built; 2 calls to the fake runtime.
|
||||
- **AC-4** hardware change (sm 87→89, jp 6.2→6.3, trt 10.3→10.5):
|
||||
every backbone rebuilt at the new filename; old files at the old
|
||||
filename untouched on disk.
|
||||
- **AC-5** tampered sidecar (overwrite LightGlue's `.sha256` with
|
||||
`0`×64): LightGlue rebuilt; DINOv2 + ALIKED still reused; 1
|
||||
`c10.engine.sidecar.mismatch` WARN log with `model_name=
|
||||
lightglue` and `reason=digest_mismatch`. Plus a sibling case
|
||||
where the sidecar file is deleted entirely (`Sha256Sidecar.verify`
|
||||
raises) — same WARN-then-rebuild outcome.
|
||||
- **AC-6** `EngineBuildError` mid-corpus (backbone 2 of 3 fails):
|
||||
error propagates; backbone 1 (pre-cached, reused) untouched on
|
||||
disk; backbone 2's would-be engine NOT on disk (atomic-write
|
||||
guarantee from the fake mirrors AZ-298's real behaviour);
|
||||
backbone 3 never attempted (single call recorded for backbone 2).
|
||||
- **AC-7** `CalibrationCacheError` propagates with the
|
||||
`c10.engine.compile.error` ERROR log carrying `model_name`,
|
||||
`calibration_path`, `error_class=CalibrationCacheError`.
|
||||
- **AC-8** filename is exactly
|
||||
`dinov2_vpr__sm87_jp6.2_trt10.3_fp16.engine` (per AZ-281
|
||||
canonical schema with the `__` separator between model and
|
||||
`sm`); sidecar at `*.engine.sha256` with 64-hex digest;
|
||||
`EngineFilenameSchema.parse` round-trip + `Sha256Sidecar.verify`
|
||||
both pass.
|
||||
- **AC-9** `compile_duration_s` is a positive float for every
|
||||
`BUILT` result, `None` for every `REUSED` result.
|
||||
- **AC-10** empty `backbones` tuple → empty result; ZERO runtime
|
||||
calls; ZERO files written; 1 summary log with all-zero counts.
|
||||
- **NFR-perf-cache-hit** Tier-2 placeholder skip (200 MB engine
|
||||
sweep belongs in the AZ-321 microbench harness on Jetson).
|
||||
- **NFR-reliability-atomic-write** Tier-2 placeholder skip (kill-
|
||||
during-compile scenario lives in the microbench harness; the
|
||||
atomicity contract itself is owned by AZ-280's tests).
|
||||
|
||||
The Tier-1 tests use a `_FakeRuntime` that satisfies
|
||||
`CompileEngineCallable` and writes deterministic engine bytes via
|
||||
the REAL `Sha256Sidecar.write_atomic_and_sidecar` — so the cache-hit
|
||||
/ cache-miss / tampered-sidecar paths run against the same helper
|
||||
the production wiring uses. Only the C7-runtime-specific compile
|
||||
internals (TRT engine bytes, calibration cache, GPU memory) are
|
||||
mocked.
|
||||
|
||||
### Modified (docs)
|
||||
|
||||
- `_docs/02_document/module-layout.md` — c10_provisioning Per-
|
||||
Component Mapping now lists the new internal modules
|
||||
(`engine_compiler.py`, `config.py`) and the composition-root
|
||||
`c10_factory.py`; the Public API re-export list is extended with
|
||||
the AZ-321 surface; the `Config block` line is added (registered
|
||||
on import). `default_provisioner.py` row marked `pending` until
|
||||
the AZ-325 task lands.
|
||||
|
||||
## Acceptance criteria coverage
|
||||
|
||||
| AC | Test | Status |
|
||||
|----|------|--------|
|
||||
| AC-1 Cold cache → all built | `test_ac1_cold_cache_compiles_every_backbone` | passing |
|
||||
| AC-2 Warm cache → all reused, zero compile calls | `test_ac2_warm_cache_reuses_every_backbone` | passing |
|
||||
| AC-3 Mixed cache | `test_ac3_mixed_cache_hits_and_misses` | passing |
|
||||
| AC-4 Hardware change invalidates filename | `test_ac4_hardware_change_invalidates_cache` | passing |
|
||||
| AC-5 Tampered sidecar + missing sidecar paths | `test_ac5_tampered_sidecar_invalidates_that_engine` + `test_missing_sidecar_treated_as_cache_miss` | passing |
|
||||
| AC-6 `EngineBuildError` propagates, partial state consistent | `test_ac6_engine_build_error_propagates_and_third_backbone_untouched` | passing |
|
||||
| AC-7 `CalibrationCacheError` propagates with diagnostic log | `test_ac7_calibration_cache_error_propagates` | passing |
|
||||
| AC-8 Filename + sidecar layout matches AZ-281 schema | `test_ac8_filename_and_sidecar_layout` | passing |
|
||||
| AC-9 `compile_duration_s` recorded for built only | `test_ac9_compile_duration_recorded_for_built_only` | passing |
|
||||
| AC-10 Empty backbones → empty result, no side effects | `test_ac10_empty_backbones_returns_empty` | passing |
|
||||
| NFR-perf-cache-hit p99 ≤ 1.5 s for 200 MB engine | `test_nfr_perf_cache_hit_p99_under_1500ms_for_200mb_engine` (Tier-2) | Tier-2 skipped |
|
||||
| NFR-reliability atomic-write no half-engine after kill | `test_nfr_reliability_atomic_write_no_half_engine_after_kill` (Tier-2) | Tier-2 skipped |
|
||||
|
||||
## AC Test Coverage: 10 of 10 covered (+ 2 NFRs)
|
||||
## Code Review Verdict: PASS_WITH_WARNINGS (4 Low accepted; see Findings)
|
||||
## Auto-Fix Attempts: 0
|
||||
## Stuck Agents: None
|
||||
|
||||
## Findings (self-review)
|
||||
|
||||
| # | Severity | Category | Location | Note | Resolution |
|
||||
|---|----------|----------|----------|------|------------|
|
||||
| 1 | Low | Architecture | `engine_compiler.py::_compile_one` | Catches the broad `Exception` (not the specific AZ-297 `RuntimeError` family) because the c10 layer cannot import `components.c7_inference` (architecture rule `test_az270_compose_root.test_ac6`). The C7 contract scopes its runtime exceptions to its own family; ANY exception bubbling out of `compile_engine` is treated as a compile failure here. Re-raise preserves the original type. Inline comment documents the rule. | Open (Low) — accepted; architecture rule wins. |
|
||||
| 2 | Low | Maintainability | `engine_compiler.py::CompileEngineCallable` | Duplicates the `compile_engine` method shape from the C7 `InferenceRuntime` Protocol. Mirrors the LightGlue dual-Protocol pattern already in `_types/manifests.py` (consumer-side structural cut vs. producer-side opaque marker). | Open (Low) — accepted; matches established pattern. |
|
||||
| 3 | Low | Architecture | `engine_compiler.py` ↔ C7InferenceConfig | `EngineCompileRequest.cache_root` MUST equal the directory the C7 runtime writes to (`C7InferenceConfig.engine_cache_dir`). The composition root (`build_engine_compiler` + the C10 corpus driver T5 in AZ-325) is responsible for keeping the two in sync; the compiler itself trusts the request. A divergence would cause cache hits to always miss. | Open (Low) — flagged for AZ-325 to enforce. |
|
||||
| 4 | Low | Scope | `engine_compiler.py::_build_config_for_backbone` | Synthesises exactly one `OptimizationProfile` with `min == opt == max == expected_input_shape`. Backbones requiring dynamic input ranges would need a richer `BackboneSpec` carrying explicit `OptimizationProfile` tuples. None of the AZ-321 corpus backbones (DINOv2-VPR, LightGlue, ALIKED) need dynamic shapes today, but the limitation is real. | Open (Low) — accepted; future extension. |
|
||||
|
||||
## Tracker
|
||||
|
||||
- AZ-321 transitioned to **In Progress** at session start; will move
|
||||
to **In Testing** post-commit per `protocols.md`.
|
||||
|
||||
## Test suite
|
||||
|
||||
- `tests/unit/c10_provisioning/` — 13 passing, 2 Tier-2 skips
|
||||
(cache-hit p99 NFR + atomic-write kill scenario).
|
||||
- Combined unit suite excluding pending components (c1, c2, c2.5,
|
||||
c3, c3.5, c4, c5, c8, c11, c12) and the c6 collection blocker on
|
||||
this host (missing `psycopg_pool` is a known dev-machine env issue,
|
||||
pre-existing) — 543 passing, 21 environment-skipped, 1 warning
|
||||
(pre-existing `pynvml` FutureWarning unrelated to AZ-321).
|
||||
|
||||
## Next batch
|
||||
|
||||
Cycle 1 advances per the greenfield queue — autodev re-detects the
|
||||
next AZ ticket in the Step 7 batch loop. AZ-321 unblocks AZ-322
|
||||
(C10 Descriptor Batcher), AZ-337 (C2 UltraVPR), AZ-345 / AZ-346 /
|
||||
AZ-347 (C3 matchers), and AZ-349 (C3.5 refiner) at the topological
|
||||
level; the next ready batch is computed by `compute-next-batch`.
|
||||
|
||||
A cumulative review (batches 31–33) will fire at the next sub-skill
|
||||
phase boundary per Step 14.5's K=3 trigger.
|
||||
Reference in New Issue
Block a user