gps-denied-onboard

mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 22:01:13 +00:00

Author	SHA1	Message	Date
Oleksandr Bezdieniezhnykh	0ad3278b12	[AZ-299] C7 OnnxTrtEpRuntime: ORT + TRT EP fallback strategy Land the fallback InferenceRuntime strategy that satisfies C7-IT-05: when the TRT-direct path (AZ-298) cannot deserialise a cached engine or when the operator explicitly selects ORT, the system stays in the air at degraded latency rather than dropping the request. Conforms to the AZ-297 Protocol; current_runtime_label() == "onnx_trt_ep". Production - onnx_trt_ep_runtime.py: compile_engine is a no-op returning an EngineCacheEntry pointing at the source .onnx; deserialize_engine is gate-first for .engine entries and gate-skip for .onnx, builds an ORT InferenceSession with the provider list [TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider], stages cached engines into the ORT TRT EP cache directory via symlink-or-copy, warms up with one session.run after construction, and honours config.inference.ort_disallow_cpu_ fallback by raising EngineDeserializeError when the active provider resolves to CPU; infer emits a one-shot c7.fallback_to_onnx_trt_ep WARN log plus gcs_alert callback on first call when is_fallback= True; release_engine is idempotent. _build_provider_args is the single point that pins TRT EP option-key names (Risk-3) and caps trt_max_workspace_size at gpu_memory_budget_bytes // 4 (AC-8). - config.py: adds ort_trt_cache_dir (validated non-empty) and ort_disallow_cpu_fallback to C7InferenceConfig. - fdr_client/records.py: adds c7.fallback_to_onnx_trt_ep and c7.cpu_fallback FDR record kinds. Tests - test_onnx_trt_ep_runtime.py: covers AC-1..AC-8 + Risk-2 CPU-fallback alert + Risk-3 option-key pin + NFR-reliability error rewrap; Tier-1 via fake ORT session; Tier-2 placeholders skip on macOS dev for numerical FP16 comparison and session-creation perf NFR. - test_protocol_conformance.py: drops onnx_trt_ep from the missing- module parametrize now that the module ships. - test_az272_fdr_record_schema.py: extends per-kind fixture builder to cover the two new C7 FDR kinds in the roundtrip / schema-version AC tests. Docs - module-layout.md: replaces the pending onnx_trt_runtime row with the shipped onnx_trt_ep_runtime row + capabilities list. - batch_32_cycle1_report.md + reviews/batch_32_review.md: full batch + self-review (PASS_WITH_WARNINGS, 4 Low findings accepted). Tests run: c7_inference 139 passing + 17 Tier-2 skips; combined unit suite (excluding pending components) 529 passing, 19 env-skipped. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 23:55:50 +03:00
Oleksandr Bezdieniezhnykh	18a69022b3	[AZ-298] C7 TensorrtRuntime: TRT 10.3 + INT8 calib trust + GPU budget Implement the production-default InferenceRuntime strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). The runtime owns the full TRT lifecycle: compile_engine via the Polygraphy + trtexec + IBuilderConfig hybrid (FP16 / INT8 / Mixed precision), deserialize_engine with EngineGate-first ordering and a pre-allocation GPU memory budget gate, infer via H2D -> enqueueV3 -> D2H -> stream sync on the owned CUDA stream, idempotent release_engine, and an injected ThermalStatePublisher delegation for thermal_state. INT8 calibration cache trust (D-C10-6, AC-2/3/4) is enforced by a .calib_cache.sha256 file-integrity sidecar (AZ-280) plus a new .calib_cache.dataset_sha256 sidecar that records the dataset content hash at compile time; reuse only when both agree, rebuild silently on dataset hash mismatch, raise CalibrationCacheError on corrupt sidecar (never silently overwritten). GPU memory budget (NFT-LIM-01, default 4 GiB) is checked BEFORE any TRT call beyond the gate (AC-6); a pre-allocation refusal raises OutOfMemoryError and leaves the resident state unchanged. TensorRT 10.3 / Polygraphy / PyCUDA are lazy-imported inside the methods that need them so the module loads cleanly on Tier-0 hosts. A standalone CLI entry (python -m gps_denied_onboard.components.c7_inference.tensorrt_runtime compile <onnx> <build_config.json>) is wired for C10 CacheProvisioner (AZ-321) to invoke pre-flight without holding a runtime instance. C7InferenceConfig gains gpu_memory_budget_bytes (default 4 GiB) and trtexec_timeout_s (default 600 s, Risk 4 mitigation), both validated in __post_init__. Tests: 26 active + 6 Tier-2-gated skips; AC-1 / AC-3 / AC-4 / AC-5 / AC-6 / AC-7 / AC-10 + NFR-reliability fully covered on Tier-1 via fake CUDA / TRT modules; AC-2 / AC-8 / AC-9 / NFR-perf-deserialize placeholders skip with prerequisite reason and live in the AZ-298 Tier-2 microbench harness. Code review verdict PASS_WITH_WARNINGS (1 Medium hot-path hoist fix auto-applied). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 23:11:49 +03:00
Oleksandr Bezdieniezhnykh	49a67f770d	[AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton background-thread publisher that polls jtop (preferred) or pynvml (fallback) at config.thermal_poll_hz, stores an atomic ThermalState snapshot, and emits c7.thermal_transition FDR records on every throttle flip with a WARN log on entry and an INFO log on exit. Default-safe on TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited WARN. Sources return a raw ThermalReading; the publisher stamps measured_at_ns via its injected Clock so _JtopSource / _PynvmlSource stay clean of direct time.* calls (Invariant 2). _poll_once is the deterministic test seam — start() spawns the production thread. - c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS - [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject - 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural) green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred - full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 10:33:37 +03:00
Oleksandr Bezdieniezhnykh	59f56c032f	[AZ-301] Implement EngineGate — D-C10-3 + D-C10-7 takeoff validator AZ-301 takeoff-side validator every InferenceRuntime strategy calls before deserialize_engine. Five-step deterministic refusal pipeline, in order: 1. filename schema parse -> EngineSchemaMismatchError(reason=...) 2. schema tuple match -> EngineSchemaMismatchError(expected,got) 3. sidecar present -> EngineSidecarMissingError 4. sidecar trust -> EngineHashMismatchError(stage=sidecar) 5. manifest match -> EngineHashMismatchError(stage=manifest) Refusal order is part of the public contract (AC-7 verifies a fixture that is BOTH schema-mismatched AND missing-sidecar refuses at step 1). Production code (new): - components/c7_inference/engine_gate.py -- EngineGate, HostTuple, read_host_tuple (Jetson: pynvml + /etc/nv_tegra_release + tensorrt.__version__; raises RuntimeError on Tier-1) - components/c7_inference/manifest.py -- DeploymentManifest, ManifestReader, ManifestReaderProtocol. Risk-2 enforced at the type level: __getitem__ raises EngineHashMismatchError on missing key, NEVER KeyError, so the gate cannot silently pass - components/c7_inference/__init__.py -- re-exports the new public surface Tests (new): tests/unit/c7_inference/test_engine_gate.py covers AC-1..AC-7 + NFR-reliability-no-write + manifest reader + refusal log emission. 14 tests unconditional + AC-8 Tier-2 skip (needs real NVML + L4T release file + tensorrt binding). Three task-spec -> as-built deltas documented in _docs/02_tasks/done/AZ-301_c7_engine_gate.md Implementation Notes: 1. HostTuple lives in engine_gate.py (the only consumer); re-exported from package __init__.py. 2. read_host_tuple takes precision as a keyword argument — three of four fields come from the host, precision is engine-build metadata supplied by the caller. 3. AC-8 is Tier-2-only; AC-1..AC-7 + NFR-reliability + extras run on every CI host. Risk-2 (manifest reader silently treats missing entry as pass): DeploymentManifest.__getitem__ raises EngineHashMismatchError with "missing manifest entry for {path}" — covered by test_manifest_missing_entry_raises_hash_mismatch. NFR-perf-validate (p99 <= 50 ms): tier-2 only — a real 500 MB engine streaming sha256 cannot be benchmarked on Tier-1 fixtures. AZ-302 (ThermalStatePublisher) + AZ-304 (C6 Postgres schema) deferred to batches 26 / 27 to keep the 1-task batch cadence and isolate their respective env / testcontainer surface areas. Suite: 1134 passed / 11 skipped. No regressions outside the new files. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 10:20:21 +03:00
Oleksandr Bezdieniezhnykh	65ad2168ed	[AZ-300] Implement PytorchFp16Runtime — C7 simple-baseline strategy AZ-300 mandatory simple-baseline InferenceRuntime (eager FP16 PyTorch). Implements the AZ-297 Protocol; current_runtime_label returns "pytorch_fp16". Numerical reference every fancier C7 strategy (AZ-298 TRT, AZ-299 ORT) is measured against, and the only viable runtime for Tier-1 workstation Docker where TRT is non-trivial to install. Production code (new): - components/c7_inference/pytorch_fp16_runtime.py — runtime + PytorchEngineHandle + output-shape adapter - components/c7_inference/architecture_registry.py — torch-free register_architecture / default_registry / ArchitectureFactory (Risk-1 mitigation: no L2->L3 back-edge from C7 into per-backbone code) - components/c7_inference/__init__.py — re-exports the registry mechanism. Still does NOT import the concrete strategy module (Invariant I-5) - components/c7_inference/config.py — adds per_frame_debug_log bool field (gates the DEBUG per-frame latency log) Tests (new): tests/unit/c7_inference/test_pytorch_fp16_runtime.py covers AC-1..AC-8 + NFRs. AC-1/2/6/7 + thermal/release/registry guards run unconditionally (17 tests); AC-3/4/5/8 + NFR-perf-deserialize + NFR-reliability-eval-mode require CUDA and skip on Tier-1 CI / macOS dev. Tests (modified): - test_protocol_conformance.py — narrowed test_ac5_build_inference_runtime_flag_on_but_module_missing parametrisation to exclude pytorch_fp16 (now-built); TRT / ORT still covered until AZ-298 / AZ-299 ship. CI: .github/workflows/ci.yml lint + unit jobs now install '-e .[dev,inference]' because mypy + pytest need torch + torchvision + onnxruntime on the runner. Three task-spec -> as-built deltas documented in _docs/02_tasks/done/AZ-300_c7_pytorch_baseline.md Implementation Notes: 1. Constructor conforms to AZ-297 factory shape (config positional; thermal_publisher + registry + clock keyword-only optionals). AZ-302 will update the factory to thread thermal_publisher. 2. Architecture registry uses extras["model_name"] as lookup key (avoids touching the frozen BuildConfig / EngineCacheEntry DTOs). 3. Warm-up forward deferred to AZ-300 tier-2 follow-up — the zero-arg registry has no per-backbone input-shape metadata. Suite: 1120 passed / 10 skipped (CUDA + Tier-2 + cmake / actionlint environment gates). No regressions in non-c7_inference areas. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 10:13:21 +03:00
Oleksandr Bezdieniezhnykh	daff5d4d1c	[AZ-297] C7 InferenceRuntime: Protocol + DTOs + factory Freezes the c7_inference Public API per _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md v1.0.0: - InferenceRuntime Protocol (6 methods: compile_engine, deserialize_engine, infer, release_engine, thermal_state, current_runtime_label) in components/c7_inference/interface.py. - DTOs (PrecisionMode enum, OptimizationProfile, BuildConfig, EngineCacheEntry, EngineHandle opaque marker) in _types/inference.py — placed at the L1 types layer so C10 can re-export EngineCacheEntry without crossing the components.* boundary (AZ-270 AC-6). - ThermalState DTO expanded in _types/thermal.py from the AZ-355 forward-declared stub to the AZ-297 contract shape (cpu/gpu temp, thermal_throttle_active, measured_clock_mhz, measured_at_ns, is_telemetry_available). Invariant I-6: when telemetry is unavailable, throttle is False. - Error family rooted at c7_inference.errors.RuntimeError (9 subtypes: EngineBuildError, EngineDeserializeError, EngineHashMismatchError, EngineSchemaMismatchError, EngineSidecarMissingError, CalibrationCacheError, InferenceError, OutOfMemoryError, TelemetryUnavailableError). RuntimeNotAvailableError stays in runtime_root/errors.py — composition-time, outside the family. - C7InferenceConfig per-component config block (runtime label, thermal_poll_hz, engine_cache_dir) with constructor-time validation rejecting unknown runtime labels. - Composition-root factory build_inference_runtime in runtime_root/inference_factory.py with three BUILD_* gates (BUILD_TENSORRT_RUNTIME, BUILD_ONNX_TRT_EP_RUNTIME, BUILD_PYTORCH_FP16_RUNTIME). Concrete strategy modules are imported lazily via __import__ AFTER the flag check, so a Tier-0 build with the flag OFF MUST NOT load the strategy module (AC-5 / I-5; verifiable via sys.modules). - 37 conformance tests cover all 8 ACs + NFR-perf-factory (p99 build under 200 ms × 1000 calls) + NFR-reliability-error-family. AC-8 introspects the contract file's Shape table and asserts method parity against the runtime Protocol; also asserts all 9 error subtypes are documented. Retired the AZ-263 scaffolding EngineCacheEntry from _types/manifests.py (replaced by the AZ-297 canonical shape in _types/inference.py); updated the LightGlue-flavoured EngineHandle Protocol docstring in _types/manifests.py to rationalize its intentional dual existence with the C7 opaque EngineHandle (same name, different consumer-side cut, mirroring the C4/C5 ISam2GraphHandle pattern). Stale ThermalState.throttle docstring references in c4_pose/config.py, c4_pose/interface.py, and _types/pose.py updated to thermal_throttle_active. Full unit-test sweep: 843 passed, 2 pre-existing environment skips (cmake, actionlint). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 04:30:14 +03:00
Oleksandr Bezdieniezhnykh	b12db61444	[AZ-263] Bootstrap: repo skeleton + Docker + CI + Alembic + Tier-1 tests Implements the AZ-263 / E-BOOT initial structure task: - Python src/-layout package `gps_denied_onboard/` with per-component interface stubs (14 components), type-only DTOs under `_types/`, shared helpers under `helpers/` (R14 LightGlue ownership), structured JSON logging, runtime composition root with env-var fail-fast gate, healthcheck module shared by Docker and CI smoke. - CMake top-level + `cmake/{build_options,dependencies,strategies}.cmake` with the BUILD_* per-binary flags (ADR-002) and pinned external git refs for OKVIS2 / VINS-Mono / GTSAM / FAISS / OpenCV >=4.12.0. - Three Dockerfiles (companion-tier1, operator-tooling, mock-suite-sat-service) + two compose files (dev + Tier-1 test). - Four GitHub Actions workflows: ci.yml (lint/unit/integration/dual binary build/SBOM diff/security), ci-tier2.yml (self-hosted Jetson AC-bound NFTs), release.yml, cve-rescan.yml. - Two CI gate scripts: `ci/sbom_diff.py` (deployment SBOM subset + R02 exclusion), `ci/opencv_pin_gate.py` (>=4.12.0 enforcement, D-CROSS-CVE-1). - Alembic-driven Postgres 16 initial migration `0001_initial.py` mirroring satellite-provider tiles + flights + sector_classifications + manifests + engine_cache_entries (data_model.md s 2). - Tier-1 test scaffolding: 95 passing unit tests covering every AC, per-component smoke tests, structured logging JSON output check, env-var gate check, healthcheck import check. Two CI-gated tests (cmake configure, actionlint) skip locally with explicit reasons. - Batch report + code review report under `_docs/03_implementation/`. Verdict: PASS_WITH_WARNINGS (two Low findings, both informational). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-11 01:00:28 +03:00

7 Commits