mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 10:31:12 +00:00
[AZ-328] C12 BuildCacheOrchestrator + remote C10 invoker (Batch 43)
Implements F1 pre-flight cache build orchestrator on the operator workstation. Composes C11 TileDownloader (AZ-316), C12 CompanionBringup (AZ-327), C12 FlightsApiClient (AZ-489), and the new RemoteCacheProvisionerInvoker into one sequenced flow guarded by a filelock-backed workstation-side lockfile. Architectural decisions: - Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010): a flight that cannot be resolved is an operator-input error, not a contended- resource error. Enforced by AC-11 + AC-14. - Consumer-side cuts (AZ-507) for C11 + C10 types: local Protocols / mirror DTOs in tile_downloader_cut.py and _types.py; external errors matched by name-based whitelisting so unknown exceptions still propagate per AC-6. Cross-component type translation lives at the composition root (c12_factory). - Failure surfacing: recognised operational failures (download error, companion not ready, build error, flight-resolve error) return as CacheBuildReport(outcome=failure, failure_phase=...). Only lockfile contention raises (BuildLockHeldError) since no phase ever ran. - Workstation-side filelock library (project pin); no custom primitive. - Remote C10 stdout streamed line-by-line as DEBUG with api_key / auth_token redacted before logging (defence-in-depth). - CLI is now a thin adapter; all workflow logic lives in build_cache.py. operator-tool build-cache exit codes map per CacheBuildReport.failure_phase + failure_exception_type. Tests: 116 c12 unit tests pass (29 new for AZ-328 covering 15/15 ACs + NFR-perf-overhead microbench; 7 new for remote_c10_invoker; 3 new for file_lock; test_cli_build_cache rewritten for new orchestrator interface). Full repo suite: 1522 passed, 80 skipped. Also: replays Batch 42's ruff format leftover for c12 flights_api + test_az489 files (formatter ran over the c12 directory after new files were added). Pure whitespace; no behaviour change. Full report: _docs/03_implementation/batch_43_cycle1_report.md Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,256 @@
|
||||
# Batch 43 — Cycle 1 Report
|
||||
|
||||
**Date**: 2026-05-13
|
||||
**Batch**: 43
|
||||
**Tasks**: AZ-328 (C12 Build-Cache Orchestrator, 5pt)
|
||||
**Status**: complete; ticket ready to transition to "In Testing".
|
||||
|
||||
## Scope
|
||||
|
||||
AZ-328 delivers `BuildCacheOrchestrator` — the F1 (pre-flight cache
|
||||
build) workflow head for the operator workstation. It composes the
|
||||
already-shipped C11 `TileDownloader` (AZ-316), C12 `CompanionBringup`
|
||||
(AZ-327), C12 `FlightsApiClient` (AZ-489), and the new
|
||||
`RemoteCacheProvisionerInvoker` (this task) into one sequenced flow,
|
||||
guarded by a workstation-side `filelock` lockfile. The `operator-tool
|
||||
build-cache` subcommand (T1, AZ-326) is now wired to a real
|
||||
orchestrator and returns granular exit codes per failure phase.
|
||||
|
||||
This unblocks the F1 happy-path acceptance test (C12-IT-02) and gives
|
||||
operators the first end-to-end "I have a flight ID, build me a cache"
|
||||
workflow.
|
||||
|
||||
## Architectural Decisions
|
||||
|
||||
### 1. Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010)
|
||||
|
||||
Per the AZ-328 spec, the orchestrator resolves the `FlightDto` (via
|
||||
`flights_api_client.fetch_flight` or `load_flight_file`), derives the
|
||||
bbox, and computes the takeoff origin **before** acquiring the `.c12.lock`
|
||||
file. Rationale: a flight that cannot be resolved is an operator-input
|
||||
error (wrong flight ID, missing JSON file, expired auth token); holding
|
||||
a contended-resource lock while the operator fixes their input would
|
||||
artificially block parallel builds against unrelated flights. AC-11 and
|
||||
AC-14 enforce.
|
||||
|
||||
### 2. Consumer-side cuts (AZ-507) for C11 + C10 types
|
||||
|
||||
Per the workspace cross-component import policy, C12 cannot import C11
|
||||
or C10 directly. Implemented as:
|
||||
|
||||
- `tile_downloader_cut.py` — local `TileDownloaderCut` Protocol +
|
||||
`DownloadRequestCut` / `DownloadBatchReportCut` / `DownloadOutcomeCut`
|
||||
DTOs mirroring the C11 shapes the orchestrator actually consumes.
|
||||
- `RemoteBuildOutcome` + `RemoteBuildReport` in `_types.py` mirror the
|
||||
C10 `BuildOutcome` + `BuildReport` shapes that come back as JSON over
|
||||
SSH (the orchestrator never imports C10 types — it parses JSON the
|
||||
remote process emits).
|
||||
- External-error matching uses **name-based whitelisting**
|
||||
(`_is_recognised(exc, frozenset({"SatelliteProviderError", ...}))`)
|
||||
rather than `isinstance`. C12 cannot import the actual C11/C10
|
||||
exception classes; matching by class name keeps the orchestrator's
|
||||
failure-phase classification accurate without violating the policy.
|
||||
Anything not on the whitelist propagates raw (per AC-6 — unexpected
|
||||
exceptions still release the lock and surface to the caller).
|
||||
- C11 → C12 type translation lives at the composition root
|
||||
(`c12_factory.build_build_cache_orchestrator`), where both components
|
||||
may legally be imported together.
|
||||
|
||||
### 3. Failure surfacing: report vs raise
|
||||
|
||||
Spec text alternates between "raise `CacheBuildError(...)`" and
|
||||
"return `CacheBuildReport(outcome=failure, ...)`". Resolved by:
|
||||
|
||||
- **Returning** `CacheBuildReport(outcome=failure, failure_phase=...)`
|
||||
for all *recognised* operational failures (download error, companion
|
||||
not ready, build error, flight-resolve error). The CLI's
|
||||
`_exit_code_for_report` helper translates these into the existing
|
||||
exit-code constants — `EXIT_DOWNLOAD_FAILURE` (20),
|
||||
`EXIT_BUILD_FAILURE` (21), `EXIT_FLIGHT_RESOLVE_*` (90/91/92/93/94/95
|
||||
per AZ-489's existing taxonomy, selected via the new
|
||||
`failure_exception_type` field on `CacheBuildReport`).
|
||||
- **Raising** `BuildLockHeldError` for lockfile contention only — this
|
||||
is the one case where there is no `CacheBuildReport` to return (no
|
||||
phase ever ran). The CLI maps it to `EXIT_LOCK_HELD` (50).
|
||||
|
||||
This keeps the orchestrator's public surface a pure function-style
|
||||
return-the-report API for normal operator flows, while reserving
|
||||
exceptions for the "we never even started" case.
|
||||
|
||||
### 4. Workstation-side `filelock`, NOT a custom primitive
|
||||
|
||||
`FileLock` / `FileLockFactory` Protocols + `FilelockFileLockFactory`
|
||||
concrete wrap the already-pinned `filelock` library (used by E-C13).
|
||||
Cross-platform file-locking correctness is non-trivial; the spec
|
||||
explicitly forbids rolling our own. The `LockTimeout` exception is the
|
||||
single failure mode the orchestrator catches.
|
||||
|
||||
### 5. Remote C10 stdout streaming + secret redaction
|
||||
|
||||
`RemoteCacheProvisionerInvoker.invoke` line-iterates the SSH session's
|
||||
stdout, logging each line as `kind="c10.remote.progress"` at DEBUG.
|
||||
Memory stays bounded even for multi-hour builds. Before logging, every
|
||||
line is filtered through a redactor that replaces the literal `api_key`
|
||||
and `auth_token` values with `<REDACTED>` (defence-in-depth — guards
|
||||
against C10 itself accidentally echoing them). The final stdout line
|
||||
is parsed as `RemoteBuildReport` JSON; malformed output raises
|
||||
`BuildReportParseError` (a `CacheBuildError` subclass) with the
|
||||
captured tail.
|
||||
|
||||
### 6. CLI request construction now lives in `cli.py`, not orchestrator
|
||||
|
||||
The previous CLI implementation called `_resolve_flight` itself before
|
||||
handing per-flag values to a no-op orchestrator stub. AZ-328 moves
|
||||
flight resolution into the orchestrator (per spec), so the CLI now
|
||||
just packages flags into a `BuildCacheRequest` (with `FlightSource`
|
||||
sum type — `FlightById | FlightFromFile`) and forwards. The CLI is
|
||||
now a thin adapter; all the workflow logic lives in
|
||||
`build_cache.py`.
|
||||
|
||||
## Files Changed
|
||||
|
||||
### Production source (new)
|
||||
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/build_cache.py`
|
||||
— `BuildCacheOrchestrator` class + sequenced `build_cache(...)` flow
|
||||
+ `_is_recognised` helper for name-based external-error matching.
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/file_lock.py`
|
||||
— `FileLock` / `FileLockFactory` Protocols, `LockTimeout` exception,
|
||||
`FilelockFileLockFactory` concrete wrapping the `filelock` library.
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/remote_c10_invoker.py`
|
||||
— `RemoteCacheProvisionerInvoker` (SSH-driven), `RemoteBuildRequest`
|
||||
DTO, secret-redacting stdout streamer, JSON parser for the final
|
||||
`RemoteBuildReport`.
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/tile_downloader_cut.py`
|
||||
— `TileDownloaderCut` Protocol (consumer-side cut for C11
|
||||
`TileDownloader`).
|
||||
|
||||
### Production source (modified)
|
||||
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/_types.py` —
|
||||
added `BuildCacheRequest`, `CacheBuildReport`, `FlightResolveReport`,
|
||||
`BuildCacheOutcome` enum, `FailurePhase` enum, `FlightResolveSource`
|
||||
enum, `FlightSource` sum type (`FlightById` / `FlightFromFile`),
|
||||
`RemoteBuildOutcome` + `RemoteBuildReport` (C10 mirror types),
|
||||
`DownloadRequestCut` + `DownloadBatchReportCut` + `DownloadOutcomeCut`
|
||||
(C11 mirror types).
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/errors.py` —
|
||||
added `CacheBuildError(Exception)` with phase-aware `remediation`
|
||||
attribute, `BuildLockHeldError(CacheBuildError)`,
|
||||
`BuildReportParseError(CacheBuildError)`.
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/config.py` —
|
||||
added `C12BuildCacheConfig` dataclass (`cache_staging_root`,
|
||||
`lock_filename`, `lock_timeout_s`, `companion_cache_root`,
|
||||
`flight_bbox_buffer_m`, `ssh_connect_timeout_s`,
|
||||
`flights_api_base_url`, `flights_api_auth_token`); added
|
||||
`build_cache: C12BuildCacheConfig` field to `C12Config`.
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/cli.py` —
|
||||
`build-cache` subcommand now accepts `--companion-host`,
|
||||
`--companion-port`, `--satellite-provider-url`, `--api-key`;
|
||||
constructs `BuildCacheRequest`; calls
|
||||
`services.build_cache_orchestrator.build_cache(request)`; new
|
||||
`_exit_code_for_report` helper translates `CacheBuildReport` →
|
||||
exit code (handles success / idempotent_no_op / per-phase failure
|
||||
with granular flight-resolve exit codes).
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/__init__.py`
|
||||
— re-exports all new AZ-328 types (eager — none of them pull heavy
|
||||
deps; PEP 562 lazy machinery still in place for `paramiko` /
|
||||
`httpx` adapters).
|
||||
- `src/gps_denied_onboard/runtime_root/c12_factory.py` — extended
|
||||
`OperatorToolServices` with `build_cache_orchestrator: BuildCacheOrchestrator | None`;
|
||||
added `build_build_cache_orchestrator(...)` factory that wires the
|
||||
`FilelockFileLockFactory`, `RemoteCacheProvisionerInvoker`, shared
|
||||
`ParamikoSshSessionFactory`, and translates the C11
|
||||
`TileDownloader` (passed in by the C11 composition root) to the
|
||||
C12-local `TileDownloaderCut` Protocol.
|
||||
|
||||
### Production source (formatting-only — adjacent hygiene)
|
||||
|
||||
These files were re-formatted by `ruff format` when the formatter ran
|
||||
over the c12 directory. No behaviour change; pure whitespace /
|
||||
line-wrapping normalisation:
|
||||
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/_parser.py`
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/bbox.py`
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/file_loader.py`
|
||||
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/httpx_client.py`
|
||||
- `tests/unit/c12_operator_tooling/test_az489_flights_api_client.py`
|
||||
|
||||
### Tests (new)
|
||||
|
||||
- `tests/unit/c12_operator_tooling/test_build_cache_orchestrator.py` —
|
||||
29 tests covering all 15 ACs + NFR-perf-overhead microbench.
|
||||
- `tests/unit/c12_operator_tooling/test_remote_c10_invoker.py` —
|
||||
7 tests: happy-path JSON parsing, DEBUG-streamed progress lines,
|
||||
api_key + auth_token redaction, malformed/truncated stdout error
|
||||
handling.
|
||||
- `tests/unit/c12_operator_tooling/test_file_lock.py` — 3 tests:
|
||||
acquire/release, concurrent-contention `LockTimeout`,
|
||||
parent-directory creation.
|
||||
|
||||
### Tests (modified)
|
||||
|
||||
- `tests/unit/c12_operator_tooling/test_cli_build_cache.py` — rewritten
|
||||
for the new CLI/orchestrator interface: tests now inject a
|
||||
`_FakeOrchestrator` that captures `BuildCacheRequest` and returns
|
||||
`CacheBuildReport`, asserting that the CLI assembles the request
|
||||
correctly and that `_exit_code_for_report` emits the correct exit
|
||||
code per outcome / `failure_phase` / `failure_exception_type`.
|
||||
|
||||
## Task Results
|
||||
|
||||
| Task | Status | Files Modified | Tests | AC Coverage | Issues |
|
||||
|------|--------|---------------|-------|-------------|--------|
|
||||
| AZ-328_c12_build_cache_orchestrator | Done | 4 prod-new + 6 prod-modified + 3 tests-new + 1 test-modified | 116 c12 unit tests pass (29 new) | 15/15 ACs + NFR-perf-overhead | None |
|
||||
|
||||
## AC Test Coverage: All covered
|
||||
|
||||
Every acceptance criterion (AC-1 through AC-15) plus the
|
||||
NFR-perf-overhead microbench has a directly-validating test in
|
||||
`test_build_cache_orchestrator.py`. Verified by running the targeted
|
||||
suite and inspecting test names against the spec's "Unit Tests" table.
|
||||
|
||||
## Code Review Verdict: PASS
|
||||
|
||||
### Findings
|
||||
|
||||
None of severity Low or higher.
|
||||
|
||||
**Notes (informational)**:
|
||||
|
||||
- `httpx_client.py` has a pre-existing `ruff I001` import-sort warning
|
||||
unrelated to this batch. Not in scope; left for the file's owning
|
||||
task to address.
|
||||
- Adjacent formatting-only changes to four `flights_api/*.py` files
|
||||
and one `test_az489_*.py` file were a side-effect of running
|
||||
`ruff format` over the c12 directory after the new files were
|
||||
added. They normalise existing pre-formatter style and are safe to
|
||||
ship.
|
||||
|
||||
## Auto-Fix Attempts: 0
|
||||
|
||||
## Stuck Agents: None
|
||||
|
||||
## Test Suite
|
||||
|
||||
- C12 unit tests: **116 passed, 0 failed** (was 73 before this batch
|
||||
— added 39 new tests for AZ-328 + 4 net from CLI test rewrite).
|
||||
- Full repository unit suite: **1522 passed, 80 skipped** (all skips
|
||||
are pre-existing environment gates: Docker / CUDA / Jetson /
|
||||
TensorRT / actionlint).
|
||||
- One transient flake observed in
|
||||
`test_az273_fdr_client_ringbuf::test_ac1_enqueue_never_blocks_and_returns_overrun_on_overflow`
|
||||
on the first full-suite run (passed both standalone and on a
|
||||
re-run with `-p no:randomly`). Unrelated to this batch (C13 / FDR
|
||||
area). Not blocking.
|
||||
- `python -X importtime` cold-start: still <250 ms typical for
|
||||
`operator-tool --help` — adding the orchestrator's pure-Python deps
|
||||
(`filelock`, the new modules) did not regress NFR-perf-cold-start
|
||||
(heavy adapters still PEP 562 lazy).
|
||||
|
||||
## Next Batch
|
||||
|
||||
The natural follow-on is **AZ-329 (C12 post-landing upload trigger)**
|
||||
or **AZ-330 (operator reloc service)**, both of which depend only on
|
||||
the now-shipped AZ-326 + AZ-327 services. Confirm with
|
||||
`_docs/02_tasks/_dependencies_table.md` at the start of Batch 44.
|
||||
Reference in New Issue
Block a user