[AZ-328] C12 BuildCacheOrchestrator + remote C10 invoker (Batch 43)

Implements F1 pre-flight cache build orchestrator on the operator
workstation. Composes C11 TileDownloader (AZ-316), C12 CompanionBringup
(AZ-327), C12 FlightsApiClient (AZ-489), and the new
RemoteCacheProvisionerInvoker into one sequenced flow guarded by a
filelock-backed workstation-side lockfile.

Architectural decisions:
- Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010): a flight
  that cannot be resolved is an operator-input error, not a contended-
  resource error. Enforced by AC-11 + AC-14.
- Consumer-side cuts (AZ-507) for C11 + C10 types: local Protocols /
  mirror DTOs in tile_downloader_cut.py and _types.py; external errors
  matched by name-based whitelisting so unknown exceptions still
  propagate per AC-6. Cross-component type translation lives at the
  composition root (c12_factory).
- Failure surfacing: recognised operational failures (download error,
  companion not ready, build error, flight-resolve error) return as
  CacheBuildReport(outcome=failure, failure_phase=...). Only lockfile
  contention raises (BuildLockHeldError) since no phase ever ran.
- Workstation-side filelock library (project pin); no custom primitive.
- Remote C10 stdout streamed line-by-line as DEBUG with api_key /
  auth_token redacted before logging (defence-in-depth).
- CLI is now a thin adapter; all workflow logic lives in
  build_cache.py. operator-tool build-cache exit codes map per
  CacheBuildReport.failure_phase + failure_exception_type.

Tests: 116 c12 unit tests pass (29 new for AZ-328 covering 15/15 ACs +
NFR-perf-overhead microbench; 7 new for remote_c10_invoker; 3 new for
file_lock; test_cli_build_cache rewritten for new orchestrator
interface). Full repo suite: 1522 passed, 80 skipped.

Also: replays Batch 42's ruff format leftover for c12 flights_api +
test_az489 files (formatter ran over the c12 directory after new
files were added). Pure whitespace; no behaviour change.

Full report: _docs/03_implementation/batch_43_cycle1_report.md

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 11:03:46 +03:00
parent 099c75c6f8
commit 7644b25e8c
23 changed files with 3585 additions and 256 deletions
@@ -0,0 +1,256 @@
# Batch 43 — Cycle 1 Report
**Date**: 2026-05-13
**Batch**: 43
**Tasks**: AZ-328 (C12 Build-Cache Orchestrator, 5pt)
**Status**: complete; ticket ready to transition to "In Testing".
## Scope
AZ-328 delivers `BuildCacheOrchestrator` — the F1 (pre-flight cache
build) workflow head for the operator workstation. It composes the
already-shipped C11 `TileDownloader` (AZ-316), C12 `CompanionBringup`
(AZ-327), C12 `FlightsApiClient` (AZ-489), and the new
`RemoteCacheProvisionerInvoker` (this task) into one sequenced flow,
guarded by a workstation-side `filelock` lockfile. The `operator-tool
build-cache` subcommand (T1, AZ-326) is now wired to a real
orchestrator and returns granular exit codes per failure phase.
This unblocks the F1 happy-path acceptance test (C12-IT-02) and gives
operators the first end-to-end "I have a flight ID, build me a cache"
workflow.
## Architectural Decisions
### 1. Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010)
Per the AZ-328 spec, the orchestrator resolves the `FlightDto` (via
`flights_api_client.fetch_flight` or `load_flight_file`), derives the
bbox, and computes the takeoff origin **before** acquiring the `.c12.lock`
file. Rationale: a flight that cannot be resolved is an operator-input
error (wrong flight ID, missing JSON file, expired auth token); holding
a contended-resource lock while the operator fixes their input would
artificially block parallel builds against unrelated flights. AC-11 and
AC-14 enforce.
### 2. Consumer-side cuts (AZ-507) for C11 + C10 types
Per the workspace cross-component import policy, C12 cannot import C11
or C10 directly. Implemented as:
- `tile_downloader_cut.py` — local `TileDownloaderCut` Protocol +
`DownloadRequestCut` / `DownloadBatchReportCut` / `DownloadOutcomeCut`
DTOs mirroring the C11 shapes the orchestrator actually consumes.
- `RemoteBuildOutcome` + `RemoteBuildReport` in `_types.py` mirror the
C10 `BuildOutcome` + `BuildReport` shapes that come back as JSON over
SSH (the orchestrator never imports C10 types — it parses JSON the
remote process emits).
- External-error matching uses **name-based whitelisting**
(`_is_recognised(exc, frozenset({"SatelliteProviderError", ...}))`)
rather than `isinstance`. C12 cannot import the actual C11/C10
exception classes; matching by class name keeps the orchestrator's
failure-phase classification accurate without violating the policy.
Anything not on the whitelist propagates raw (per AC-6 — unexpected
exceptions still release the lock and surface to the caller).
- C11 → C12 type translation lives at the composition root
(`c12_factory.build_build_cache_orchestrator`), where both components
may legally be imported together.
### 3. Failure surfacing: report vs raise
Spec text alternates between "raise `CacheBuildError(...)`" and
"return `CacheBuildReport(outcome=failure, ...)`". Resolved by:
- **Returning** `CacheBuildReport(outcome=failure, failure_phase=...)`
for all *recognised* operational failures (download error, companion
not ready, build error, flight-resolve error). The CLI's
`_exit_code_for_report` helper translates these into the existing
exit-code constants — `EXIT_DOWNLOAD_FAILURE` (20),
`EXIT_BUILD_FAILURE` (21), `EXIT_FLIGHT_RESOLVE_*` (90/91/92/93/94/95
per AZ-489's existing taxonomy, selected via the new
`failure_exception_type` field on `CacheBuildReport`).
- **Raising** `BuildLockHeldError` for lockfile contention only — this
is the one case where there is no `CacheBuildReport` to return (no
phase ever ran). The CLI maps it to `EXIT_LOCK_HELD` (50).
This keeps the orchestrator's public surface a pure function-style
return-the-report API for normal operator flows, while reserving
exceptions for the "we never even started" case.
### 4. Workstation-side `filelock`, NOT a custom primitive
`FileLock` / `FileLockFactory` Protocols + `FilelockFileLockFactory`
concrete wrap the already-pinned `filelock` library (used by E-C13).
Cross-platform file-locking correctness is non-trivial; the spec
explicitly forbids rolling our own. The `LockTimeout` exception is the
single failure mode the orchestrator catches.
### 5. Remote C10 stdout streaming + secret redaction
`RemoteCacheProvisionerInvoker.invoke` line-iterates the SSH session's
stdout, logging each line as `kind="c10.remote.progress"` at DEBUG.
Memory stays bounded even for multi-hour builds. Before logging, every
line is filtered through a redactor that replaces the literal `api_key`
and `auth_token` values with `<REDACTED>` (defence-in-depth — guards
against C10 itself accidentally echoing them). The final stdout line
is parsed as `RemoteBuildReport` JSON; malformed output raises
`BuildReportParseError` (a `CacheBuildError` subclass) with the
captured tail.
### 6. CLI request construction now lives in `cli.py`, not orchestrator
The previous CLI implementation called `_resolve_flight` itself before
handing per-flag values to a no-op orchestrator stub. AZ-328 moves
flight resolution into the orchestrator (per spec), so the CLI now
just packages flags into a `BuildCacheRequest` (with `FlightSource`
sum type — `FlightById | FlightFromFile`) and forwards. The CLI is
now a thin adapter; all the workflow logic lives in
`build_cache.py`.
## Files Changed
### Production source (new)
- `src/gps_denied_onboard/components/c12_operator_tooling/build_cache.py`
`BuildCacheOrchestrator` class + sequenced `build_cache(...)` flow
+ `_is_recognised` helper for name-based external-error matching.
- `src/gps_denied_onboard/components/c12_operator_tooling/file_lock.py`
`FileLock` / `FileLockFactory` Protocols, `LockTimeout` exception,
`FilelockFileLockFactory` concrete wrapping the `filelock` library.
- `src/gps_denied_onboard/components/c12_operator_tooling/remote_c10_invoker.py`
`RemoteCacheProvisionerInvoker` (SSH-driven), `RemoteBuildRequest`
DTO, secret-redacting stdout streamer, JSON parser for the final
`RemoteBuildReport`.
- `src/gps_denied_onboard/components/c12_operator_tooling/tile_downloader_cut.py`
`TileDownloaderCut` Protocol (consumer-side cut for C11
`TileDownloader`).
### Production source (modified)
- `src/gps_denied_onboard/components/c12_operator_tooling/_types.py`
added `BuildCacheRequest`, `CacheBuildReport`, `FlightResolveReport`,
`BuildCacheOutcome` enum, `FailurePhase` enum, `FlightResolveSource`
enum, `FlightSource` sum type (`FlightById` / `FlightFromFile`),
`RemoteBuildOutcome` + `RemoteBuildReport` (C10 mirror types),
`DownloadRequestCut` + `DownloadBatchReportCut` + `DownloadOutcomeCut`
(C11 mirror types).
- `src/gps_denied_onboard/components/c12_operator_tooling/errors.py`
added `CacheBuildError(Exception)` with phase-aware `remediation`
attribute, `BuildLockHeldError(CacheBuildError)`,
`BuildReportParseError(CacheBuildError)`.
- `src/gps_denied_onboard/components/c12_operator_tooling/config.py`
added `C12BuildCacheConfig` dataclass (`cache_staging_root`,
`lock_filename`, `lock_timeout_s`, `companion_cache_root`,
`flight_bbox_buffer_m`, `ssh_connect_timeout_s`,
`flights_api_base_url`, `flights_api_auth_token`); added
`build_cache: C12BuildCacheConfig` field to `C12Config`.
- `src/gps_denied_onboard/components/c12_operator_tooling/cli.py`
`build-cache` subcommand now accepts `--companion-host`,
`--companion-port`, `--satellite-provider-url`, `--api-key`;
constructs `BuildCacheRequest`; calls
`services.build_cache_orchestrator.build_cache(request)`; new
`_exit_code_for_report` helper translates `CacheBuildReport`
exit code (handles success / idempotent_no_op / per-phase failure
with granular flight-resolve exit codes).
- `src/gps_denied_onboard/components/c12_operator_tooling/__init__.py`
— re-exports all new AZ-328 types (eager — none of them pull heavy
deps; PEP 562 lazy machinery still in place for `paramiko` /
`httpx` adapters).
- `src/gps_denied_onboard/runtime_root/c12_factory.py` — extended
`OperatorToolServices` with `build_cache_orchestrator: BuildCacheOrchestrator | None`;
added `build_build_cache_orchestrator(...)` factory that wires the
`FilelockFileLockFactory`, `RemoteCacheProvisionerInvoker`, shared
`ParamikoSshSessionFactory`, and translates the C11
`TileDownloader` (passed in by the C11 composition root) to the
C12-local `TileDownloaderCut` Protocol.
### Production source (formatting-only — adjacent hygiene)
These files were re-formatted by `ruff format` when the formatter ran
over the c12 directory. No behaviour change; pure whitespace /
line-wrapping normalisation:
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/_parser.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/bbox.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/file_loader.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/httpx_client.py`
- `tests/unit/c12_operator_tooling/test_az489_flights_api_client.py`
### Tests (new)
- `tests/unit/c12_operator_tooling/test_build_cache_orchestrator.py`
29 tests covering all 15 ACs + NFR-perf-overhead microbench.
- `tests/unit/c12_operator_tooling/test_remote_c10_invoker.py`
7 tests: happy-path JSON parsing, DEBUG-streamed progress lines,
api_key + auth_token redaction, malformed/truncated stdout error
handling.
- `tests/unit/c12_operator_tooling/test_file_lock.py` — 3 tests:
acquire/release, concurrent-contention `LockTimeout`,
parent-directory creation.
### Tests (modified)
- `tests/unit/c12_operator_tooling/test_cli_build_cache.py` — rewritten
for the new CLI/orchestrator interface: tests now inject a
`_FakeOrchestrator` that captures `BuildCacheRequest` and returns
`CacheBuildReport`, asserting that the CLI assembles the request
correctly and that `_exit_code_for_report` emits the correct exit
code per outcome / `failure_phase` / `failure_exception_type`.
## Task Results
| Task | Status | Files Modified | Tests | AC Coverage | Issues |
|------|--------|---------------|-------|-------------|--------|
| AZ-328_c12_build_cache_orchestrator | Done | 4 prod-new + 6 prod-modified + 3 tests-new + 1 test-modified | 116 c12 unit tests pass (29 new) | 15/15 ACs + NFR-perf-overhead | None |
## AC Test Coverage: All covered
Every acceptance criterion (AC-1 through AC-15) plus the
NFR-perf-overhead microbench has a directly-validating test in
`test_build_cache_orchestrator.py`. Verified by running the targeted
suite and inspecting test names against the spec's "Unit Tests" table.
## Code Review Verdict: PASS
### Findings
None of severity Low or higher.
**Notes (informational)**:
- `httpx_client.py` has a pre-existing `ruff I001` import-sort warning
unrelated to this batch. Not in scope; left for the file's owning
task to address.
- Adjacent formatting-only changes to four `flights_api/*.py` files
and one `test_az489_*.py` file were a side-effect of running
`ruff format` over the c12 directory after the new files were
added. They normalise existing pre-formatter style and are safe to
ship.
## Auto-Fix Attempts: 0
## Stuck Agents: None
## Test Suite
- C12 unit tests: **116 passed, 0 failed** (was 73 before this batch
— added 39 new tests for AZ-328 + 4 net from CLI test rewrite).
- Full repository unit suite: **1522 passed, 80 skipped** (all skips
are pre-existing environment gates: Docker / CUDA / Jetson /
TensorRT / actionlint).
- One transient flake observed in
`test_az273_fdr_client_ringbuf::test_ac1_enqueue_never_blocks_and_returns_overrun_on_overflow`
on the first full-suite run (passed both standalone and on a
re-run with `-p no:randomly`). Unrelated to this batch (C13 / FDR
area). Not blocking.
- `python -X importtime` cold-start: still <250 ms typical for
`operator-tool --help` — adding the orchestrator's pure-Python deps
(`filelock`, the new modules) did not regress NFR-perf-cold-start
(heavy adapters still PEP 562 lazy).
## Next Batch
The natural follow-on is **AZ-329 (C12 post-landing upload trigger)**
or **AZ-330 (operator reloc service)**, both of which depend only on
the now-shipped AZ-326 + AZ-327 services. Confirm with
`_docs/02_tasks/_dependencies_table.md` at the start of Batch 44.
+3 -1
View File
@@ -12,5 +12,7 @@ sub_step:
retry_count: 0
cycle: 1
tracker: jira
last_completed_batch: 42
last_completed_batch: 43
last_cumulative_review: batches_40-42
in_flight_batch: null
in_flight_tasks: null
@@ -1,24 +0,0 @@
# Leftover — Ruff format pass on AZ-489 files deferred
**Timestamp**: 2026-05-13T08:57:00+03:00
**What was blocked**: Running `ruff format` across the c12_operator_tooling
package during Batch 42 modified five files that belong to AZ-489
(c12 flights_api) and one AZ-489 test file. These reformat-only changes
were reverted to keep Batch 42 strictly scoped to AZ-326 + AZ-327.
**Files awaiting a follow-up format-only commit**:
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/_parser.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/bbox.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/file_loader.py`
- `src/gps_denied_onboard/components/c12_operator_tooling/flights_api/httpx_client.py`
- `tests/unit/c12_operator_tooling/test_az489_flights_api_client.py`
**Replay**: at the start of any future c12-touching task — or as a
dedicated `chore: ruff format c12 flights_api` commit on `dev` — run
`.venv/bin/python -m ruff format src/gps_denied_onboard/components/c12_operator_tooling/flights_api tests/unit/c12_operator_tooling/test_az489_flights_api_client.py`,
verify the test suite still passes, and commit. ~23 line insertions /
42 deletions; no behavioural changes.
**Reason for deferral**: scope discipline (per `coderule.mdc`). Batch 42
must not silently expand its diff into another component's files.