Files
gps-denied-onboard/_docs/03_implementation/batch_43_cycle1_report.md
T
Oleksandr Bezdieniezhnykh 7644b25e8c [AZ-328] C12 BuildCacheOrchestrator + remote C10 invoker (Batch 43)
Implements F1 pre-flight cache build orchestrator on the operator
workstation. Composes C11 TileDownloader (AZ-316), C12 CompanionBringup
(AZ-327), C12 FlightsApiClient (AZ-489), and the new
RemoteCacheProvisionerInvoker into one sequenced flow guarded by a
filelock-backed workstation-side lockfile.

Architectural decisions:
- Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010): a flight
  that cannot be resolved is an operator-input error, not a contended-
  resource error. Enforced by AC-11 + AC-14.
- Consumer-side cuts (AZ-507) for C11 + C10 types: local Protocols /
  mirror DTOs in tile_downloader_cut.py and _types.py; external errors
  matched by name-based whitelisting so unknown exceptions still
  propagate per AC-6. Cross-component type translation lives at the
  composition root (c12_factory).
- Failure surfacing: recognised operational failures (download error,
  companion not ready, build error, flight-resolve error) return as
  CacheBuildReport(outcome=failure, failure_phase=...). Only lockfile
  contention raises (BuildLockHeldError) since no phase ever ran.
- Workstation-side filelock library (project pin); no custom primitive.
- Remote C10 stdout streamed line-by-line as DEBUG with api_key /
  auth_token redacted before logging (defence-in-depth).
- CLI is now a thin adapter; all workflow logic lives in
  build_cache.py. operator-tool build-cache exit codes map per
  CacheBuildReport.failure_phase + failure_exception_type.

Tests: 116 c12 unit tests pass (29 new for AZ-328 covering 15/15 ACs +
NFR-perf-overhead microbench; 7 new for remote_c10_invoker; 3 new for
file_lock; test_cli_build_cache rewritten for new orchestrator
interface). Full repo suite: 1522 passed, 80 skipped.

Also: replays Batch 42's ruff format leftover for c12 flights_api +
test_az489 files (formatter ran over the c12 directory after new
files were added). Pure whitespace; no behaviour change.

Full report: _docs/03_implementation/batch_43_cycle1_report.md

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 11:03:46 +03:00

12 KiB

Batch 43 — Cycle 1 Report

Date: 2026-05-13 Batch: 43 Tasks: AZ-328 (C12 Build-Cache Orchestrator, 5pt) Status: complete; ticket ready to transition to "In Testing".

Scope

AZ-328 delivers BuildCacheOrchestrator — the F1 (pre-flight cache build) workflow head for the operator workstation. It composes the already-shipped C11 TileDownloader (AZ-316), C12 CompanionBringup (AZ-327), C12 FlightsApiClient (AZ-489), and the new RemoteCacheProvisionerInvoker (this task) into one sequenced flow, guarded by a workstation-side filelock lockfile. The operator-tool build-cache subcommand (T1, AZ-326) is now wired to a real orchestrator and returns granular exit codes per failure phase.

This unblocks the F1 happy-path acceptance test (C12-IT-02) and gives operators the first end-to-end "I have a flight ID, build me a cache" workflow.

Architectural Decisions

1. Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010)

Per the AZ-328 spec, the orchestrator resolves the FlightDto (via flights_api_client.fetch_flight or load_flight_file), derives the bbox, and computes the takeoff origin before acquiring the .c12.lock file. Rationale: a flight that cannot be resolved is an operator-input error (wrong flight ID, missing JSON file, expired auth token); holding a contended-resource lock while the operator fixes their input would artificially block parallel builds against unrelated flights. AC-11 and AC-14 enforce.

2. Consumer-side cuts (AZ-507) for C11 + C10 types

Per the workspace cross-component import policy, C12 cannot import C11 or C10 directly. Implemented as:

  • tile_downloader_cut.py — local TileDownloaderCut Protocol + DownloadRequestCut / DownloadBatchReportCut / DownloadOutcomeCut DTOs mirroring the C11 shapes the orchestrator actually consumes.
  • RemoteBuildOutcome + RemoteBuildReport in _types.py mirror the C10 BuildOutcome + BuildReport shapes that come back as JSON over SSH (the orchestrator never imports C10 types — it parses JSON the remote process emits).
  • External-error matching uses name-based whitelisting (_is_recognised(exc, frozenset({"SatelliteProviderError", ...}))) rather than isinstance. C12 cannot import the actual C11/C10 exception classes; matching by class name keeps the orchestrator's failure-phase classification accurate without violating the policy. Anything not on the whitelist propagates raw (per AC-6 — unexpected exceptions still release the lock and surface to the caller).
  • C11 → C12 type translation lives at the composition root (c12_factory.build_build_cache_orchestrator), where both components may legally be imported together.

3. Failure surfacing: report vs raise

Spec text alternates between "raise CacheBuildError(...)" and "return CacheBuildReport(outcome=failure, ...)". Resolved by:

  • Returning CacheBuildReport(outcome=failure, failure_phase=...) for all recognised operational failures (download error, companion not ready, build error, flight-resolve error). The CLI's _exit_code_for_report helper translates these into the existing exit-code constants — EXIT_DOWNLOAD_FAILURE (20), EXIT_BUILD_FAILURE (21), EXIT_FLIGHT_RESOLVE_* (90/91/92/93/94/95 per AZ-489's existing taxonomy, selected via the new failure_exception_type field on CacheBuildReport).
  • Raising BuildLockHeldError for lockfile contention only — this is the one case where there is no CacheBuildReport to return (no phase ever ran). The CLI maps it to EXIT_LOCK_HELD (50).

This keeps the orchestrator's public surface a pure function-style return-the-report API for normal operator flows, while reserving exceptions for the "we never even started" case.

4. Workstation-side filelock, NOT a custom primitive

FileLock / FileLockFactory Protocols + FilelockFileLockFactory concrete wrap the already-pinned filelock library (used by E-C13). Cross-platform file-locking correctness is non-trivial; the spec explicitly forbids rolling our own. The LockTimeout exception is the single failure mode the orchestrator catches.

5. Remote C10 stdout streaming + secret redaction

RemoteCacheProvisionerInvoker.invoke line-iterates the SSH session's stdout, logging each line as kind="c10.remote.progress" at DEBUG. Memory stays bounded even for multi-hour builds. Before logging, every line is filtered through a redactor that replaces the literal api_key and auth_token values with <REDACTED> (defence-in-depth — guards against C10 itself accidentally echoing them). The final stdout line is parsed as RemoteBuildReport JSON; malformed output raises BuildReportParseError (a CacheBuildError subclass) with the captured tail.

6. CLI request construction now lives in cli.py, not orchestrator

The previous CLI implementation called _resolve_flight itself before handing per-flag values to a no-op orchestrator stub. AZ-328 moves flight resolution into the orchestrator (per spec), so the CLI now just packages flags into a BuildCacheRequest (with FlightSource sum type — FlightById | FlightFromFile) and forwards. The CLI is now a thin adapter; all the workflow logic lives in build_cache.py.

Files Changed

Production source (new)

  • src/gps_denied_onboard/components/c12_operator_tooling/build_cache.pyBuildCacheOrchestrator class + sequenced build_cache(...) flow
    • _is_recognised helper for name-based external-error matching.
  • src/gps_denied_onboard/components/c12_operator_tooling/file_lock.pyFileLock / FileLockFactory Protocols, LockTimeout exception, FilelockFileLockFactory concrete wrapping the filelock library.
  • src/gps_denied_onboard/components/c12_operator_tooling/remote_c10_invoker.pyRemoteCacheProvisionerInvoker (SSH-driven), RemoteBuildRequest DTO, secret-redacting stdout streamer, JSON parser for the final RemoteBuildReport.
  • src/gps_denied_onboard/components/c12_operator_tooling/tile_downloader_cut.pyTileDownloaderCut Protocol (consumer-side cut for C11 TileDownloader).

Production source (modified)

  • src/gps_denied_onboard/components/c12_operator_tooling/_types.py — added BuildCacheRequest, CacheBuildReport, FlightResolveReport, BuildCacheOutcome enum, FailurePhase enum, FlightResolveSource enum, FlightSource sum type (FlightById / FlightFromFile), RemoteBuildOutcome + RemoteBuildReport (C10 mirror types), DownloadRequestCut + DownloadBatchReportCut + DownloadOutcomeCut (C11 mirror types).
  • src/gps_denied_onboard/components/c12_operator_tooling/errors.py — added CacheBuildError(Exception) with phase-aware remediation attribute, BuildLockHeldError(CacheBuildError), BuildReportParseError(CacheBuildError).
  • src/gps_denied_onboard/components/c12_operator_tooling/config.py — added C12BuildCacheConfig dataclass (cache_staging_root, lock_filename, lock_timeout_s, companion_cache_root, flight_bbox_buffer_m, ssh_connect_timeout_s, flights_api_base_url, flights_api_auth_token); added build_cache: C12BuildCacheConfig field to C12Config.
  • src/gps_denied_onboard/components/c12_operator_tooling/cli.pybuild-cache subcommand now accepts --companion-host, --companion-port, --satellite-provider-url, --api-key; constructs BuildCacheRequest; calls services.build_cache_orchestrator.build_cache(request); new _exit_code_for_report helper translates CacheBuildReport → exit code (handles success / idempotent_no_op / per-phase failure with granular flight-resolve exit codes).
  • src/gps_denied_onboard/components/c12_operator_tooling/__init__.py — re-exports all new AZ-328 types (eager — none of them pull heavy deps; PEP 562 lazy machinery still in place for paramiko / httpx adapters).
  • src/gps_denied_onboard/runtime_root/c12_factory.py — extended OperatorToolServices with build_cache_orchestrator: BuildCacheOrchestrator | None; added build_build_cache_orchestrator(...) factory that wires the FilelockFileLockFactory, RemoteCacheProvisionerInvoker, shared ParamikoSshSessionFactory, and translates the C11 TileDownloader (passed in by the C11 composition root) to the C12-local TileDownloaderCut Protocol.

Production source (formatting-only — adjacent hygiene)

These files were re-formatted by ruff format when the formatter ran over the c12 directory. No behaviour change; pure whitespace / line-wrapping normalisation:

  • src/gps_denied_onboard/components/c12_operator_tooling/flights_api/_parser.py
  • src/gps_denied_onboard/components/c12_operator_tooling/flights_api/bbox.py
  • src/gps_denied_onboard/components/c12_operator_tooling/flights_api/file_loader.py
  • src/gps_denied_onboard/components/c12_operator_tooling/flights_api/httpx_client.py
  • tests/unit/c12_operator_tooling/test_az489_flights_api_client.py

Tests (new)

  • tests/unit/c12_operator_tooling/test_build_cache_orchestrator.py — 29 tests covering all 15 ACs + NFR-perf-overhead microbench.
  • tests/unit/c12_operator_tooling/test_remote_c10_invoker.py — 7 tests: happy-path JSON parsing, DEBUG-streamed progress lines, api_key + auth_token redaction, malformed/truncated stdout error handling.
  • tests/unit/c12_operator_tooling/test_file_lock.py — 3 tests: acquire/release, concurrent-contention LockTimeout, parent-directory creation.

Tests (modified)

  • tests/unit/c12_operator_tooling/test_cli_build_cache.py — rewritten for the new CLI/orchestrator interface: tests now inject a _FakeOrchestrator that captures BuildCacheRequest and returns CacheBuildReport, asserting that the CLI assembles the request correctly and that _exit_code_for_report emits the correct exit code per outcome / failure_phase / failure_exception_type.

Task Results

Task Status Files Modified Tests AC Coverage Issues
AZ-328_c12_build_cache_orchestrator Done 4 prod-new + 6 prod-modified + 3 tests-new + 1 test-modified 116 c12 unit tests pass (29 new) 15/15 ACs + NFR-perf-overhead None

AC Test Coverage: All covered

Every acceptance criterion (AC-1 through AC-15) plus the NFR-perf-overhead microbench has a directly-validating test in test_build_cache_orchestrator.py. Verified by running the targeted suite and inspecting test names against the spec's "Unit Tests" table.

Code Review Verdict: PASS

Findings

None of severity Low or higher.

Notes (informational):

  • httpx_client.py has a pre-existing ruff I001 import-sort warning unrelated to this batch. Not in scope; left for the file's owning task to address.
  • Adjacent formatting-only changes to four flights_api/*.py files and one test_az489_*.py file were a side-effect of running ruff format over the c12 directory after the new files were added. They normalise existing pre-formatter style and are safe to ship.

Auto-Fix Attempts: 0

Stuck Agents: None

Test Suite

  • C12 unit tests: 116 passed, 0 failed (was 73 before this batch — added 39 new tests for AZ-328 + 4 net from CLI test rewrite).
  • Full repository unit suite: 1522 passed, 80 skipped (all skips are pre-existing environment gates: Docker / CUDA / Jetson / TensorRT / actionlint).
  • One transient flake observed in test_az273_fdr_client_ringbuf::test_ac1_enqueue_never_blocks_and_returns_overrun_on_overflow on the first full-suite run (passed both standalone and on a re-run with -p no:randomly). Unrelated to this batch (C13 / FDR area). Not blocking.
  • python -X importtime cold-start: still <250 ms typical for operator-tool --help — adding the orchestrator's pure-Python deps (filelock, the new modules) did not regress NFR-perf-cold-start (heavy adapters still PEP 562 lazy).

Next Batch

The natural follow-on is AZ-329 (C12 post-landing upload trigger) or AZ-330 (operator reloc service), both of which depend only on the now-shipped AZ-326 + AZ-327 services. Confirm with _docs/02_tasks/_dependencies_table.md at the start of Batch 44.