Files
gps-denied-onboard/_docs/02_tasks/done/AZ-327_c12_companion_bringup.md
T
Oleksandr Bezdieniezhnykh 5fe67023b2 [AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor
Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 19:42:46 +03:00

22 KiB
Raw Blame History

C12 Companion Bringup — SSH verify_companion_ready + ReadinessReport

Task: AZ-327_c12_companion_bringup Name: C12 Companion Bringup Description: Implement CompanionBringup, the C12-internal helper that opens an SSH session against the companion (paramiko per project pin), inspects the companion-side filesystem for the four required pre-flight artifacts (Manifest.json, .engine files + AZ-280 sidecars, calibration JSON), runs sidecar verification on the engines via a remote sha256sum over the engine path (compared against the sidecar's hex digest), and returns a ReadinessReport per description.md § 2 (manifest_present, content_hashes_pass, engines_present, calibration_present, outcome ∈ {ready, not_ready}, not_ready_reasons: list[str]). Owns the two error families: CompanionUnreachableError (SSH session-open failure: TCP refused, auth failed, host key mismatch, socket timeout) and ContentHashMismatchError (sidecar verification fails on at least one engine — distinct from "engine missing", which is a not-ready signal not an exception). Public surface is one method verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport. SSH user, key file, host-key policy, connect-timeout, and the canonical companion-side cache root come from config (config.c12.companion_ssh_user, config.c12.companion_ssh_keyfile, config.c12.companion_host_key_policy, config.c12.companion_connect_timeout_s, config.c12.companion_cache_root) per AZ-269. The session is opened in a try/finally block; the connection is always closed even if the four checks raise. INFO log on every successful call (with the four boolean flags + outcome); WARN on degraded readiness (any 3-of-4); ERROR on the two error families. Complexity: 3 points Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module Component: c12_operator_orchestrator (epic AZ-253 / E-C12) Tracker: AZ-327 Epic: AZ-253 (E-C12)

Document Dependencies

  • _docs/02_document/components/13_c12_operator_orchestrator/description.md — § 2 (verify_companion_ready interface + ReadinessReport DTO shape), § 5 (CompanionUnreachableError, ContentHashMismatchError), § 7 (filesystem lockfile note — relevant for orchestrator T3 not this task).
  • _docs/02_document/contracts/shared_helpers/sha256_sidecar.md — sidecar file format (this task verifies remotely; does not import the helper but reuses the schema).
  • _docs/02_document/contracts/shared_helpers/engine_filename_schema.md — engine filename layout used to enumerate the expected engines list.
  • _docs/02_document/contracts/shared_logging/log_record_schema.md — INFO/WARN/ERROR log shapes.

Problem

Without a real CompanionBringup:

  • build_cache (sibling T3) cannot run safely — the orchestrator would invoke C10 on the companion without any pre-flight visibility into the companion's state. A half-provisioned companion would either silently miscompile (manifest stale) or corrupt the cache.
  • The verify-ready CLI subcommand has no implementation — operators cannot diagnose "is my companion in a usable state?" without SSHing in manually.
  • Pre-flight content-hash verification per AC-NEW-1's takeoff gate (AZ-324 covers the airborne side) has no operator-side counterpart — sidecar mismatches that occur during the SSH transfer would only surface at takeoff, too late.
  • CompanionUnreachableError and ContentHashMismatchError exist as concept-only types in description.md § 5 with no producer.
  • Configuration knobs for SSH credentials, host-key policy, and the canonical cache root have no consumer; AZ-269's loader cannot validate them against a concrete usage.

This task delivers the bring-up + verification layer. It does NOT orchestrate the build_cache flow (sibling T3 does), does NOT invoke C10 (T3 does via SSH after this task confirms readiness), and does NOT perform the takeoff-time content-hash verification (AZ-324 owns the airborne side).

Outcome

  • A CompanionBringup class at src/operator_tool/companion_bringup.py:
    • Constructor: __init__(self, *, ssh_factory: SshSessionFactory, sidecar_verifier: RemoteSidecarVerifier, logger: Logger, config: C12CompanionConfig).
    • C12CompanionConfig (@dataclass(frozen=True)): ssh_user: str, ssh_keyfile: Path, host_key_policy: enum {strict, known_hosts, reject_new}, connect_timeout_s: float = 10.0, companion_cache_root: PurePosixPath = PurePosixPath("/var/lib/azaion/c10/cache"), manifest_filename: str = "Manifest.json", calibration_filename: str = "camera_calibration.json", expected_engines: tuple[str, ...] = () (the orchestrator passes the list per the request; default empty fails AC-2 cleanly).
    • Public method: verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport.
  • DTOs at src/operator_tool/_types.py:
    • CompanionAddress (@dataclass(frozen=True)): host: str, port: int = 22.
    • ReadinessReport (@dataclass(frozen=True)): manifest_present: bool, content_hashes_pass: bool, engines_present: bool, calibration_present: bool, outcome: enum {ready, not_ready}, not_ready_reasons: tuple[str, ...], companion_cache_root: str, engines_inspected_count: int.
  • Errors at src/operator_tool/errors.py:
    • CompanionUnreachableError(Exception): attributes host: str, port: int, reason: enum {connect_refused, auth_failed, host_key_mismatch, timeout, other}, underlying_exception_repr: str. remediation attribute returns a one-line operator-friendly hint per reason.
    • ContentHashMismatchError(Exception): attributes engine_path: str, expected_sha256_hex: str, actual_sha256_hex: str. remediation attribute returns "Re-run the cache build (operator-orchestrator build-cache --area ...) to repopulate the affected engine.".
  • A SshSessionFactory Protocol at src/operator_tool/ssh_session.py:
    @runtime_checkable
    class SshSession(Protocol):
        def run(self, command: str, *, timeout_s: float) -> RemoteCommandResult: ...
        def file_exists(self, remote_path: PurePosixPath) -> bool: ...
        def list_dir(self, remote_path: PurePosixPath) -> list[str]: ...
        def close(self) -> None: ...
    
    @runtime_checkable
    class SshSessionFactory(Protocol):
        def open(self, address: CompanionAddress, *, timeout_s: float) -> SshSession: ...
    
    Concrete implementation ParamikoSshSessionFactory wraps paramiko.SSHClient with the documented host-key policy mapping (strict → RejectPolicy, known_hosts → AutoAddPolicy gated on ~/.ssh/known_hosts presence, reject_new → RejectPolicy with explicit allowlist).
  • A RemoteSidecarVerifier helper at src/operator_tool/remote_sidecar_verifier.py:
    • verify(session: SshSession, engine_path: PurePosixPath) -> RemoteSidecarResult — runs sha256sum <engine_path> over the SSH session, parses the first 64 hex chars, reads the sidecar file at <engine_path>.sha256 via session.run("cat ..."), parses its 64 hex chars, compares case-insensitively. Returns RemoteSidecarResult(matches: bool, expected_hex: str, actual_hex: str).
  • Method flow for verify_companion_ready:
    1. Open SSH session via ssh_factory.open(companion_address, timeout_s=config.connect_timeout_s). On any paramiko/socket exception → catch and raise CompanionUnreachableError mapping the underlying type to a reason enum value. Always wrap subsequent steps in try/finally that closes the session.
    2. Check 1 — manifest_present: session.file_exists(companion_cache_root / manifest_filename).
    3. Check 2 — engines_present: session.list_dir(companion_cache_root / "engines") → set of filenames; compare against config.expected_engines. If config.expected_engines is empty → engines_present = False, not_ready_reasons += ["expected_engines list empty in caller-supplied config"]. Else engines_present = expected_engines.issubset(listed_engines); if not, append "engines_missing: <comma-list>".
    4. Check 3 — content_hashes_pass: for each engine in the intersection of expected_engines and listed_engines, call sidecar_verifier.verify(session, companion_cache_root / "engines" / engine). If ANY result matches == False → raise ContentHashMismatchError with the first failing path. If all match → content_hashes_pass = True. Records engines_inspected_count regardless.
    5. Check 4 — calibration_present: session.file_exists(companion_cache_root / calibration_filename).
    6. Compute outcome: ready iff all four booleans are True; not_ready otherwise.
    7. Emit log: INFO kind="c12.companion.ready" with the four flags + outcome on success; WARN kind="c12.companion.degraded" if any check failed without raising (i.e. outcome=not_ready due to a missing artifact, not a hash mismatch).
    8. Return the ReadinessReport.
  • Composition-root factory at src/gps_denied_onboard/runtime_root/c12_factory.py extends T1's OperatorOrchestratorServices dataclass with a companion_bringup: CompanionBringup field. The factory build_companion_bringup(config) -> CompanionBringup constructs the paramiko-backed session factory + remote sidecar verifier + logger.

Scope

Included

  • CompanionBringup class with the single public method.
  • The 2 DTOs (CompanionAddress, ReadinessReport) plus the outcome and reason enum types.
  • The 2 error types (CompanionUnreachableError, ContentHashMismatchError) with remediation attributes.
  • SshSessionFactory + SshSession Protocols.
  • ParamikoSshSessionFactory + ParamikoSshSession concrete implementations.
  • RemoteSidecarVerifier helper.
  • Composition-root factory.
  • Config schema extension on AZ-269's loader (config.c12.companion_* block).
  • verify-ready subcommand wiring delegated to T1's CLI shell — this task ships the service class; T1's cli.py resolves it from the composition root.
  • Conformance unit tests using a fake SshSessionFactory (no paramiko in unit tests) covering all 6 acceptance criteria.

Excluded

  • The build_cache orchestration that consumes verify_companion_ready (sibling T3).
  • The actual SSH-invocation of C10 on the companion (sibling T3).
  • The takeoff-time content-hash verification on the airborne side (AZ-324).
  • Engine compilation (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323) — all C10 owns these and they ran prior to this task being invoked.
  • A SOCKS proxy or jump-host SSH path — direct SSH only this cycle.
  • Telemetry exfiltration of operator workstation key material — host key + private key never appear in log output (only fingerprint hash if at all).

Acceptance Criteria

AC-1: All four artifacts present + sidecars verify → outcome=ready Given the companion's SSH is reachable, Manifest.json exists, all expected_engines exist, all sidecars verify, and the calibration file exists When verify_companion_ready(address) is called Then ReadinessReport(manifest_present=True, content_hashes_pass=True, engines_present=True, calibration_present=True, outcome=ready, not_ready_reasons=()) is returned; ONE INFO log kind="c12.companion.ready" is emitted

AC-2: Missing engine → outcome=not_ready Given expected_engines=("dinov2_vpr_sm87_jp62_trt103_fp16.engine", "lightglue_sm87_jp62_trt103_fp16.engine") and only the first exists on the companion When verify_companion_ready(address) is called Then engines_present=False; not_ready_reasons contains "engines_missing: lightglue_sm87_jp62_trt103_fp16.engine"; outcome=not_ready; ONE WARN log kind="c12.companion.degraded"; NO ContentHashMismatchError is raised

AC-3: Sidecar mismatch → ContentHashMismatchError Given an engine file is present but its sidecar's hex digest does not match the engine's actual SHA-256 When verify_companion_ready(address) is called Then ContentHashMismatchError is raised with engine_path, expected_sha256_hex, actual_sha256_hex populated; the SSH session is closed (session.close() is called in finally); ONE ERROR log kind="c12.companion.hash.mismatch" is emitted

AC-4: SSH connection refused → CompanionUnreachableError(reason=connect_refused) Given the companion address is unreachable (TCP RST or no listener) When verify_companion_ready(address) is called Then CompanionUnreachableError(reason=connect_refused, underlying_exception_repr="...") is raised; the underlying paramiko/socket exception's repr is captured; ONE ERROR log kind="c12.companion.unreachable"; remediation attribute returns "Check companion power, USB/Ethernet cable, and config.c12.companion_address."

AC-5: SSH auth failure → CompanionUnreachableError(reason=auth_failed) Given the companion is reachable but the SSH key is wrong or revoked When verify_companion_ready(address) is called Then CompanionUnreachableError(reason=auth_failed, ...) is raised; ERROR log kind="c12.companion.unreachable" with reason="auth_failed"; remediation attribute returns "Verify config.c12.companion_ssh_keyfile matches the public key in ~/.ssh/authorized_keys on the companion."

AC-6: Host key mismatch with host_key_policy=strictCompanionUnreachableError(reason=host_key_mismatch) Given the companion's host key has changed and config.c12.companion_host_key_policy = strict When verify_companion_ready(address) is called Then CompanionUnreachableError(reason=host_key_mismatch, ...) is raised; ERROR log; remediation returns "Inspect ~/.ssh/known_hosts; if the companion was reflashed, remove its old entry; otherwise treat as a security incident."

AC-7: SSH session is always closed Given any of the four checks raises an unexpected exception (e.g. SFTP returns OSError) When verify_companion_ready(address) is called Then the exception propagates to the caller; session.close() was called exactly once before propagation (verifiable via spy on the fake SshSession); no socket descriptor leaks

AC-8: Connect timeout → CompanionUnreachableError(reason=timeout) Given the companion address routes but never responds to TCP SYN within config.c12.companion_connect_timeout_s When verify_companion_ready(address) is called Then CompanionUnreachableError(reason=timeout, ...) is raised within connect_timeout_s + 1.0 s (allowing test jitter); ERROR log includes the configured timeout value

AC-9: engines_inspected_count reflects what was actually checked Given a mix of present + missing engines (2 of 3 expected exist) When verify_companion_ready(address) is called Then engines_inspected_count == 2; the missing engine appears in not_ready_reasons but does NOT trigger a sidecar verify call (verifiable via spy)

AC-10: host_key_policy=reject_new blocks first connection to a previously unseen host Given config.c12.companion_host_key_policy = reject_new and the companion is not in ~/.ssh/known_hosts When verify_companion_ready(address) is called Then CompanionUnreachableError(reason=host_key_mismatch, ...) is raised; ERROR log; remediation returns "Add the companion to ~/.ssh/known_hosts first via a manual ssh-keyscan, then retry."

Non-Functional Requirements

Performance

  • A successful verify_companion_ready call against a local-network companion (≤ 1 ms RTT) with 5 engines completes in ≤ 5 s wall-clock (dominated by 5 × sha256sum over engines totaling ~1 GB on the companion's NVMe).
  • Connection-open phase ≤ 2 s p99 in normal conditions; the connect_timeout_s config caps the worst case at the configured value.

Compatibility

  • paramiko per the project pin; no version override.
  • Host-key policies map to paramiko's MissingHostKeyPolicy subclasses; if paramiko changes the API in a future minor version, this task's policy mapping is the only place to update.

Reliability

  • The session is closed in finally on every code path (AC-7 covers).
  • sha256sum invocation has a per-engine timeout (default 60 s, config-overrideable) so a hung companion does not hold the operator's CLI indefinitely.
  • The four checks are sequential, not parallel, to keep the SSH session simple and ordering deterministic for log correlation.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Fake SshSessionFactory returning a fake session where all four checks succeed ReadinessReport(outcome=ready) + INFO log
AC-2 Fake session with one missing engine outcome=not_ready, not_ready_reasons lists the missing engine, no hash check on the missing one
AC-3 Fake session where sidecar verifier returns matches=False ContentHashMismatchError with populated attributes, session closed, ERROR log
AC-4 SshSessionFactory.open raises ConnectionRefusedError CompanionUnreachableError(reason=connect_refused), ERROR log
AC-5 SshSessionFactory.open raises paramiko.AuthenticationException CompanionUnreachableError(reason=auth_failed), ERROR log
AC-6 SshSessionFactory.open raises paramiko.BadHostKeyException with policy=strict CompanionUnreachableError(reason=host_key_mismatch), ERROR log
AC-7 Fake session whose file_exists raises OSError mid-flow OSError propagates; session.close() called exactly once
AC-8 SshSessionFactory.open raises socket.timeout after connect_timeout_s CompanionUnreachableError(reason=timeout), log includes timeout value
AC-9 Fake session with mixed-presence engines, sidecar-verifier spy engines_inspected_count == count_of_present_expected, sidecar verifier not called for missing engines
AC-10 host_key_policy=reject_new + unknown host CompanionUnreachableError(reason=host_key_mismatch) with reject_new-specific remediation text
NFR-perf-cold-call Microbench against in-process fake session × 100 p99 ≤ 50 ms for the orchestration overhead (excludes real SSH)

Constraints

  • paramiko is the only allowed SSH library — no subprocess.run("ssh ...") shell-out (security: shell injection surface; reliability: no parsed output).
  • SshSessionFactory is a Protocol, NOT a class — the concrete ParamikoSshSessionFactory is one implementation, allowing tests to inject fakes without monkey-patching paramiko.
  • The RemoteSidecarVerifier does NOT pull the engine bytes back to the operator workstation — it runs sha256sum on the companion and parses the output. This avoids a multi-GB transfer per readiness check.
  • The error families (CompanionUnreachableError, ContentHashMismatchError) are the canonical types; sibling tasks (T3 build_cache) MUST consume these and not redefine them.
  • The host-key policy auto_add_unknown is intentionally NOT a supported value — silently accepting new host keys defeats the security model. The supported set is strict | known_hosts | reject_new; known_hosts requires the entry to already exist; reject_new is functionally identical to strict but with a clearer error message.
  • This task does NOT cache SSH sessions — every verify_companion_ready call opens and closes a fresh session. Caching would complicate the failure model for marginal performance gain (the bottleneck is the four sha256sum runs, not session establishment).

Risks & Mitigation

Risk 1: paramiko version drift breaks the host-key-policy mapping

  • Risk: A future paramiko minor release renames or removes MissingHostKeyPolicy subclasses; this task's mapping breaks silently in tests that don't exercise paramiko itself.
  • Mitigation: A single integration test (marked @pytest.mark.requires_paramiko) constructs ParamikoSshSessionFactory with each policy value and asserts the resulting paramiko policy class name. Catches version drift on dependency upgrades.

Risk 2: sha256sum is missing or behaves differently on the companion image

  • Risk: The companion is JetPack-based; if it ships without coreutils's sha256sum, this task's verifier breaks at runtime.
  • Mitigation: A composition-root health check at startup runs sha256sum --version over the SSH session and surfaces a clear CompanionUnreachableError(reason=other, underlying_exception_repr="sha256sum not found") if absent. JetPack base images include coreutils per ADR-005.

Risk 3: Operator's ~/.ssh/known_hosts has stale entries from prior bench runs

  • Risk: A reflashed companion exhibits AC-10 / AC-6 failures legitimately, but operators see the cryptic paramiko traceback if remediation hints are unclear.
  • Mitigation: AC-6 / AC-10 require the remediation attribute on CompanionUnreachableError to mention ~/.ssh/known_hosts explicitly. The CLI subcommand verify-ready (in T1) prints the remediation hint to stderr.

Risk 4: Long-running sha256sum hangs the operator's CLI

  • Risk: A degraded companion NVMe causes sha256sum on a 200 MB engine to take minutes; the operator sees a hung command.
  • Mitigation: RemoteSidecarVerifier enforces a per-engine timeout (default 60 s, config-overrideable). On timeout, the verifier raises ContentHashMismatchError(actual_sha256_hex="<timeout>") so the operator sees a clear failure and can investigate the disk.

Runtime Completeness

  • Named capability: pre-flight companion artifact verification per AC-NEW-1 + description.md § 2 verify_companion_ready.
  • Production code that must exist: real CompanionBringup orchestrating real ParamikoSshSessionFactory + real RemoteSidecarVerifier (with real sha256sum over SSH); real config-driven SSH credentials + host-key policy + cache root.
  • Allowed external stubs: tests MAY use a fake SshSessionFactory returning a fake SshSession whose run, file_exists, list_dir are scripted; production wiring uses paramiko + the real companion.
  • Unacceptable substitutes: shelling out to ssh ... via subprocess.run (security + reliability); reading sidecars by pulling engine bytes back to the workstation (multi-GB per readiness check); auto_add_unknown host-key policy (security defeat); a "skip-verify" config flag (defeats AC-NEW-1).