# C12 Companion Bringup — SSH `verify_companion_ready` + `ReadinessReport` **Task**: AZ-327_c12_companion_bringup **Name**: C12 Companion Bringup **Description**: Implement `CompanionBringup`, the C12-internal helper that opens an SSH session against the companion (paramiko per project pin), inspects the companion-side filesystem for the four required pre-flight artifacts (Manifest.json, .engine files + AZ-280 sidecars, calibration JSON), runs sidecar verification on the engines via a remote `sha256sum` over the engine path (compared against the sidecar's hex digest), and returns a `ReadinessReport` per description.md § 2 (`manifest_present`, `content_hashes_pass`, `engines_present`, `calibration_present`, `outcome ∈ {ready, not_ready}`, `not_ready_reasons: list[str]`). Owns the two error families: `CompanionUnreachableError` (SSH session-open failure: TCP refused, auth failed, host key mismatch, socket timeout) and `ContentHashMismatchError` (sidecar verification fails on at least one engine — distinct from "engine missing", which is a not-ready signal not an exception). Public surface is one method `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`. SSH user, key file, host-key policy, connect-timeout, and the canonical companion-side cache root come from config (`config.c12.companion_ssh_user`, `config.c12.companion_ssh_keyfile`, `config.c12.companion_host_key_policy`, `config.c12.companion_connect_timeout_s`, `config.c12.companion_cache_root`) per AZ-269. The session is opened in a `try/finally` block; the connection is always closed even if the four checks raise. INFO log on every successful call (with the four boolean flags + outcome); WARN on degraded readiness (any 3-of-4); ERROR on the two error families. **Complexity**: 3 points **Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module **Component**: c12_operator_tooling (epic AZ-253 / E-C12) **Tracker**: AZ-327 **Epic**: AZ-253 (E-C12) ### Document Dependencies - `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`verify_companion_ready` interface + `ReadinessReport` DTO shape), § 5 (`CompanionUnreachableError`, `ContentHashMismatchError`), § 7 (filesystem lockfile note — relevant for orchestrator T3 not this task). - `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar file format (this task verifies remotely; does not import the helper but reuses the schema). - `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — engine filename layout used to enumerate the expected engines list. - `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes. ## Problem Without a real `CompanionBringup`: - `build_cache` (sibling T3) cannot run safely — the orchestrator would invoke C10 on the companion without any pre-flight visibility into the companion's state. A half-provisioned companion would either silently miscompile (manifest stale) or corrupt the cache. - The `verify-ready` CLI subcommand has no implementation — operators cannot diagnose "is my companion in a usable state?" without SSHing in manually. - Pre-flight content-hash verification per AC-NEW-1's takeoff gate (AZ-324 covers the airborne side) has no operator-side counterpart — sidecar mismatches that occur during the SSH transfer would only surface at takeoff, too late. - `CompanionUnreachableError` and `ContentHashMismatchError` exist as concept-only types in description.md § 5 with no producer. - Configuration knobs for SSH credentials, host-key policy, and the canonical cache root have no consumer; AZ-269's loader cannot validate them against a concrete usage. This task delivers the bring-up + verification layer. It does NOT orchestrate the `build_cache` flow (sibling T3 does), does NOT invoke C10 (T3 does via SSH after this task confirms readiness), and does NOT perform the takeoff-time content-hash verification (AZ-324 owns the airborne side). ## Outcome - A `CompanionBringup` class at `src/operator_tool/companion_bringup.py`: - Constructor: `__init__(self, *, ssh_factory: SshSessionFactory, sidecar_verifier: RemoteSidecarVerifier, logger: Logger, config: C12CompanionConfig)`. - `C12CompanionConfig` (`@dataclass(frozen=True)`): `ssh_user: str`, `ssh_keyfile: Path`, `host_key_policy: enum {strict, known_hosts, reject_new}`, `connect_timeout_s: float = 10.0`, `companion_cache_root: PurePosixPath = PurePosixPath("/var/lib/azaion/c10/cache")`, `manifest_filename: str = "Manifest.json"`, `calibration_filename: str = "camera_calibration.json"`, `expected_engines: tuple[str, ...] = ()` (the orchestrator passes the list per the request; default empty fails AC-2 cleanly). - Public method: `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`. - DTOs at `src/operator_tool/_types.py`: - `CompanionAddress` (`@dataclass(frozen=True)`): `host: str`, `port: int = 22`. - `ReadinessReport` (`@dataclass(frozen=True)`): `manifest_present: bool`, `content_hashes_pass: bool`, `engines_present: bool`, `calibration_present: bool`, `outcome: enum {ready, not_ready}`, `not_ready_reasons: tuple[str, ...]`, `companion_cache_root: str`, `engines_inspected_count: int`. - Errors at `src/operator_tool/errors.py`: - `CompanionUnreachableError(Exception)`: attributes `host: str`, `port: int`, `reason: enum {connect_refused, auth_failed, host_key_mismatch, timeout, other}`, `underlying_exception_repr: str`. `remediation` attribute returns a one-line operator-friendly hint per `reason`. - `ContentHashMismatchError(Exception)`: attributes `engine_path: str`, `expected_sha256_hex: str`, `actual_sha256_hex: str`. `remediation` attribute returns "Re-run the cache build (`operator-tool build-cache --area ...`) to repopulate the affected engine.". - A `SshSessionFactory` Protocol at `src/operator_tool/ssh_session.py`: ```python @runtime_checkable class SshSession(Protocol): def run(self, command: str, *, timeout_s: float) -> RemoteCommandResult: ... def file_exists(self, remote_path: PurePosixPath) -> bool: ... def list_dir(self, remote_path: PurePosixPath) -> list[str]: ... def close(self) -> None: ... @runtime_checkable class SshSessionFactory(Protocol): def open(self, address: CompanionAddress, *, timeout_s: float) -> SshSession: ... ``` Concrete implementation `ParamikoSshSessionFactory` wraps `paramiko.SSHClient` with the documented host-key policy mapping (`strict → RejectPolicy`, `known_hosts → AutoAddPolicy gated on `~/.ssh/known_hosts` presence`, `reject_new → RejectPolicy with explicit allowlist`). - A `RemoteSidecarVerifier` helper at `src/operator_tool/remote_sidecar_verifier.py`: - `verify(session: SshSession, engine_path: PurePosixPath) -> RemoteSidecarResult` — runs `sha256sum ` over the SSH session, parses the first 64 hex chars, reads the sidecar file at `.sha256` via `session.run("cat ...")`, parses its 64 hex chars, compares case-insensitively. Returns `RemoteSidecarResult(matches: bool, expected_hex: str, actual_hex: str)`. - Method flow for `verify_companion_ready`: 1. Open SSH session via `ssh_factory.open(companion_address, timeout_s=config.connect_timeout_s)`. On any paramiko/socket exception → catch and raise `CompanionUnreachableError` mapping the underlying type to a `reason` enum value. Always wrap subsequent steps in `try/finally` that closes the session. 2. Check 1 — `manifest_present`: `session.file_exists(companion_cache_root / manifest_filename)`. 3. Check 2 — `engines_present`: `session.list_dir(companion_cache_root / "engines")` → set of filenames; compare against `config.expected_engines`. If `config.expected_engines` is empty → `engines_present = False`, `not_ready_reasons += ["expected_engines list empty in caller-supplied config"]`. Else `engines_present = expected_engines.issubset(listed_engines)`; if not, append `"engines_missing: "`. 4. Check 3 — `content_hashes_pass`: for each engine in the intersection of `expected_engines` and `listed_engines`, call `sidecar_verifier.verify(session, companion_cache_root / "engines" / engine)`. If ANY result `matches == False` → raise `ContentHashMismatchError` with the first failing path. If all match → `content_hashes_pass = True`. Records `engines_inspected_count` regardless. 5. Check 4 — `calibration_present`: `session.file_exists(companion_cache_root / calibration_filename)`. 6. Compute `outcome`: `ready` iff all four booleans are `True`; `not_ready` otherwise. 7. Emit log: INFO `kind="c12.companion.ready"` with the four flags + outcome on success; WARN `kind="c12.companion.degraded"` if any check failed without raising (i.e. `outcome=not_ready` due to a missing artifact, not a hash mismatch). 8. Return the `ReadinessReport`. - Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with a `companion_bringup: CompanionBringup` field. The factory `build_companion_bringup(config) -> CompanionBringup` constructs the paramiko-backed session factory + remote sidecar verifier + logger. ## Scope ### Included - `CompanionBringup` class with the single public method. - The 2 DTOs (`CompanionAddress`, `ReadinessReport`) plus the `outcome` and `reason` enum types. - The 2 error types (`CompanionUnreachableError`, `ContentHashMismatchError`) with `remediation` attributes. - `SshSessionFactory` + `SshSession` Protocols. - `ParamikoSshSessionFactory` + `ParamikoSshSession` concrete implementations. - `RemoteSidecarVerifier` helper. - Composition-root factory. - Config schema extension on AZ-269's loader (`config.c12.companion_*` block). - `verify-ready` subcommand wiring delegated to T1's CLI shell — this task ships the service class; T1's `cli.py` resolves it from the composition root. - Conformance unit tests using a fake `SshSessionFactory` (no paramiko in unit tests) covering all 6 acceptance criteria. ### Excluded - The `build_cache` orchestration that consumes `verify_companion_ready` (sibling T3). - The actual SSH-invocation of C10 on the companion (sibling T3). - The takeoff-time content-hash verification on the airborne side (AZ-324). - Engine compilation (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323) — all C10 owns these and they ran prior to this task being invoked. - A SOCKS proxy or jump-host SSH path — direct SSH only this cycle. - Telemetry exfiltration of operator workstation key material — host key + private key never appear in log output (only fingerprint hash if at all). ## Acceptance Criteria **AC-1: All four artifacts present + sidecars verify → `outcome=ready`** Given the companion's SSH is reachable, `Manifest.json` exists, all `expected_engines` exist, all sidecars verify, and the calibration file exists When `verify_companion_ready(address)` is called Then `ReadinessReport(manifest_present=True, content_hashes_pass=True, engines_present=True, calibration_present=True, outcome=ready, not_ready_reasons=())` is returned; ONE INFO log `kind="c12.companion.ready"` is emitted **AC-2: Missing engine → `outcome=not_ready`** Given `expected_engines=("dinov2_vpr_sm87_jp62_trt103_fp16.engine", "lightglue_sm87_jp62_trt103_fp16.engine")` and only the first exists on the companion When `verify_companion_ready(address)` is called Then `engines_present=False`; `not_ready_reasons` contains `"engines_missing: lightglue_sm87_jp62_trt103_fp16.engine"`; `outcome=not_ready`; ONE WARN log `kind="c12.companion.degraded"`; NO `ContentHashMismatchError` is raised **AC-3: Sidecar mismatch → `ContentHashMismatchError`** Given an engine file is present but its sidecar's hex digest does not match the engine's actual SHA-256 When `verify_companion_ready(address)` is called Then `ContentHashMismatchError` is raised with `engine_path`, `expected_sha256_hex`, `actual_sha256_hex` populated; the SSH session is closed (`session.close()` is called in `finally`); ONE ERROR log `kind="c12.companion.hash.mismatch"` is emitted **AC-4: SSH connection refused → `CompanionUnreachableError(reason=connect_refused)`** Given the companion address is unreachable (TCP RST or no listener) When `verify_companion_ready(address)` is called Then `CompanionUnreachableError(reason=connect_refused, underlying_exception_repr="...")` is raised; the underlying paramiko/socket exception's repr is captured; ONE ERROR log `kind="c12.companion.unreachable"`; `remediation` attribute returns "Check companion power, USB/Ethernet cable, and `config.c12.companion_address`." **AC-5: SSH auth failure → `CompanionUnreachableError(reason=auth_failed)`** Given the companion is reachable but the SSH key is wrong or revoked When `verify_companion_ready(address)` is called Then `CompanionUnreachableError(reason=auth_failed, ...)` is raised; ERROR log `kind="c12.companion.unreachable"` with `reason="auth_failed"`; `remediation` attribute returns "Verify `config.c12.companion_ssh_keyfile` matches the public key in `~/.ssh/authorized_keys` on the companion." **AC-6: Host key mismatch with `host_key_policy=strict` → `CompanionUnreachableError(reason=host_key_mismatch)`** Given the companion's host key has changed and `config.c12.companion_host_key_policy = strict` When `verify_companion_ready(address)` is called Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Inspect `~/.ssh/known_hosts`; if the companion was reflashed, remove its old entry; otherwise treat as a security incident." **AC-7: SSH session is always closed** Given any of the four checks raises an unexpected exception (e.g. SFTP returns `OSError`) When `verify_companion_ready(address)` is called Then the exception propagates to the caller; `session.close()` was called exactly once before propagation (verifiable via spy on the fake `SshSession`); no socket descriptor leaks **AC-8: Connect timeout → `CompanionUnreachableError(reason=timeout)`** Given the companion address routes but never responds to TCP SYN within `config.c12.companion_connect_timeout_s` When `verify_companion_ready(address)` is called Then `CompanionUnreachableError(reason=timeout, ...)` is raised within `connect_timeout_s + 1.0 s` (allowing test jitter); ERROR log includes the configured timeout value **AC-9: `engines_inspected_count` reflects what was actually checked** Given a mix of present + missing engines (2 of 3 expected exist) When `verify_companion_ready(address)` is called Then `engines_inspected_count == 2`; the missing engine appears in `not_ready_reasons` but does NOT trigger a sidecar verify call (verifiable via spy) **AC-10: `host_key_policy=reject_new` blocks first connection to a previously unseen host** Given `config.c12.companion_host_key_policy = reject_new` and the companion is not in `~/.ssh/known_hosts` When `verify_companion_ready(address)` is called Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Add the companion to `~/.ssh/known_hosts` first via a manual `ssh-keyscan`, then retry." ## Non-Functional Requirements **Performance** - A successful `verify_companion_ready` call against a local-network companion (≤ 1 ms RTT) with 5 engines completes in ≤ 5 s wall-clock (dominated by 5 × `sha256sum` over engines totaling ~1 GB on the companion's NVMe). - Connection-open phase ≤ 2 s p99 in normal conditions; the `connect_timeout_s` config caps the worst case at the configured value. **Compatibility** - paramiko per the project pin; no version override. - Host-key policies map to paramiko's `MissingHostKeyPolicy` subclasses; if paramiko changes the API in a future minor version, this task's policy mapping is the only place to update. **Reliability** - The session is closed in `finally` on every code path (AC-7 covers). - `sha256sum` invocation has a per-engine timeout (default 60 s, config-overrideable) so a hung companion does not hold the operator's CLI indefinitely. - The four checks are sequential, not parallel, to keep the SSH session simple and ordering deterministic for log correlation. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Fake `SshSessionFactory` returning a fake session where all four checks succeed | `ReadinessReport(outcome=ready)` + INFO log | | AC-2 | Fake session with one missing engine | `outcome=not_ready`, `not_ready_reasons` lists the missing engine, no hash check on the missing one | | AC-3 | Fake session where sidecar verifier returns `matches=False` | `ContentHashMismatchError` with populated attributes, session closed, ERROR log | | AC-4 | `SshSessionFactory.open` raises `ConnectionRefusedError` | `CompanionUnreachableError(reason=connect_refused)`, ERROR log | | AC-5 | `SshSessionFactory.open` raises `paramiko.AuthenticationException` | `CompanionUnreachableError(reason=auth_failed)`, ERROR log | | AC-6 | `SshSessionFactory.open` raises `paramiko.BadHostKeyException` with `policy=strict` | `CompanionUnreachableError(reason=host_key_mismatch)`, ERROR log | | AC-7 | Fake session whose `file_exists` raises `OSError` mid-flow | `OSError` propagates; `session.close()` called exactly once | | AC-8 | `SshSessionFactory.open` raises `socket.timeout` after `connect_timeout_s` | `CompanionUnreachableError(reason=timeout)`, log includes timeout value | | AC-9 | Fake session with mixed-presence engines, sidecar-verifier spy | `engines_inspected_count == count_of_present_expected`, sidecar verifier not called for missing engines | | AC-10 | `host_key_policy=reject_new` + unknown host | `CompanionUnreachableError(reason=host_key_mismatch)` with `reject_new`-specific remediation text | | NFR-perf-cold-call | Microbench against in-process fake session × 100 | p99 ≤ 50 ms for the orchestration overhead (excludes real SSH) | ## Constraints - paramiko is the only allowed SSH library — no `subprocess.run("ssh ...")` shell-out (security: shell injection surface; reliability: no parsed output). - `SshSessionFactory` is a Protocol, NOT a class — the concrete `ParamikoSshSessionFactory` is one implementation, allowing tests to inject fakes without monkey-patching paramiko. - The `RemoteSidecarVerifier` does NOT pull the engine bytes back to the operator workstation — it runs `sha256sum` on the companion and parses the output. This avoids a multi-GB transfer per readiness check. - The error families (`CompanionUnreachableError`, `ContentHashMismatchError`) are the canonical types; sibling tasks (T3 build_cache) MUST consume these and not redefine them. - The host-key policy `auto_add_unknown` is intentionally NOT a supported value — silently accepting new host keys defeats the security model. The supported set is `strict | known_hosts | reject_new`; `known_hosts` requires the entry to already exist; `reject_new` is functionally identical to `strict` but with a clearer error message. - This task does NOT cache SSH sessions — every `verify_companion_ready` call opens and closes a fresh session. Caching would complicate the failure model for marginal performance gain (the bottleneck is the four `sha256sum` runs, not session establishment). ## Risks & Mitigation **Risk 1: paramiko version drift breaks the host-key-policy mapping** - *Risk*: A future paramiko minor release renames or removes `MissingHostKeyPolicy` subclasses; this task's mapping breaks silently in tests that don't exercise paramiko itself. - *Mitigation*: A single integration test (marked `@pytest.mark.requires_paramiko`) constructs `ParamikoSshSessionFactory` with each policy value and asserts the resulting paramiko policy class name. Catches version drift on dependency upgrades. **Risk 2: `sha256sum` is missing or behaves differently on the companion image** - *Risk*: The companion is JetPack-based; if it ships without `coreutils`'s `sha256sum`, this task's verifier breaks at runtime. - *Mitigation*: A composition-root health check at startup runs `sha256sum --version` over the SSH session and surfaces a clear `CompanionUnreachableError(reason=other, underlying_exception_repr="sha256sum not found")` if absent. JetPack base images include `coreutils` per ADR-005. **Risk 3: Operator's `~/.ssh/known_hosts` has stale entries from prior bench runs** - *Risk*: A reflashed companion exhibits AC-10 / AC-6 failures legitimately, but operators see the cryptic paramiko traceback if remediation hints are unclear. - *Mitigation*: AC-6 / AC-10 require the `remediation` attribute on `CompanionUnreachableError` to mention `~/.ssh/known_hosts` explicitly. The CLI subcommand `verify-ready` (in T1) prints the remediation hint to stderr. **Risk 4: Long-running `sha256sum` hangs the operator's CLI** - *Risk*: A degraded companion NVMe causes `sha256sum` on a 200 MB engine to take minutes; the operator sees a hung command. - *Mitigation*: `RemoteSidecarVerifier` enforces a per-engine timeout (default 60 s, config-overrideable). On timeout, the verifier raises `ContentHashMismatchError(actual_sha256_hex="")` so the operator sees a clear failure and can investigate the disk. ## Runtime Completeness - **Named capability**: pre-flight companion artifact verification per AC-NEW-1 + description.md § 2 `verify_companion_ready`. - **Production code that must exist**: real `CompanionBringup` orchestrating real `ParamikoSshSessionFactory` + real `RemoteSidecarVerifier` (with real `sha256sum` over SSH); real config-driven SSH credentials + host-key policy + cache root. - **Allowed external stubs**: tests MAY use a fake `SshSessionFactory` returning a fake `SshSession` whose `run`, `file_exists`, `list_dir` are scripted; production wiring uses paramiko + the real companion. - **Unacceptable substitutes**: shelling out to `ssh ...` via `subprocess.run` (security + reliability); reading sidecars by pulling engine bytes back to the workstation (multi-GB per readiness check); `auto_add_unknown` host-key policy (security defeat); a "skip-verify" config flag (defeats AC-NEW-1).