Implements two new C12 services and rebalances the C11/C12 boundary in one atomic commit: * AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the `flight_footer` FDR record's `clean_shutdown` field; 4 refusal modes; new FdrFooterReader Protocol + LocalFdrFooterReader. * AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol cut (E-C8 owns the future pymavlink concrete); new FDR record kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals, reason 200 chars). * AZ-523 C11 internal flight-state gate removed (SRP refactor): `confirm_flight_state` / `FlightStateSignal` use / `FlightStateNotOnGroundError` deleted from C11; TileUploader contract bumped to v2.0.0 (frozen) with migration note; AZ-317 superseded. * AZ-524 Package rename `c12_operator_tooling` → `c12_operator_orchestrator` across source, tests, pyproject, CMake, Dockerfile, compose, CI, runtime-root services class (`OperatorOrchestratorServices`) + factory function (`build_operator_orchestrator`), logger namespaces, config slug, docs, and the E-C12 epic title. Tests: 1543 passed, 80 skipped (all environment gates). Targeted AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start NFR-perf still ≤ 500 ms p99. Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523 + AZ-524 created and closed as audit-trail tickets. See `_docs/03_implementation/batch_44_cycle1_report.md`. Co-authored-by: Cursor <cursoragent@cursor.com>
22 KiB
C12 Companion Bringup — SSH verify_companion_ready + ReadinessReport
Task: AZ-327_c12_companion_bringup
Name: C12 Companion Bringup
Description: Implement CompanionBringup, the C12-internal helper that opens an SSH session against the companion (paramiko per project pin), inspects the companion-side filesystem for the four required pre-flight artifacts (Manifest.json, .engine files + AZ-280 sidecars, calibration JSON), runs sidecar verification on the engines via a remote sha256sum over the engine path (compared against the sidecar's hex digest), and returns a ReadinessReport per description.md § 2 (manifest_present, content_hashes_pass, engines_present, calibration_present, outcome ∈ {ready, not_ready}, not_ready_reasons: list[str]). Owns the two error families: CompanionUnreachableError (SSH session-open failure: TCP refused, auth failed, host key mismatch, socket timeout) and ContentHashMismatchError (sidecar verification fails on at least one engine — distinct from "engine missing", which is a not-ready signal not an exception). Public surface is one method verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport. SSH user, key file, host-key policy, connect-timeout, and the canonical companion-side cache root come from config (config.c12.companion_ssh_user, config.c12.companion_ssh_keyfile, config.c12.companion_host_key_policy, config.c12.companion_connect_timeout_s, config.c12.companion_cache_root) per AZ-269. The session is opened in a try/finally block; the connection is always closed even if the four checks raise. INFO log on every successful call (with the four boolean flags + outcome); WARN on degraded readiness (any 3-of-4); ERROR on the two error families.
Complexity: 3 points
Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
Component: c12_operator_orchestrator (epic AZ-253 / E-C12)
Tracker: AZ-327
Epic: AZ-253 (E-C12)
Document Dependencies
_docs/02_document/components/13_c12_operator_orchestrator/description.md— § 2 (verify_companion_readyinterface +ReadinessReportDTO shape), § 5 (CompanionUnreachableError,ContentHashMismatchError), § 7 (filesystem lockfile note — relevant for orchestrator T3 not this task)._docs/02_document/contracts/shared_helpers/sha256_sidecar.md— sidecar file format (this task verifies remotely; does not import the helper but reuses the schema)._docs/02_document/contracts/shared_helpers/engine_filename_schema.md— engine filename layout used to enumerate the expected engines list._docs/02_document/contracts/shared_logging/log_record_schema.md— INFO/WARN/ERROR log shapes.
Problem
Without a real CompanionBringup:
build_cache(sibling T3) cannot run safely — the orchestrator would invoke C10 on the companion without any pre-flight visibility into the companion's state. A half-provisioned companion would either silently miscompile (manifest stale) or corrupt the cache.- The
verify-readyCLI subcommand has no implementation — operators cannot diagnose "is my companion in a usable state?" without SSHing in manually. - Pre-flight content-hash verification per AC-NEW-1's takeoff gate (AZ-324 covers the airborne side) has no operator-side counterpart — sidecar mismatches that occur during the SSH transfer would only surface at takeoff, too late.
CompanionUnreachableErrorandContentHashMismatchErrorexist as concept-only types in description.md § 5 with no producer.- Configuration knobs for SSH credentials, host-key policy, and the canonical cache root have no consumer; AZ-269's loader cannot validate them against a concrete usage.
This task delivers the bring-up + verification layer. It does NOT orchestrate the build_cache flow (sibling T3 does), does NOT invoke C10 (T3 does via SSH after this task confirms readiness), and does NOT perform the takeoff-time content-hash verification (AZ-324 owns the airborne side).
Outcome
- A
CompanionBringupclass atsrc/operator_tool/companion_bringup.py:- Constructor:
__init__(self, *, ssh_factory: SshSessionFactory, sidecar_verifier: RemoteSidecarVerifier, logger: Logger, config: C12CompanionConfig). C12CompanionConfig(@dataclass(frozen=True)):ssh_user: str,ssh_keyfile: Path,host_key_policy: enum {strict, known_hosts, reject_new},connect_timeout_s: float = 10.0,companion_cache_root: PurePosixPath = PurePosixPath("/var/lib/azaion/c10/cache"),manifest_filename: str = "Manifest.json",calibration_filename: str = "camera_calibration.json",expected_engines: tuple[str, ...] = ()(the orchestrator passes the list per the request; default empty fails AC-2 cleanly).- Public method:
verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport.
- Constructor:
- DTOs at
src/operator_tool/_types.py:CompanionAddress(@dataclass(frozen=True)):host: str,port: int = 22.ReadinessReport(@dataclass(frozen=True)):manifest_present: bool,content_hashes_pass: bool,engines_present: bool,calibration_present: bool,outcome: enum {ready, not_ready},not_ready_reasons: tuple[str, ...],companion_cache_root: str,engines_inspected_count: int.
- Errors at
src/operator_tool/errors.py:CompanionUnreachableError(Exception): attributeshost: str,port: int,reason: enum {connect_refused, auth_failed, host_key_mismatch, timeout, other},underlying_exception_repr: str.remediationattribute returns a one-line operator-friendly hint perreason.ContentHashMismatchError(Exception): attributesengine_path: str,expected_sha256_hex: str,actual_sha256_hex: str.remediationattribute returns "Re-run the cache build (operator-orchestrator build-cache --area ...) to repopulate the affected engine.".
- A
SshSessionFactoryProtocol atsrc/operator_tool/ssh_session.py:Concrete implementation@runtime_checkable class SshSession(Protocol): def run(self, command: str, *, timeout_s: float) -> RemoteCommandResult: ... def file_exists(self, remote_path: PurePosixPath) -> bool: ... def list_dir(self, remote_path: PurePosixPath) -> list[str]: ... def close(self) -> None: ... @runtime_checkable class SshSessionFactory(Protocol): def open(self, address: CompanionAddress, *, timeout_s: float) -> SshSession: ...ParamikoSshSessionFactorywrapsparamiko.SSHClientwith the documented host-key policy mapping (strict → RejectPolicy,known_hosts → AutoAddPolicy gated on~/.ssh/known_hostspresence,reject_new → RejectPolicy with explicit allowlist). - A
RemoteSidecarVerifierhelper atsrc/operator_tool/remote_sidecar_verifier.py:verify(session: SshSession, engine_path: PurePosixPath) -> RemoteSidecarResult— runssha256sum <engine_path>over the SSH session, parses the first 64 hex chars, reads the sidecar file at<engine_path>.sha256viasession.run("cat ..."), parses its 64 hex chars, compares case-insensitively. ReturnsRemoteSidecarResult(matches: bool, expected_hex: str, actual_hex: str).
- Method flow for
verify_companion_ready:- Open SSH session via
ssh_factory.open(companion_address, timeout_s=config.connect_timeout_s). On any paramiko/socket exception → catch and raiseCompanionUnreachableErrormapping the underlying type to areasonenum value. Always wrap subsequent steps intry/finallythat closes the session. - Check 1 —
manifest_present:session.file_exists(companion_cache_root / manifest_filename). - Check 2 —
engines_present:session.list_dir(companion_cache_root / "engines")→ set of filenames; compare againstconfig.expected_engines. Ifconfig.expected_enginesis empty →engines_present = False,not_ready_reasons += ["expected_engines list empty in caller-supplied config"]. Elseengines_present = expected_engines.issubset(listed_engines); if not, append"engines_missing: <comma-list>". - Check 3 —
content_hashes_pass: for each engine in the intersection ofexpected_enginesandlisted_engines, callsidecar_verifier.verify(session, companion_cache_root / "engines" / engine). If ANY resultmatches == False→ raiseContentHashMismatchErrorwith the first failing path. If all match →content_hashes_pass = True. Recordsengines_inspected_countregardless. - Check 4 —
calibration_present:session.file_exists(companion_cache_root / calibration_filename). - Compute
outcome:readyiff all four booleans areTrue;not_readyotherwise. - Emit log: INFO
kind="c12.companion.ready"with the four flags + outcome on success; WARNkind="c12.companion.degraded"if any check failed without raising (i.e.outcome=not_readydue to a missing artifact, not a hash mismatch). - Return the
ReadinessReport.
- Open SSH session via
- Composition-root factory at
src/gps_denied_onboard/runtime_root/c12_factory.pyextends T1'sOperatorOrchestratorServicesdataclass with acompanion_bringup: CompanionBringupfield. The factorybuild_companion_bringup(config) -> CompanionBringupconstructs the paramiko-backed session factory + remote sidecar verifier + logger.
Scope
Included
CompanionBringupclass with the single public method.- The 2 DTOs (
CompanionAddress,ReadinessReport) plus theoutcomeandreasonenum types. - The 2 error types (
CompanionUnreachableError,ContentHashMismatchError) withremediationattributes. SshSessionFactory+SshSessionProtocols.ParamikoSshSessionFactory+ParamikoSshSessionconcrete implementations.RemoteSidecarVerifierhelper.- Composition-root factory.
- Config schema extension on AZ-269's loader (
config.c12.companion_*block). verify-readysubcommand wiring delegated to T1's CLI shell — this task ships the service class; T1'scli.pyresolves it from the composition root.- Conformance unit tests using a fake
SshSessionFactory(no paramiko in unit tests) covering all 6 acceptance criteria.
Excluded
- The
build_cacheorchestration that consumesverify_companion_ready(sibling T3). - The actual SSH-invocation of C10 on the companion (sibling T3).
- The takeoff-time content-hash verification on the airborne side (AZ-324).
- Engine compilation (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323) — all C10 owns these and they ran prior to this task being invoked.
- A SOCKS proxy or jump-host SSH path — direct SSH only this cycle.
- Telemetry exfiltration of operator workstation key material — host key + private key never appear in log output (only fingerprint hash if at all).
Acceptance Criteria
AC-1: All four artifacts present + sidecars verify → outcome=ready
Given the companion's SSH is reachable, Manifest.json exists, all expected_engines exist, all sidecars verify, and the calibration file exists
When verify_companion_ready(address) is called
Then ReadinessReport(manifest_present=True, content_hashes_pass=True, engines_present=True, calibration_present=True, outcome=ready, not_ready_reasons=()) is returned; ONE INFO log kind="c12.companion.ready" is emitted
AC-2: Missing engine → outcome=not_ready
Given expected_engines=("dinov2_vpr_sm87_jp62_trt103_fp16.engine", "lightglue_sm87_jp62_trt103_fp16.engine") and only the first exists on the companion
When verify_companion_ready(address) is called
Then engines_present=False; not_ready_reasons contains "engines_missing: lightglue_sm87_jp62_trt103_fp16.engine"; outcome=not_ready; ONE WARN log kind="c12.companion.degraded"; NO ContentHashMismatchError is raised
AC-3: Sidecar mismatch → ContentHashMismatchError
Given an engine file is present but its sidecar's hex digest does not match the engine's actual SHA-256
When verify_companion_ready(address) is called
Then ContentHashMismatchError is raised with engine_path, expected_sha256_hex, actual_sha256_hex populated; the SSH session is closed (session.close() is called in finally); ONE ERROR log kind="c12.companion.hash.mismatch" is emitted
AC-4: SSH connection refused → CompanionUnreachableError(reason=connect_refused)
Given the companion address is unreachable (TCP RST or no listener)
When verify_companion_ready(address) is called
Then CompanionUnreachableError(reason=connect_refused, underlying_exception_repr="...") is raised; the underlying paramiko/socket exception's repr is captured; ONE ERROR log kind="c12.companion.unreachable"; remediation attribute returns "Check companion power, USB/Ethernet cable, and config.c12.companion_address."
AC-5: SSH auth failure → CompanionUnreachableError(reason=auth_failed)
Given the companion is reachable but the SSH key is wrong or revoked
When verify_companion_ready(address) is called
Then CompanionUnreachableError(reason=auth_failed, ...) is raised; ERROR log kind="c12.companion.unreachable" with reason="auth_failed"; remediation attribute returns "Verify config.c12.companion_ssh_keyfile matches the public key in ~/.ssh/authorized_keys on the companion."
AC-6: Host key mismatch with host_key_policy=strict → CompanionUnreachableError(reason=host_key_mismatch)
Given the companion's host key has changed and config.c12.companion_host_key_policy = strict
When verify_companion_ready(address) is called
Then CompanionUnreachableError(reason=host_key_mismatch, ...) is raised; ERROR log; remediation returns "Inspect ~/.ssh/known_hosts; if the companion was reflashed, remove its old entry; otherwise treat as a security incident."
AC-7: SSH session is always closed
Given any of the four checks raises an unexpected exception (e.g. SFTP returns OSError)
When verify_companion_ready(address) is called
Then the exception propagates to the caller; session.close() was called exactly once before propagation (verifiable via spy on the fake SshSession); no socket descriptor leaks
AC-8: Connect timeout → CompanionUnreachableError(reason=timeout)
Given the companion address routes but never responds to TCP SYN within config.c12.companion_connect_timeout_s
When verify_companion_ready(address) is called
Then CompanionUnreachableError(reason=timeout, ...) is raised within connect_timeout_s + 1.0 s (allowing test jitter); ERROR log includes the configured timeout value
AC-9: engines_inspected_count reflects what was actually checked
Given a mix of present + missing engines (2 of 3 expected exist)
When verify_companion_ready(address) is called
Then engines_inspected_count == 2; the missing engine appears in not_ready_reasons but does NOT trigger a sidecar verify call (verifiable via spy)
AC-10: host_key_policy=reject_new blocks first connection to a previously unseen host
Given config.c12.companion_host_key_policy = reject_new and the companion is not in ~/.ssh/known_hosts
When verify_companion_ready(address) is called
Then CompanionUnreachableError(reason=host_key_mismatch, ...) is raised; ERROR log; remediation returns "Add the companion to ~/.ssh/known_hosts first via a manual ssh-keyscan, then retry."
Non-Functional Requirements
Performance
- A successful
verify_companion_readycall against a local-network companion (≤ 1 ms RTT) with 5 engines completes in ≤ 5 s wall-clock (dominated by 5 ×sha256sumover engines totaling ~1 GB on the companion's NVMe). - Connection-open phase ≤ 2 s p99 in normal conditions; the
connect_timeout_sconfig caps the worst case at the configured value.
Compatibility
- paramiko per the project pin; no version override.
- Host-key policies map to paramiko's
MissingHostKeyPolicysubclasses; if paramiko changes the API in a future minor version, this task's policy mapping is the only place to update.
Reliability
- The session is closed in
finallyon every code path (AC-7 covers). sha256suminvocation has a per-engine timeout (default 60 s, config-overrideable) so a hung companion does not hold the operator's CLI indefinitely.- The four checks are sequential, not parallel, to keep the SSH session simple and ordering deterministic for log correlation.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Fake SshSessionFactory returning a fake session where all four checks succeed |
ReadinessReport(outcome=ready) + INFO log |
| AC-2 | Fake session with one missing engine | outcome=not_ready, not_ready_reasons lists the missing engine, no hash check on the missing one |
| AC-3 | Fake session where sidecar verifier returns matches=False |
ContentHashMismatchError with populated attributes, session closed, ERROR log |
| AC-4 | SshSessionFactory.open raises ConnectionRefusedError |
CompanionUnreachableError(reason=connect_refused), ERROR log |
| AC-5 | SshSessionFactory.open raises paramiko.AuthenticationException |
CompanionUnreachableError(reason=auth_failed), ERROR log |
| AC-6 | SshSessionFactory.open raises paramiko.BadHostKeyException with policy=strict |
CompanionUnreachableError(reason=host_key_mismatch), ERROR log |
| AC-7 | Fake session whose file_exists raises OSError mid-flow |
OSError propagates; session.close() called exactly once |
| AC-8 | SshSessionFactory.open raises socket.timeout after connect_timeout_s |
CompanionUnreachableError(reason=timeout), log includes timeout value |
| AC-9 | Fake session with mixed-presence engines, sidecar-verifier spy | engines_inspected_count == count_of_present_expected, sidecar verifier not called for missing engines |
| AC-10 | host_key_policy=reject_new + unknown host |
CompanionUnreachableError(reason=host_key_mismatch) with reject_new-specific remediation text |
| NFR-perf-cold-call | Microbench against in-process fake session × 100 | p99 ≤ 50 ms for the orchestration overhead (excludes real SSH) |
Constraints
- paramiko is the only allowed SSH library — no
subprocess.run("ssh ...")shell-out (security: shell injection surface; reliability: no parsed output). SshSessionFactoryis a Protocol, NOT a class — the concreteParamikoSshSessionFactoryis one implementation, allowing tests to inject fakes without monkey-patching paramiko.- The
RemoteSidecarVerifierdoes NOT pull the engine bytes back to the operator workstation — it runssha256sumon the companion and parses the output. This avoids a multi-GB transfer per readiness check. - The error families (
CompanionUnreachableError,ContentHashMismatchError) are the canonical types; sibling tasks (T3 build_cache) MUST consume these and not redefine them. - The host-key policy
auto_add_unknownis intentionally NOT a supported value — silently accepting new host keys defeats the security model. The supported set isstrict | known_hosts | reject_new;known_hostsrequires the entry to already exist;reject_newis functionally identical tostrictbut with a clearer error message. - This task does NOT cache SSH sessions — every
verify_companion_readycall opens and closes a fresh session. Caching would complicate the failure model for marginal performance gain (the bottleneck is the foursha256sumruns, not session establishment).
Risks & Mitigation
Risk 1: paramiko version drift breaks the host-key-policy mapping
- Risk: A future paramiko minor release renames or removes
MissingHostKeyPolicysubclasses; this task's mapping breaks silently in tests that don't exercise paramiko itself. - Mitigation: A single integration test (marked
@pytest.mark.requires_paramiko) constructsParamikoSshSessionFactorywith each policy value and asserts the resulting paramiko policy class name. Catches version drift on dependency upgrades.
Risk 2: sha256sum is missing or behaves differently on the companion image
- Risk: The companion is JetPack-based; if it ships without
coreutils'ssha256sum, this task's verifier breaks at runtime. - Mitigation: A composition-root health check at startup runs
sha256sum --versionover the SSH session and surfaces a clearCompanionUnreachableError(reason=other, underlying_exception_repr="sha256sum not found")if absent. JetPack base images includecoreutilsper ADR-005.
Risk 3: Operator's ~/.ssh/known_hosts has stale entries from prior bench runs
- Risk: A reflashed companion exhibits AC-10 / AC-6 failures legitimately, but operators see the cryptic paramiko traceback if remediation hints are unclear.
- Mitigation: AC-6 / AC-10 require the
remediationattribute onCompanionUnreachableErrorto mention~/.ssh/known_hostsexplicitly. The CLI subcommandverify-ready(in T1) prints the remediation hint to stderr.
Risk 4: Long-running sha256sum hangs the operator's CLI
- Risk: A degraded companion NVMe causes
sha256sumon a 200 MB engine to take minutes; the operator sees a hung command. - Mitigation:
RemoteSidecarVerifierenforces a per-engine timeout (default 60 s, config-overrideable). On timeout, the verifier raisesContentHashMismatchError(actual_sha256_hex="<timeout>")so the operator sees a clear failure and can investigate the disk.
Runtime Completeness
- Named capability: pre-flight companion artifact verification per AC-NEW-1 + description.md § 2
verify_companion_ready. - Production code that must exist: real
CompanionBringuporchestrating realParamikoSshSessionFactory+ realRemoteSidecarVerifier(with realsha256sumover SSH); real config-driven SSH credentials + host-key policy + cache root. - Allowed external stubs: tests MAY use a fake
SshSessionFactoryreturning a fakeSshSessionwhoserun,file_exists,list_dirare scripted; production wiring uses paramiko + the real companion. - Unacceptable substitutes: shelling out to
ssh ...viasubprocess.run(security + reliability); reading sidecars by pulling engine bytes back to the workstation (multi-GB per readiness check);auto_add_unknownhost-key policy (security defeat); a "skip-verify" config flag (defeats AC-NEW-1).