mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-23 07:21:13 +00:00
[AZ-326] [AZ-327] C12 operator-tool CLI + companion SSH bringup
AZ-326 (3pt): operator-tool Click CLI shell at src/gps_denied_onboard/components/c12_operator_tooling/cli.py with six subcommands (download, build-cache, upload-pending, reloc-confirm, verify-ready, set-sector); SectorClassificationStore (atomic-write JSON under ~/.azaion/onboard/sector-classifications.json); freshness-table lookup driving AC-NEW-6; EXIT_* constants; AZ-266 structured-JSON log wiring to a rotating ~/.azaion/onboard/c12-tooling.log handler; operator-tool console-script entry in pyproject.toml. AZ-327 (3pt): CompanionBringup orchestrator at src/gps_denied_onboard/components/c12_operator_tooling/companion_bringup.py that opens an SSH session against the companion (paramiko per project pin), checks the four pre-flight artifacts (Manifest, expected engines, sha256 sidecars, calibration), and returns a ReadinessReport per description.md S2; CompanionUnreachableError + ContentHashMismatchError with operator-friendly remediation hints; ParamikoSshSessionFactory + RemoteSidecarVerifier (sha256sum + cat over SSH, no bytes pulled to the workstation); paramiko>=3.4,<4.0 dep added. NFR-perf-cold-start fix: PEP 562 lazy __getattr__ in c12_operator_tooling/__init__.py and flights_api/__init__.py defers HttpxFlightsApiClient (httpx), ParamikoSshSession[Factory] (paramiko + cryptography), bbox_from_waypoints / takeoff_origin_from_flight (numpy + pyproj). cli.py imports from leaf flights_api modules. operator-tool --help cold start: ~870ms -> <200ms typical, <500ms p99. Includes 73 unit tests (incl. paramiko-version-drift smoke per AZ-327 Risk 1) + console-script integration test. All 1494 repo-wide unit tests pass; 80 skips are pre-existing environment gates. Batch report: _docs/03_implementation/batch_42_cycle1_report.md. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,209 @@
|
||||
# C12 Companion Bringup — SSH `verify_companion_ready` + `ReadinessReport`
|
||||
|
||||
**Task**: AZ-327_c12_companion_bringup
|
||||
**Name**: C12 Companion Bringup
|
||||
**Description**: Implement `CompanionBringup`, the C12-internal helper that opens an SSH session against the companion (paramiko per project pin), inspects the companion-side filesystem for the four required pre-flight artifacts (Manifest.json, .engine files + AZ-280 sidecars, calibration JSON), runs sidecar verification on the engines via a remote `sha256sum` over the engine path (compared against the sidecar's hex digest), and returns a `ReadinessReport` per description.md § 2 (`manifest_present`, `content_hashes_pass`, `engines_present`, `calibration_present`, `outcome ∈ {ready, not_ready}`, `not_ready_reasons: list[str]`). Owns the two error families: `CompanionUnreachableError` (SSH session-open failure: TCP refused, auth failed, host key mismatch, socket timeout) and `ContentHashMismatchError` (sidecar verification fails on at least one engine — distinct from "engine missing", which is a not-ready signal not an exception). Public surface is one method `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`. SSH user, key file, host-key policy, connect-timeout, and the canonical companion-side cache root come from config (`config.c12.companion_ssh_user`, `config.c12.companion_ssh_keyfile`, `config.c12.companion_host_key_policy`, `config.c12.companion_connect_timeout_s`, `config.c12.companion_cache_root`) per AZ-269. The session is opened in a `try/finally` block; the connection is always closed even if the four checks raise. INFO log on every successful call (with the four boolean flags + outcome); WARN on degraded readiness (any 3-of-4); ERROR on the two error families.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
|
||||
**Component**: c12_operator_tooling (epic AZ-253 / E-C12)
|
||||
**Tracker**: AZ-327
|
||||
**Epic**: AZ-253 (E-C12)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`verify_companion_ready` interface + `ReadinessReport` DTO shape), § 5 (`CompanionUnreachableError`, `ContentHashMismatchError`), § 7 (filesystem lockfile note — relevant for orchestrator T3 not this task).
|
||||
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar file format (this task verifies remotely; does not import the helper but reuses the schema).
|
||||
- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — engine filename layout used to enumerate the expected engines list.
|
||||
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes.
|
||||
|
||||
## Problem
|
||||
|
||||
Without a real `CompanionBringup`:
|
||||
|
||||
- `build_cache` (sibling T3) cannot run safely — the orchestrator would invoke C10 on the companion without any pre-flight visibility into the companion's state. A half-provisioned companion would either silently miscompile (manifest stale) or corrupt the cache.
|
||||
- The `verify-ready` CLI subcommand has no implementation — operators cannot diagnose "is my companion in a usable state?" without SSHing in manually.
|
||||
- Pre-flight content-hash verification per AC-NEW-1's takeoff gate (AZ-324 covers the airborne side) has no operator-side counterpart — sidecar mismatches that occur during the SSH transfer would only surface at takeoff, too late.
|
||||
- `CompanionUnreachableError` and `ContentHashMismatchError` exist as concept-only types in description.md § 5 with no producer.
|
||||
- Configuration knobs for SSH credentials, host-key policy, and the canonical cache root have no consumer; AZ-269's loader cannot validate them against a concrete usage.
|
||||
|
||||
This task delivers the bring-up + verification layer. It does NOT orchestrate the `build_cache` flow (sibling T3 does), does NOT invoke C10 (T3 does via SSH after this task confirms readiness), and does NOT perform the takeoff-time content-hash verification (AZ-324 owns the airborne side).
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `CompanionBringup` class at `src/operator_tool/companion_bringup.py`:
|
||||
- Constructor: `__init__(self, *, ssh_factory: SshSessionFactory, sidecar_verifier: RemoteSidecarVerifier, logger: Logger, config: C12CompanionConfig)`.
|
||||
- `C12CompanionConfig` (`@dataclass(frozen=True)`): `ssh_user: str`, `ssh_keyfile: Path`, `host_key_policy: enum {strict, known_hosts, reject_new}`, `connect_timeout_s: float = 10.0`, `companion_cache_root: PurePosixPath = PurePosixPath("/var/lib/azaion/c10/cache")`, `manifest_filename: str = "Manifest.json"`, `calibration_filename: str = "camera_calibration.json"`, `expected_engines: tuple[str, ...] = ()` (the orchestrator passes the list per the request; default empty fails AC-2 cleanly).
|
||||
- Public method: `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`.
|
||||
- DTOs at `src/operator_tool/_types.py`:
|
||||
- `CompanionAddress` (`@dataclass(frozen=True)`): `host: str`, `port: int = 22`.
|
||||
- `ReadinessReport` (`@dataclass(frozen=True)`): `manifest_present: bool`, `content_hashes_pass: bool`, `engines_present: bool`, `calibration_present: bool`, `outcome: enum {ready, not_ready}`, `not_ready_reasons: tuple[str, ...]`, `companion_cache_root: str`, `engines_inspected_count: int`.
|
||||
- Errors at `src/operator_tool/errors.py`:
|
||||
- `CompanionUnreachableError(Exception)`: attributes `host: str`, `port: int`, `reason: enum {connect_refused, auth_failed, host_key_mismatch, timeout, other}`, `underlying_exception_repr: str`. `remediation` attribute returns a one-line operator-friendly hint per `reason`.
|
||||
- `ContentHashMismatchError(Exception)`: attributes `engine_path: str`, `expected_sha256_hex: str`, `actual_sha256_hex: str`. `remediation` attribute returns "Re-run the cache build (`operator-tool build-cache --area ...`) to repopulate the affected engine.".
|
||||
- A `SshSessionFactory` Protocol at `src/operator_tool/ssh_session.py`:
|
||||
```python
|
||||
@runtime_checkable
|
||||
class SshSession(Protocol):
|
||||
def run(self, command: str, *, timeout_s: float) -> RemoteCommandResult: ...
|
||||
def file_exists(self, remote_path: PurePosixPath) -> bool: ...
|
||||
def list_dir(self, remote_path: PurePosixPath) -> list[str]: ...
|
||||
def close(self) -> None: ...
|
||||
|
||||
@runtime_checkable
|
||||
class SshSessionFactory(Protocol):
|
||||
def open(self, address: CompanionAddress, *, timeout_s: float) -> SshSession: ...
|
||||
```
|
||||
Concrete implementation `ParamikoSshSessionFactory` wraps `paramiko.SSHClient` with the documented host-key policy mapping (`strict → RejectPolicy`, `known_hosts → AutoAddPolicy gated on `~/.ssh/known_hosts` presence`, `reject_new → RejectPolicy with explicit allowlist`).
|
||||
- A `RemoteSidecarVerifier` helper at `src/operator_tool/remote_sidecar_verifier.py`:
|
||||
- `verify(session: SshSession, engine_path: PurePosixPath) -> RemoteSidecarResult` — runs `sha256sum <engine_path>` over the SSH session, parses the first 64 hex chars, reads the sidecar file at `<engine_path>.sha256` via `session.run("cat ...")`, parses its 64 hex chars, compares case-insensitively. Returns `RemoteSidecarResult(matches: bool, expected_hex: str, actual_hex: str)`.
|
||||
- Method flow for `verify_companion_ready`:
|
||||
1. Open SSH session via `ssh_factory.open(companion_address, timeout_s=config.connect_timeout_s)`. On any paramiko/socket exception → catch and raise `CompanionUnreachableError` mapping the underlying type to a `reason` enum value. Always wrap subsequent steps in `try/finally` that closes the session.
|
||||
2. Check 1 — `manifest_present`: `session.file_exists(companion_cache_root / manifest_filename)`.
|
||||
3. Check 2 — `engines_present`: `session.list_dir(companion_cache_root / "engines")` → set of filenames; compare against `config.expected_engines`. If `config.expected_engines` is empty → `engines_present = False`, `not_ready_reasons += ["expected_engines list empty in caller-supplied config"]`. Else `engines_present = expected_engines.issubset(listed_engines)`; if not, append `"engines_missing: <comma-list>"`.
|
||||
4. Check 3 — `content_hashes_pass`: for each engine in the intersection of `expected_engines` and `listed_engines`, call `sidecar_verifier.verify(session, companion_cache_root / "engines" / engine)`. If ANY result `matches == False` → raise `ContentHashMismatchError` with the first failing path. If all match → `content_hashes_pass = True`. Records `engines_inspected_count` regardless.
|
||||
5. Check 4 — `calibration_present`: `session.file_exists(companion_cache_root / calibration_filename)`.
|
||||
6. Compute `outcome`: `ready` iff all four booleans are `True`; `not_ready` otherwise.
|
||||
7. Emit log: INFO `kind="c12.companion.ready"` with the four flags + outcome on success; WARN `kind="c12.companion.degraded"` if any check failed without raising (i.e. `outcome=not_ready` due to a missing artifact, not a hash mismatch).
|
||||
8. Return the `ReadinessReport`.
|
||||
- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with a `companion_bringup: CompanionBringup` field. The factory `build_companion_bringup(config) -> CompanionBringup` constructs the paramiko-backed session factory + remote sidecar verifier + logger.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `CompanionBringup` class with the single public method.
|
||||
- The 2 DTOs (`CompanionAddress`, `ReadinessReport`) plus the `outcome` and `reason` enum types.
|
||||
- The 2 error types (`CompanionUnreachableError`, `ContentHashMismatchError`) with `remediation` attributes.
|
||||
- `SshSessionFactory` + `SshSession` Protocols.
|
||||
- `ParamikoSshSessionFactory` + `ParamikoSshSession` concrete implementations.
|
||||
- `RemoteSidecarVerifier` helper.
|
||||
- Composition-root factory.
|
||||
- Config schema extension on AZ-269's loader (`config.c12.companion_*` block).
|
||||
- `verify-ready` subcommand wiring delegated to T1's CLI shell — this task ships the service class; T1's `cli.py` resolves it from the composition root.
|
||||
- Conformance unit tests using a fake `SshSessionFactory` (no paramiko in unit tests) covering all 6 acceptance criteria.
|
||||
|
||||
### Excluded
|
||||
|
||||
- The `build_cache` orchestration that consumes `verify_companion_ready` (sibling T3).
|
||||
- The actual SSH-invocation of C10 on the companion (sibling T3).
|
||||
- The takeoff-time content-hash verification on the airborne side (AZ-324).
|
||||
- Engine compilation (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323) — all C10 owns these and they ran prior to this task being invoked.
|
||||
- A SOCKS proxy or jump-host SSH path — direct SSH only this cycle.
|
||||
- Telemetry exfiltration of operator workstation key material — host key + private key never appear in log output (only fingerprint hash if at all).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: All four artifacts present + sidecars verify → `outcome=ready`**
|
||||
Given the companion's SSH is reachable, `Manifest.json` exists, all `expected_engines` exist, all sidecars verify, and the calibration file exists
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `ReadinessReport(manifest_present=True, content_hashes_pass=True, engines_present=True, calibration_present=True, outcome=ready, not_ready_reasons=())` is returned; ONE INFO log `kind="c12.companion.ready"` is emitted
|
||||
|
||||
**AC-2: Missing engine → `outcome=not_ready`**
|
||||
Given `expected_engines=("dinov2_vpr_sm87_jp62_trt103_fp16.engine", "lightglue_sm87_jp62_trt103_fp16.engine")` and only the first exists on the companion
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `engines_present=False`; `not_ready_reasons` contains `"engines_missing: lightglue_sm87_jp62_trt103_fp16.engine"`; `outcome=not_ready`; ONE WARN log `kind="c12.companion.degraded"`; NO `ContentHashMismatchError` is raised
|
||||
|
||||
**AC-3: Sidecar mismatch → `ContentHashMismatchError`**
|
||||
Given an engine file is present but its sidecar's hex digest does not match the engine's actual SHA-256
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `ContentHashMismatchError` is raised with `engine_path`, `expected_sha256_hex`, `actual_sha256_hex` populated; the SSH session is closed (`session.close()` is called in `finally`); ONE ERROR log `kind="c12.companion.hash.mismatch"` is emitted
|
||||
|
||||
**AC-4: SSH connection refused → `CompanionUnreachableError(reason=connect_refused)`**
|
||||
Given the companion address is unreachable (TCP RST or no listener)
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `CompanionUnreachableError(reason=connect_refused, underlying_exception_repr="...")` is raised; the underlying paramiko/socket exception's repr is captured; ONE ERROR log `kind="c12.companion.unreachable"`; `remediation` attribute returns "Check companion power, USB/Ethernet cable, and `config.c12.companion_address`."
|
||||
|
||||
**AC-5: SSH auth failure → `CompanionUnreachableError(reason=auth_failed)`**
|
||||
Given the companion is reachable but the SSH key is wrong or revoked
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `CompanionUnreachableError(reason=auth_failed, ...)` is raised; ERROR log `kind="c12.companion.unreachable"` with `reason="auth_failed"`; `remediation` attribute returns "Verify `config.c12.companion_ssh_keyfile` matches the public key in `~/.ssh/authorized_keys` on the companion."
|
||||
|
||||
**AC-6: Host key mismatch with `host_key_policy=strict` → `CompanionUnreachableError(reason=host_key_mismatch)`**
|
||||
Given the companion's host key has changed and `config.c12.companion_host_key_policy = strict`
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Inspect `~/.ssh/known_hosts`; if the companion was reflashed, remove its old entry; otherwise treat as a security incident."
|
||||
|
||||
**AC-7: SSH session is always closed**
|
||||
Given any of the four checks raises an unexpected exception (e.g. SFTP returns `OSError`)
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then the exception propagates to the caller; `session.close()` was called exactly once before propagation (verifiable via spy on the fake `SshSession`); no socket descriptor leaks
|
||||
|
||||
**AC-8: Connect timeout → `CompanionUnreachableError(reason=timeout)`**
|
||||
Given the companion address routes but never responds to TCP SYN within `config.c12.companion_connect_timeout_s`
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `CompanionUnreachableError(reason=timeout, ...)` is raised within `connect_timeout_s + 1.0 s` (allowing test jitter); ERROR log includes the configured timeout value
|
||||
|
||||
**AC-9: `engines_inspected_count` reflects what was actually checked**
|
||||
Given a mix of present + missing engines (2 of 3 expected exist)
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `engines_inspected_count == 2`; the missing engine appears in `not_ready_reasons` but does NOT trigger a sidecar verify call (verifiable via spy)
|
||||
|
||||
**AC-10: `host_key_policy=reject_new` blocks first connection to a previously unseen host**
|
||||
Given `config.c12.companion_host_key_policy = reject_new` and the companion is not in `~/.ssh/known_hosts`
|
||||
When `verify_companion_ready(address)` is called
|
||||
Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Add the companion to `~/.ssh/known_hosts` first via a manual `ssh-keyscan`, then retry."
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- A successful `verify_companion_ready` call against a local-network companion (≤ 1 ms RTT) with 5 engines completes in ≤ 5 s wall-clock (dominated by 5 × `sha256sum` over engines totaling ~1 GB on the companion's NVMe).
|
||||
- Connection-open phase ≤ 2 s p99 in normal conditions; the `connect_timeout_s` config caps the worst case at the configured value.
|
||||
|
||||
**Compatibility**
|
||||
- paramiko per the project pin; no version override.
|
||||
- Host-key policies map to paramiko's `MissingHostKeyPolicy` subclasses; if paramiko changes the API in a future minor version, this task's policy mapping is the only place to update.
|
||||
|
||||
**Reliability**
|
||||
- The session is closed in `finally` on every code path (AC-7 covers).
|
||||
- `sha256sum` invocation has a per-engine timeout (default 60 s, config-overrideable) so a hung companion does not hold the operator's CLI indefinitely.
|
||||
- The four checks are sequential, not parallel, to keep the SSH session simple and ordering deterministic for log correlation.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Fake `SshSessionFactory` returning a fake session where all four checks succeed | `ReadinessReport(outcome=ready)` + INFO log |
|
||||
| AC-2 | Fake session with one missing engine | `outcome=not_ready`, `not_ready_reasons` lists the missing engine, no hash check on the missing one |
|
||||
| AC-3 | Fake session where sidecar verifier returns `matches=False` | `ContentHashMismatchError` with populated attributes, session closed, ERROR log |
|
||||
| AC-4 | `SshSessionFactory.open` raises `ConnectionRefusedError` | `CompanionUnreachableError(reason=connect_refused)`, ERROR log |
|
||||
| AC-5 | `SshSessionFactory.open` raises `paramiko.AuthenticationException` | `CompanionUnreachableError(reason=auth_failed)`, ERROR log |
|
||||
| AC-6 | `SshSessionFactory.open` raises `paramiko.BadHostKeyException` with `policy=strict` | `CompanionUnreachableError(reason=host_key_mismatch)`, ERROR log |
|
||||
| AC-7 | Fake session whose `file_exists` raises `OSError` mid-flow | `OSError` propagates; `session.close()` called exactly once |
|
||||
| AC-8 | `SshSessionFactory.open` raises `socket.timeout` after `connect_timeout_s` | `CompanionUnreachableError(reason=timeout)`, log includes timeout value |
|
||||
| AC-9 | Fake session with mixed-presence engines, sidecar-verifier spy | `engines_inspected_count == count_of_present_expected`, sidecar verifier not called for missing engines |
|
||||
| AC-10 | `host_key_policy=reject_new` + unknown host | `CompanionUnreachableError(reason=host_key_mismatch)` with `reject_new`-specific remediation text |
|
||||
| NFR-perf-cold-call | Microbench against in-process fake session × 100 | p99 ≤ 50 ms for the orchestration overhead (excludes real SSH) |
|
||||
|
||||
## Constraints
|
||||
|
||||
- paramiko is the only allowed SSH library — no `subprocess.run("ssh ...")` shell-out (security: shell injection surface; reliability: no parsed output).
|
||||
- `SshSessionFactory` is a Protocol, NOT a class — the concrete `ParamikoSshSessionFactory` is one implementation, allowing tests to inject fakes without monkey-patching paramiko.
|
||||
- The `RemoteSidecarVerifier` does NOT pull the engine bytes back to the operator workstation — it runs `sha256sum` on the companion and parses the output. This avoids a multi-GB transfer per readiness check.
|
||||
- The error families (`CompanionUnreachableError`, `ContentHashMismatchError`) are the canonical types; sibling tasks (T3 build_cache) MUST consume these and not redefine them.
|
||||
- The host-key policy `auto_add_unknown` is intentionally NOT a supported value — silently accepting new host keys defeats the security model. The supported set is `strict | known_hosts | reject_new`; `known_hosts` requires the entry to already exist; `reject_new` is functionally identical to `strict` but with a clearer error message.
|
||||
- This task does NOT cache SSH sessions — every `verify_companion_ready` call opens and closes a fresh session. Caching would complicate the failure model for marginal performance gain (the bottleneck is the four `sha256sum` runs, not session establishment).
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: paramiko version drift breaks the host-key-policy mapping**
|
||||
- *Risk*: A future paramiko minor release renames or removes `MissingHostKeyPolicy` subclasses; this task's mapping breaks silently in tests that don't exercise paramiko itself.
|
||||
- *Mitigation*: A single integration test (marked `@pytest.mark.requires_paramiko`) constructs `ParamikoSshSessionFactory` with each policy value and asserts the resulting paramiko policy class name. Catches version drift on dependency upgrades.
|
||||
|
||||
**Risk 2: `sha256sum` is missing or behaves differently on the companion image**
|
||||
- *Risk*: The companion is JetPack-based; if it ships without `coreutils`'s `sha256sum`, this task's verifier breaks at runtime.
|
||||
- *Mitigation*: A composition-root health check at startup runs `sha256sum --version` over the SSH session and surfaces a clear `CompanionUnreachableError(reason=other, underlying_exception_repr="sha256sum not found")` if absent. JetPack base images include `coreutils` per ADR-005.
|
||||
|
||||
**Risk 3: Operator's `~/.ssh/known_hosts` has stale entries from prior bench runs**
|
||||
- *Risk*: A reflashed companion exhibits AC-10 / AC-6 failures legitimately, but operators see the cryptic paramiko traceback if remediation hints are unclear.
|
||||
- *Mitigation*: AC-6 / AC-10 require the `remediation` attribute on `CompanionUnreachableError` to mention `~/.ssh/known_hosts` explicitly. The CLI subcommand `verify-ready` (in T1) prints the remediation hint to stderr.
|
||||
|
||||
**Risk 4: Long-running `sha256sum` hangs the operator's CLI**
|
||||
- *Risk*: A degraded companion NVMe causes `sha256sum` on a 200 MB engine to take minutes; the operator sees a hung command.
|
||||
- *Mitigation*: `RemoteSidecarVerifier` enforces a per-engine timeout (default 60 s, config-overrideable). On timeout, the verifier raises `ContentHashMismatchError(actual_sha256_hex="<timeout>")` so the operator sees a clear failure and can investigate the disk.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: pre-flight companion artifact verification per AC-NEW-1 + description.md § 2 `verify_companion_ready`.
|
||||
- **Production code that must exist**: real `CompanionBringup` orchestrating real `ParamikoSshSessionFactory` + real `RemoteSidecarVerifier` (with real `sha256sum` over SSH); real config-driven SSH credentials + host-key policy + cache root.
|
||||
- **Allowed external stubs**: tests MAY use a fake `SshSessionFactory` returning a fake `SshSession` whose `run`, `file_exists`, `list_dir` are scripted; production wiring uses paramiko + the real companion.
|
||||
- **Unacceptable substitutes**: shelling out to `ssh ...` via `subprocess.run` (security + reliability); reading sidecars by pulling engine bytes back to the workstation (multi-GB per readiness check); `auto_add_unknown` host-key policy (security defeat); a "skip-verify" config flag (defeats AC-NEW-1).
|
||||
Reference in New Issue
Block a user