# C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background **Task**: AZ-302_c7_thermal_publisher **Name**: C7 ThermalState Publisher **Description**: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic `ThermalState` snapshot, exposes it via `InferenceRuntime.thermal_state()` for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults `thermal_throttle_active = false` on `TelemetryUnavailableError`. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02). **Complexity**: 3 points **Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf **Component**: c7_inference (epic AZ-249 / E-C7) **Tracker**: AZ-302 **Epic**: AZ-249 (E-C7) ### Document Dependencies - `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `ThermalState` and `thermal_state()`; produced by AZ-297. - `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — used to emit thermal-transition FDR records via `FdrClient.publish`. - `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c7.thermal_transition"` record envelope. - `_docs/02_document/contracts/shared_logging/log_record_schema.md` — WARN log shape on throttle entry / telemetry loss. - `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries `thermal_poll_hz` and a fallback flag for telemetry unavailability behaviour. ## Problem C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher: - C4 has no canonical `ThermalState` source — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read. - The F3 hot path threads cannot poll `jetson-stats` themselves (would race with `infer` on the GIL-protected critical section per description.md § 7). - Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events. - `TelemetryUnavailableError` (jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie. This task is the SINGLE owner of thermal telemetry across the companion process. ## Outcome - A `ThermalStatePublisher` class at `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` with `start() / stop()` lifecycle and `read() -> ThermalState` accessor. - A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at `config.inference.thermal_poll_hz` Hz (default 1.0); each successful poll updates a single `_atomic_snapshot: ThermalState` field via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report). - `read()` is the lock-free reader: returns the current `_atomic_snapshot`. F3 hot-path callers MAY call `read()` from any thread; the call is wait-free and ≤ 1 µs. - Throttle-state transitions (the boolean `thermal_throttle_active` flips between two consecutive polls) emit: - One `kind="c7.thermal_transition"` FDR record via the constructor-injected `FdrClient` (AZ-273) carrying `previous_state`, `new_state`, `gpu_temp_c`, `cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`. - One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO). - `TelemetryUnavailableError` (raised by jetson-stats / pynvml internals): the publisher catches, sets `thermal_throttle_active = false` and `gpu_temp_c = None / cpu_temp_c = None`, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies — `ThermalState.is_telemetry_available` is set to `False` so C4 can choose to ignore the throttle bit. - The publisher is a singleton constructed by the composition root and passed by reference into every `InferenceRuntime` strategy that needs `thermal_state()`. The strategies do NOT each construct their own publisher. - The publisher is started during composition (after `FdrClient` is registered, before any consumer calls `read()`); stopped by the composition root's process-exit hook. ## Scope ### Included - `ThermalStatePublisher` class with `__init__(config, fdr_client, logger)`, `start()`, `stop()`, `read() -> ThermalState`. - Background polling thread: while `_running`, sleep `1.0 / config.inference.thermal_poll_hz` seconds, read `jetson-stats` (`from jtop import jtop` context manager OR `pynvml.nvmlDeviceGetTemperature` / `nvmlDeviceGetCurrentClocksThrottleReasons` direct calls — implementation chooses based on availability), build a fresh `ThermalState`, atomically replace `_atomic_snapshot`. - Source-selection logic: at `start()`, attempt to import `jtop` (jetson-stats); if unavailable, attempt `pynvml`; if both unavailable, raise `TelemetryUnavailableError` from `start()` itself (composition aborts cleanly — operator chooses to disable thermal-aware paths). - Throttle-transition detection: compare `previous._atomic_snapshot.thermal_throttle_active` vs. new value; on flip, emit FDR record + log per Outcome. - WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec): `kind="c7.thermal.unavailable"`. - INFO log on `start()` and `stop()`; INFO log on throttle-exit transitions; WARN log on throttle-entry transitions. - `ThermalState` extension: add `is_telemetry_available: bool` to the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's `## Test Cases`.) - Constructor-injected `Clock` for testability — tests inject a fake clock that advances on demand; production wires `time.monotonic_ns`. - The `read()` accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path. - A `ThermalStatePublisher.is_running() -> bool` introspection accessor for the composition root and tests. ### Excluded - AZ-297 InferenceRuntime Protocol — this task adds the `is_telemetry_available` field to the existing `ThermalState` DTO; the Protocol method `thermal_state()` is owned by the strategies that delegate to this publisher. - AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves. - AZ-301 EngineGate — unrelated. - C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES `ThermalState`; C4 consumes. - Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry. - Cross-flight thermal trend analysis — operator post-flight tooling owns it. - A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle. ## Acceptance Criteria **AC-1: read() is wait-free and returns the latest snapshot** Given a started publisher with two completed polls When `read()` is called from the F3 hot path Then the call returns within 1 µs (microbenched); the returned `ThermalState` matches the most recent poll's data **AC-2: throttle entry within 1 s** Given a running publisher and a simulated jetson-stats spoof flipping `throttle_active` from False to True at time T When the test waits 1 s and calls `read()` Then the returned `ThermalState.thermal_throttle_active == True` (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR `kind="c7.thermal_transition"` record was emitted with `new_state=True`; ONE WARN log was emitted **AC-3: throttle exit within 1 s and INFO log** Given a publisher currently in throttle and a simulated flip to False When 1 s elapses and the test calls `read()` Then `thermal_throttle_active == False`; ONE FDR record with `new_state=False`; ONE INFO log (NOT WARN) records the exit **AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False** Given a started publisher whose jetson-stats source raises `TelemetryUnavailableError` on every poll When the test waits 2 s and calls `read()` Then `ThermalState.is_telemetry_available == False`, `thermal_throttle_active == False` (default-safe), `gpu_temp_c == None`, `cpu_temp_c == None`; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec) **AC-5: cold-start with no source raises** Given an environment where neither `jtop` nor `pynvml` is importable When the publisher's `start()` is called Then `TelemetryUnavailableError` is raised from `start()` itself; the publisher is in `is_running() == False` state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's) **AC-6: start/stop lifecycle is idempotent** Given a publisher When `start()` is called twice in succession Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; `is_running() == True` When `stop()` is called twice Then the second `stop()` is a no-op; resources are NOT double-freed **AC-7: poll thread does not interfere with infer hot path** Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running `infer` at 39 Hz aggregate (per description.md) When the benchmark runs for 60 s Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms **AC-8: FDR record envelope matches contract** Given a throttle transition When the FDR record is written Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` shape with `kind="c7.thermal_transition"`, `producer_id="c7_inference.thermal"`, `payload` containing `{previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns}` (deep-equal verified against the schema's payload type) ## Non-Functional Requirements **Performance** - `read()` is wait-free; p99 ≤ 1 µs. - Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 30–80 ms typically). - F3 hot-path latency unchanged when publisher is running (AC-7). **Reliability** - The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread; `read()` is a single object-reference load (atomic under GIL). - Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state. - The publisher is one of the FIRST things the composition root starts (after `FdrClient` registration) and one of the LAST things it stops (before `FdrClient.stop`). Documented as a startup-order constraint. **Concurrency** - One polling thread per process. The publisher is a singleton. - `read()` is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Microbench read() × 100k from a worker thread | p99 ≤ 1 µs; returned state matches latest poll | | AC-2 | Spoof flip False→True at T; wait 1 s; read | throttle_active=True; one FDR + one WARN | | AC-3 | Spoof flip True→False; wait 1 s; read | throttle_active=False; one FDR + one INFO | | AC-4 | Spoof unavailable on every poll for 2 s | is_telemetry_available=False; default-safe; ≤ 2 WARN | | AC-5 | Disable jtop + pynvml; start | TelemetryUnavailableError raised; is_running=False | | AC-6 | start twice; stop twice | Idempotent; no double-spawn / double-free | | AC-7 | F3 hot-path bench with publisher running | p95 unchanged vs. baseline; poll p99 ≤ 100 ms | | AC-8 | Throttle transition FDR record shape | Matches schema deep-equal | | NFR-perf-poll | Microbench poll body × 100 | p99 ≤ 100 ms | | NFR-reliability-default-safe | Telemetry unavailable on first poll → read() | is_telemetry_available=False; throttle_active=False | ## Constraints - One polling thread per process; publisher is a singleton. - `read()` is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path. - Source selection at `start()` time: `jtop` first, `pynvml` second; if both fail, raise `TelemetryUnavailableError` from `start()` (do NOT silently default-safe — that hides a misconfigured deployment). - Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart). - WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented. - `ThermalState.is_telemetry_available` is added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log. - This task introduces no new third-party dependencies — `jtop` (jetson-stats) and `pynvml` are both already pinned by the description.md key dependencies table. ## Risks & Mitigation **Risk 1: GIL atomicity assumption is wrong on a future Python** - *Risk*: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic. - *Mitigation*: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an `atomic` library wrapper or threading.Lock around the snapshot). **Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)** - *Risk*: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates. - *Mitigation*: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 30–80 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle). **Risk 3: jetson-stats internal threading interferes with our polling thread** - *Risk*: `jtop` runs its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state. - *Mitigation*: This publisher is the SINGLE owner of `jtop` access in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report. **Risk 4: FDR record emission rate spikes during rapid throttle oscillation** - *Risk*: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets. - *Mitigation*: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records. ## Runtime Completeness - **Named capability**: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1). - **Production code that must exist**: real `ThermalStatePublisher` class with real background thread, real `jtop` / `pynvml` poll, real lock-free `_atomic_snapshot` reference swap, real FDR record emission via the injected `FdrClient`. - **Allowed external stubs**: tests MAY substitute a `FakeJtopSource` and a `FakeFdrClient` (AC-2..AC-8); production wiring uses real `jtop` + real AZ-273 `FdrClient`. - **Unacceptable substitutes**: a polling loop that uses `time.sleep` without a real `Clock` injection (would break test determinism); a snapshot field guarded by a `threading.Lock` on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises `TelemetryUnavailableError` from `start()` (would hide misconfigured deployments — exactly the failure mode AC-5 prevents).