[AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller

Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton
background-thread publisher that polls jtop (preferred) or pynvml
(fallback) at config.thermal_poll_hz, stores an atomic ThermalState
snapshot, and emits c7.thermal_transition FDR records on every throttle
flip with a WARN log on entry and an INFO log on exit. Default-safe on
TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited
WARN.

Sources return a raw ThermalReading; the publisher stamps measured_at_ns
via its injected Clock so _JtopSource / _PynvmlSource stay clean of
direct time.* calls (Invariant 2). _poll_once is the deterministic test
seam — start() spawns the production thread.

- c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS
- [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject
- 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural)
  green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred
- full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 10:33:37 +03:00
parent 59f56c032f
commit 49a67f770d
9 changed files with 1175 additions and 2 deletions
@@ -1,177 +0,0 @@
# C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background
**Task**: AZ-302_c7_thermal_publisher
**Name**: C7 ThermalState Publisher
**Description**: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic `ThermalState` snapshot, exposes it via `InferenceRuntime.thermal_state()` for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults `thermal_throttle_active = false` on `TelemetryUnavailableError`. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02).
**Complexity**: 3 points
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf
**Component**: c7_inference (epic AZ-249 / E-C7)
**Tracker**: AZ-302
**Epic**: AZ-249 (E-C7)
### Document Dependencies
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `ThermalState` and `thermal_state()`; produced by AZ-297.
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — used to emit thermal-transition FDR records via `FdrClient.publish`.
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md``kind="c7.thermal_transition"` record envelope.
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — WARN log shape on throttle entry / telemetry loss.
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries `thermal_poll_hz` and a fallback flag for telemetry unavailability behaviour.
## Problem
C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher:
- C4 has no canonical `ThermalState` source — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read.
- The F3 hot path threads cannot poll `jetson-stats` themselves (would race with `infer` on the GIL-protected critical section per description.md § 7).
- Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events.
- `TelemetryUnavailableError` (jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie.
This task is the SINGLE owner of thermal telemetry across the companion process.
## Outcome
- A `ThermalStatePublisher` class at `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` with `start() / stop()` lifecycle and `read() -> ThermalState` accessor.
- A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at `config.inference.thermal_poll_hz` Hz (default 1.0); each successful poll updates a single `_atomic_snapshot: ThermalState` field via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report).
- `read()` is the lock-free reader: returns the current `_atomic_snapshot`. F3 hot-path callers MAY call `read()` from any thread; the call is wait-free and ≤ 1 µs.
- Throttle-state transitions (the boolean `thermal_throttle_active` flips between two consecutive polls) emit:
- One `kind="c7.thermal_transition"` FDR record via the constructor-injected `FdrClient` (AZ-273) carrying `previous_state`, `new_state`, `gpu_temp_c`, `cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`.
- One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO).
- `TelemetryUnavailableError` (raised by jetson-stats / pynvml internals): the publisher catches, sets `thermal_throttle_active = false` and `gpu_temp_c = None / cpu_temp_c = None`, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies — `ThermalState.is_telemetry_available` is set to `False` so C4 can choose to ignore the throttle bit.
- The publisher is a singleton constructed by the composition root and passed by reference into every `InferenceRuntime` strategy that needs `thermal_state()`. The strategies do NOT each construct their own publisher.
- The publisher is started during composition (after `FdrClient` is registered, before any consumer calls `read()`); stopped by the composition root's process-exit hook.
## Scope
### Included
- `ThermalStatePublisher` class with `__init__(config, fdr_client, logger)`, `start()`, `stop()`, `read() -> ThermalState`.
- Background polling thread: while `_running`, sleep `1.0 / config.inference.thermal_poll_hz` seconds, read `jetson-stats` (`from jtop import jtop` context manager OR `pynvml.nvmlDeviceGetTemperature` / `nvmlDeviceGetCurrentClocksThrottleReasons` direct calls — implementation chooses based on availability), build a fresh `ThermalState`, atomically replace `_atomic_snapshot`.
- Source-selection logic: at `start()`, attempt to import `jtop` (jetson-stats); if unavailable, attempt `pynvml`; if both unavailable, raise `TelemetryUnavailableError` from `start()` itself (composition aborts cleanly — operator chooses to disable thermal-aware paths).
- Throttle-transition detection: compare `previous._atomic_snapshot.thermal_throttle_active` vs. new value; on flip, emit FDR record + log per Outcome.
- WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec): `kind="c7.thermal.unavailable"`.
- INFO log on `start()` and `stop()`; INFO log on throttle-exit transitions; WARN log on throttle-entry transitions.
- `ThermalState` extension: add `is_telemetry_available: bool` to the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's `## Test Cases`.)
- Constructor-injected `Clock` for testability — tests inject a fake clock that advances on demand; production wires `time.monotonic_ns`.
- The `read()` accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path.
- A `ThermalStatePublisher.is_running() -> bool` introspection accessor for the composition root and tests.
### Excluded
- AZ-297 InferenceRuntime Protocol — this task adds the `is_telemetry_available` field to the existing `ThermalState` DTO; the Protocol method `thermal_state()` is owned by the strategies that delegate to this publisher.
- AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves.
- AZ-301 EngineGate — unrelated.
- C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES `ThermalState`; C4 consumes.
- Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry.
- Cross-flight thermal trend analysis — operator post-flight tooling owns it.
- A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle.
## Acceptance Criteria
**AC-1: read() is wait-free and returns the latest snapshot**
Given a started publisher with two completed polls
When `read()` is called from the F3 hot path
Then the call returns within 1 µs (microbenched); the returned `ThermalState` matches the most recent poll's data
**AC-2: throttle entry within 1 s**
Given a running publisher and a simulated jetson-stats spoof flipping `throttle_active` from False to True at time T
When the test waits 1 s and calls `read()`
Then the returned `ThermalState.thermal_throttle_active == True` (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR `kind="c7.thermal_transition"` record was emitted with `new_state=True`; ONE WARN log was emitted
**AC-3: throttle exit within 1 s and INFO log**
Given a publisher currently in throttle and a simulated flip to False
When 1 s elapses and the test calls `read()`
Then `thermal_throttle_active == False`; ONE FDR record with `new_state=False`; ONE INFO log (NOT WARN) records the exit
**AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False**
Given a started publisher whose jetson-stats source raises `TelemetryUnavailableError` on every poll
When the test waits 2 s and calls `read()`
Then `ThermalState.is_telemetry_available == False`, `thermal_throttle_active == False` (default-safe), `gpu_temp_c == None`, `cpu_temp_c == None`; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec)
**AC-5: cold-start with no source raises**
Given an environment where neither `jtop` nor `pynvml` is importable
When the publisher's `start()` is called
Then `TelemetryUnavailableError` is raised from `start()` itself; the publisher is in `is_running() == False` state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's)
**AC-6: start/stop lifecycle is idempotent**
Given a publisher
When `start()` is called twice in succession
Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; `is_running() == True`
When `stop()` is called twice
Then the second `stop()` is a no-op; resources are NOT double-freed
**AC-7: poll thread does not interfere with infer hot path**
Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running `infer` at 39 Hz aggregate (per description.md)
When the benchmark runs for 60 s
Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms
**AC-8: FDR record envelope matches contract**
Given a throttle transition
When the FDR record is written
Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` shape with `kind="c7.thermal_transition"`, `producer_id="c7_inference.thermal"`, `payload` containing `{previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns}` (deep-equal verified against the schema's payload type)
## Non-Functional Requirements
**Performance**
- `read()` is wait-free; p99 ≤ 1 µs.
- Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 3080 ms typically).
- F3 hot-path latency unchanged when publisher is running (AC-7).
**Reliability**
- The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread; `read()` is a single object-reference load (atomic under GIL).
- Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state.
- The publisher is one of the FIRST things the composition root starts (after `FdrClient` registration) and one of the LAST things it stops (before `FdrClient.stop`). Documented as a startup-order constraint.
**Concurrency**
- One polling thread per process. The publisher is a singleton.
- `read()` is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Microbench read() × 100k from a worker thread | p99 ≤ 1 µs; returned state matches latest poll |
| AC-2 | Spoof flip False→True at T; wait 1 s; read | throttle_active=True; one FDR + one WARN |
| AC-3 | Spoof flip True→False; wait 1 s; read | throttle_active=False; one FDR + one INFO |
| AC-4 | Spoof unavailable on every poll for 2 s | is_telemetry_available=False; default-safe; ≤ 2 WARN |
| AC-5 | Disable jtop + pynvml; start | TelemetryUnavailableError raised; is_running=False |
| AC-6 | start twice; stop twice | Idempotent; no double-spawn / double-free |
| AC-7 | F3 hot-path bench with publisher running | p95 unchanged vs. baseline; poll p99 ≤ 100 ms |
| AC-8 | Throttle transition FDR record shape | Matches schema deep-equal |
| NFR-perf-poll | Microbench poll body × 100 | p99 ≤ 100 ms |
| NFR-reliability-default-safe | Telemetry unavailable on first poll → read() | is_telemetry_available=False; throttle_active=False |
## Constraints
- One polling thread per process; publisher is a singleton.
- `read()` is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path.
- Source selection at `start()` time: `jtop` first, `pynvml` second; if both fail, raise `TelemetryUnavailableError` from `start()` (do NOT silently default-safe — that hides a misconfigured deployment).
- Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart).
- WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented.
- `ThermalState.is_telemetry_available` is added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log.
- This task introduces no new third-party dependencies — `jtop` (jetson-stats) and `pynvml` are both already pinned by the description.md key dependencies table.
## Risks & Mitigation
**Risk 1: GIL atomicity assumption is wrong on a future Python**
- *Risk*: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic.
- *Mitigation*: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an `atomic` library wrapper or threading.Lock around the snapshot).
**Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)**
- *Risk*: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates.
- *Mitigation*: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 3080 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle).
**Risk 3: jetson-stats internal threading interferes with our polling thread**
- *Risk*: `jtop` runs its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state.
- *Mitigation*: This publisher is the SINGLE owner of `jtop` access in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report.
**Risk 4: FDR record emission rate spikes during rapid throttle oscillation**
- *Risk*: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets.
- *Mitigation*: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records.
## Runtime Completeness
- **Named capability**: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1).
- **Production code that must exist**: real `ThermalStatePublisher` class with real background thread, real `jtop` / `pynvml` poll, real lock-free `_atomic_snapshot` reference swap, real FDR record emission via the injected `FdrClient`.
- **Allowed external stubs**: tests MAY substitute a `FakeJtopSource` and a `FakeFdrClient` (AC-2..AC-8); production wiring uses real `jtop` + real AZ-273 `FdrClient`.
- **Unacceptable substitutes**: a polling loop that uses `time.sleep` without a real `Clock` injection (would break test determinism); a snapshot field guarded by a `threading.Lock` on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises `TelemetryUnavailableError` from `start()` (would hide misconfigured deployments — exactly the failure mode AC-5 prevents).