[AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller

Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton
background-thread publisher that polls jtop (preferred) or pynvml
(fallback) at config.thermal_poll_hz, stores an atomic ThermalState
snapshot, and emits c7.thermal_transition FDR records on every throttle
flip with a WARN log on entry and an INFO log on exit. Default-safe on
TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited
WARN.

Sources return a raw ThermalReading; the publisher stamps measured_at_ns
via its injected Clock so _JtopSource / _PynvmlSource stay clean of
direct time.* calls (Invariant 2). _poll_once is the deterministic test
seam — start() spawns the production thread.

- c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS
- [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject
- 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural)
  green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred
- full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 10:33:37 +03:00
parent 59f56c032f
commit 49a67f770d
9 changed files with 1175 additions and 2 deletions
@@ -101,6 +101,20 @@ KNOWN_PAYLOAD_KEYS: Final[dict[str, frozenset[str]]] = {
"threshold_m",
}
),
# AZ-302 / E-C7: emitted on every thermal-throttle transition (False->True or
# True->False). One record per flip; steady-state polls emit nothing on this
# kind. `previous_state` and `new_state` are the throttle booleans pre/post
# transition; temperatures and clock are taken from the same poll cycle.
"c7.thermal_transition": frozenset(
{
"previous_state",
"new_state",
"gpu_temp_c",
"cpu_temp_c",
"measured_clock_mhz",
"measured_at_ns",
}
),
}
KNOWN_KINDS: Final[frozenset[str]] = frozenset(KNOWN_PAYLOAD_KEYS.keys())