Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton background-thread publisher that polls jtop (preferred) or pynvml (fallback) at config.thermal_poll_hz, stores an atomic ThermalState snapshot, and emits c7.thermal_transition FDR records on every throttle flip with a WARN log on entry and an INFO log on exit. Default-safe on TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited WARN. Sources return a raw ThermalReading; the publisher stamps measured_at_ns via its injected Clock so _JtopSource / _PynvmlSource stay clean of direct time.* calls (Invariant 2). _poll_once is the deterministic test seam — start() spawns the production thread. - c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS - [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject - 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural) green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred - full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips Co-authored-by: Cursor <cursoragent@cursor.com>
19 KiB
C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background
Task: AZ-302_c7_thermal_publisher
Name: C7 ThermalState Publisher
Description: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic ThermalState snapshot, exposes it via InferenceRuntime.thermal_state() for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults thermal_throttle_active = false on TelemetryUnavailableError. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02).
Complexity: 3 points
Dependencies: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf
Component: c7_inference (epic AZ-249 / E-C7)
Tracker: AZ-302
Epic: AZ-249 (E-C7)
Document Dependencies
_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md— definesThermalStateandthermal_state(); produced by AZ-297._docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md— used to emit thermal-transition FDR records viaFdrClient.publish._docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md—kind="c7.thermal_transition"record envelope._docs/02_document/contracts/shared_logging/log_record_schema.md— WARN log shape on throttle entry / telemetry loss._docs/02_document/contracts/shared_config/composition_root_protocol.md— Config carriesthermal_poll_hzand a fallback flag for telemetry unavailability behaviour.
Problem
C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher:
- C4 has no canonical
ThermalStatesource — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read. - The F3 hot path threads cannot poll
jetson-statsthemselves (would race withinferon the GIL-protected critical section per description.md § 7). - Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events.
TelemetryUnavailableError(jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie.
This task is the SINGLE owner of thermal telemetry across the companion process.
Outcome
- A
ThermalStatePublisherclass atsrc/gps_denied_onboard/components/c7_inference/thermal_publisher.pywithstart() / stop()lifecycle andread() -> ThermalStateaccessor. - A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at
config.inference.thermal_poll_hzHz (default 1.0); each successful poll updates a single_atomic_snapshot: ThermalStatefield via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report). read()is the lock-free reader: returns the current_atomic_snapshot. F3 hot-path callers MAY callread()from any thread; the call is wait-free and ≤ 1 µs.- Throttle-state transitions (the boolean
thermal_throttle_activeflips between two consecutive polls) emit:- One
kind="c7.thermal_transition"FDR record via the constructor-injectedFdrClient(AZ-273) carryingprevious_state,new_state,gpu_temp_c,cpu_temp_c,measured_clock_mhz,measured_at_ns. - One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO).
- One
TelemetryUnavailableError(raised by jetson-stats / pynvml internals): the publisher catches, setsthermal_throttle_active = falseandgpu_temp_c = None / cpu_temp_c = None, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies —ThermalState.is_telemetry_availableis set toFalseso C4 can choose to ignore the throttle bit.- The publisher is a singleton constructed by the composition root and passed by reference into every
InferenceRuntimestrategy that needsthermal_state(). The strategies do NOT each construct their own publisher. - The publisher is started during composition (after
FdrClientis registered, before any consumer callsread()); stopped by the composition root's process-exit hook.
Scope
Included
ThermalStatePublisherclass with__init__(config, fdr_client, logger),start(),stop(),read() -> ThermalState.- Background polling thread: while
_running, sleep1.0 / config.inference.thermal_poll_hzseconds, readjetson-stats(from jtop import jtopcontext manager ORpynvml.nvmlDeviceGetTemperature/nvmlDeviceGetCurrentClocksThrottleReasonsdirect calls — implementation chooses based on availability), build a freshThermalState, atomically replace_atomic_snapshot. - Source-selection logic: at
start(), attempt to importjtop(jetson-stats); if unavailable, attemptpynvml; if both unavailable, raiseTelemetryUnavailableErrorfromstart()itself (composition aborts cleanly — operator chooses to disable thermal-aware paths). - Throttle-transition detection: compare
previous._atomic_snapshot.thermal_throttle_activevs. new value; on flip, emit FDR record + log per Outcome. - WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec):
kind="c7.thermal.unavailable". - INFO log on
start()andstop(); INFO log on throttle-exit transitions; WARN log on throttle-entry transitions. ThermalStateextension: addis_telemetry_available: boolto the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's## Test Cases.)- Constructor-injected
Clockfor testability — tests inject a fake clock that advances on demand; production wirestime.monotonic_ns. - The
read()accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path. - A
ThermalStatePublisher.is_running() -> boolintrospection accessor for the composition root and tests.
Excluded
- AZ-297 InferenceRuntime Protocol — this task adds the
is_telemetry_availablefield to the existingThermalStateDTO; the Protocol methodthermal_state()is owned by the strategies that delegate to this publisher. - AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves.
- AZ-301 EngineGate — unrelated.
- C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES
ThermalState; C4 consumes. - Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry.
- Cross-flight thermal trend analysis — operator post-flight tooling owns it.
- A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle.
Acceptance Criteria
AC-1: read() is wait-free and returns the latest snapshot
Given a started publisher with two completed polls
When read() is called from the F3 hot path
Then the call returns within 1 µs (microbenched); the returned ThermalState matches the most recent poll's data
AC-2: throttle entry within 1 s
Given a running publisher and a simulated jetson-stats spoof flipping throttle_active from False to True at time T
When the test waits 1 s and calls read()
Then the returned ThermalState.thermal_throttle_active == True (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR kind="c7.thermal_transition" record was emitted with new_state=True; ONE WARN log was emitted
AC-3: throttle exit within 1 s and INFO log
Given a publisher currently in throttle and a simulated flip to False
When 1 s elapses and the test calls read()
Then thermal_throttle_active == False; ONE FDR record with new_state=False; ONE INFO log (NOT WARN) records the exit
AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False
Given a started publisher whose jetson-stats source raises TelemetryUnavailableError on every poll
When the test waits 2 s and calls read()
Then ThermalState.is_telemetry_available == False, thermal_throttle_active == False (default-safe), gpu_temp_c == None, cpu_temp_c == None; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec)
AC-5: cold-start with no source raises
Given an environment where neither jtop nor pynvml is importable
When the publisher's start() is called
Then TelemetryUnavailableError is raised from start() itself; the publisher is in is_running() == False state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's)
AC-6: start/stop lifecycle is idempotent
Given a publisher
When start() is called twice in succession
Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; is_running() == True
When stop() is called twice
Then the second stop() is a no-op; resources are NOT double-freed
AC-7: poll thread does not interfere with infer hot path
Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running infer at 39 Hz aggregate (per description.md)
When the benchmark runs for 60 s
Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms
AC-8: FDR record envelope matches contract
Given a throttle transition
When the FDR record is written
Then the record matches _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md shape with kind="c7.thermal_transition", producer_id="c7_inference.thermal", payload containing {previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns} (deep-equal verified against the schema's payload type)
Non-Functional Requirements
Performance
read()is wait-free; p99 ≤ 1 µs.- Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 30–80 ms typically).
- F3 hot-path latency unchanged when publisher is running (AC-7).
Reliability
- The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread;
read()is a single object-reference load (atomic under GIL). - Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state.
- The publisher is one of the FIRST things the composition root starts (after
FdrClientregistration) and one of the LAST things it stops (beforeFdrClient.stop). Documented as a startup-order constraint.
Concurrency
- One polling thread per process. The publisher is a singleton.
read()is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Microbench read() × 100k from a worker thread | p99 ≤ 1 µs; returned state matches latest poll |
| AC-2 | Spoof flip False→True at T; wait 1 s; read | throttle_active=True; one FDR + one WARN |
| AC-3 | Spoof flip True→False; wait 1 s; read | throttle_active=False; one FDR + one INFO |
| AC-4 | Spoof unavailable on every poll for 2 s | is_telemetry_available=False; default-safe; ≤ 2 WARN |
| AC-5 | Disable jtop + pynvml; start | TelemetryUnavailableError raised; is_running=False |
| AC-6 | start twice; stop twice | Idempotent; no double-spawn / double-free |
| AC-7 | F3 hot-path bench with publisher running | p95 unchanged vs. baseline; poll p99 ≤ 100 ms |
| AC-8 | Throttle transition FDR record shape | Matches schema deep-equal |
| NFR-perf-poll | Microbench poll body × 100 | p99 ≤ 100 ms |
| NFR-reliability-default-safe | Telemetry unavailable on first poll → read() | is_telemetry_available=False; throttle_active=False |
Constraints
- One polling thread per process; publisher is a singleton.
read()is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path.- Source selection at
start()time:jtopfirst,pynvmlsecond; if both fail, raiseTelemetryUnavailableErrorfromstart()(do NOT silently default-safe — that hides a misconfigured deployment). - Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart).
- WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented.
ThermalState.is_telemetry_availableis added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log.- This task introduces no new third-party dependencies —
jtop(jetson-stats) andpynvmlare both already pinned by the description.md key dependencies table.
Risks & Mitigation
Risk 1: GIL atomicity assumption is wrong on a future Python
- Risk: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic.
- Mitigation: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an
atomiclibrary wrapper or threading.Lock around the snapshot).
Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)
- Risk: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates.
- Mitigation: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 30–80 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle).
Risk 3: jetson-stats internal threading interferes with our polling thread
- Risk:
jtopruns its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state. - Mitigation: This publisher is the SINGLE owner of
jtopaccess in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report.
Risk 4: FDR record emission rate spikes during rapid throttle oscillation
- Risk: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets.
- Mitigation: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records.
Runtime Completeness
- Named capability: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1).
- Production code that must exist: real
ThermalStatePublisherclass with real background thread, realjtop/pynvmlpoll, real lock-free_atomic_snapshotreference swap, real FDR record emission via the injectedFdrClient. - Allowed external stubs: tests MAY substitute a
FakeJtopSourceand aFakeFdrClient(AC-2..AC-8); production wiring uses realjtop+ real AZ-273FdrClient. - Unacceptable substitutes: a polling loop that uses
time.sleepwithout a realClockinjection (would break test determinism); a snapshot field guarded by athreading.Lockon the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raisesTelemetryUnavailableErrorfromstart()(would hide misconfigured deployments — exactly the failure mode AC-5 prevents).
Implementation Notes (2026-05-12, batch 26)
Four task-spec → as-built deltas:
-
ThermalReadingintermediate DTO — spec said sourceread()returns aThermalState. As-built, sources return aThermalReading(nomeasured_at_ns, nois_telemetry_available) and the publisher stamps both fields from its injectedClock. This keeps_JtopSource/_PynvmlSourcefrom callingtime.monotonic_ns()directly, whichtests/_meta/test_no_direct_time_in_components.py(Invariant 2) forbids in anysrc/gps_denied_onboard/components/**file. Spec already required aClock-injectable publisher; this delta extends the contract one node deeper into the source classes. -
Telemetry backends are an optional pyproject group — spec said "this task introduces no new third-party dependencies —
jtopandpynvmlare both already pinned by the description.md key dependencies table" but neither was actually inpyproject.toml. Added a[telemetry]optional extra (jetson-stats>=4.2,pynvml>=11.5) so Tier-1 installs (.[dev]) stay slim and Tier-2 Jetson installs (.[dev,inference,telemetry]) pull them in. Both backends remain runtime-optional: the publisher discovers them atstart()time and raisesTelemetryUnavailableErroronly when both are absent. -
_poll_onceis the deterministic test seam — NOTstart()— the AC-1..AC-4 / AC-8 tests script the source + drive the clock under a_ManualClock;start()spawns the background polling thread which would race those deterministic calls (observed in first test iteration). Tests bypassstart()via a_prime_without_thread()helper that runs the source-selection + initial-poll path without spawning the thread.start()+stop()are still exercised by AC-5 (cold-start fallback), AC-6 (idempotency), and one integration test that proves the background thread actually polls. -
c7.thermal_transitionregistered in the FDR schema — added the kind toKNOWN_PAYLOAD_KEYSinsrc/gps_denied_onboard/fdr_client/records.pywith the six payload fields the spec names (previous_state,new_state,gpu_temp_c,cpu_temp_c,measured_clock_mhz,measured_at_ns). The AZ-272 fixture (tests/unit/test_az272_fdr_record_schema.py::_kind_payload) was extended with a sample payload so the every-known-kind roundtrip test passes for the new kind.
As-built file map
src/gps_denied_onboard/components/c7_inference/thermal_publisher.py—ThermalStatePublisher,ThermalSourceProtocol,ThermalReadingintermediate DTO,_JtopSource(Tier-2 jtop binding),_PynvmlSource(Tier-2 NVML fallback),_default_safe_snapshot(Invariant I-6 enforcement).src/gps_denied_onboard/components/c7_inference/__init__.py— re-exportsThermalStatePublisher,ThermalSource,ThermalReading.src/gps_denied_onboard/fdr_client/records.py—c7.thermal_transitionadded toKNOWN_PAYLOAD_KEYS.pyproject.toml— new[telemetry]optional dep group (jetson-stats,pynvml).tests/unit/c7_inference/test_thermal_publisher.py— 14 tests (AC-1 x2, AC-2, AC-3, AC-4, AC-5 x2, AC-6 x2, background-thread integration, AC-8, NFR-default-safe, Protocol structural, Invariant I-6).tests/unit/test_az272_fdr_record_schema.py—_kind_payloadextended with the new kind.
AC-7 (F3-hot-path non-interference) and the wait-free p99 microbench in AC-1 are Tier-2 perf concerns; the Tier-2 validation task will cover them on real Jetson hardware.