Files
gps-denied-onboard/_docs/02_tasks/done/AZ-302_c7_thermal_publisher.md
Oleksandr Bezdieniezhnykh 49a67f770d [AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller
Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton
background-thread publisher that polls jtop (preferred) or pynvml
(fallback) at config.thermal_poll_hz, stores an atomic ThermalState
snapshot, and emits c7.thermal_transition FDR records on every throttle
flip with a WARN log on entry and an INFO log on exit. Default-safe on
TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited
WARN.

Sources return a raw ThermalReading; the publisher stamps measured_at_ns
via its injected Clock so _JtopSource / _PynvmlSource stay clean of
direct time.* calls (Invariant 2). _poll_once is the deterministic test
seam — start() spawns the production thread.

- c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS
- [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject
- 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural)
  green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred
- full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 10:33:37 +03:00

19 KiB
Raw Permalink Blame History

C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background

Task: AZ-302_c7_thermal_publisher Name: C7 ThermalState Publisher Description: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic ThermalState snapshot, exposes it via InferenceRuntime.thermal_state() for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults thermal_throttle_active = false on TelemetryUnavailableError. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02). Complexity: 3 points Dependencies: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf Component: c7_inference (epic AZ-249 / E-C7) Tracker: AZ-302 Epic: AZ-249 (E-C7)

Document Dependencies

  • _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — defines ThermalState and thermal_state(); produced by AZ-297.
  • _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md — used to emit thermal-transition FDR records via FdrClient.publish.
  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.mdkind="c7.thermal_transition" record envelope.
  • _docs/02_document/contracts/shared_logging/log_record_schema.md — WARN log shape on throttle entry / telemetry loss.
  • _docs/02_document/contracts/shared_config/composition_root_protocol.md — Config carries thermal_poll_hz and a fallback flag for telemetry unavailability behaviour.

Problem

C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher:

  • C4 has no canonical ThermalState source — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read.
  • The F3 hot path threads cannot poll jetson-stats themselves (would race with infer on the GIL-protected critical section per description.md § 7).
  • Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events.
  • TelemetryUnavailableError (jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie.

This task is the SINGLE owner of thermal telemetry across the companion process.

Outcome

  • A ThermalStatePublisher class at src/gps_denied_onboard/components/c7_inference/thermal_publisher.py with start() / stop() lifecycle and read() -> ThermalState accessor.
  • A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at config.inference.thermal_poll_hz Hz (default 1.0); each successful poll updates a single _atomic_snapshot: ThermalState field via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report).
  • read() is the lock-free reader: returns the current _atomic_snapshot. F3 hot-path callers MAY call read() from any thread; the call is wait-free and ≤ 1 µs.
  • Throttle-state transitions (the boolean thermal_throttle_active flips between two consecutive polls) emit:
    • One kind="c7.thermal_transition" FDR record via the constructor-injected FdrClient (AZ-273) carrying previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns.
    • One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO).
  • TelemetryUnavailableError (raised by jetson-stats / pynvml internals): the publisher catches, sets thermal_throttle_active = false and gpu_temp_c = None / cpu_temp_c = None, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies — ThermalState.is_telemetry_available is set to False so C4 can choose to ignore the throttle bit.
  • The publisher is a singleton constructed by the composition root and passed by reference into every InferenceRuntime strategy that needs thermal_state(). The strategies do NOT each construct their own publisher.
  • The publisher is started during composition (after FdrClient is registered, before any consumer calls read()); stopped by the composition root's process-exit hook.

Scope

Included

  • ThermalStatePublisher class with __init__(config, fdr_client, logger), start(), stop(), read() -> ThermalState.
  • Background polling thread: while _running, sleep 1.0 / config.inference.thermal_poll_hz seconds, read jetson-stats (from jtop import jtop context manager OR pynvml.nvmlDeviceGetTemperature / nvmlDeviceGetCurrentClocksThrottleReasons direct calls — implementation chooses based on availability), build a fresh ThermalState, atomically replace _atomic_snapshot.
  • Source-selection logic: at start(), attempt to import jtop (jetson-stats); if unavailable, attempt pynvml; if both unavailable, raise TelemetryUnavailableError from start() itself (composition aborts cleanly — operator chooses to disable thermal-aware paths).
  • Throttle-transition detection: compare previous._atomic_snapshot.thermal_throttle_active vs. new value; on flip, emit FDR record + log per Outcome.
  • WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec): kind="c7.thermal.unavailable".
  • INFO log on start() and stop(); INFO log on throttle-exit transitions; WARN log on throttle-entry transitions.
  • ThermalState extension: add is_telemetry_available: bool to the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's ## Test Cases.)
  • Constructor-injected Clock for testability — tests inject a fake clock that advances on demand; production wires time.monotonic_ns.
  • The read() accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path.
  • A ThermalStatePublisher.is_running() -> bool introspection accessor for the composition root and tests.

Excluded

  • AZ-297 InferenceRuntime Protocol — this task adds the is_telemetry_available field to the existing ThermalState DTO; the Protocol method thermal_state() is owned by the strategies that delegate to this publisher.
  • AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves.
  • AZ-301 EngineGate — unrelated.
  • C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES ThermalState; C4 consumes.
  • Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry.
  • Cross-flight thermal trend analysis — operator post-flight tooling owns it.
  • A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle.

Acceptance Criteria

AC-1: read() is wait-free and returns the latest snapshot Given a started publisher with two completed polls When read() is called from the F3 hot path Then the call returns within 1 µs (microbenched); the returned ThermalState matches the most recent poll's data

AC-2: throttle entry within 1 s Given a running publisher and a simulated jetson-stats spoof flipping throttle_active from False to True at time T When the test waits 1 s and calls read() Then the returned ThermalState.thermal_throttle_active == True (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR kind="c7.thermal_transition" record was emitted with new_state=True; ONE WARN log was emitted

AC-3: throttle exit within 1 s and INFO log Given a publisher currently in throttle and a simulated flip to False When 1 s elapses and the test calls read() Then thermal_throttle_active == False; ONE FDR record with new_state=False; ONE INFO log (NOT WARN) records the exit

AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False Given a started publisher whose jetson-stats source raises TelemetryUnavailableError on every poll When the test waits 2 s and calls read() Then ThermalState.is_telemetry_available == False, thermal_throttle_active == False (default-safe), gpu_temp_c == None, cpu_temp_c == None; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec)

AC-5: cold-start with no source raises Given an environment where neither jtop nor pynvml is importable When the publisher's start() is called Then TelemetryUnavailableError is raised from start() itself; the publisher is in is_running() == False state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's)

AC-6: start/stop lifecycle is idempotent Given a publisher When start() is called twice in succession Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; is_running() == True

When stop() is called twice Then the second stop() is a no-op; resources are NOT double-freed

AC-7: poll thread does not interfere with infer hot path Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running infer at 39 Hz aggregate (per description.md) When the benchmark runs for 60 s Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms

AC-8: FDR record envelope matches contract Given a throttle transition When the FDR record is written Then the record matches _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md shape with kind="c7.thermal_transition", producer_id="c7_inference.thermal", payload containing {previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns} (deep-equal verified against the schema's payload type)

Non-Functional Requirements

Performance

  • read() is wait-free; p99 ≤ 1 µs.
  • Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 3080 ms typically).
  • F3 hot-path latency unchanged when publisher is running (AC-7).

Reliability

  • The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread; read() is a single object-reference load (atomic under GIL).
  • Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state.
  • The publisher is one of the FIRST things the composition root starts (after FdrClient registration) and one of the LAST things it stops (before FdrClient.stop). Documented as a startup-order constraint.

Concurrency

  • One polling thread per process. The publisher is a singleton.
  • read() is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Microbench read() × 100k from a worker thread p99 ≤ 1 µs; returned state matches latest poll
AC-2 Spoof flip False→True at T; wait 1 s; read throttle_active=True; one FDR + one WARN
AC-3 Spoof flip True→False; wait 1 s; read throttle_active=False; one FDR + one INFO
AC-4 Spoof unavailable on every poll for 2 s is_telemetry_available=False; default-safe; ≤ 2 WARN
AC-5 Disable jtop + pynvml; start TelemetryUnavailableError raised; is_running=False
AC-6 start twice; stop twice Idempotent; no double-spawn / double-free
AC-7 F3 hot-path bench with publisher running p95 unchanged vs. baseline; poll p99 ≤ 100 ms
AC-8 Throttle transition FDR record shape Matches schema deep-equal
NFR-perf-poll Microbench poll body × 100 p99 ≤ 100 ms
NFR-reliability-default-safe Telemetry unavailable on first poll → read() is_telemetry_available=False; throttle_active=False

Constraints

  • One polling thread per process; publisher is a singleton.
  • read() is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path.
  • Source selection at start() time: jtop first, pynvml second; if both fail, raise TelemetryUnavailableError from start() (do NOT silently default-safe — that hides a misconfigured deployment).
  • Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart).
  • WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented.
  • ThermalState.is_telemetry_available is added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log.
  • This task introduces no new third-party dependencies — jtop (jetson-stats) and pynvml are both already pinned by the description.md key dependencies table.

Risks & Mitigation

Risk 1: GIL atomicity assumption is wrong on a future Python

  • Risk: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic.
  • Mitigation: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an atomic library wrapper or threading.Lock around the snapshot).

Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)

  • Risk: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates.
  • Mitigation: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 3080 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle).

Risk 3: jetson-stats internal threading interferes with our polling thread

  • Risk: jtop runs its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state.
  • Mitigation: This publisher is the SINGLE owner of jtop access in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report.

Risk 4: FDR record emission rate spikes during rapid throttle oscillation

  • Risk: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets.
  • Mitigation: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records.

Runtime Completeness

  • Named capability: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1).
  • Production code that must exist: real ThermalStatePublisher class with real background thread, real jtop / pynvml poll, real lock-free _atomic_snapshot reference swap, real FDR record emission via the injected FdrClient.
  • Allowed external stubs: tests MAY substitute a FakeJtopSource and a FakeFdrClient (AC-2..AC-8); production wiring uses real jtop + real AZ-273 FdrClient.
  • Unacceptable substitutes: a polling loop that uses time.sleep without a real Clock injection (would break test determinism); a snapshot field guarded by a threading.Lock on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises TelemetryUnavailableError from start() (would hide misconfigured deployments — exactly the failure mode AC-5 prevents).

Implementation Notes (2026-05-12, batch 26)

Four task-spec → as-built deltas:

  1. ThermalReading intermediate DTO — spec said source read() returns a ThermalState. As-built, sources return a ThermalReading (no measured_at_ns, no is_telemetry_available) and the publisher stamps both fields from its injected Clock. This keeps _JtopSource / _PynvmlSource from calling time.monotonic_ns() directly, which tests/_meta/test_no_direct_time_in_components.py (Invariant 2) forbids in any src/gps_denied_onboard/components/** file. Spec already required a Clock-injectable publisher; this delta extends the contract one node deeper into the source classes.

  2. Telemetry backends are an optional pyproject group — spec said "this task introduces no new third-party dependencies — jtop and pynvml are both already pinned by the description.md key dependencies table" but neither was actually in pyproject.toml. Added a [telemetry] optional extra (jetson-stats>=4.2, pynvml>=11.5) so Tier-1 installs (.[dev]) stay slim and Tier-2 Jetson installs (.[dev,inference,telemetry]) pull them in. Both backends remain runtime-optional: the publisher discovers them at start() time and raises TelemetryUnavailableError only when both are absent.

  3. _poll_once is the deterministic test seam — NOT start() — the AC-1..AC-4 / AC-8 tests script the source + drive the clock under a _ManualClock; start() spawns the background polling thread which would race those deterministic calls (observed in first test iteration). Tests bypass start() via a _prime_without_thread() helper that runs the source-selection + initial-poll path without spawning the thread. start() + stop() are still exercised by AC-5 (cold-start fallback), AC-6 (idempotency), and one integration test that proves the background thread actually polls.

  4. c7.thermal_transition registered in the FDR schema — added the kind to KNOWN_PAYLOAD_KEYS in src/gps_denied_onboard/fdr_client/records.py with the six payload fields the spec names (previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns). The AZ-272 fixture (tests/unit/test_az272_fdr_record_schema.py::_kind_payload) was extended with a sample payload so the every-known-kind roundtrip test passes for the new kind.

As-built file map

  • src/gps_denied_onboard/components/c7_inference/thermal_publisher.pyThermalStatePublisher, ThermalSource Protocol, ThermalReading intermediate DTO, _JtopSource (Tier-2 jtop binding), _PynvmlSource (Tier-2 NVML fallback), _default_safe_snapshot (Invariant I-6 enforcement).
  • src/gps_denied_onboard/components/c7_inference/__init__.py — re-exports ThermalStatePublisher, ThermalSource, ThermalReading.
  • src/gps_denied_onboard/fdr_client/records.pyc7.thermal_transition added to KNOWN_PAYLOAD_KEYS.
  • pyproject.toml — new [telemetry] optional dep group (jetson-stats, pynvml).
  • tests/unit/c7_inference/test_thermal_publisher.py — 14 tests (AC-1 x2, AC-2, AC-3, AC-4, AC-5 x2, AC-6 x2, background-thread integration, AC-8, NFR-default-safe, Protocol structural, Invariant I-6).
  • tests/unit/test_az272_fdr_record_schema.py_kind_payload extended with the new kind.

AC-7 (F3-hot-path non-interference) and the wait-free p99 microbench in AC-1 are Tier-2 perf concerns; the Tier-2 validation task will cover them on real Jetson hardware.