Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton background-thread publisher that polls jtop (preferred) or pynvml (fallback) at config.thermal_poll_hz, stores an atomic ThermalState snapshot, and emits c7.thermal_transition FDR records on every throttle flip with a WARN log on entry and an INFO log on exit. Default-safe on TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited WARN. Sources return a raw ThermalReading; the publisher stamps measured_at_ns via its injected Clock so _JtopSource / _PynvmlSource stay clean of direct time.* calls (Invariant 2). _poll_once is the deterministic test seam — start() spawns the production thread. - c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS - [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject - 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural) green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred - full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips Co-authored-by: Cursor <cursoragent@cursor.com>
7.1 KiB
Batch 26 / Cycle 1 — Implementation Report
Date: 2026-05-12 Tasks: AZ-302 (C7 ThermalStatePublisher — jtop/NVML 1 Hz background) Story points landed: 3 Status: complete (AZ-302 → In Testing)
Scope summary
Single-task batch, continuing the 1-task cadence the user reaffirmed when picking option A after batches 24 and 25. AZ-302 was the deferred 3-pointer surfaced by the batch-25 narrow-down (background thread, two telemetry backends, FDR record kind, log rate-limiting). AZ-304 (C6 Postgres schema, 2pt) remains queued for batch 27.
Files added / modified
New
src/gps_denied_onboard/components/c7_inference/thermal_publisher.py—ThermalStatePublisher(start/stop/read/is_running,_poll_oncetest seam, atomic snapshot, source selection),ThermalSourceruntime-checkable Protocol,ThermalReadingintermediate DTO,_JtopSource(jtop binding, Tier-2 production path),_PynvmlSource(NVML fallback),_default_safe_snapshot(Invariant I-6 enforcement).tests/unit/c7_inference/test_thermal_publisher.py— 14 deterministic tests (AC-1..AC-6, AC-8, NFR-default-safe, Protocol structural, Invariant I-6) plus one real-thread integration test that proves the background poller actually polls and emits transitions.
Modified
src/gps_denied_onboard/components/c7_inference/__init__.py— re-exportsThermalStatePublisher,ThermalSource,ThermalReading.src/gps_denied_onboard/fdr_client/records.py—c7.thermal_transitionregistered inKNOWN_PAYLOAD_KEYSwith six documented payload keys (previous_state,new_state,gpu_temp_c,cpu_temp_c,measured_clock_mhz,measured_at_ns).pyproject.toml— new[telemetry]optional dep group (jetson-stats>=4.2,pynvml>=11.5). Tier-1 (.[dev]) install stays slim; Tier-2 Jetson install adds.[telemetry].tests/unit/test_az272_fdr_record_schema.py—_kind_payloadfixture extended with a samplec7.thermal_transitionpayload so the every-known-kind roundtrip test passes._docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md→ moved to_docs/02_tasks/done/; appended## Implementation Notes (2026-05-12, batch 26)documenting the four task-spec → as-built deltas.
Design decisions
-
ThermalReadingintermediate DTO instead of source-stampedThermalState. The spec implied sourceread()returns a fully stampedThermalState. As-built, sources return aThermalReading(raw measurement only) and the publisher stampsmeasured_at_ns+is_telemetry_availablefrom its injectedClock. This keeps the_JtopSource/_PynvmlSourcemodules from callingtime.monotonic_ns()directly, whichtests/_meta/ test_no_direct_time_in_components.py(Invariant 2) forbids in anysrc/gps_denied_onboard/components/**file. Documented in the spec's as-built notes section. -
_poll_onceis the deterministic test seam — notstart(). The AC-1..AC-4 / AC-8 tests script the source and drive a_ManualClockforward; the productionstart()would spawn a background thread that races those deterministic calls (observed in the first test iteration). Tests bypassstart()via a_prime_without_thread()helper that exercises the source-selection + initial-poll path without the thread.start()/stop()remain covered by AC-5 (cold-start fallback), AC-6 (idempotency), and a real-thread integration test at 200 Hz that proves the background loop polls the source and emits transition records. -
Telemetry backends moved to a
[telemetry]optional dep group. The spec claimed bothjtopandpynvmlwere already pinned by description.md, but neither was inpyproject.toml. They are now a[telemetry]optional extra so Tier-1 installs (.[dev]) stay slim and Tier-2 Jetson installs (.[dev,inference,telemetry]) pull them in. Both backends remain runtime-optional: the publisher discovers them atstart()time and only raisesTelemetryUnavailableErrorwhen both are absent. -
WARN-on-unavailable rate-limiting uses the injected
Clock, nottime.monotonic_ns(). The 1-second rate-limit window is enforced bynow_ns - last_warn_ns >= 1_000_000_000againstself._clock. monotonic_ns(), so replay-deterministic tests can simulate rate limit windows by advancing the manual clock.
AC coverage
| AC | Test name | Status |
|---|---|---|
| AC-1 | test_ac1_read_returns_latest_snapshot + test_ac1_read_is_wait_free_object_identity |
passing |
| AC-2 | test_ac2_throttle_entry_emits_fdr_and_warn |
passing |
| AC-3 | test_ac3_throttle_exit_emits_fdr_and_info |
passing |
| AC-4 | test_ac4_telemetry_unavailable_defaults_safe_and_rate_limits |
passing |
| AC-5 | test_ac5_start_raises_when_no_source_available + test_ac5_jtop_falls_back_to_pynvml |
passing |
| AC-6 | test_ac6_double_start_is_noop + test_ac6_double_stop_is_noop |
passing |
| AC-7 | F3 hot-path non-interference perf | Tier-2 deferred |
| AC-8 | test_ac8_fdr_record_round_trips_through_wire_format |
passing |
| NFR-perf-poll | microbench poll p99 ≤ 100 ms | Tier-2 deferred |
| NFR-perf-read p99 ≤ 1 µs | wait-free read microbench | Tier-2 deferred |
| NFR-default-safe | test_nfr_default_safe_on_first_poll_failure |
passing |
| Structural | test_scripted_source_satisfies_protocol |
passing |
| Invariant I-6 | test_default_safe_snapshot_invariant_i6 |
passing |
AC-7 and the two Tier-2 microbenches need real Jetson hardware to be
meaningful; they roll into the Tier-2 validation task with AZ-301's
AC-8 (read_host_tuple on real NVML/L4T).
Test run
./.venv/bin/python -m pytest tests/unit tests/_meta -q → 1140
passed, 11 skipped (Tier-2 / CUDA / cmake / actionlint), no failures.
Mypy strict on the three production-source files we touched: clean.
Ruff on the same set: clean.
Self-review
- Production code (
thermal_publisher.py,__init__.py,records.py,pyproject.toml): no use oftime.*in component modules (Invariant 2 meta-test passes); FDR kind documented in the schema-keys table; telemetry backends are optional; the publisher raisesTelemetryUnavailableErroron cold-start with no source per AC-5; idempotent start/stop per AC-6; rate-limited WARN per AC-4. - Tests: every AC has at least one named assertion; the AC-1 wait-free microbench and AC-7 hot-path bench are Tier-2 deferred (documented in the report + the spec as-built notes).
- Lint / type: ruff + mypy strict clean on the modified set.
- Docs: spec moved to
done/with the as-built delta notes.
Known gaps
- AZ-302 AC-7 + AC-1 microbench + NFR-perf-poll → require real Jetson thermal source under load; rolled into the Tier-2 validation task.
- AZ-304 (C6 Postgres schema) deferred to batch 27 — testcontainers setup + Alembic baseline are independently meaningful.
_JtopSource.read()catches a bareExceptionbecause jtop's internal exception types are not stable across jtop versions; the exception is always rewrapped asTelemetryUnavailableError(never swallowed silently) so this does not violate the "no silent error suppression" coding rule.