Files
gps-denied-onboard/_docs/03_implementation/batch_26_cycle1_report.md
T
Oleksandr Bezdieniezhnykh 49a67f770d [AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller
Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton
background-thread publisher that polls jtop (preferred) or pynvml
(fallback) at config.thermal_poll_hz, stores an atomic ThermalState
snapshot, and emits c7.thermal_transition FDR records on every throttle
flip with a WARN log on entry and an INFO log on exit. Default-safe on
TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited
WARN.

Sources return a raw ThermalReading; the publisher stamps measured_at_ns
via its injected Clock so _JtopSource / _PynvmlSource stay clean of
direct time.* calls (Invariant 2). _poll_once is the deterministic test
seam — start() spawns the production thread.

- c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS
- [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject
- 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural)
  green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred
- full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 10:33:37 +03:00

7.1 KiB

Batch 26 / Cycle 1 — Implementation Report

Date: 2026-05-12 Tasks: AZ-302 (C7 ThermalStatePublisher — jtop/NVML 1 Hz background) Story points landed: 3 Status: complete (AZ-302 → In Testing)

Scope summary

Single-task batch, continuing the 1-task cadence the user reaffirmed when picking option A after batches 24 and 25. AZ-302 was the deferred 3-pointer surfaced by the batch-25 narrow-down (background thread, two telemetry backends, FDR record kind, log rate-limiting). AZ-304 (C6 Postgres schema, 2pt) remains queued for batch 27.

Files added / modified

New

  • src/gps_denied_onboard/components/c7_inference/thermal_publisher.pyThermalStatePublisher (start/stop/read/is_running, _poll_once test seam, atomic snapshot, source selection), ThermalSource runtime-checkable Protocol, ThermalReading intermediate DTO, _JtopSource (jtop binding, Tier-2 production path), _PynvmlSource (NVML fallback), _default_safe_snapshot (Invariant I-6 enforcement).
  • tests/unit/c7_inference/test_thermal_publisher.py — 14 deterministic tests (AC-1..AC-6, AC-8, NFR-default-safe, Protocol structural, Invariant I-6) plus one real-thread integration test that proves the background poller actually polls and emits transitions.

Modified

  • src/gps_denied_onboard/components/c7_inference/__init__.py — re-exports ThermalStatePublisher, ThermalSource, ThermalReading.
  • src/gps_denied_onboard/fdr_client/records.pyc7.thermal_transition registered in KNOWN_PAYLOAD_KEYS with six documented payload keys (previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns).
  • pyproject.toml — new [telemetry] optional dep group (jetson-stats>=4.2, pynvml>=11.5). Tier-1 (.[dev]) install stays slim; Tier-2 Jetson install adds .[telemetry].
  • tests/unit/test_az272_fdr_record_schema.py_kind_payload fixture extended with a sample c7.thermal_transition payload so the every-known-kind roundtrip test passes.
  • _docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md → moved to _docs/02_tasks/done/; appended ## Implementation Notes (2026-05-12, batch 26) documenting the four task-spec → as-built deltas.

Design decisions

  1. ThermalReading intermediate DTO instead of source-stamped ThermalState. The spec implied source read() returns a fully stamped ThermalState. As-built, sources return a ThermalReading (raw measurement only) and the publisher stamps measured_at_ns + is_telemetry_available from its injected Clock. This keeps the _JtopSource / _PynvmlSource modules from calling time.monotonic_ns() directly, which tests/_meta/ test_no_direct_time_in_components.py (Invariant 2) forbids in any src/gps_denied_onboard/components/** file. Documented in the spec's as-built notes section.

  2. _poll_once is the deterministic test seam — not start(). The AC-1..AC-4 / AC-8 tests script the source and drive a _ManualClock forward; the production start() would spawn a background thread that races those deterministic calls (observed in the first test iteration). Tests bypass start() via a _prime_without_thread() helper that exercises the source-selection + initial-poll path without the thread. start() / stop() remain covered by AC-5 (cold-start fallback), AC-6 (idempotency), and a real-thread integration test at 200 Hz that proves the background loop polls the source and emits transition records.

  3. Telemetry backends moved to a [telemetry] optional dep group. The spec claimed both jtop and pynvml were already pinned by description.md, but neither was in pyproject.toml. They are now a [telemetry] optional extra so Tier-1 installs (.[dev]) stay slim and Tier-2 Jetson installs (.[dev,inference,telemetry]) pull them in. Both backends remain runtime-optional: the publisher discovers them at start() time and only raises TelemetryUnavailableError when both are absent.

  4. WARN-on-unavailable rate-limiting uses the injected Clock, not time.monotonic_ns(). The 1-second rate-limit window is enforced by now_ns - last_warn_ns >= 1_000_000_000 against self._clock. monotonic_ns(), so replay-deterministic tests can simulate rate limit windows by advancing the manual clock.

AC coverage

AC Test name Status
AC-1 test_ac1_read_returns_latest_snapshot + test_ac1_read_is_wait_free_object_identity passing
AC-2 test_ac2_throttle_entry_emits_fdr_and_warn passing
AC-3 test_ac3_throttle_exit_emits_fdr_and_info passing
AC-4 test_ac4_telemetry_unavailable_defaults_safe_and_rate_limits passing
AC-5 test_ac5_start_raises_when_no_source_available + test_ac5_jtop_falls_back_to_pynvml passing
AC-6 test_ac6_double_start_is_noop + test_ac6_double_stop_is_noop passing
AC-7 F3 hot-path non-interference perf Tier-2 deferred
AC-8 test_ac8_fdr_record_round_trips_through_wire_format passing
NFR-perf-poll microbench poll p99 ≤ 100 ms Tier-2 deferred
NFR-perf-read p99 ≤ 1 µs wait-free read microbench Tier-2 deferred
NFR-default-safe test_nfr_default_safe_on_first_poll_failure passing
Structural test_scripted_source_satisfies_protocol passing
Invariant I-6 test_default_safe_snapshot_invariant_i6 passing

AC-7 and the two Tier-2 microbenches need real Jetson hardware to be meaningful; they roll into the Tier-2 validation task with AZ-301's AC-8 (read_host_tuple on real NVML/L4T).

Test run

./.venv/bin/python -m pytest tests/unit tests/_meta -q1140 passed, 11 skipped (Tier-2 / CUDA / cmake / actionlint), no failures. Mypy strict on the three production-source files we touched: clean. Ruff on the same set: clean.

Self-review

  • Production code (thermal_publisher.py, __init__.py, records.py, pyproject.toml): no use of time.* in component modules (Invariant 2 meta-test passes); FDR kind documented in the schema-keys table; telemetry backends are optional; the publisher raises TelemetryUnavailableError on cold-start with no source per AC-5; idempotent start/stop per AC-6; rate-limited WARN per AC-4.
  • Tests: every AC has at least one named assertion; the AC-1 wait-free microbench and AC-7 hot-path bench are Tier-2 deferred (documented in the report + the spec as-built notes).
  • Lint / type: ruff + mypy strict clean on the modified set.
  • Docs: spec moved to done/ with the as-built delta notes.

Known gaps

  • AZ-302 AC-7 + AC-1 microbench + NFR-perf-poll → require real Jetson thermal source under load; rolled into the Tier-2 validation task.
  • AZ-304 (C6 Postgres schema) deferred to batch 27 — testcontainers setup + Alembic baseline are independently meaningful.
  • _JtopSource.read() catches a bare Exception because jtop's internal exception types are not stable across jtop versions; the exception is always rewrapped as TelemetryUnavailableError (never swallowed silently) so this does not violate the "no silent error suppression" coding rule.