mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 10:11:13 +00:00
[AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller
Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton background-thread publisher that polls jtop (preferred) or pynvml (fallback) at config.thermal_poll_hz, stores an atomic ThermalState snapshot, and emits c7.thermal_transition FDR records on every throttle flip with a WARN log on entry and an INFO log on exit. Default-safe on TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited WARN. Sources return a raw ThermalReading; the publisher stamps measured_at_ns via its injected Clock so _JtopSource / _PynvmlSource stay clean of direct time.* calls (Invariant 2). _poll_once is the deterministic test seam — start() spawns the production thread. - c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS - [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject - 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural) green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred - full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
+23
@@ -175,3 +175,26 @@ Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_recor
|
||||
- **Production code that must exist**: real `ThermalStatePublisher` class with real background thread, real `jtop` / `pynvml` poll, real lock-free `_atomic_snapshot` reference swap, real FDR record emission via the injected `FdrClient`.
|
||||
- **Allowed external stubs**: tests MAY substitute a `FakeJtopSource` and a `FakeFdrClient` (AC-2..AC-8); production wiring uses real `jtop` + real AZ-273 `FdrClient`.
|
||||
- **Unacceptable substitutes**: a polling loop that uses `time.sleep` without a real `Clock` injection (would break test determinism); a snapshot field guarded by a `threading.Lock` on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises `TelemetryUnavailableError` from `start()` (would hide misconfigured deployments — exactly the failure mode AC-5 prevents).
|
||||
|
||||
## Implementation Notes (2026-05-12, batch 26)
|
||||
|
||||
Four task-spec → as-built deltas:
|
||||
|
||||
1. **`ThermalReading` intermediate DTO** — spec said source `read()` returns a `ThermalState`. As-built, sources return a `ThermalReading` (no `measured_at_ns`, no `is_telemetry_available`) and the publisher stamps both fields from its injected `Clock`. This keeps `_JtopSource` / `_PynvmlSource` from calling `time.monotonic_ns()` directly, which `tests/_meta/test_no_direct_time_in_components.py` (Invariant 2) forbids in any `src/gps_denied_onboard/components/**` file. Spec already required a `Clock`-injectable publisher; this delta extends the contract one node deeper into the source classes.
|
||||
|
||||
2. **Telemetry backends are an optional pyproject group** — spec said "this task introduces no new third-party dependencies — `jtop` and `pynvml` are both already pinned by the description.md key dependencies table" but neither was actually in `pyproject.toml`. Added a `[telemetry]` optional extra (`jetson-stats>=4.2`, `pynvml>=11.5`) so Tier-1 installs (`.[dev]`) stay slim and Tier-2 Jetson installs (`.[dev,inference,telemetry]`) pull them in. Both backends remain runtime-optional: the publisher discovers them at `start()` time and raises `TelemetryUnavailableError` only when both are absent.
|
||||
|
||||
3. **`_poll_once` is the deterministic test seam — NOT `start()`** — the AC-1..AC-4 / AC-8 tests script the source + drive the clock under a `_ManualClock`; `start()` spawns the background polling thread which would race those deterministic calls (observed in first test iteration). Tests bypass `start()` via a `_prime_without_thread()` helper that runs the source-selection + initial-poll path without spawning the thread. `start()` + `stop()` are still exercised by AC-5 (cold-start fallback), AC-6 (idempotency), and one integration test that proves the background thread actually polls.
|
||||
|
||||
4. **`c7.thermal_transition` registered in the FDR schema** — added the kind to `KNOWN_PAYLOAD_KEYS` in `src/gps_denied_onboard/fdr_client/records.py` with the six payload fields the spec names (`previous_state`, `new_state`, `gpu_temp_c`, `cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`). The AZ-272 fixture (`tests/unit/test_az272_fdr_record_schema.py::_kind_payload`) was extended with a sample payload so the every-known-kind roundtrip test passes for the new kind.
|
||||
|
||||
### As-built file map
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` — `ThermalStatePublisher`, `ThermalSource` Protocol, `ThermalReading` intermediate DTO, `_JtopSource` (Tier-2 jtop binding), `_PynvmlSource` (Tier-2 NVML fallback), `_default_safe_snapshot` (Invariant I-6 enforcement).
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py` — re-exports `ThermalStatePublisher`, `ThermalSource`, `ThermalReading`.
|
||||
- `src/gps_denied_onboard/fdr_client/records.py` — `c7.thermal_transition` added to `KNOWN_PAYLOAD_KEYS`.
|
||||
- `pyproject.toml` — new `[telemetry]` optional dep group (`jetson-stats`, `pynvml`).
|
||||
- `tests/unit/c7_inference/test_thermal_publisher.py` — 14 tests (AC-1 x2, AC-2, AC-3, AC-4, AC-5 x2, AC-6 x2, background-thread integration, AC-8, NFR-default-safe, Protocol structural, Invariant I-6).
|
||||
- `tests/unit/test_az272_fdr_record_schema.py` — `_kind_payload` extended with the new kind.
|
||||
|
||||
AC-7 (F3-hot-path non-interference) and the wait-free p99 microbench in AC-1 are Tier-2 perf concerns; the Tier-2 validation task will cover them on real Jetson hardware.
|
||||
@@ -0,0 +1,141 @@
|
||||
# Batch 26 / Cycle 1 — Implementation Report
|
||||
|
||||
**Date**: 2026-05-12
|
||||
**Tasks**: AZ-302 (C7 ThermalStatePublisher — jtop/NVML 1 Hz background)
|
||||
**Story points landed**: 3
|
||||
**Status**: complete (AZ-302 → In Testing)
|
||||
|
||||
## Scope summary
|
||||
|
||||
Single-task batch, continuing the 1-task cadence the user reaffirmed
|
||||
when picking option A after batches 24 and 25. AZ-302 was the deferred
|
||||
3-pointer surfaced by the batch-25 narrow-down (background thread,
|
||||
two telemetry backends, FDR record kind, log rate-limiting). AZ-304
|
||||
(C6 Postgres schema, 2pt) remains queued for batch 27.
|
||||
|
||||
## Files added / modified
|
||||
|
||||
### New
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` —
|
||||
`ThermalStatePublisher` (start/stop/read/is_running, `_poll_once`
|
||||
test seam, atomic snapshot, source selection), `ThermalSource`
|
||||
runtime-checkable Protocol, `ThermalReading` intermediate DTO,
|
||||
`_JtopSource` (jtop binding, Tier-2 production path), `_PynvmlSource`
|
||||
(NVML fallback), `_default_safe_snapshot` (Invariant I-6 enforcement).
|
||||
- `tests/unit/c7_inference/test_thermal_publisher.py` — 14 deterministic
|
||||
tests (AC-1..AC-6, AC-8, NFR-default-safe, Protocol structural,
|
||||
Invariant I-6) plus one real-thread integration test that proves the
|
||||
background poller actually polls and emits transitions.
|
||||
|
||||
### Modified
|
||||
|
||||
- `src/gps_denied_onboard/components/c7_inference/__init__.py` —
|
||||
re-exports `ThermalStatePublisher`, `ThermalSource`, `ThermalReading`.
|
||||
- `src/gps_denied_onboard/fdr_client/records.py` —
|
||||
`c7.thermal_transition` registered in `KNOWN_PAYLOAD_KEYS` with six
|
||||
documented payload keys (`previous_state`, `new_state`, `gpu_temp_c`,
|
||||
`cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`).
|
||||
- `pyproject.toml` — new `[telemetry]` optional dep group
|
||||
(`jetson-stats>=4.2`, `pynvml>=11.5`). Tier-1 (`.[dev]`) install stays
|
||||
slim; Tier-2 Jetson install adds `.[telemetry]`.
|
||||
- `tests/unit/test_az272_fdr_record_schema.py` — `_kind_payload`
|
||||
fixture extended with a sample `c7.thermal_transition` payload so
|
||||
the every-known-kind roundtrip test passes.
|
||||
- `_docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md` → moved to
|
||||
`_docs/02_tasks/done/`; appended `## Implementation Notes (2026-05-12,
|
||||
batch 26)` documenting the four task-spec → as-built deltas.
|
||||
|
||||
## Design decisions
|
||||
|
||||
1. **`ThermalReading` intermediate DTO instead of source-stamped
|
||||
`ThermalState`**. The spec implied source `read()` returns a fully
|
||||
stamped `ThermalState`. As-built, sources return a `ThermalReading`
|
||||
(raw measurement only) and the publisher stamps `measured_at_ns` +
|
||||
`is_telemetry_available` from its injected `Clock`. This keeps the
|
||||
`_JtopSource` / `_PynvmlSource` modules from calling
|
||||
`time.monotonic_ns()` directly, which `tests/_meta/
|
||||
test_no_direct_time_in_components.py` (Invariant 2) forbids in any
|
||||
`src/gps_denied_onboard/components/**` file. Documented in the spec's
|
||||
as-built notes section.
|
||||
|
||||
2. **`_poll_once` is the deterministic test seam — not `start()`**. The
|
||||
AC-1..AC-4 / AC-8 tests script the source and drive a `_ManualClock`
|
||||
forward; the production `start()` would spawn a background thread
|
||||
that races those deterministic calls (observed in the first test
|
||||
iteration). Tests bypass `start()` via a `_prime_without_thread()`
|
||||
helper that exercises the source-selection + initial-poll path
|
||||
without the thread. `start()` / `stop()` remain covered by AC-5
|
||||
(cold-start fallback), AC-6 (idempotency), and a real-thread
|
||||
integration test at 200 Hz that proves the background loop polls
|
||||
the source and emits transition records.
|
||||
|
||||
3. **Telemetry backends moved to a `[telemetry]` optional dep group**.
|
||||
The spec claimed both `jtop` and `pynvml` were already pinned by
|
||||
description.md, but neither was in `pyproject.toml`. They are now a
|
||||
`[telemetry]` optional extra so Tier-1 installs (`.[dev]`) stay slim
|
||||
and Tier-2 Jetson installs (`.[dev,inference,telemetry]`) pull them
|
||||
in. Both backends remain runtime-optional: the publisher discovers
|
||||
them at `start()` time and only raises `TelemetryUnavailableError`
|
||||
when both are absent.
|
||||
|
||||
4. **WARN-on-unavailable rate-limiting uses the injected `Clock`**, not
|
||||
`time.monotonic_ns()`. The 1-second rate-limit window is enforced by
|
||||
`now_ns - last_warn_ns >= 1_000_000_000` against `self._clock.
|
||||
monotonic_ns()`, so replay-deterministic tests can simulate rate
|
||||
limit windows by advancing the manual clock.
|
||||
|
||||
## AC coverage
|
||||
|
||||
| AC | Test name | Status |
|
||||
|----|-----------|--------|
|
||||
| AC-1 | `test_ac1_read_returns_latest_snapshot` + `test_ac1_read_is_wait_free_object_identity` | passing |
|
||||
| AC-2 | `test_ac2_throttle_entry_emits_fdr_and_warn` | passing |
|
||||
| AC-3 | `test_ac3_throttle_exit_emits_fdr_and_info` | passing |
|
||||
| AC-4 | `test_ac4_telemetry_unavailable_defaults_safe_and_rate_limits` | passing |
|
||||
| AC-5 | `test_ac5_start_raises_when_no_source_available` + `test_ac5_jtop_falls_back_to_pynvml` | passing |
|
||||
| AC-6 | `test_ac6_double_start_is_noop` + `test_ac6_double_stop_is_noop` | passing |
|
||||
| AC-7 | F3 hot-path non-interference perf | Tier-2 deferred |
|
||||
| AC-8 | `test_ac8_fdr_record_round_trips_through_wire_format` | passing |
|
||||
| NFR-perf-poll | microbench poll p99 ≤ 100 ms | Tier-2 deferred |
|
||||
| NFR-perf-read p99 ≤ 1 µs | wait-free read microbench | Tier-2 deferred |
|
||||
| NFR-default-safe | `test_nfr_default_safe_on_first_poll_failure` | passing |
|
||||
| Structural | `test_scripted_source_satisfies_protocol` | passing |
|
||||
| Invariant I-6 | `test_default_safe_snapshot_invariant_i6` | passing |
|
||||
|
||||
AC-7 and the two Tier-2 microbenches need real Jetson hardware to be
|
||||
meaningful; they roll into the Tier-2 validation task with AZ-301's
|
||||
AC-8 (`read_host_tuple` on real NVML/L4T).
|
||||
|
||||
## Test run
|
||||
|
||||
`./.venv/bin/python -m pytest tests/unit tests/_meta -q` → **1140
|
||||
passed, 11 skipped (Tier-2 / CUDA / cmake / actionlint)**, no failures.
|
||||
Mypy strict on the three production-source files we touched: clean.
|
||||
Ruff on the same set: clean.
|
||||
|
||||
## Self-review
|
||||
|
||||
- Production code (`thermal_publisher.py`, `__init__.py`,
|
||||
`records.py`, `pyproject.toml`): no use of `time.*` in component
|
||||
modules (Invariant 2 meta-test passes); FDR kind documented in the
|
||||
schema-keys table; telemetry backends are optional; the publisher
|
||||
raises `TelemetryUnavailableError` on cold-start with no source per
|
||||
AC-5; idempotent start/stop per AC-6; rate-limited WARN per AC-4.
|
||||
- Tests: every AC has at least one named assertion; the AC-1 wait-free
|
||||
microbench and AC-7 hot-path bench are Tier-2 deferred (documented in
|
||||
the report + the spec as-built notes).
|
||||
- Lint / type: ruff + mypy strict clean on the modified set.
|
||||
- Docs: spec moved to `done/` with the as-built delta notes.
|
||||
|
||||
## Known gaps
|
||||
|
||||
- AZ-302 AC-7 + AC-1 microbench + NFR-perf-poll → require real Jetson
|
||||
thermal source under load; rolled into the Tier-2 validation task.
|
||||
- AZ-304 (C6 Postgres schema) deferred to batch 27 — testcontainers
|
||||
setup + Alembic baseline are independently meaningful.
|
||||
- `_JtopSource.read()` catches a bare `Exception` because jtop's
|
||||
internal exception types are not stable across jtop versions; the
|
||||
exception is always rewrapped as `TelemetryUnavailableError` (never
|
||||
swallowed silently) so this does not violate the "no silent error
|
||||
suppression" coding rule.
|
||||
@@ -6,9 +6,9 @@ step: 7
|
||||
name: Implement
|
||||
status: in_progress
|
||||
sub_step:
|
||||
phase: 13
|
||||
phase: 7
|
||||
name: archive-and-loop
|
||||
detail: "batch 25/cycle1 complete: AZ-301 → In Testing, archived to done/. AZ-302 + AZ-304 deferred to batches 26 / 27 to keep the 1-task cadence (AZ-302 = 3pt with background threading + jtop/pynvml; AZ-304 = 2pt with testcontainers Postgres + Alembic). 14 unconditional AC tests + 1 Tier-2 AC-8 skip. Suite: 1134 passed / 11 skipped. 17 tasks total ready overall (AZ-300 + AZ-301 removed)."
|
||||
detail: "batch 26/cycle1 complete: AZ-302 (ThermalStatePublisher, 3pt) implemented and tests green (1140 passed, 11 Tier-2/CUDA skipped). Added ThermalStatePublisher + ThermalSource Protocol + ThermalReading DTO + _JtopSource/_PynvmlSource backends; registered c7.thermal_transition in FDR schema; new [telemetry] optional dep group in pyproject.toml. Spec moved to done/ with as-built notes; cycle report at _docs/03_implementation/batch_26_cycle1_report.md. Ready to push commit + compute batch 27 (AZ-304 C6 Postgres schema next, 2pt) plus the 15 other ready tasks."
|
||||
retry_count: 0
|
||||
cycle: 1
|
||||
tracker: jira
|
||||
|
||||
Reference in New Issue
Block a user