mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 17:51:14 +00:00
[AZ-302] C7 ThermalStatePublisher — jtop/NVML 1 Hz background poller
Implements AZ-297 InferenceRuntime's thermal_state() side: a singleton background-thread publisher that polls jtop (preferred) or pynvml (fallback) at config.thermal_poll_hz, stores an atomic ThermalState snapshot, and emits c7.thermal_transition FDR records on every throttle flip with a WARN log on entry and an INFO log on exit. Default-safe on TelemetryUnavailableError per Invariant I-6 with a 1-Hz rate-limited WARN. Sources return a raw ThermalReading; the publisher stamps measured_at_ns via its injected Clock so _JtopSource / _PynvmlSource stay clean of direct time.* calls (Invariant 2). _poll_once is the deterministic test seam — start() spawns the production thread. - c7.thermal_transition registered in fdr_client.records KNOWN_PAYLOAD_KEYS - [telemetry] optional dep group (jetson-stats, pynvml) added to pyproject - 14 unit tests (AC-1..AC-6, AC-8, NFR-default-safe, structural) green; AC-7 / AC-1 microbench / NFR-perf-poll Tier-2 deferred - full unit suite: 1140 passed, 11 expected Tier-2/CUDA skips Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,177 +0,0 @@
|
||||
# C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background
|
||||
|
||||
**Task**: AZ-302_c7_thermal_publisher
|
||||
**Name**: C7 ThermalState Publisher
|
||||
**Description**: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic `ThermalState` snapshot, exposes it via `InferenceRuntime.thermal_state()` for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults `thermal_throttle_active = false` on `TelemetryUnavailableError`. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02).
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf
|
||||
**Component**: c7_inference (epic AZ-249 / E-C7)
|
||||
**Tracker**: AZ-302
|
||||
**Epic**: AZ-249 (E-C7)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `ThermalState` and `thermal_state()`; produced by AZ-297.
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — used to emit thermal-transition FDR records via `FdrClient.publish`.
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c7.thermal_transition"` record envelope.
|
||||
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — WARN log shape on throttle entry / telemetry loss.
|
||||
- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries `thermal_poll_hz` and a fallback flag for telemetry unavailability behaviour.
|
||||
|
||||
## Problem
|
||||
|
||||
C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher:
|
||||
|
||||
- C4 has no canonical `ThermalState` source — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read.
|
||||
- The F3 hot path threads cannot poll `jetson-stats` themselves (would race with `infer` on the GIL-protected critical section per description.md § 7).
|
||||
- Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events.
|
||||
- `TelemetryUnavailableError` (jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie.
|
||||
|
||||
This task is the SINGLE owner of thermal telemetry across the companion process.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `ThermalStatePublisher` class at `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` with `start() / stop()` lifecycle and `read() -> ThermalState` accessor.
|
||||
- A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at `config.inference.thermal_poll_hz` Hz (default 1.0); each successful poll updates a single `_atomic_snapshot: ThermalState` field via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report).
|
||||
- `read()` is the lock-free reader: returns the current `_atomic_snapshot`. F3 hot-path callers MAY call `read()` from any thread; the call is wait-free and ≤ 1 µs.
|
||||
- Throttle-state transitions (the boolean `thermal_throttle_active` flips between two consecutive polls) emit:
|
||||
- One `kind="c7.thermal_transition"` FDR record via the constructor-injected `FdrClient` (AZ-273) carrying `previous_state`, `new_state`, `gpu_temp_c`, `cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`.
|
||||
- One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO).
|
||||
- `TelemetryUnavailableError` (raised by jetson-stats / pynvml internals): the publisher catches, sets `thermal_throttle_active = false` and `gpu_temp_c = None / cpu_temp_c = None`, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies — `ThermalState.is_telemetry_available` is set to `False` so C4 can choose to ignore the throttle bit.
|
||||
- The publisher is a singleton constructed by the composition root and passed by reference into every `InferenceRuntime` strategy that needs `thermal_state()`. The strategies do NOT each construct their own publisher.
|
||||
- The publisher is started during composition (after `FdrClient` is registered, before any consumer calls `read()`); stopped by the composition root's process-exit hook.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `ThermalStatePublisher` class with `__init__(config, fdr_client, logger)`, `start()`, `stop()`, `read() -> ThermalState`.
|
||||
- Background polling thread: while `_running`, sleep `1.0 / config.inference.thermal_poll_hz` seconds, read `jetson-stats` (`from jtop import jtop` context manager OR `pynvml.nvmlDeviceGetTemperature` / `nvmlDeviceGetCurrentClocksThrottleReasons` direct calls — implementation chooses based on availability), build a fresh `ThermalState`, atomically replace `_atomic_snapshot`.
|
||||
- Source-selection logic: at `start()`, attempt to import `jtop` (jetson-stats); if unavailable, attempt `pynvml`; if both unavailable, raise `TelemetryUnavailableError` from `start()` itself (composition aborts cleanly — operator chooses to disable thermal-aware paths).
|
||||
- Throttle-transition detection: compare `previous._atomic_snapshot.thermal_throttle_active` vs. new value; on flip, emit FDR record + log per Outcome.
|
||||
- WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec): `kind="c7.thermal.unavailable"`.
|
||||
- INFO log on `start()` and `stop()`; INFO log on throttle-exit transitions; WARN log on throttle-entry transitions.
|
||||
- `ThermalState` extension: add `is_telemetry_available: bool` to the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's `## Test Cases`.)
|
||||
- Constructor-injected `Clock` for testability — tests inject a fake clock that advances on demand; production wires `time.monotonic_ns`.
|
||||
- The `read()` accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path.
|
||||
- A `ThermalStatePublisher.is_running() -> bool` introspection accessor for the composition root and tests.
|
||||
|
||||
### Excluded
|
||||
|
||||
- AZ-297 InferenceRuntime Protocol — this task adds the `is_telemetry_available` field to the existing `ThermalState` DTO; the Protocol method `thermal_state()` is owned by the strategies that delegate to this publisher.
|
||||
- AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves.
|
||||
- AZ-301 EngineGate — unrelated.
|
||||
- C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES `ThermalState`; C4 consumes.
|
||||
- Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry.
|
||||
- Cross-flight thermal trend analysis — operator post-flight tooling owns it.
|
||||
- A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: read() is wait-free and returns the latest snapshot**
|
||||
Given a started publisher with two completed polls
|
||||
When `read()` is called from the F3 hot path
|
||||
Then the call returns within 1 µs (microbenched); the returned `ThermalState` matches the most recent poll's data
|
||||
|
||||
**AC-2: throttle entry within 1 s**
|
||||
Given a running publisher and a simulated jetson-stats spoof flipping `throttle_active` from False to True at time T
|
||||
When the test waits 1 s and calls `read()`
|
||||
Then the returned `ThermalState.thermal_throttle_active == True` (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR `kind="c7.thermal_transition"` record was emitted with `new_state=True`; ONE WARN log was emitted
|
||||
|
||||
**AC-3: throttle exit within 1 s and INFO log**
|
||||
Given a publisher currently in throttle and a simulated flip to False
|
||||
When 1 s elapses and the test calls `read()`
|
||||
Then `thermal_throttle_active == False`; ONE FDR record with `new_state=False`; ONE INFO log (NOT WARN) records the exit
|
||||
|
||||
**AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False**
|
||||
Given a started publisher whose jetson-stats source raises `TelemetryUnavailableError` on every poll
|
||||
When the test waits 2 s and calls `read()`
|
||||
Then `ThermalState.is_telemetry_available == False`, `thermal_throttle_active == False` (default-safe), `gpu_temp_c == None`, `cpu_temp_c == None`; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec)
|
||||
|
||||
**AC-5: cold-start with no source raises**
|
||||
Given an environment where neither `jtop` nor `pynvml` is importable
|
||||
When the publisher's `start()` is called
|
||||
Then `TelemetryUnavailableError` is raised from `start()` itself; the publisher is in `is_running() == False` state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's)
|
||||
|
||||
**AC-6: start/stop lifecycle is idempotent**
|
||||
Given a publisher
|
||||
When `start()` is called twice in succession
|
||||
Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; `is_running() == True`
|
||||
|
||||
When `stop()` is called twice
|
||||
Then the second `stop()` is a no-op; resources are NOT double-freed
|
||||
|
||||
**AC-7: poll thread does not interfere with infer hot path**
|
||||
Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running `infer` at 39 Hz aggregate (per description.md)
|
||||
When the benchmark runs for 60 s
|
||||
Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms
|
||||
|
||||
**AC-8: FDR record envelope matches contract**
|
||||
Given a throttle transition
|
||||
When the FDR record is written
|
||||
Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` shape with `kind="c7.thermal_transition"`, `producer_id="c7_inference.thermal"`, `payload` containing `{previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns}` (deep-equal verified against the schema's payload type)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- `read()` is wait-free; p99 ≤ 1 µs.
|
||||
- Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 30–80 ms typically).
|
||||
- F3 hot-path latency unchanged when publisher is running (AC-7).
|
||||
|
||||
**Reliability**
|
||||
- The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread; `read()` is a single object-reference load (atomic under GIL).
|
||||
- Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state.
|
||||
- The publisher is one of the FIRST things the composition root starts (after `FdrClient` registration) and one of the LAST things it stops (before `FdrClient.stop`). Documented as a startup-order constraint.
|
||||
|
||||
**Concurrency**
|
||||
- One polling thread per process. The publisher is a singleton.
|
||||
- `read()` is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Microbench read() × 100k from a worker thread | p99 ≤ 1 µs; returned state matches latest poll |
|
||||
| AC-2 | Spoof flip False→True at T; wait 1 s; read | throttle_active=True; one FDR + one WARN |
|
||||
| AC-3 | Spoof flip True→False; wait 1 s; read | throttle_active=False; one FDR + one INFO |
|
||||
| AC-4 | Spoof unavailable on every poll for 2 s | is_telemetry_available=False; default-safe; ≤ 2 WARN |
|
||||
| AC-5 | Disable jtop + pynvml; start | TelemetryUnavailableError raised; is_running=False |
|
||||
| AC-6 | start twice; stop twice | Idempotent; no double-spawn / double-free |
|
||||
| AC-7 | F3 hot-path bench with publisher running | p95 unchanged vs. baseline; poll p99 ≤ 100 ms |
|
||||
| AC-8 | Throttle transition FDR record shape | Matches schema deep-equal |
|
||||
| NFR-perf-poll | Microbench poll body × 100 | p99 ≤ 100 ms |
|
||||
| NFR-reliability-default-safe | Telemetry unavailable on first poll → read() | is_telemetry_available=False; throttle_active=False |
|
||||
|
||||
## Constraints
|
||||
|
||||
- One polling thread per process; publisher is a singleton.
|
||||
- `read()` is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path.
|
||||
- Source selection at `start()` time: `jtop` first, `pynvml` second; if both fail, raise `TelemetryUnavailableError` from `start()` (do NOT silently default-safe — that hides a misconfigured deployment).
|
||||
- Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart).
|
||||
- WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented.
|
||||
- `ThermalState.is_telemetry_available` is added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log.
|
||||
- This task introduces no new third-party dependencies — `jtop` (jetson-stats) and `pynvml` are both already pinned by the description.md key dependencies table.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: GIL atomicity assumption is wrong on a future Python**
|
||||
- *Risk*: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic.
|
||||
- *Mitigation*: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an `atomic` library wrapper or threading.Lock around the snapshot).
|
||||
|
||||
**Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)**
|
||||
- *Risk*: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates.
|
||||
- *Mitigation*: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 30–80 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle).
|
||||
|
||||
**Risk 3: jetson-stats internal threading interferes with our polling thread**
|
||||
- *Risk*: `jtop` runs its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state.
|
||||
- *Mitigation*: This publisher is the SINGLE owner of `jtop` access in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report.
|
||||
|
||||
**Risk 4: FDR record emission rate spikes during rapid throttle oscillation**
|
||||
- *Risk*: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets.
|
||||
- *Mitigation*: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1).
|
||||
- **Production code that must exist**: real `ThermalStatePublisher` class with real background thread, real `jtop` / `pynvml` poll, real lock-free `_atomic_snapshot` reference swap, real FDR record emission via the injected `FdrClient`.
|
||||
- **Allowed external stubs**: tests MAY substitute a `FakeJtopSource` and a `FakeFdrClient` (AC-2..AC-8); production wiring uses real `jtop` + real AZ-273 `FdrClient`.
|
||||
- **Unacceptable substitutes**: a polling loop that uses `time.sleep` without a real `Clock` injection (would break test determinism); a snapshot field guarded by a `threading.Lock` on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises `TelemetryUnavailableError` from `start()` (would hide misconfigured deployments — exactly the failure mode AC-5 prevents).
|
||||
Reference in New Issue
Block a user