mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-21 19:51:10 +00:00
[AZ-675] telemetry_stream Tonic gRPC server + per-client lossy queue
ci/woodpecker/push/build-arm Pipeline failed
ci/woodpecker/push/build-arm Pipeline failed
Pins operator-link transport to gRPC server-streaming (closes architecture Q2 in favour of gRPC). Adds first-time tonic / prost / tonic-build infrastructure to the workspace; uses protoc-bin-vendored so neither dev machines nor CI need system protoc installed. Design — back-pressure lives in the per-topic tokio::sync::broadcast ring, drained directly by the tonic-streamed response via BroadcastStream + StreamMap. No intermediate mpsc buffer that could absorb back-pressure invisibly. Slow client overrun -> Lagged(n) event -> per-(client_id, topic) drop counter incremented; healthy clients on the same topic are unaffected. Service surface — Subscribe(SubscribeRequest) -> stream TelemetryMessage; five topics (TelemetrySample, GimbalState, DetectionEvent, MovementCandidate, MapObjectsBundle); empty topics list defaults to subscribe-all; empty client_id rejected; stream drop decrements subscribed_clients via StreamGuard. TelemetrySink push_detections is now real; push_frame still NotImplemented(AZ-676 video path). Tests — 6 unit + 5 integration (AC-1..AC-3 via in-process gRPC client, plus subscribe-all default + empty-client_id rejection). Clippy on telemetry_stream clean. Pre-existing mission_executor ac3 test polling race surfaces more reliably under the new tonic build pressure; documented as _docs/_process_leftovers/2026-05-20_mission_executor_ac3_flake.md and unchanged by this batch. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,38 @@
|
||||
# Leftover: `mission_executor::ac3_bounded_retry_then_success` polling race
|
||||
|
||||
**Timestamp**: 2026-05-20T08:30:00+02:00
|
||||
**Origin**: Batch 8 (mission_executor state machine). Surfaced in batches 11, 12, 13 as intermittent. Reproduces more reliably on dev box under batch 14 workspace test load (the new tonic stack increases build/runtime pressure).
|
||||
**Severity**: Medium (test design, not production code)
|
||||
**Not blocking**: pre-existing failure in unrelated area; production `mission_executor` behaviour is correct — the test simply has a polling race.
|
||||
|
||||
## Symptom
|
||||
|
||||
```
|
||||
test ac3_bounded_retry_then_success ... FAILED
|
||||
thread 'ac3_bounded_retry_then_success' panicked at
|
||||
crates/mission_executor/tests/state_machine.rs:116:
|
||||
FSM did not reach MissionUploaded; stuck at WaitAuto
|
||||
```
|
||||
|
||||
`WaitAuto` is the FSM state *after* `MissionUploaded`. The FSM passed *through* `MissionUploaded` faster than the test's 5 ms polling cadence could observe it. The post-assertion (`matches!(state, WaitAuto | MissionUploaded)`) acknowledges either is fine, but `await_state(target=MissionUploaded)` panics before that assertion runs.
|
||||
|
||||
## Root cause
|
||||
|
||||
`crates/mission_executor/tests/state_machine.rs` lines 100-118 — `await_state` polls every 5 ms; FSM `tick_interval` is also 5 ms; a successful retry+upload can complete in less than one polling interval.
|
||||
|
||||
## Recommended fix (out of scope for current batch)
|
||||
|
||||
Replace polling with an event latch:
|
||||
|
||||
- Have `MissionExecutorHandle::state_stream()` (or expose `tokio::sync::watch::Receiver<MissionState>`) so tests can `await` on the channel changing through the target state.
|
||||
- Or: record a `Vec<MissionState>` history in `Inner` and assert the target is *in* the history at the end, not the current state.
|
||||
|
||||
Either approach is ~30 lines of test-only refactor. Production code does not need to change.
|
||||
|
||||
## Replay instructions
|
||||
|
||||
When working on `mission_executor` next (e.g. batch that touches the state machine or tick loop):
|
||||
|
||||
1. Pick one of the two fixes above.
|
||||
2. Re-run `cargo test --workspace` to confirm flake is gone.
|
||||
3. Delete this leftover.
|
||||
Reference in New Issue
Block a user