mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-22 05:21:09 +00:00
[AZ-675] telemetry_stream Tonic gRPC server + per-client lossy queue
ci/woodpecker/push/build-arm Pipeline failed
ci/woodpecker/push/build-arm Pipeline failed
Pins operator-link transport to gRPC server-streaming (closes architecture Q2 in favour of gRPC). Adds first-time tonic / prost / tonic-build infrastructure to the workspace; uses protoc-bin-vendored so neither dev machines nor CI need system protoc installed. Design — back-pressure lives in the per-topic tokio::sync::broadcast ring, drained directly by the tonic-streamed response via BroadcastStream + StreamMap. No intermediate mpsc buffer that could absorb back-pressure invisibly. Slow client overrun -> Lagged(n) event -> per-(client_id, topic) drop counter incremented; healthy clients on the same topic are unaffected. Service surface — Subscribe(SubscribeRequest) -> stream TelemetryMessage; five topics (TelemetrySample, GimbalState, DetectionEvent, MovementCandidate, MapObjectsBundle); empty topics list defaults to subscribe-all; empty client_id rejected; stream drop decrements subscribed_clients via StreamGuard. TelemetrySink push_detections is now real; push_frame still NotImplemented(AZ-676 video path). Tests — 6 unit + 5 integration (AC-1..AC-3 via in-process gRPC client, plus subscribe-all default + empty-client_id rejection). Clippy on telemetry_stream clean. Pre-existing mission_executor ac3 test polling race surfaces more reliably under the new tonic build pressure; documented as _docs/_process_leftovers/2026-05-20_mission_executor_ac3_flake.md and unchanged by this batch. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,68 @@
|
||||
# Telemetry gRPC Server + Per-Client Lossy Subscriber
|
||||
|
||||
**Task**: AZ-675_telemetry_stream_grpc_server
|
||||
**Name**: Tonic gRPC server bind + per-client lossy subscriber bounded queue
|
||||
**Description**: Bring up the operator-bound telemetry gRPC server (Tonic). Per-client subscriber has a bounded queue. Slow clients drop oldest, count drops; never block the producer.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-640_initial_structure, AZ-649_mission_executor_telemetry_forwarding, AZ-657_frame_ingest_rtsp_session
|
||||
**Component**: telemetry_stream
|
||||
**Tracker**: AZ-675
|
||||
**Epic**: AZ-637
|
||||
|
||||
## Problem
|
||||
|
||||
`telemetry_stream` is the operator-bound publisher for `TelemetrySample`, `GimbalState`, `DetectionEvent`, `MovementCandidate`, `MapObjectsBundle`. Per-client throttling MUST be lossy and per-client so a slow client never starves a healthy one. The server runs over the operator-link gRPC channel — same physical transport as `operator_bridge` but a separate logical service.
|
||||
|
||||
## Outcome
|
||||
|
||||
- Tonic gRPC server bound on `telemetry.listen_addr` exposing a single subscribe-style streaming RPC per topic (or a multiplex RPC).
|
||||
- Each connected client has a `(bounded_queue, drop_counter, last_sent_seq)` state.
|
||||
- Producer fan-out copies (refcount where possible) the message into each subscriber's queue. Full queue → drop oldest, increment `drops_total{client_id, topic}`.
|
||||
- Disconnects cleanly tear down the subscriber.
|
||||
- Health surface: `subscribed_clients`, `drops_total{client_id, topic}`, `bytes_out_per_topic`.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- Tonic server bind + cleanup.
|
||||
- Per-client subscriber state.
|
||||
- Drop-oldest back-pressure.
|
||||
- Disconnect handling.
|
||||
|
||||
### Excluded
|
||||
- The .proto schema (lives in `shared/contracts/telemetry-stream.proto`; if absent, add it as a side-effect of this task).
|
||||
- Diff-based snapshot emission for `MapObjectsBundle` (task 38).
|
||||
- Operator commands (lives in `operator_bridge` component).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Multiple subscribers receive the same stream**
|
||||
Given 3 clients subscribed to `TelemetrySample`
|
||||
When 100 samples are published
|
||||
Then each client receives all 100 (assuming no slowness); ordering preserved.
|
||||
|
||||
**AC-2: Slow subscriber drops oldest, healthy unaffected**
|
||||
Given client A reads slowly and client B reads at full speed
|
||||
When producer pushes 1000 samples while A is paused
|
||||
Then client A's queue grows up to `max_queue` and then drops oldest (drops_total{A} > 0); client B receives all 1000.
|
||||
|
||||
**AC-3: Disconnect cleanly removes subscriber**
|
||||
Given a connected client
|
||||
When the gRPC stream is canceled
|
||||
Then `subscribed_clients` decrements by 1; producer fan-out no longer copies to that client.
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Per-message fan-out CPU: ≤2 ms p99 for ≤10 clients (per architecture NFR class).
|
||||
- Tx tail latency end-to-end (producer → wire) ≤100 ms p95 over a healthy link.
|
||||
|
||||
**Reliability**
|
||||
- No producer-side blocking on slow clients (hard rule).
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: Tonic gRPC operator telemetry stream with lossy per-client throttling.
|
||||
- **Production code that must exist**: real gRPC server; real per-client subscriber state machine; real drop counters.
|
||||
- **Allowed external stubs**: an in-process gRPC client in tests.
|
||||
- **Unacceptable substitutes**: a single global queue (head-of-line blocking) is unacceptable.
|
||||
Reference in New Issue
Block a user