[AZ-675] telemetry_stream Tonic gRPC server + per-client lossy queue
ci/woodpecker/push/build-arm Pipeline failed

Pins operator-link transport to gRPC server-streaming (closes
architecture Q2 in favour of gRPC). Adds first-time tonic / prost /
tonic-build infrastructure to the workspace; uses
protoc-bin-vendored so neither dev machines nor CI need system
protoc installed.

Design — back-pressure lives in the per-topic tokio::sync::broadcast
ring, drained directly by the tonic-streamed response via
BroadcastStream + StreamMap. No intermediate mpsc buffer that could
absorb back-pressure invisibly. Slow client overrun -> Lagged(n)
event -> per-(client_id, topic) drop counter incremented; healthy
clients on the same topic are unaffected.

Service surface — Subscribe(SubscribeRequest) -> stream
TelemetryMessage; five topics (TelemetrySample, GimbalState,
DetectionEvent, MovementCandidate, MapObjectsBundle); empty topics
list defaults to subscribe-all; empty client_id rejected; stream
drop decrements subscribed_clients via StreamGuard. TelemetrySink
push_detections is now real; push_frame still NotImplemented(AZ-676
video path).

Tests — 6 unit + 5 integration (AC-1..AC-3 via in-process gRPC
client, plus subscribe-all default + empty-client_id rejection).
Clippy on telemetry_stream clean.

Pre-existing mission_executor ac3 test polling race surfaces more
reliably under the new tonic build pressure; documented as
_docs/_process_leftovers/2026-05-20_mission_executor_ac3_flake.md
and unchanged by this batch.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-20 12:44:08 +03:00
parent 9fe0bbeac9
commit ff790bd639
15 changed files with 1700 additions and 25 deletions
@@ -0,0 +1,68 @@
# Telemetry gRPC Server + Per-Client Lossy Subscriber
**Task**: AZ-675_telemetry_stream_grpc_server
**Name**: Tonic gRPC server bind + per-client lossy subscriber bounded queue
**Description**: Bring up the operator-bound telemetry gRPC server (Tonic). Per-client subscriber has a bounded queue. Slow clients drop oldest, count drops; never block the producer.
**Complexity**: 3 points
**Dependencies**: AZ-640_initial_structure, AZ-649_mission_executor_telemetry_forwarding, AZ-657_frame_ingest_rtsp_session
**Component**: telemetry_stream
**Tracker**: AZ-675
**Epic**: AZ-637
## Problem
`telemetry_stream` is the operator-bound publisher for `TelemetrySample`, `GimbalState`, `DetectionEvent`, `MovementCandidate`, `MapObjectsBundle`. Per-client throttling MUST be lossy and per-client so a slow client never starves a healthy one. The server runs over the operator-link gRPC channel — same physical transport as `operator_bridge` but a separate logical service.
## Outcome
- Tonic gRPC server bound on `telemetry.listen_addr` exposing a single subscribe-style streaming RPC per topic (or a multiplex RPC).
- Each connected client has a `(bounded_queue, drop_counter, last_sent_seq)` state.
- Producer fan-out copies (refcount where possible) the message into each subscriber's queue. Full queue → drop oldest, increment `drops_total{client_id, topic}`.
- Disconnects cleanly tear down the subscriber.
- Health surface: `subscribed_clients`, `drops_total{client_id, topic}`, `bytes_out_per_topic`.
## Scope
### Included
- Tonic server bind + cleanup.
- Per-client subscriber state.
- Drop-oldest back-pressure.
- Disconnect handling.
### Excluded
- The .proto schema (lives in `shared/contracts/telemetry-stream.proto`; if absent, add it as a side-effect of this task).
- Diff-based snapshot emission for `MapObjectsBundle` (task 38).
- Operator commands (lives in `operator_bridge` component).
## Acceptance Criteria
**AC-1: Multiple subscribers receive the same stream**
Given 3 clients subscribed to `TelemetrySample`
When 100 samples are published
Then each client receives all 100 (assuming no slowness); ordering preserved.
**AC-2: Slow subscriber drops oldest, healthy unaffected**
Given client A reads slowly and client B reads at full speed
When producer pushes 1000 samples while A is paused
Then client A's queue grows up to `max_queue` and then drops oldest (drops_total{A} > 0); client B receives all 1000.
**AC-3: Disconnect cleanly removes subscriber**
Given a connected client
When the gRPC stream is canceled
Then `subscribed_clients` decrements by 1; producer fan-out no longer copies to that client.
## Non-Functional Requirements
**Performance**
- Per-message fan-out CPU: ≤2 ms p99 for ≤10 clients (per architecture NFR class).
- Tx tail latency end-to-end (producer → wire) ≤100 ms p95 over a healthy link.
**Reliability**
- No producer-side blocking on slow clients (hard rule).
## Runtime Completeness
- **Named capability**: Tonic gRPC operator telemetry stream with lossy per-client throttling.
- **Production code that must exist**: real gRPC server; real per-client subscriber state machine; real drop counters.
- **Allowed external stubs**: an in-process gRPC client in tests.
- **Unacceptable substitutes**: a single global queue (head-of-line blocking) is unacceptable.