11 KiB
LiveKit Stream Detection
Task: AZ-150_livekit_stream_detection Name: LiveKit Stream Detection Integration Description: Enable real-time object detection on 5-10 simultaneous LiveKit WebRTC streams. Two-app architecture: a Playwright companion app for authentication and stream discovery, plus LiveKit SDK integration in the detection service for frame capture and inference. Complexity: 5 points Dependencies: None (extends existing detection service) Component: Feature Jira: AZ-150 Epic: AZ-149
Problem
The platform streams live video via LiveKit WebRTC. The detection service currently only processes pre-recorded video files and static images via cv2.VideoCapture. There is no way to run real-time object detection on live streams. The user needs to detect objects on 5-10 out of 50+ simultaneous streams shown on the platform's web page.
Key constraints:
- No LiveKit API key/secret available (only browser-level access)
- LiveKit WebRTC streams cannot be consumed by
cv2.VideoCapture - Tokens are issued by the platform's backend and expire periodically
- Must handle 5-10 concurrent streams without overwhelming the GPU inference engine
Outcome
- User can open the platform's stream page in a Playwright-controlled browser, log in, and see all available streams
- The system automatically discovers stream IDs, LiveKit room names, tokens, and server URL from network traffic
- User selects which streams to run detection on via an injected UI overlay
- Detection runs continuously on selected streams with results flowing through the existing SSE endpoint
- Tokens are automatically refreshed as the page renews them
Architecture
Two-App Design
App 1: stream_discover.py (Playwright companion)
- Launches real Chromium browser (separate process)
- Python controls it via Chrome DevTools Protocol (CDP) over local WebSocket
- User interacts with browser normally (login, navigation)
- Python silently intercepts network traffic and reads the DOM
- Injects a floating selection UI onto the page
- Sends selected stream configs to the detection service API
App 2: Detection Service (existing FastAPI in main.py)
- New /detect/livekit/* endpoints receive stream configs from companion app
- livekit_source.py connects to LiveKit rooms via livekit.rtc SDK
- livekit_detector.py orchestrates multi-stream frame capture and batched inference
- inference.pyx gains a new detect_frames() method for raw numpy frame batches
- Results flow through existing SSE /detect/stream endpoint
How Playwright Works (NOT a Webview)
Playwright is a browser automation library by Microsoft. It does NOT embed a browser inside a Python window. Instead:
playwright.chromium.launch(headless=False)starts a real standalone Chromium process -- identical to opening Chrome- Python communicates with this browser via CDP (Chrome DevTools Protocol) over a local WebSocket
- The user sees a normal browser window and interacts with it normally (login, clicking, scrolling)
- Python silently observes all network traffic, reads the DOM, and can inject HTML/JavaScript
- There is no Python GUI -- the browser window IS the entire interface
Python Process Chromium Process (separate)
+--------------------------+ +---------------------------+
| stream_discover.py | | Normal browser window |
| | | |
| - Playwright library | CDP | - User logs in normally |
| - Token interceptor |<====>| - DevTools Protocol |
| - DOM parser | WS | - Full web app rendering |
| - Selection UI injector | | - LiveKit video playback |
+--------------------------+ +---------------------------+
Advantages over a webview:
- No GUI code to write -- browser IS the UI
- User sees the exact same web app they normally use
- Full access to network requests, cookies, localStorage
- Playwright handles CDP complexity
Data Flow
1. User logs in via browser
2. User navigates to streams page
3. Python intercepts HTTP responses containing LiveKit JWT tokens
4. Python parses DOM for data-testid="mission-video-*" elements
5. Python decodes JWTs to extract room names
6. Python injects floating panel with stream checkboxes onto the page
7. User selects streams, clicks "Start Detection"
8. Python POSTs {livekit_url, rooms[{name, token, stream_id}]} to detection service
9. Detection service connects to LiveKit rooms via livekit.rtc
10. Frames are sampled, batched, and run through inference engine
11. DetectionEvents emitted via existing SSE /detect/stream
12. Python companion stays open, intercepts token refreshes, pushes to detection service
Multi-Stream Frame Processing
Stream 1 (async task) ─── sample every Nth frame ──┐
Stream 2 (async task) ─── sample every Nth frame ──├─► Shared Frame Queue ─► Detection Worker Thread
Stream N (async task) ─── sample every Nth frame ──┘ │ │
│ ▼
backpressure: inference.detect_frames()
keep only latest │
frame per stream ▼
DetectionEvent → SSE
- At 30fps input with frame_period_recognition=4: ~7.5 fps per stream
- 10 streams = ~75 frames/sec into the queue
- Engine batch size determines how many frames are processed at once
- Backpressure: each stream keeps only its latest unprocessed frame; stale frames dropped
Scope
Included
Companion App (stream_discover.py)
- Playwright browser launch and lifecycle management
- Network response interception for LiveKit JWT token capture
- WebSocket URL interception for LiveKit server URL discovery
- DOM parsing for stream ID and display name extraction
- JWT decoding to map stream_id -> room_name
- Injected floating UI panel with stream checkboxes and "Start Detection" button
- HTTP POST to detection service with selected stream configs
- Token refresh monitoring and forwarding
Detection Service
livekit_source.py: LiveKit room connection, video track subscription, VideoFrame -> BGR numpy conversionlivekit_detector.py: multi-stream task orchestration, frame sampling, shared queue, batched detection loop, SSE event emissioninference.pyx/.pxd: newdetect_frames(frames, config)cpdef method for raw numpy frame batchesmain.py: new endpoints POST /detect/livekit/start, POST /detect/livekit/refresh-tokens, DELETE /detect/livekit/stop, GET /detect/livekit/statusrequirements.txt: addlivekitandplaywrightdependencies
Excluded
- LiveKit API key/secret based token generation (no access)
- Publishing video back to LiveKit
- Recording or saving stream frames to disk
- Modifying existing /detect or /detect/{media_id} endpoints
- UI beyond the injected browser overlay
Acceptance Criteria
AC-1: Stream Discovery Given the user opens the platform's stream page in the Playwright browser When the page loads and streams are rendered Then the companion app discovers all stream IDs, display names, LiveKit tokens, room names, and server URL from network traffic and DOM
AC-2: Stream Selection UI Given streams are discovered When the companion app injects the selection panel Then the user sees a floating panel listing all streams with checkboxes and a "Start Detection" button
AC-3: Start Detection Given the user selects N streams and clicks "Start Detection" When the companion app sends the config to the detection service Then the detection service connects to N LiveKit rooms and begins receiving video frames
AC-4: Real-Time Inference Given the detection service is receiving frames from LiveKit streams When frames are sampled and batched through the inference engine Then DetectionEvents with annotations are emitted via the existing SSE /detect/stream endpoint
AC-5: Multi-Stream Handling Given 5-10 streams are active simultaneously When inference runs continuously Then all streams are processed fairly (round-robin or queue-based) without any stream being starved
AC-6: Token Refresh Given the platform's frontend refreshes LiveKit tokens periodically When the companion app detects a token renewal in network traffic Then the new token is forwarded to the detection service and the LiveKit connection continues without interruption
AC-7: Stop Detection Given detection is running on N streams When the user calls DELETE /detect/livekit/stop Then all LiveKit connections are cleanly closed and detection tasks cancelled
File Changes
| File | Action | Description |
|---|---|---|
stream_discover.py |
New | Playwright companion app |
livekit_source.py |
New | LiveKit room connection and frame capture |
livekit_detector.py |
New | Multi-stream detection orchestration |
inference.pyx |
Modified | Add detect_frames cpdef method |
inference.pxd |
Modified | Declare detect_frames method |
main.py |
Modified | Add /detect/livekit/* endpoints |
requirements.txt |
Modified | Add livekit, playwright |
Non-Functional Requirements
Performance
- Frame-to-detection latency < 500ms per batch (excluding network latency)
- 10 concurrent streams without OOM or queue overflow
Reliability
- Graceful handling of LiveKit disconnections (auto-reconnect or clean stop)
- Token expiry handled without crash
Risks & Mitigation
Risk 1: LiveKit Python SDK frame format compatibility
- Risk: VideoFrame format (RGBA/I420/NV12) may vary by codec and platform
- Mitigation: Use
frame.convert(VideoBufferType.RGBA)to normalize, then convert to BGR
Risk 2: Token expiration before refresh is captured
- Risk: If the companion app misses a token refresh, the LiveKit connection drops
- Mitigation: Implement reconnection logic in livekit_source.py; companion app can re-request tokens
Risk 3: Inference engine bottleneck with 10 streams
- Risk: GPU/CPU inference cannot keep up with frame arrival rate
- Mitigation: Backpressure design (drop stale frames); configurable frame_period_recognition to reduce load
Risk 4: Playwright browser stability
- Risk: Long-running browser session may leak memory or crash
- Mitigation: Monitor browser process health; provide manual restart capability
Risk 5: LiveKit room structure unknown
- Risk: Rooms may be structured differently than expected (multi-track, SFU routing)
- Mitigation: Start with single-track subscription per room; adapt after initial testing