# LiveKit Stream Detection **Task**: AZ-150_livekit_stream_detection **Name**: LiveKit Stream Detection Integration **Description**: Enable real-time object detection on 5-10 simultaneous LiveKit WebRTC streams. Two-app architecture: a Playwright companion app for authentication and stream discovery, plus LiveKit SDK integration in the detection service for frame capture and inference. **Complexity**: 5 points **Dependencies**: None (extends existing detection service) **Component**: Feature **Jira**: AZ-150 **Epic**: AZ-149 ## Problem The platform streams live video via LiveKit WebRTC. The detection service currently only processes pre-recorded video files and static images via `cv2.VideoCapture`. There is no way to run real-time object detection on live streams. The user needs to detect objects on 5-10 out of 50+ simultaneous streams shown on the platform's web page. Key constraints: - No LiveKit API key/secret available (only browser-level access) - LiveKit WebRTC streams cannot be consumed by `cv2.VideoCapture` - Tokens are issued by the platform's backend and expire periodically - Must handle 5-10 concurrent streams without overwhelming the GPU inference engine ## Outcome - User can open the platform's stream page in a Playwright-controlled browser, log in, and see all available streams - The system automatically discovers stream IDs, LiveKit room names, tokens, and server URL from network traffic - User selects which streams to run detection on via an injected UI overlay - Detection runs continuously on selected streams with results flowing through the existing SSE endpoint - Tokens are automatically refreshed as the page renews them ## Architecture ### Two-App Design ``` App 1: stream_discover.py (Playwright companion) - Launches real Chromium browser (separate process) - Python controls it via Chrome DevTools Protocol (CDP) over local WebSocket - User interacts with browser normally (login, navigation) - Python silently intercepts network traffic and reads the DOM - Injects a floating selection UI onto the page - Sends selected stream configs to the detection service API App 2: Detection Service (existing FastAPI in main.py) - New /detect/livekit/* endpoints receive stream configs from companion app - livekit_source.py connects to LiveKit rooms via livekit.rtc SDK - livekit_detector.py orchestrates multi-stream frame capture and batched inference - inference.pyx gains a new detect_frames() method for raw numpy frame batches - Results flow through existing SSE /detect/stream endpoint ``` ### How Playwright Works (NOT a Webview) Playwright is a browser automation library by Microsoft. It does NOT embed a browser inside a Python window. Instead: 1. `playwright.chromium.launch(headless=False)` starts a **real standalone Chromium process** -- identical to opening Chrome 2. Python communicates with this browser via CDP (Chrome DevTools Protocol) over a local WebSocket 3. The user sees a normal browser window and interacts with it normally (login, clicking, scrolling) 4. Python silently observes all network traffic, reads the DOM, and can inject HTML/JavaScript 5. There is no Python GUI -- the browser window IS the entire interface ``` Python Process Chromium Process (separate) +--------------------------+ +---------------------------+ | stream_discover.py | | Normal browser window | | | | | | - Playwright library | CDP | - User logs in normally | | - Token interceptor |<====>| - DevTools Protocol | | - DOM parser | WS | - Full web app rendering | | - Selection UI injector | | - LiveKit video playback | +--------------------------+ +---------------------------+ ``` Advantages over a webview: - No GUI code to write -- browser IS the UI - User sees the exact same web app they normally use - Full access to network requests, cookies, localStorage - Playwright handles CDP complexity ### Data Flow ``` 1. User logs in via browser 2. User navigates to streams page 3. Python intercepts HTTP responses containing LiveKit JWT tokens 4. Python parses DOM for data-testid="mission-video-*" elements 5. Python decodes JWTs to extract room names 6. Python injects floating panel with stream checkboxes onto the page 7. User selects streams, clicks "Start Detection" 8. Python POSTs {livekit_url, rooms[{name, token, stream_id}]} to detection service 9. Detection service connects to LiveKit rooms via livekit.rtc 10. Frames are sampled, batched, and run through inference engine 11. DetectionEvents emitted via existing SSE /detect/stream 12. Python companion stays open, intercepts token refreshes, pushes to detection service ``` ### Multi-Stream Frame Processing ``` Stream 1 (async task) ─── sample every Nth frame ──┐ Stream 2 (async task) ─── sample every Nth frame ──├─► Shared Frame Queue ─► Detection Worker Thread Stream N (async task) ─── sample every Nth frame ──┘ │ │ │ ▼ backpressure: inference.detect_frames() keep only latest │ frame per stream ▼ DetectionEvent → SSE ``` - At 30fps input with frame_period_recognition=4: ~7.5 fps per stream - 10 streams = ~75 frames/sec into the queue - Engine batch size determines how many frames are processed at once - Backpressure: each stream keeps only its latest unprocessed frame; stale frames dropped ## Scope ### Included **Companion App (stream_discover.py)** - Playwright browser launch and lifecycle management - Network response interception for LiveKit JWT token capture - WebSocket URL interception for LiveKit server URL discovery - DOM parsing for stream ID and display name extraction - JWT decoding to map stream_id -> room_name - Injected floating UI panel with stream checkboxes and "Start Detection" button - HTTP POST to detection service with selected stream configs - Token refresh monitoring and forwarding **Detection Service** - `livekit_source.py`: LiveKit room connection, video track subscription, VideoFrame -> BGR numpy conversion - `livekit_detector.py`: multi-stream task orchestration, frame sampling, shared queue, batched detection loop, SSE event emission - `inference.pyx`/`.pxd`: new `detect_frames(frames, config)` cpdef method for raw numpy frame batches - `main.py`: new endpoints POST /detect/livekit/start, POST /detect/livekit/refresh-tokens, DELETE /detect/livekit/stop, GET /detect/livekit/status - `requirements.txt`: add `livekit` and `playwright` dependencies ### Excluded - LiveKit API key/secret based token generation (no access) - Publishing video back to LiveKit - Recording or saving stream frames to disk - Modifying existing /detect or /detect/{media_id} endpoints - UI beyond the injected browser overlay ## Acceptance Criteria **AC-1: Stream Discovery** Given the user opens the platform's stream page in the Playwright browser When the page loads and streams are rendered Then the companion app discovers all stream IDs, display names, LiveKit tokens, room names, and server URL from network traffic and DOM **AC-2: Stream Selection UI** Given streams are discovered When the companion app injects the selection panel Then the user sees a floating panel listing all streams with checkboxes and a "Start Detection" button **AC-3: Start Detection** Given the user selects N streams and clicks "Start Detection" When the companion app sends the config to the detection service Then the detection service connects to N LiveKit rooms and begins receiving video frames **AC-4: Real-Time Inference** Given the detection service is receiving frames from LiveKit streams When frames are sampled and batched through the inference engine Then DetectionEvents with annotations are emitted via the existing SSE /detect/stream endpoint **AC-5: Multi-Stream Handling** Given 5-10 streams are active simultaneously When inference runs continuously Then all streams are processed fairly (round-robin or queue-based) without any stream being starved **AC-6: Token Refresh** Given the platform's frontend refreshes LiveKit tokens periodically When the companion app detects a token renewal in network traffic Then the new token is forwarded to the detection service and the LiveKit connection continues without interruption **AC-7: Stop Detection** Given detection is running on N streams When the user calls DELETE /detect/livekit/stop Then all LiveKit connections are cleanly closed and detection tasks cancelled ## File Changes | File | Action | Description | |------|--------|-------------| | `stream_discover.py` | New | Playwright companion app | | `livekit_source.py` | New | LiveKit room connection and frame capture | | `livekit_detector.py` | New | Multi-stream detection orchestration | | `inference.pyx` | Modified | Add `detect_frames` cpdef method | | `inference.pxd` | Modified | Declare `detect_frames` method | | `main.py` | Modified | Add /detect/livekit/* endpoints | | `requirements.txt` | Modified | Add `livekit`, `playwright` | ## Non-Functional Requirements **Performance** - Frame-to-detection latency < 500ms per batch (excluding network latency) - 10 concurrent streams without OOM or queue overflow **Reliability** - Graceful handling of LiveKit disconnections (auto-reconnect or clean stop) - Token expiry handled without crash ## Risks & Mitigation **Risk 1: LiveKit Python SDK frame format compatibility** - *Risk*: VideoFrame format (RGBA/I420/NV12) may vary by codec and platform - *Mitigation*: Use `frame.convert(VideoBufferType.RGBA)` to normalize, then convert to BGR **Risk 2: Token expiration before refresh is captured** - *Risk*: If the companion app misses a token refresh, the LiveKit connection drops - *Mitigation*: Implement reconnection logic in livekit_source.py; companion app can re-request tokens **Risk 3: Inference engine bottleneck with 10 streams** - *Risk*: GPU/CPU inference cannot keep up with frame arrival rate - *Mitigation*: Backpressure design (drop stale frames); configurable frame_period_recognition to reduce load **Risk 4: Playwright browser stability** - *Risk*: Long-running browser session may leak memory or crash - *Mitigation*: Monitor browser process health; provide manual restart capability **Risk 5: LiveKit room structure unknown** - *Risk*: Rooms may be structured differently than expected (multi-track, SFU routing) - *Mitigation*: Start with single-track subscription per room; adapt after initial testing