mirror of https://github.com/azaion/detections.git synced 2026-04-22 22:36:32 +00:00

Files

T

Oleksandr Bezdieniezhnykh 5be53739cd Refactor inference engine and task management: Remove obsolete inference engine and ONNX engine files, update inference processing to utilize batch handling, and enhance task management structure in documentation. Adjust paths for task specifications to align with new directory organization.

2026-03-28 01:04:28 +02:00

11 KiB

Raw Blame History

LiveKit Stream Detection

Task: AZ-150_livekit_stream_detection Name: LiveKit Stream Detection Integration Description: Enable real-time object detection on 5-10 simultaneous LiveKit WebRTC streams. Two-app architecture: a Playwright companion app for authentication and stream discovery, plus LiveKit SDK integration in the detection service for frame capture and inference. Complexity: 5 points Dependencies: None (extends existing detection service) Component: Feature Jira: AZ-150 Epic: AZ-149

Problem

The platform streams live video via LiveKit WebRTC. The detection service currently only processes pre-recorded video files and static images via cv2.VideoCapture. There is no way to run real-time object detection on live streams. The user needs to detect objects on 5-10 out of 50+ simultaneous streams shown on the platform's web page.

Key constraints:

No LiveKit API key/secret available (only browser-level access)
LiveKit WebRTC streams cannot be consumed by cv2.VideoCapture
Tokens are issued by the platform's backend and expire periodically
Must handle 5-10 concurrent streams without overwhelming the GPU inference engine

Outcome

User can open the platform's stream page in a Playwright-controlled browser, log in, and see all available streams
The system automatically discovers stream IDs, LiveKit room names, tokens, and server URL from network traffic
User selects which streams to run detection on via an injected UI overlay
Detection runs continuously on selected streams with results flowing through the existing SSE endpoint
Tokens are automatically refreshed as the page renews them

Architecture

Two-App Design

App 1: stream_discover.py (Playwright companion)
  - Launches real Chromium browser (separate process)
  - Python controls it via Chrome DevTools Protocol (CDP) over local WebSocket
  - User interacts with browser normally (login, navigation)
  - Python silently intercepts network traffic and reads the DOM
  - Injects a floating selection UI onto the page
  - Sends selected stream configs to the detection service API

App 2: Detection Service (existing FastAPI in main.py)
  - New /detect/livekit/* endpoints receive stream configs from companion app
  - livekit_source.py connects to LiveKit rooms via livekit.rtc SDK
  - livekit_detector.py orchestrates multi-stream frame capture and batched inference
  - inference.pyx gains a new detect_frames() method for raw numpy frame batches
  - Results flow through existing SSE /detect/stream endpoint

How Playwright Works (NOT a Webview)

Playwright is a browser automation library by Microsoft. It does NOT embed a browser inside a Python window. Instead:

playwright.chromium.launch(headless=False) starts a real standalone Chromium process -- identical to opening Chrome
Python communicates with this browser via CDP (Chrome DevTools Protocol) over a local WebSocket
The user sees a normal browser window and interacts with it normally (login, clicking, scrolling)
Python silently observes all network traffic, reads the DOM, and can inject HTML/JavaScript
There is no Python GUI -- the browser window IS the entire interface

Python Process                    Chromium Process (separate)
+--------------------------+      +---------------------------+
| stream_discover.py       |      | Normal browser window     |
|                          |      |                           |
| - Playwright library     | CDP  | - User logs in normally   |
| - Token interceptor      |<====>| - DevTools Protocol       |
| - DOM parser             | WS   | - Full web app rendering  |
| - Selection UI injector  |      | - LiveKit video playback  |
+--------------------------+      +---------------------------+

Advantages over a webview:

No GUI code to write -- browser IS the UI
User sees the exact same web app they normally use
Full access to network requests, cookies, localStorage
Playwright handles CDP complexity

Data Flow

1. User logs in via browser
2. User navigates to streams page
3. Python intercepts HTTP responses containing LiveKit JWT tokens
4. Python parses DOM for data-testid="mission-video-*" elements
5. Python decodes JWTs to extract room names
6. Python injects floating panel with stream checkboxes onto the page
7. User selects streams, clicks "Start Detection"
8. Python POSTs {livekit_url, rooms[{name, token, stream_id}]} to detection service
9. Detection service connects to LiveKit rooms via livekit.rtc
10. Frames are sampled, batched, and run through inference engine
11. DetectionEvents emitted via existing SSE /detect/stream
12. Python companion stays open, intercepts token refreshes, pushes to detection service

Multi-Stream Frame Processing

Stream 1 (async task) ─── sample every Nth frame ──┐
Stream 2 (async task) ─── sample every Nth frame ──├─► Shared Frame Queue ─► Detection Worker Thread
Stream N (async task) ─── sample every Nth frame ──┘         │                    │
                                                              │                    ▼
                                                     backpressure:          inference.detect_frames()
                                                     keep only latest           │
                                                     frame per stream           ▼
                                                                          DetectionEvent → SSE

At 30fps input with frame_period_recognition=4: ~7.5 fps per stream
10 streams = ~75 frames/sec into the queue
Engine batch size determines how many frames are processed at once
Backpressure: each stream keeps only its latest unprocessed frame; stale frames dropped

Scope

Included

Companion App (stream_discover.py)

Playwright browser launch and lifecycle management
Network response interception for LiveKit JWT token capture
WebSocket URL interception for LiveKit server URL discovery
DOM parsing for stream ID and display name extraction
JWT decoding to map stream_id -> room_name
Injected floating UI panel with stream checkboxes and "Start Detection" button
HTTP POST to detection service with selected stream configs
Token refresh monitoring and forwarding

Detection Service

livekit_source.py: LiveKit room connection, video track subscription, VideoFrame -> BGR numpy conversion
livekit_detector.py: multi-stream task orchestration, frame sampling, shared queue, batched detection loop, SSE event emission
inference.pyx/.pxd: new detect_frames(frames, config) cpdef method for raw numpy frame batches
main.py: new endpoints POST /detect/livekit/start, POST /detect/livekit/refresh-tokens, DELETE /detect/livekit/stop, GET /detect/livekit/status
requirements.txt: add livekit and playwright dependencies

Excluded

LiveKit API key/secret based token generation (no access)
Publishing video back to LiveKit
Recording or saving stream frames to disk
Modifying existing /detect or /detect/{media_id} endpoints
UI beyond the injected browser overlay

Acceptance Criteria

AC-1: Stream Discovery Given the user opens the platform's stream page in the Playwright browser When the page loads and streams are rendered Then the companion app discovers all stream IDs, display names, LiveKit tokens, room names, and server URL from network traffic and DOM

AC-2: Stream Selection UI Given streams are discovered When the companion app injects the selection panel Then the user sees a floating panel listing all streams with checkboxes and a "Start Detection" button

AC-3: Start Detection Given the user selects N streams and clicks "Start Detection" When the companion app sends the config to the detection service Then the detection service connects to N LiveKit rooms and begins receiving video frames

AC-4: Real-Time Inference Given the detection service is receiving frames from LiveKit streams When frames are sampled and batched through the inference engine Then DetectionEvents with annotations are emitted via the existing SSE /detect/stream endpoint

AC-5: Multi-Stream Handling Given 5-10 streams are active simultaneously When inference runs continuously Then all streams are processed fairly (round-robin or queue-based) without any stream being starved

AC-6: Token Refresh Given the platform's frontend refreshes LiveKit tokens periodically When the companion app detects a token renewal in network traffic Then the new token is forwarded to the detection service and the LiveKit connection continues without interruption

AC-7: Stop Detection Given detection is running on N streams When the user calls DELETE /detect/livekit/stop Then all LiveKit connections are cleanly closed and detection tasks cancelled

File Changes

File	Action	Description
`stream_discover.py`	New	Playwright companion app
`livekit_source.py`	New	LiveKit room connection and frame capture
`livekit_detector.py`	New	Multi-stream detection orchestration
`inference.pyx`	Modified	Add `detect_frames` cpdef method
`inference.pxd`	Modified	Declare `detect_frames` method
`main.py`	Modified	Add /detect/livekit/* endpoints
`requirements.txt`	Modified	Add `livekit`, `playwright`

Non-Functional Requirements

Performance

Frame-to-detection latency < 500ms per batch (excluding network latency)
10 concurrent streams without OOM or queue overflow

Reliability

Graceful handling of LiveKit disconnections (auto-reconnect or clean stop)
Token expiry handled without crash

Risks & Mitigation

Risk 1: LiveKit Python SDK frame format compatibility

Risk: VideoFrame format (RGBA/I420/NV12) may vary by codec and platform
Mitigation: Use frame.convert(VideoBufferType.RGBA) to normalize, then convert to BGR

Risk 2: Token expiration before refresh is captured

Risk: If the companion app misses a token refresh, the LiveKit connection drops
Mitigation: Implement reconnection logic in livekit_source.py; companion app can re-request tokens

Risk 3: Inference engine bottleneck with 10 streams

Risk: GPU/CPU inference cannot keep up with frame arrival rate
Mitigation: Backpressure design (drop stale frames); configurable frame_period_recognition to reduce load

Risk 4: Playwright browser stability

Risk: Long-running browser session may leak memory or crash
Mitigation: Monitor browser process health; provide manual restart capability

Risk 5: LiveKit room structure unknown

Risk: Rooms may be structured differently than expected (multi-track, SFU routing)
Mitigation: Start with single-track subscription per room; adapt after initial testing

11 KiB Raw Blame History