Files
Oleksandr Bezdieniezhnykh be4cab4fcb [AZ-178] Implement streaming video detection endpoint
- Added `/detect/video` endpoint for true streaming video detection, allowing inference to start as upload bytes arrive.
- Introduced `run_detect_video_stream` method in the inference module to handle video processing from a file-like object.
- Updated media hashing to include a new function for computing hashes directly from files with minimal I/O.
- Enhanced documentation to reflect changes in video processing and API behavior.

Made-with: Cursor
2026-04-01 03:11:43 +03:00

1.9 KiB
Raw Permalink Blame History

Module: media_hash

Purpose

Content-based hashing for media files using XxHash64 with a deterministic sampling algorithm. Produces a stable, unique ID for any media file based on its content.

Public Interface

Function Signature Description
compute_media_content_hash (data: bytes, virtual: bool = False) -> str Returns hex XxHash64 digest of sampled content. If virtual=True, prefixes with "V".
compute_media_content_hash_from_file (path: str, virtual: bool = False) -> str Same algorithm but reads sampling regions directly from a file on disk — only 3 KB I/O regardless of file size. Produces identical hashes to the bytes-based version. (AZ-178)

Internal Logic

Sampling Algorithm (_sampling_payload)

  • Small files (< 3072 bytes): uses entire content
  • Large files (≥ 3072 bytes): samples 3 × 1024-byte windows: first 1024, middle 1024, last 1024
  • All payloads are prefixed with the 8-byte little-endian file size for collision resistance

The sampling avoids reading the full file through the hash function while still providing high uniqueness — the head, middle, and tail capture format headers, content, and EOF markers.

Dependencies

  • External: xxhash (pinned at 3.5.0 in requirements.txt)
  • Internal: none (leaf module)

Consumers

  • main — computes content hash for uploaded media in POST /detect (bytes version) and POST /detect/video (file version) to use as the media record ID and storage filename

Data Models

None.

Configuration

None.

External Integrations

None.

Security

None. The hash is non-cryptographic (fast, not tamper-resistant).

Tests

  • tests/test_media_hash.py — covers small files, large files, and virtual prefix behavior
  • tests/test_az178_streaming_video.py::TestMediaContentHashFromFile — verifies file-based hash matches bytes-based hash for small, large, boundary, and virtual cases