Files
Oleksandr Bezdieniezhnykh be4cab4fcb [AZ-178] Implement streaming video detection endpoint
- Added `/detect/video` endpoint for true streaming video detection, allowing inference to start as upload bytes arrive.
- Introduced `run_detect_video_stream` method in the inference module to handle video processing from a file-like object.
- Updated media hashing to include a new function for computing hashes directly from files with minimal I/O.
- Enhanced documentation to reflect changes in video processing and API behavior.

Made-with: Cursor
2026-04-01 03:11:43 +03:00

53 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Module: media_hash
## Purpose
Content-based hashing for media files using XxHash64 with a deterministic sampling algorithm. Produces a stable, unique ID for any media file based on its content.
## Public Interface
| Function | Signature | Description |
|----------|-----------|-------------|
| `compute_media_content_hash` | `(data: bytes, virtual: bool = False) -> str` | Returns hex XxHash64 digest of sampled content. If `virtual=True`, prefixes with "V". |
| `compute_media_content_hash_from_file` | `(path: str, virtual: bool = False) -> str` | Same algorithm but reads sampling regions directly from a file on disk — only 3 KB I/O regardless of file size. Produces identical hashes to the bytes-based version. (AZ-178) |
## Internal Logic
### Sampling Algorithm (`_sampling_payload`)
- **Small files** (< 3072 bytes): uses entire content
- **Large files** (≥ 3072 bytes): samples 3 × 1024-byte windows: first 1024, middle 1024, last 1024
- All payloads are prefixed with the 8-byte little-endian file size for collision resistance
The sampling avoids reading the full file through the hash function while still providing high uniqueness — the head, middle, and tail capture format headers, content, and EOF markers.
## Dependencies
- **External**: `xxhash` (pinned at 3.5.0 in requirements.txt)
- **Internal**: none (leaf module)
## Consumers
- `main` — computes content hash for uploaded media in `POST /detect` (bytes version) and `POST /detect/video` (file version) to use as the media record ID and storage filename
## Data Models
None.
## Configuration
None.
## External Integrations
None.
## Security
None. The hash is non-cryptographic (fast, not tamper-resistant).
## Tests
- `tests/test_media_hash.py` — covers small files, large files, and virtual prefix behavior
- `tests/test_az178_streaming_video.py::TestMediaContentHashFromFile` — verifies file-based hash matches bytes-based hash for small, large, boundary, and virtual cases