# Module: media_hash ## Purpose Content-based hashing for media files using XxHash64 with a deterministic sampling algorithm. Produces a stable, unique ID for any media file based on its content. ## Public Interface | Function | Signature | Description | |----------|-----------|-------------| | `compute_media_content_hash` | `(data: bytes, virtual: bool = False) -> str` | Returns hex XxHash64 digest of sampled content. If `virtual=True`, prefixes with "V". | | `compute_media_content_hash_from_file` | `(path: str, virtual: bool = False) -> str` | Same algorithm but reads sampling regions directly from a file on disk — only 3 KB I/O regardless of file size. Produces identical hashes to the bytes-based version. (AZ-178) | ## Internal Logic ### Sampling Algorithm (`_sampling_payload`) - **Small files** (< 3072 bytes): uses entire content - **Large files** (≥ 3072 bytes): samples 3 × 1024-byte windows: first 1024, middle 1024, last 1024 - All payloads are prefixed with the 8-byte little-endian file size for collision resistance The sampling avoids reading the full file through the hash function while still providing high uniqueness — the head, middle, and tail capture format headers, content, and EOF markers. ## Dependencies - **External**: `xxhash` (pinned at 3.5.0 in requirements.txt) - **Internal**: none (leaf module) ## Consumers - `main` — computes content hash for uploaded media in `POST /detect` (bytes version) and `POST /detect/video` (file version) to use as the media record ID and storage filename ## Data Models None. ## Configuration None. ## External Integrations None. ## Security None. The hash is non-cryptographic (fast, not tamper-resistant). ## Tests - `tests/test_media_hash.py` — covers small files, large files, and virtual prefix behavior - `tests/test_az178_streaming_video.py::TestMediaContentHashFromFile` — verifies file-based hash matches bytes-based hash for small, large, boundary, and virtual cases