Files
detections/_docs/02_document/modules/media_hash.md
T
Oleksandr Bezdieniezhnykh 1fe9425aa8 [AZ-172] Update documentation for distributed architecture, add Update Docs step to workflow
- Update module docs: main, inference, ai_config, loader_http_client
- Add new module doc: media_hash
- Update component docs: inference_pipeline, api
- Update system-flows (F2, F3) and data_parameters
- Add Task Mode to document skill for incremental doc updates
- Insert Step 11 (Update Docs) in existing-code flow, renumber 11-13 to 12-14

Made-with: Cursor
2026-03-31 17:25:58 +03:00

51 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Module: media_hash
## Purpose
Content-based hashing for media files using XxHash64 with a deterministic sampling algorithm. Produces a stable, unique ID for any media file based on its content.
## Public Interface
| Function | Signature | Description |
|----------|-----------|-------------|
| `compute_media_content_hash` | `(data: bytes, virtual: bool = False) -> str` | Returns hex XxHash64 digest of sampled content. If `virtual=True`, prefixes with "V". |
## Internal Logic
### Sampling Algorithm (`_sampling_payload`)
- **Small files** (< 3072 bytes): uses entire content
- **Large files** (≥ 3072 bytes): samples 3 × 1024-byte windows: first 1024, middle 1024, last 1024
- All payloads are prefixed with the 8-byte little-endian file size for collision resistance
The sampling avoids reading the full file through the hash function while still providing high uniqueness — the head, middle, and tail capture format headers, content, and EOF markers.
## Dependencies
- **External**: `xxhash` (pinned at 3.5.0 in requirements.txt)
- **Internal**: none (leaf module)
## Consumers
- `main` — computes content hash for uploaded media in `POST /detect` to use as the media record ID and storage filename
## Data Models
None.
## Configuration
None.
## External Integrations
None.
## Security
None. The hash is non-cryptographic (fast, not tamper-resistant).
## Tests
- `tests/test_media_hash.py` — covers small files, large files, and virtual prefix behavior