detections/_docs/02_document/architecture.md

# Azaion.Detections — Architecture

## 1. System Context

**Problem being solved**: Automated object detection on aerial imagery and video — identifying military and infrastructure objects (vehicles, artillery, trenches, personnel, etc.) from drone/satellite feeds and returning structured detection results with bounding boxes, class labels, and confidence scores.

**System boundaries**:
- **Inside**: FastAPI HTTP service, Cython-based inference pipeline, ONNX/TensorRT inference engines, image tiling, video frame processing, detection postprocessing
- **Outside**: Loader service (model storage), Annotations service (result persistence + auth), client applications

**External systems**:

| System | Integration Type | Direction | Purpose |
|--------|-----------------|-----------|---------|
| Loader Service | REST (HTTP) | Both | Download AI models, upload converted TensorRT engines |
| Annotations Service | REST (HTTP) | Outbound | Post detection results, refresh auth tokens |
| Client Applications | REST + SSE | Inbound | Submit detection requests, receive streaming results |

## 2. Technology Stack

| Layer | Technology | Version | Rationale |
|-------|-----------|---------|-----------|
| Language | Python 3 + Cython | 3.1.3 (Cython) | Python for API, Cython for performance-critical inference loops |
| Framework | FastAPI + Uvicorn | latest | Async HTTP + SSE support |
| ML Runtime (CPU) | ONNX Runtime | 1.22.0 | Portable model format, CPU/CUDA provider fallback |
| ML Runtime (GPU) | TensorRT + PyCUDA | 10.11.0 / 2025.1.1 | Maximum GPU inference performance |
| Image Processing | OpenCV | 4.10.0 | Frame decoding, preprocessing, tiling |
| Serialization | msgpack | 1.1.1 | Compact binary serialization for annotations and configs |
| HTTP Client | requests | 2.32.4 | Synchronous HTTP to Loader and Annotations services |
| Logging | loguru | 0.7.3 | Structured file + console logging |
| GPU Monitoring | pynvml | 12.0.0 | GPU detection, capability checks, memory queries |
| Numeric | NumPy | 2.3.0 | Tensor manipulation |

## 3. Deployment Model

**Infrastructure**: Containerized microservice, deployed alongside Loader and Annotations services (likely Docker Compose or Kubernetes given service discovery by hostname).

**Environment-specific configuration**:

| Config | Development | Production |
|--------|-------------|------------|
| LOADER_URL | `http://loader:8080` (default) | Environment variable |
| ANNOTATIONS_URL | `http://annotations:8080` (default) | Environment variable |
| GPU | Optional (falls back to ONNX CPU) | Required (TensorRT) |
| Logging | Console + file | File (`Logs/log_inference_YYYYMMDD.txt`, 30-day retention) |

## 4. Data Model Overview

**Core entities**:

| Entity | Description | Owned By Component |
|--------|-------------|--------------------|
| AnnotationClass | Detection class metadata (name, color, max physical size) | 01 Domain |
| Detection | Single bounding box with class + confidence | 01 Domain |
| Annotation | Collection of detections for one frame/tile + image | 01 Domain |
| AIRecognitionConfig | Runtime inference parameters | 01 Domain |
| AIAvailabilityStatus | Engine lifecycle state | 01 Domain |
| DetectionDto | API-facing detection response | 04 API |
| DetectionEvent | SSE event payload | 04 API |

**Key relationships**:
- Annotation → Detection: one-to-many (detections within a frame/tile)
- Detection → AnnotationClass: many-to-one (via class ID lookup in annotations_dict)
- Annotation → Media: many-to-one (multiple annotations per video/image)

**Data flow summary**:
- Media bytes → Preprocessing → Engine → Raw output → Postprocessing → Detection/Annotation → DTO → HTTP/SSE response
- ONNX model bytes → Loader → Engine init (or TensorRT conversion → upload back to Loader)

## 5. Integration Points

### Internal Communication

| From | To | Protocol | Pattern | Notes |
|------|----|----------|---------|-------|
| API | Inference Pipeline | Direct Python call | Sync (via ThreadPoolExecutor) | Lazy initialization |
| Inference Pipeline | Inference Engines | Direct Cython call | Sync | Strategy pattern selection |
| Inference Pipeline | Loader | HTTP POST | Request-Response | Model download/upload |

### External Integrations

| External System | Protocol | Auth | Rate Limits | Failure Mode |
|----------------|----------|------|-------------|--------------|
| Loader Service | HTTP POST | None | None observed | Exception → LoadResult(err) |
| Annotations Service | HTTP POST | Bearer JWT | None observed | Exception silently caught |
| Annotations Auth | HTTP POST | Refresh token | None observed | Exception silently caught |

#### Annotations Service Contract

Detections → Annotations is the primary outbound integration. During async media detection (`POST /detect/{media_id}`), each detection batch is posted to the Annotations service for persistence and downstream sync.

**Endpoint:** `POST {ANNOTATIONS_URL}/annotations`

**Trigger:** Each valid annotation batch during F3 (async media detection), only when the original client request included an Authorization header.

**Payload sent by Detections:** `mediaId`, `source` (AI=0), `videoTime`, list of Detection objects (`centerX`, `centerY`, `width`, `height`, `classNum`, `label`, `confidence`), and optional base64 `image`. `userId` is not included — resolved from the JWT by Annotations. The Annotations API contract also accepts `description`, `affiliation`, and `combatReadiness` on each Detection, but Detections does not populate these.

**Responses:** 201 Created, 400 Bad Request (missing image/mediaId), 404 Not Found (unknown mediaId).

**Auth:** Bearer JWT forwarded from the client. For long-running video, auto-refreshed via `POST {ANNOTATIONS_URL}/auth/refresh` (TokenManager, 60s pre-expiry window).

**Downstream effect (Annotations side):**
1. Annotation persisted to local PostgreSQL (image hashed to XxHash64 ID)
2. SSE event published to UI subscribers
3. Annotation ID enqueued to `annotations_queue_records` → FailsafeProducer → RabbitMQ Stream (`azaion-annotations`) for central DB sync and AI training

**Failure isolation:** All POST failures are silently caught. Detection processing and SSE streaming continue regardless of Annotations service availability.

See `_docs/02_document/modules/main.md` § "Annotations Service Integration" for field-level schema detail.

## 6. Non-Functional Requirements

| Requirement | Target | Measurement | Priority |
|------------|--------|-------------|----------|
| Concurrent inference | 2 parallel jobs max | ThreadPoolExecutor workers | High |
| SSE queue depth | 100 events per client | asyncio.Queue maxsize | Medium |
| Log retention | 30 days | loguru rotation config | Medium |
| GPU compatibility | Compute capability ≥ 6.1 | pynvml check at startup | High |
| Model format | ONNX (portable) + TensorRT (GPU-specific) | Engine filename includes CC+SM | High |

## 7. Security Architecture

**Authentication**: Pass-through Bearer JWT from client → forwarded to Annotations service. JWT exp decoded locally (base64, no signature verification) for token refresh timing.

**Authorization**: None at the detection service level. Auth is delegated to the Annotations service.

**Data protection**:
- At rest: not applicable (no local persistence of detection results)
- In transit: no TLS configured at application level (expected to be handled by infrastructure/reverse proxy)
- Secrets management: tokens received per-request, no stored credentials

**Audit logging**: Inference activity logged to daily rotated files. No auth audit logging.

## 8. Key Architectural Decisions

### ADR-001: Cython for Inference Pipeline

**Context**: Detection postprocessing involves tight loops over bounding box coordinates with floating-point math.

**Decision**: Implement the inference pipeline, data models, and engines as Cython `cdef` classes with typed variables.

**Alternatives considered**:
1. Pure Python — rejected due to loop-heavy postprocessing performance
2. C/C++ extension — rejected for development velocity; Cython offers C-speed with Python-like syntax

**Consequences**: Build step required (setup.py + Cython compilation). IDE support and debugging more complex.

### ADR-002: Dual Engine Strategy (TensorRT + ONNX Fallback)

**Context**: Need maximum GPU inference speed where available, but must also run on CPU-only machines.

**Decision**: Check GPU at module load time. If compatible NVIDIA GPU found, use TensorRT; otherwise fall back to ONNX Runtime. Background-convert ONNX→TensorRT and cache the engine.

**Alternatives considered**:
1. TensorRT only — rejected; would break CPU-only development/testing
2. ONNX only — rejected; significantly slower on GPU vs TensorRT

**Consequences**: Two code paths to maintain. GPU-specific engine files cached per architecture.

### ADR-003: Lazy Inference Initialization

**Context**: Engine initialization is slow (model download, possible conversion). API should start accepting health checks immediately.

**Decision**: `Inference` is created on first actual detection request, not at app startup. Health endpoint works without engine.

**Consequences**: First detection request has higher latency. `AIAvailabilityStatus` reports state transitions during initialization.

### ADR-004: Large Image Tiling with GSD-Based Sizing

**Context**: Aerial images can be much larger than the model's fixed input size (1280×1280). Simple resize would lose small object detail.

**Decision**: Split large images into tiles sized by ground sampling distance (`METERS_IN_TILE / GSD` pixels) with configurable overlap. Deduplicate detections across tile boundaries.

**Consequences**: More complex pipeline. Tile deduplication relies on coordinate proximity threshold.