- Introduced `AIAvailabilityStatus` class to manage the availability status of AI models, including methods for setting status and logging messages. - Added `AIRecognitionConfig` class to encapsulate configuration parameters for AI recognition, with a static method for creating instances from dictionaries. - Implemented enums for AI availability states to enhance clarity and maintainability. - Updated related Cython files to support the new classes and ensure proper type handling. These changes aim to improve the structure and functionality of the AI model management system, facilitating better status tracking and configuration handling.
9.5 KiB
Azaion.Detections — Architecture
1. System Context
Problem being solved: Automated object detection on aerial imagery and video — identifying military and infrastructure objects (vehicles, artillery, trenches, personnel, etc.) from drone/satellite feeds and returning structured detection results with bounding boxes, class labels, and confidence scores.
System boundaries:
- Inside: FastAPI HTTP service, Cython-based inference pipeline, ONNX/TensorRT inference engines, image tiling, video frame processing, detection postprocessing
- Outside: Loader service (model storage), Annotations service (result persistence + auth), client applications
External systems:
| System | Integration Type | Direction | Purpose |
|---|---|---|---|
| Loader Service | REST (HTTP) | Both | Download AI models, upload converted TensorRT engines |
| Annotations Service | REST (HTTP) | Outbound | Post detection results, refresh auth tokens |
| Client Applications | REST + SSE | Inbound | Submit detection requests, receive streaming results |
2. Technology Stack
| Layer | Technology | Version | Rationale |
|---|---|---|---|
| Language | Python 3 + Cython | 3.1.3 (Cython) | Python for API, Cython for performance-critical inference loops |
| Framework | FastAPI + Uvicorn | latest | Async HTTP + SSE support |
| ML Runtime (CPU) | ONNX Runtime | 1.22.0 | Portable model format, CPU/CUDA provider fallback |
| ML Runtime (GPU) | TensorRT + PyCUDA | 10.11.0 / 2025.1.1 | Maximum GPU inference performance |
| Image Processing | OpenCV | 4.10.0 | Frame decoding, preprocessing, tiling |
| HTTP Client | requests | 2.32.4 | Synchronous HTTP to Loader and Annotations services |
| Logging | loguru | 0.7.3 | Structured file + console logging |
| GPU Monitoring | pynvml | 12.0.0 | GPU detection, capability checks, memory queries |
| Numeric | NumPy | 2.3.0 | Tensor manipulation |
3. Deployment Model
Infrastructure: Containerized microservice, deployed alongside Loader and Annotations services (likely Docker Compose or Kubernetes given service discovery by hostname).
Environment-specific configuration:
| Config | Development | Production |
|---|---|---|
| LOADER_URL | http://loader:8080 (default) |
Environment variable |
| ANNOTATIONS_URL | http://annotations:8080 (default) |
Environment variable |
| GPU | Optional (falls back to ONNX CPU) | Required (TensorRT) |
| Logging | Console + file | File (Logs/log_inference_YYYYMMDD.txt, 30-day retention) |
4. Data Model Overview
Core entities:
| Entity | Description | Owned By Component |
|---|---|---|
| AnnotationClass | Detection class metadata (name, color, max physical size) | 01 Domain |
| Detection | Single bounding box with class + confidence | 01 Domain |
| Annotation | Collection of detections for one frame/tile + image | 01 Domain |
| AIRecognitionConfig | Runtime inference parameters | 01 Domain |
| AIAvailabilityStatus | Engine lifecycle state | 01 Domain |
| DetectionDto | API-facing detection response | 04 API |
| DetectionEvent | SSE event payload | 04 API |
Key relationships:
- Annotation → Detection: one-to-many (detections within a frame/tile)
- Detection → AnnotationClass: many-to-one (via class ID lookup in annotations_dict)
- Annotation → Media: many-to-one (multiple annotations per video/image)
Data flow summary:
- Media bytes → Preprocessing → Engine → Raw output → Postprocessing → Detection/Annotation → DTO → HTTP/SSE response
- ONNX model bytes → Loader → Engine init (or TensorRT conversion → upload back to Loader)
5. Integration Points
Internal Communication
| From | To | Protocol | Pattern | Notes |
|---|---|---|---|---|
| API | Inference Pipeline | Direct Python call | Sync (via ThreadPoolExecutor) | Lazy initialization |
| Inference Pipeline | Inference Engines | Direct Cython call | Sync | Strategy pattern selection |
| Inference Pipeline | Loader | HTTP POST | Request-Response | Model download/upload |
External Integrations
| External System | Protocol | Auth | Rate Limits | Failure Mode |
|---|---|---|---|---|
| Loader Service | HTTP POST | None | None observed | Exception → LoadResult(err) |
| Annotations Service | HTTP POST | Bearer JWT | None observed | Exception silently caught |
| Annotations Auth | HTTP POST | Refresh token | None observed | Exception silently caught |
Annotations Service Contract
Detections → Annotations is the primary outbound integration. During async media detection (POST /detect/{media_id}), each detection batch is posted to the Annotations service for persistence and downstream sync.
Endpoint: POST {ANNOTATIONS_URL}/annotations
Trigger: Each valid annotation batch during F3 (async media detection), only when the original client request included an Authorization header.
Payload sent by Detections: mediaId, source (AI=0), videoTime, list of Detection objects (centerX, centerY, width, height, classNum, label, confidence), and optional base64 image. userId is not included — resolved from the JWT by Annotations. The Annotations API contract also accepts description, affiliation, and combatReadiness on each Detection, but Detections does not populate these.
Responses: 201 Created, 400 Bad Request (missing image/mediaId), 404 Not Found (unknown mediaId).
Auth: Bearer JWT forwarded from the client. For long-running video, auto-refreshed via POST {ANNOTATIONS_URL}/auth/refresh (TokenManager, 60s pre-expiry window).
Downstream effect (Annotations side):
- Annotation persisted to local PostgreSQL (image hashed to XxHash64 ID)
- SSE event published to UI subscribers
- Annotation ID enqueued to
annotations_queue_records→ FailsafeProducer → RabbitMQ Stream (azaion-annotations) for central DB sync and AI training
Failure isolation: All POST failures are silently caught. Detection processing and SSE streaming continue regardless of Annotations service availability.
See _docs/02_document/modules/main.md § "Annotations Service Integration" for field-level schema detail.
6. Non-Functional Requirements
| Requirement | Target | Measurement | Priority |
|---|---|---|---|
| Concurrent inference | 2 parallel jobs max | ThreadPoolExecutor workers | High |
| SSE queue depth | 100 events per client | asyncio.Queue maxsize | Medium |
| Log retention | 30 days | loguru rotation config | Medium |
| GPU compatibility | Compute capability ≥ 6.1 | pynvml check at startup | High |
| Model format | ONNX (portable) + TensorRT (GPU-specific) | Engine filename includes CC+SM | High |
7. Security Architecture
Authentication: Pass-through Bearer JWT from client → forwarded to Annotations service. JWT exp decoded locally (base64, no signature verification) for token refresh timing.
Authorization: None at the detection service level. Auth is delegated to the Annotations service.
Data protection:
- At rest: not applicable (no local persistence of detection results)
- In transit: no TLS configured at application level (expected to be handled by infrastructure/reverse proxy)
- Secrets management: tokens received per-request, no stored credentials
Audit logging: Inference activity logged to daily rotated files. No auth audit logging.
8. Key Architectural Decisions
ADR-001: Cython for Inference Pipeline
Context: Detection postprocessing involves tight loops over bounding box coordinates with floating-point math.
Decision: Implement the inference pipeline, data models, and engines as Cython cdef classes with typed variables.
Alternatives considered:
- Pure Python — rejected due to loop-heavy postprocessing performance
- C/C++ extension — rejected for development velocity; Cython offers C-speed with Python-like syntax
Consequences: Build step required (setup.py + Cython compilation). IDE support and debugging more complex.
ADR-002: Dual Engine Strategy (TensorRT + ONNX Fallback)
Context: Need maximum GPU inference speed where available, but must also run on CPU-only machines.
Decision: Check GPU at module load time. If compatible NVIDIA GPU found, use TensorRT; otherwise fall back to ONNX Runtime. Background-convert ONNX→TensorRT and cache the engine.
Alternatives considered:
- TensorRT only — rejected; would break CPU-only development/testing
- ONNX only — rejected; significantly slower on GPU vs TensorRT
Consequences: Two code paths to maintain. GPU-specific engine files cached per architecture.
ADR-003: Lazy Inference Initialization
Context: Engine initialization is slow (model download, possible conversion). API should start accepting health checks immediately.
Decision: Inference is created on first actual detection request, not at app startup. Health endpoint works without engine.
Consequences: First detection request has higher latency. AIAvailabilityStatus reports state transitions during initialization.
ADR-004: Large Image Tiling with GSD-Based Sizing
Context: Aerial images can be much larger than the model's fixed input size (1280×1280). Simple resize would lose small object detail.
Decision: Split large images into tiles sized by ground sampling distance (METERS_IN_TILE / GSD pixels) with configurable overlap. Deduplicate detections across tile boundaries.
Consequences: More complex pipeline. Tile deduplication relies on coordinate proximity threshold.