Add E2E tests, fix bugs

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-04-13 05:17:48 +03:00
parent 1f98b5e958
commit 8f7deb3fca
71 changed files with 4740 additions and 29 deletions
@@ -0,0 +1,98 @@
# Core Models
## 1. High-Level Overview
**Purpose**: Provides shared constants, data models (Credentials, User, UnlockState), and the application-wide logging facility used by all other components.
**Architectural Pattern**: Shared kernel — foundational types and utilities with no business logic.
**Upstream dependencies**: None (leaf component)
**Downstream consumers**: Security, Resource Management, HTTP API
## 2. Internal Interfaces
### Interface: Constants
| Symbol | Type | Value / Signature |
|-----------------------|------|----------------------------|
| `CONFIG_FILE` | str | `"config.yaml"` |
| `QUEUE_CONFIG_FILENAME`| str | `"secured-config.json"` |
| `AI_ONNX_MODEL_FILE` | str | `"azaion.onnx"` |
| `CDN_CONFIG` | str | `"cdn.yaml"` |
| `MODELS_FOLDER` | str | `"models"` |
| `SMALL_SIZE_KB` | int | `3` |
| `ALIGNMENT_WIDTH` | int | `32` |
| `log(str)` | cdef | INFO-level log via Loguru |
| `logerror(str)` | cdef | ERROR-level log via Loguru |
### Interface: Credentials
| Method | Input | Output | Async | Error Types |
|----------------|--------------------------|-------------|-------|-------------|
| `__init__` | `str email, str password`| Credentials | No | — |
**Fields**: `email: str (public)`, `password: str (public)`
### Interface: User
| Method | Input | Output | Async | Error Types |
|------------|-----------------------------------|--------|-------|-------------|
| `__init__` | `str id, str email, RoleEnum role`| User | No | — |
**Enum: RoleEnum** — NONE(0), Operator(10), Validator(20), CompanionPC(30), Admin(40), ResourceUploader(50), ApiAdmin(1000)
### Interface: UnlockState
Python `str` enum: idle, authenticating, downloading_key, decrypting, loading_images, ready, error.
## 3. External API Specification
N/A — internal-only component.
## 4. Data Access Patterns
N/A — no persistent storage. All data is in-memory.
## 5. Implementation Details
**State Management**: Stateless — pure data definitions and a configured logger singleton.
**Key Dependencies**:
| Library | Version | Purpose |
|---------|---------|--------------------------------|
| loguru | 0.7.3 | Structured logging with rotation |
**Error Handling Strategy**: Logging functions never throw; they are the error-reporting mechanism.
## 6. Extensions and Helpers
None.
## 7. Caveats & Edge Cases
**Known limitations**:
- `QUEUE_MAXSIZE`, `COMMANDS_QUEUE`, `ANNOTATIONS_QUEUE` are declared in `constants.pxd` but never defined — dead declarations
- Log directory `Logs/` is hardcoded; not configurable via env var
- `psutil` is in `requirements.txt` but not used by any module
## 8. Dependency Graph
**Must be implemented after**: —
**Can be implemented in parallel with**: Security (02), Resource Management (03)
**Blocks**: Security (02), Resource Management (03), HTTP API (04)
## 9. Logging Strategy
| Log Level | When | Example |
|-----------|------|---------|
| ERROR | `logerror()` calls | Forwarded from caller modules |
| INFO | `log()` calls | Forwarded from caller modules |
| DEBUG | Stdout filter includes DEBUG | Available for development |
**Log format**: `[HH:mm:ss LEVEL] message`
**Log storage**: File (`Logs/log_loader_{date}.txt`) + stdout (INFO/DEBUG) + stderr (WARNING+)
@@ -0,0 +1,102 @@
# Security
## 1. High-Level Overview
**Purpose**: Provides AES-256-CBC encryption/decryption, multiple key derivation strategies, and OS-specific hardware fingerprinting for machine-bound access control.
**Architectural Pattern**: Utility / Strategy — stateless static methods for crypto operations; hardware fingerprinting with caching.
**Upstream dependencies**: Core Models (01) — uses `Credentials` type, `constants.log()`
**Downstream consumers**: Resource Management (03) — `ApiClient` uses all Security and HardwareService methods
## 2. Internal Interfaces
### Interface: Security
| Method | Input | Output | Async | Error Types |
|-----------------------------|----------------------------------------|--------|-------|-------------|
| `encrypt_to` | `bytes input_bytes, str key` | bytes | No | cryptography errors |
| `decrypt_to` | `bytes ciphertext_with_iv, str key` | bytes | No | cryptography errors |
| `get_hw_hash` | `str hardware` | str | No | — |
| `get_api_encryption_key` | `Credentials creds, str hardware_hash` | str | No | — |
| `get_resource_encryption_key`| — | str | No | — |
| `calc_hash` | `str key` | str | No | — |
All methods are `@staticmethod cdef` (Cython-only visibility).
### Interface: HardwareService
| Method | Input | Output | Async | Error Types |
|---------------------|-------|--------|-------|---------------------|
| `get_hardware_info` | — | str | No | subprocess errors |
`@staticmethod cdef` with module-level caching in `_CACHED_HW_INFO`.
## 3. External API Specification
N/A — internal-only component.
## 4. Data Access Patterns
### Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|-----------------|-----------|----------|---------------|
| Hardware info | In-memory (module global) | Process lifetime | Never (static hardware) |
## 5. Implementation Details
**Algorithmic Complexity**: All crypto operations are O(n) in input size.
**State Management**: HardwareService has one cached string; Security is fully stateless.
**Key Dependencies**:
| Library | Version | Purpose |
|--------------|---------|--------------------------------------|
| cryptography | 44.0.2 | AES-256-CBC cipher, PKCS7 padding |
**Error Handling Strategy**:
- Crypto errors propagate to caller (no catch)
- `subprocess.check_output` in HardwareService raises `CalledProcessError` on failure
**Key Derivation Hierarchy**:
1. Hardware hash: `SHA-384("Azaion_{hw_string}_%$$$)0_")` → base64
2. API encryption key: `SHA-384("{email}-{password}-{hw_hash}-#%@AzaionKey@%#---")` → base64 (per-user, per-machine)
3. Resource encryption key: `SHA-384("-#%@AzaionKey@%#---234sdfklgvhjbnn")` → base64 (fixed, shared)
4. AES key expansion: `SHA-256(string_key)` → 32-byte AES key (inside encrypt/decrypt)
## 6. Extensions and Helpers
None.
## 7. Caveats & Edge Cases
**Known limitations**:
- `get_resource_encryption_key()` returns a fixed key — all users share the same resource encryption key
- Hardware detection uses `shell=True` subprocess — injection risk if inputs were user-controlled (they are not)
- Linux hardware detection may fail on systems without `lscpu`, `lspci`, or `/sys/block/sda`
- Multiple GPUs: only the first GPU line is captured
**Potential race conditions**:
- `_CACHED_HW_INFO` is a module global written without locking — concurrent first calls could race, but the result is idempotent
## 8. Dependency Graph
**Must be implemented after**: Core Models (01)
**Can be implemented in parallel with**: Resource Management (03) depends on this, so Security must be ready first
**Blocks**: Resource Management (03)
## 9. Logging Strategy
| Log Level | When | Example |
|-----------|------|---------|
| INFO | Hardware info gathered | `"Gathered hardware: CPU: ... GPU: ... Memory: ... DriveSerial: ..."` |
| INFO | Cached hardware reuse | `"Using cached hardware info"` |
**Log format**: Via `constants.log()``[HH:mm:ss INFO] message`
**Log storage**: Same as Core Models logging configuration
@@ -0,0 +1,131 @@
# Resource Management
## 1. High-Level Overview
**Purpose**: Orchestrates authenticated resource download/upload using a binary-split scheme (small encrypted part via API, large part via CDN), CDN storage operations, and Docker image archive decryption/loading.
**Architectural Pattern**: Facade — `ApiClient` coordinates CDN, Security, and API calls behind a unified interface.
**Upstream dependencies**: Core Models (01) — constants, Credentials, User, RoleEnum; Security (02) — encryption, key derivation, hardware fingerprinting
**Downstream consumers**: HTTP API (04) — `main.py` uses `ApiClient` for all resource operations and `binary_split` for Docker unlock
## 2. Internal Interfaces
### Interface: ApiClient
| Method | Input | Output | Async | Error Types |
|------------------------------|-----------------------------------------------------------|--------|-------|--------------------------------|
| `set_credentials_from_dict` | `str email, str password` | — | No | API errors, YAML parse errors |
| `login` | — | — | No | HTTPError, Exception |
| `load_big_small_resource` | `str resource_name, str folder` | bytes | No | Exception (API, CDN, decrypt) |
| `upload_big_small_resource` | `bytes resource, str resource_name, str folder` | — | No | Exception (API, CDN, encrypt) |
| `upload_to_cdn` | `str bucket, str filename, bytes file_bytes` | — | No | Exception |
| `download_from_cdn` | `str bucket, str filename` | bytes | No | Exception |
Cython-only methods (cdef): `set_credentials`, `set_token`, `get_user`, `request`, `list_files`, `check_resource`, `load_bytes`, `upload_file`, `load_big_file_cdn`
### Interface: CDNManager
| Method | Input | Output | Async | Error Types |
|------------|----------------------------------------------|--------|-------|------------------|
| `upload` | `str bucket, str filename, bytes file_bytes` | bool | No | boto3 exceptions |
| `download` | `str folder, str filename` | bool | No | boto3 exceptions |
### Interface: binary_split (module-level functions)
| Function | Input | Output | Async | Error Types |
|------------------------|-------------------------------------------------|--------|-------|-----------------------|
| `download_key_fragment`| `str resource_api_url, str token` | bytes | No | requests.HTTPError |
| `decrypt_archive` | `str encrypted_path, bytes key_fragment, str output_path` | — | No | crypto/IO errors |
| `docker_load` | `str tar_path` | — | No | subprocess.CalledProcessError |
| `check_images_loaded` | `str version` | bool | No | — |
## 3. External API Specification
N/A — this component is consumed by HTTP API (04), not directly exposed.
## 4. Data Access Patterns
### Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|----------------------|---------------------|------------------|---------------------------------|
| CDN config (cdn.yaml)| In-memory (CDNManager) | Process lifetime | On re-authentication |
| JWT token | In-memory | Until 401/403 | Auto-refresh on auth error |
| Big file parts | Local filesystem | Until version mismatch | Overwritten on new upload |
### Storage Estimates
| Location | Description | Growth Rate |
|--------------------|------------------------------------|------------------------|
| `{folder}/{name}.big` | Cached large resource parts | Per resource upload |
| Logs/ | Loguru log files | ~daily rotation, 30-day retention |
## 5. Implementation Details
**State Management**: `ApiClient` is a stateful singleton (token, credentials, CDN manager). `binary_split` is stateless.
**Key Dependencies**:
| Library | Version | Purpose |
|--------------|---------|--------------------------------------|
| requests | 2.32.4 | HTTP client for API calls |
| pyjwt | 2.10.1 | JWT token decoding (no verification) |
| boto3 | 1.40.9 | S3-compatible CDN operations |
| pyyaml | 6.0.2 | CDN config parsing |
| cryptography | 44.0.2 | AES-256-CBC for archive decryption |
**Error Handling Strategy**:
- `request()` auto-retries on 401/403 (re-login then retry once)
- 500 errors raise `Exception` with response text
- 409 (Conflict) errors raise with parsed ErrorCode/Message
- CDN operations return bool (True/False) — swallow exceptions, log error
- `binary_split` functions propagate all errors to caller
**Big/Small Resource Split Protocol**:
- **Download**: small part (encrypted per-user+hw key) from API + big part from local cache or CDN → concatenate → decrypt with shared resource key
- **Upload**: encrypt entire resource with shared key → split at `min(3KB, 30%)` → small part to API, big part to CDN + local copy
## 6. Extensions and Helpers
None.
## 7. Caveats & Edge Cases
**Known limitations**:
- JWT token decoded without signature verification — trusts the API server
- CDN manager initialization requires a successful encrypted download (bootstrapping: credentials must already work for the login call that precedes CDN config download)
- `load_big_small_resource` attempts local cache first; on decrypt failure (version mismatch), silently falls through to CDN download — the error is logged but not surfaced to caller
- `API_SERVICES` list in `binary_split` is hardcoded — adding a new service requires code change
- `docker_load` and `check_images_loaded` shell out to Docker CLI — requires Docker CLI in the container
**Potential race conditions**:
- `api_client` singleton in `main.py` is initialized without locking; concurrent first requests could create multiple instances (only one is kept)
**Performance bottlenecks**:
- Large resource encryption/decryption is synchronous and in-memory
- CDN downloads are synchronous (blocking the thread)
## 8. Dependency Graph
**Must be implemented after**: Core Models (01), Security (02)
**Can be implemented in parallel with**: —
**Blocks**: HTTP API (04)
## 9. Logging Strategy
| Log Level | When | Example |
|-----------|------|---------|
| INFO | File downloaded | `"Downloaded file: cdn.yaml, 1234 bytes"` |
| INFO | File uploaded | `"Uploaded model.bin to api.azaion.com/models successfully: 200."` |
| INFO | CDN operation | `"downloaded model.big from the models"` |
| INFO | Big file check | `"checking on existence for models/model.big"` |
| ERROR | Upload failure | `"Upload fail: ConnectionError(...)"` |
| ERROR | API error | `"{'ErrorCode': 409, 'Message': '...'}"` |
**Log format**: Via `constants.log()` / `constants.logerror()`
**Log storage**: Same as Core Models logging configuration
@@ -0,0 +1,144 @@
# HTTP API
## 1. High-Level Overview
**Purpose**: FastAPI application that exposes HTTP endpoints for health monitoring, user authentication, encrypted resource loading/uploading, and a background Docker image unlock workflow.
**Architectural Pattern**: Thin controller — delegates all business logic to Resource Management (03) and binary_split.
**Upstream dependencies**: Core Models (01) — UnlockState enum; Resource Management (03) — ApiClient, binary_split functions
**Downstream consumers**: None — this is the system entry point, consumed by external HTTP clients.
## 2. Internal Interfaces
### Interface: Module-level Functions
| Function | Input | Output | Description |
|-------------------|---------------------------------|----------------|---------------------------------|
| `get_api_client` | — | ApiClient | Lazy singleton accessor |
| `_run_unlock` | `str email, str password` | — | Background task: full unlock flow |
## 3. External API Specification
| Endpoint | Method | Auth | Rate Limit | Description |
|--------------------|--------|----------|------------|------------------------------------------|
| `/health` | GET | Public | — | Liveness probe |
| `/status` | GET | Public | — | Auth status + model cache dir |
| `/login` | POST | Public | — | Set user credentials |
| `/load/{filename}` | POST | Implicit | — | Download + decrypt resource |
| `/upload/{filename}`| POST | Implicit | — | Encrypt + upload resource (big/small) |
| `/unlock` | POST | Public | — | Start background Docker unlock |
| `/unlock/status` | GET | Public | — | Poll unlock workflow progress |
"Implicit" auth = credentials must have been set via `/login` first; enforced by ApiClient's auto-login on token absence.
### Request/Response Schemas
**POST /login**
```json
// Request
{"email": "user@example.com", "password": "secret"}
// Response 200
{"status": "ok"}
// Response 401
{"detail": "error message"}
```
**POST /load/{filename}**
```json
// Request
{"filename": "model.bin", "folder": "models"}
// Response 200 — binary octet-stream
// Response 500
{"detail": "error message"}
```
**POST /upload/{filename}**
```
// Request — multipart/form-data
data: <file>
folder: "models" (form field, default "models")
// Response 200
{"status": "ok"}
```
**POST /unlock**
```json
// Request
{"email": "user@example.com", "password": "secret"}
// Response 200
{"state": "authenticating"}
// Response 404
{"detail": "Encrypted archive not found"}
```
**GET /unlock/status**
```json
// Response 200
{"state": "decrypting", "error": null}
```
## 4. Data Access Patterns
### Caching Strategy
| Data | Cache Type | TTL | Invalidation |
|---------------|---------------------|---------------|---------------------|
| ApiClient | In-memory singleton | Process life | Never |
| unlock_state | Module global | Until next unlock | State machine transition |
## 5. Implementation Details
**State Management**: Module-level globals (`api_client`, `unlock_state`, `unlock_error`) protected by `threading.Lock` for unlock state mutations.
**Key Dependencies**:
| Library | Version | Purpose |
|----------------|---------|------------------------------|
| fastapi | latest | HTTP framework |
| uvicorn | latest | ASGI server |
| pydantic | (via fastapi) | Request/response models |
| python-multipart| latest | File upload support |
**Error Handling Strategy**:
- `/login` — catches all exceptions, returns 401
- `/load`, `/upload` — catches all exceptions, returns 500
- `/unlock` — checks preconditions (archive exists, not already in progress), then delegates to background task
- Background task (`_run_unlock`) catches all exceptions, sets `unlock_state = error` with error message
## 6. Extensions and Helpers
None.
## 7. Caveats & Edge Cases
**Known limitations**:
- No authentication middleware — endpoints rely on prior `/login` call having set credentials on the singleton
- `get_api_client()` uses a global without locking — race on first concurrent access
- `/load/{filename}` has a path parameter `filename` but also takes `req.filename` from the body — the path param is unused
- `_run_unlock` silently ignores `OSError` when removing tar file (acceptable cleanup behavior)
**Potential race conditions**:
- `unlock_state` mutations are lock-protected, but `api_client` singleton creation is not
- Concurrent `/unlock` calls: the lock check prevents duplicate starts, but there's a small TOCTOU window between the check and the `background_tasks.add_task` call
**Performance bottlenecks**:
- `/load` and `/upload` are synchronous — large files block the worker thread
- `_run_unlock` runs as a background task (single thread) — only one unlock can run at a time
## 8. Dependency Graph
**Must be implemented after**: Core Models (01), Resource Management (03)
**Can be implemented in parallel with**: —
**Blocks**: — (entry point)
## 9. Logging Strategy
No direct logging in this component — all logging is handled by downstream components via `constants.log()` / `constants.logerror()`.
**Log format**: N/A (delegates)
**Log storage**: N/A (delegates)