Initial commit

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-26 00:20:30 +02:00
commit 8e2ecf50fd
144 changed files with 19781 additions and 0 deletions
@@ -0,0 +1,80 @@
# Observability
## Logging
### Detection Log
| Field | Type | Description |
|-------|------|-------------|
| ts | ISO 8601 | Detection timestamp |
| frame_id | uint64 | Source frame |
| gps_denied_lat | float64 | GPS-denied latitude |
| gps_denied_lon | float64 | GPS-denied longitude |
| tier | uint8 | Tier that produced detection |
| class | string | Detection class label |
| confidence | float32 | Detection confidence |
| bbox | float32[4] | centerX, centerY, width, height (normalized) |
| freshness | string | Freshness tag (footpaths only) |
| tier2_result | string | Tier 2 classification |
| tier2_confidence | float32 | Tier 2 confidence |
| tier3_used | bool | Whether VLM was invoked |
| thumbnail_path | string | Path to ROI thumbnail |
**Format**: JSON-lines, append-only
**Location**: `/data/output/detections.jsonl`
**Rotation**: None (circular buffer at filesystem level for L1 frames)
### Gimbal Command Log
**Format**: Text, one line per command (timestamp, command type, target angles, CRC status, retry count)
**Location**: `/data/output/gimbal.log`
### System Health Log
**Format**: JSON-lines, 1 entry per second
**Fields**: timestamp, t_junction, power_watts, gpu_mem_mb, cpu_mem_mb, degradation_level, gimbal_alive, semantic_alive, vlm_alive, nvme_free_pct
**Location**: `/data/output/health.jsonl`
### Application Error Log
**Format**: Text with severity levels (ERROR, WARN, INFO)
**Location**: `/data/output/app.log`
**Content**: Exceptions, timeouts, CRC failures, frame skips, VLM errors
## Metrics (In-Memory)
No external metrics service (air-gapped). Metrics are computed in-memory and exposed via health API endpoint:
| Metric | Type | Description |
|--------|------|-------------|
| frames_processed_total | Counter | Total frames through Tier 1 |
| frames_skipped_quality | Counter | Frames rejected by quality gate |
| detections_total | Counter | Total detections produced (all tiers) |
| tier1_latency_ms | Histogram | Tier 1 inference time |
| tier2_latency_ms | Histogram | Tier 2 processing time |
| tier3_latency_ms | Histogram | Tier 3 VLM time |
| poi_queue_depth | Gauge | Current POI queue size |
| degradation_level | Gauge | Current degradation level |
| t_junction_celsius | Gauge | Current junction temperature |
| power_draw_watts | Gauge | Current power draw |
| gpu_memory_used_mb | Gauge | Current GPU memory |
| gimbal_crc_failures | Counter | Total CRC failures on UART |
| vlm_crashes | Counter | VLM process crash count |
**Exposed via**: GET /api/v1/health (JSON response with all metrics)
## Alerting
No external alerting system. Alerts are:
1. Degradation level changes → logged to health log + detection log
2. Critical events (VLM crash, gimbal loss, thermal critical) → logged with severity ERROR
3. Operator display shows current degradation level as status indicator
## Post-Flight Analysis
After landing, NVMe data is extracted via USB for offline analysis:
- `detections.jsonl` → import into annotation tool for TP/FP labeling
- `frames/` → source material for training dataset expansion
- `health.jsonl` → thermal/power profile for hardware optimization
- `gimbal.log` → PID tuning analysis
- `app.log` → debugging and issue diagnosis