Refactor autopilot workflows and documentation: Update .gitignore to include binary and media file types, enhance agent command references in documentation, and modify annotation class for improved accessibility. Adjust inference processing to handle batch sizes and streamline test specifications for clarity and consistency across the system.

2026-06-21 08:11:07 +00:00 · 2026-03-25 05:26:19 +02:00
parent a5fc4fe073
commit 4afa1a4eec
29 changed files with 447 additions and 362 deletions
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1,104 @@
+# Expected Results
+
+Maps every input data item to its quantifiable expected result.
+Tests use this mapping to compare actual system output against known-correct answers.
+
+## Coordinate System
+
+All bounding box coordinates are **normalized to 0.0–1.0** relative to the full image/frame dimensions, matching the API response format:
+
+| Field | Meaning |
+|-------|---------|
+| `center_x` | Horizontal center of bounding box (0.0 = left edge, 1.0 = right edge) |
+| `center_y` | Vertical center of bounding box (0.0 = top edge, 1.0 = bottom edge) |
+| `width` | Bounding box width as fraction of image width |
+| `height` | Bounding box height as fraction of image height |
+| `label` | Class name from `classes.json` (e.g., `ArmorVehicle`, `Car`, `Person`) |
+| `confidence_min` | Minimum acceptable confidence for this detection (threshold comparison, `≥`) |
+
+For videos, the additional field:
+
+| Field | Meaning |
+|-------|---------|
+| `time_sec` | Timestamp in seconds from video start when this detection is visible |
+
+## Global Tolerances
+
+| Parameter | Tolerance | Comparison Method |
+|-----------|-----------|-------------------|
+| Bounding box coordinates (center_x, center_y, width, height) | ± 0.05 | `numeric_tolerance` |
+| Detection count | ± 2 | `numeric_tolerance` |
+| Confidence | ≥ `confidence_min` value per row | `threshold_min` |
+| Label | exact match | `exact` |
+| Video time_sec | ± 1.0s | `numeric_tolerance` |
+
+## Input → Expected Result Mapping
+
+### Images
+
+| # | Input File | Description | Expected Result File | Expected Detection Count | Notes |
+|---|------------|-------------|---------------------|-------------------------|-------|
+| 1 | `image_small.jpg` | 1280×720 aerial, contains detectable objects | `image_small_expected.csv` | ? | Primary test image for single-frame detection |
+| 2 | `image_large.JPG` | 6252×4168 aerial, triggers GSD-based tiling | `image_large_expected.csv` | ? | Coordinates normalized to full image (not tile) |
+| 3 | `image_dense01.jpg` | 1280×720 dense scene, many clustered objects | `image_dense01_expected.csv` | ? | Used for dedup and max-detection-cap tests |
+| 4 | `image_dense02.jpg` | 1920×1080 dense scene variant | `image_dense02_expected.csv` | ? | Borderline tiling, dedup variant |
+| 5 | `image_different_types.jpg` | 900×1600, varied object classes | `image_different_types_expected.csv` | ? | Must contain multiple distinct class labels |
+| 6 | `image_empty_scene.jpg` | 1920×1080, no detectable objects | `image_empty_scene_expected.csv` | 0 | CSV has headers only — zero detections expected |
+
+### Videos
+
+| # | Input File | Description | Expected Result File | Notes |
+|---|------------|-------------|---------------------|-------|
+| 7 | `video_short01.mp4` | Standard test video | `video_short01_expected.csv` | Primary async/SSE/video test. List key-frame detections. |
+| 8 | `video_short02.mp4` | Video variant | `video_short02_expected.csv` | Used for resilience and concurrent tests |
+| 9 | `video_long03.mp4` | Long video (288MB), generates >100 SSE events | `video_long03_expected.csv` | SSE overflow test. Only key-frame samples needed. |
+
+## How to Fill
+
+### Images
+
+1. Run the model on each image (or use the detection service)
+2. Record every detection the model returns
+3. Fill one row per detection in the CSV:
+
+```
+center_x,center_y,width,height,label,confidence_min
+0.45,0.32,0.08,0.12,Car,0.25
+0.71,0.55,0.06,0.09,Person,0.25
+```
+
+4. For `image_empty_scene_expected.csv` — leave only the header row (0 detections)
+
+### Videos
+
+1. Run the model on the video (or use the detection service with `frame_period_recognition: 1`)
+2. For key frames where detections appear, record the timestamp and detections
+3. Fill one row per detection per timestamp:
+
+```
+time_sec,center_x,center_y,width,height,label,confidence_min
+2.0,0.45,0.32,0.08,0.12,Car,0.25
+2.0,0.71,0.55,0.06,0.09,Person,0.25
+4.0,0.46,0.33,0.08,0.12,Car,0.25
+```
+
+4. You don't need every single frame — sample at key moments (e.g., every 2–4 seconds) to validate detection presence and approximate positions
+
+## Non-Detection Expected Results
+
+The following test scenarios have expected results that are not per-file detections. These are defined inline in the test specs and do not need CSV files:
+
+| Scenario | Expected Result | Comparison | Defined In |
+|----------|----------------|------------|------------|
+| Empty image (FT-N-01) | HTTP 400, `"Image is empty"` | exact | `blackbox-tests.md` |
+| Corrupt image (FT-N-02) | HTTP 400 or 422 | exact | `blackbox-tests.md` |
+| Engine unavailable (FT-N-03) | HTTP 503 or 422, not 500 | exact | `blackbox-tests.md` |
+| Duplicate media_id (FT-N-04) | HTTP 409 | exact | `blackbox-tests.md` |
+| Missing classes.json (FT-N-05) | Service fails or empty detections | exact | `blackbox-tests.md` |
+| Health pre-init (FT-P-01) | `aiAvailability: "None"` | exact | `blackbox-tests.md` |
+| Health post-init (FT-P-02) | `aiAvailability` not "None"/"Downloading" | exact | `blackbox-tests.md` |
+| Async start (FT-P-08) | `{"status": "started"}`, response < 1s | exact + threshold_max | `blackbox-tests.md` |
+| SSE completion (FT-P-09) | Final event: `mediaStatus: "AIProcessed"`, `percent: 100` | exact | `blackbox-tests.md` |
+| Max detections (NFT-RES-LIM-03) | `len(detections) ≤ 300` | threshold_max | `resource-limit-tests.md` |
+| Single image latency (NFT-PERF-01) | p95 < 5000ms (ONNX CPU) | threshold_max | `performance-tests.md` |
+| Log file naming (NFT-RES-LIM-04) | `log_inference_YYYYMMDD.txt` exists | regex | `resource-limit-tests.md` |
@@ -0,0 +1 @@
+time_sec,center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+time_sec,center_x,center_y,width,height,label,confidence_min
@@ -0,0 +1 @@
+time_sec,center_x,center_y,width,height,label,confidence_min
@@ -1,46 +1,29 @@
 # Autopilot State

 ## Current Step
-step: 2f
-name: Refactor
-status: not_started
-sub_step: —
+flow: existing-code
+step: 5
+name: Run Tests
+status: in_progress
+sub_step: 2 — Run Tests
 retry_count: 0

-## Step ↔ SubStep Reference
-| Step | Name                   | Sub-Skill                        | Internal SubSteps                        |
-|------|------------------------|----------------------------------|------------------------------------------|
-| 0    | Problem                | problem/SKILL.md                 | Phase 1–4                                |
-| 1    | Research               | research/SKILL.md                | Mode A: Phase 1–4 · Mode B: Step 0–8    |
-| 2    | Plan                   | plan/SKILL.md                    | Step 1–6                                 |
-| 2b   | Blackbox Test Spec     | blackbox-test-spec/SKILL.md      | Phase 1a–1b (existing code path only)    |
-| 2c   | Post-Test-Spec Decision| (autopilot decision gate)        | Refactor vs normal workflow              |
-| 2d   | Decompose Tests        | decompose/SKILL.md (tests-only)  | Step 1t + Step 3 + Step 4                |
-| 2e   | Implement Tests        | implement/SKILL.md               | (batch-driven, no fixed sub-steps)       |
-| 2f   | Refactor               | refactor/SKILL.md                | Phases 0–5 (6-phase method)              |
-| 2g   | New Task               | new-task/SKILL.md                | Steps 1–8 (loop)                         |
-| 2h   | Implement              | implement/SKILL.md               | (batch-driven, no fixed sub-steps)       |
-| 2i   | Run Tests              | (autopilot-managed)              | Unit + integration tests                 |
-| 2j   | Security Audit         | security/SKILL.md                | Phase 1–5 (optional)                     |
-| 2k   | Deploy                 | deploy/SKILL.md                  | Step 1–7                                 |
-
 ## Completed Steps

 | Step | Name | Completed | Key Outcome |
 |------|------|-----------|-------------|
-| — | Document (pre-step) | 2026-03-21 | 10 modules, 4 components, full _docs/ generated from existing codebase |
-| 2b | Blackbox Test Spec | 2026-03-21 | 39 test scenarios (16 positive, 8 negative, 11 non-functional), 85% total coverage, 5 artifacts produced |
-| 2c | Post-Test-Spec Decision | 2026-03-22 | User chose refactor path (A) |
-| 2d | Decompose Tests | 2026-03-23 | 11 tasks (AZ-138..AZ-148), 35 complexity points, 3 batches. Phase 3 test data gate PASSED: 39/39 scenarios validated, 12 data files provided. |
-| 2e | Implement Tests | 2026-03-23 | 11 tasks implemented across 4 batches, 38 tests (2 skipped), all code reviews PASS_WITH_WARNINGS. Commits: 5418bd7, a469579, 861d4f0, f0e3737. |
+| 1 | Document | 2026-03-21 | 10 modules, 4 components, full _docs/ generated from existing codebase |
+| 2 | Test Spec | 2026-03-21 | 39 test scenarios (16 positive, 8 negative, 11 non-functional), 85% total coverage, 5 artifacts produced |
+| 3 | Decompose Tests | 2026-03-23 | 11 tasks (AZ-138..AZ-148), 35 complexity points, 3 batches. Phase 3 test data gate PASSED: 39/39 scenarios validated, 12 data files provided. |
+| 4 | Implement Tests | 2026-03-23 | 11 tasks implemented across 4 batches, 38 tests (2 skipped), all code reviews PASS_WITH_WARNINGS. Commits: 5418bd7, a469579, 861d4f0, f0e3737. |

 ## Key Decisions
- User chose B: Document existing codebase before proceeding
+- User chose to document existing codebase before proceeding
 - Component breakdown: 4 components (Domain, Inference Engines, Inference Pipeline, API)
 - Verification: 4 legacy issues found and documented (unused serialize/from_msgpack, orphaned queue declarations)
 - Input data coverage approved at ~90% (Phase 1a)
 - Test coverage approved at 85% (21/22 AC, 13/18 restrictions) with all gaps justified
- User chose A: Refactor path (decompose tests → implement tests → refactor)
+- User chose refactor path (decompose tests → implement tests → refactor)
 - Integration Tests Epic: AZ-137
 - Test Infrastructure: AZ-138 (5 pts)
 - 10 integration test tasks decomposed: AZ-139 through AZ-148 (30 pts)
@@ -51,10 +34,10 @@ retry_count: 0
 - Jira MCP auth skipped — tickets not transitioned to In Testing

 ## Last Session
-date: 2026-03-23
-ended_at: Step 2e Implement Tests — COMPLETE. All 11 tasks, 38 tests, 4 batches.
+date: 2026-03-24
+ended_at: Step 4 Implement Tests — COMPLETE. All 11 tasks, 38 tests, 4 batches.
 reason: step completed, context limit approaching
-notes: All integration tests implemented and committed. Next step: 2f Refactor. The refactor skill runs a 6-phase method using the implemented tests as a safety net. Recommend fresh conversation for better context management.
+notes: All integration tests implemented and committed. Next step: 5 Run Tests — verify tests pass before proceeding to refactor. Recommend fresh conversation for better context management.

 ## Blockers
 - none
				`@@ -0,0 +1 @@`
				`center_x,center_y,width,height,label,confidence_min`
				`@@ -0,0 +1 @@`
				`time_sec,center_x,center_y,width,height,label,confidence_min`