mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-21 21:11:09 +00:00
bc40ea7300
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
817 lines
58 KiB
Markdown
817 lines
58 KiB
Markdown
# autopilot — Decision Rationale
|
||
|
||
This file is the load-bearing research evidence behind the design. It captures the per-dimension reasoning, the fact cards backing each decision, the component-fit matrix, the validation log, the source bibliography, the evolution from the early draft to the final solution, and the original seed problem narrative. It is **read-only** in the sense that decisions documented here have already shaped `architecture.md §7 Detailed Design` and `system-flows.md §F1–F7`; updates here should follow updates there, not lead them.
|
||
|
||
## Reasoning chain (per-dimension)
|
||
|
||
### Dimension 1: Tiered Perception Pipeline
|
||
|
||
**Fact confirmation.** Existing YOLO integration already emits normalized boxes through a FastAPI/Cython/TensorRT service (Fact #16). Ultralytics supports TensorRT FP16 export for YOLO26-style models (Fact #19). UAV small-object and camouflaged-object literature shows that small concealed targets need class-specific and attention/semantic support rather than assuming generic object-detection transfer (Fact #8, Fact #11).
|
||
|
||
**Reference comparison.** A single detector is simpler but cannot satisfy footpath tracing, endpoint reasoning, motion candidates, and VLM confirmation. A full VLM-first approach is too slow and memory-sensitive for the zoom-out / zoom-in fast paths (Fact #12, Fact #24).
|
||
|
||
**Conclusion.** Use a three-tier perception pipeline: Tier 1 fixed-class YOLO26 / YOLOE-26 TensorRT FP16 primitives, Tier 2 primitive-graph plus lightweight ROI confirmation, and Tier 3 NanoLLM VLM only for bounded zoom-in endpoint / POI questions.
|
||
|
||
**Confidence.** Medium-high. API fit is supported; runtime targets still require hardware benchmarks.
|
||
|
||
### Dimension 2: Movement Detection
|
||
|
||
**Fact confirmation.** Dynamic-camera motion detection needs ego-motion compensation because platform movement creates apparent motion in stable objects (Fact #6, Fact #7). OpenCV provides sparse optical flow, feature tracking, and global-motion estimation APIs (Fact #22). The user confirmed timestamped video, gimbal, zoom, and UAV telemetry are available for MVP.
|
||
|
||
**Reference comparison.** Naive frame differencing is simpler but directly conflicts with the stable-scene rejection requirement. Pure learned tracking without telemetry may work later, but it adds data requirements and hides failure modes.
|
||
|
||
**Conclusion.** Select telemetry-aided OpenCV ego-motion compensation as the MVP movement-detector baseline, with residual cluster extraction. Run movement detection at **both** zoom-out and zoom-in (per-zoom-band thresholds), benchmark-gating classical CV adequacy at zoom-in before MVP acceptance. The ≤5 POIs/minute cap is enforced by `scan_controller`'s POI scheduler, not by the detector itself, so the same detector can serve both zoom levels.
|
||
|
||
**Confidence.** High for mechanism fit; medium for runtime and false-positive performance until replay-tested.
|
||
|
||
### Dimension 3: Scan and Gimbal Control
|
||
|
||
**Fact confirmation.** ViewPro A40 official specs support fast tracking output metadata and a 40× optical camera, but do not prove the project's full zoom traversal time (Fact #4, Fact #5). Behaviour trees help large UAV autonomy systems, but this project has a small deterministic scan lifecycle (Fact #26).
|
||
|
||
**Reference comparison.** Behaviour trees are more extensible, but a deterministic state machine gives simpler timing, queue, and timeout tests for `ZoomedOut`, `ZoomedIn`, and `TargetFollow` states.
|
||
|
||
**Conclusion.** Use a typed `scan_controller` state machine with explicit states, queue ageing, timeouts, and target-loss handling. Treat ViewPro zoom timing as a hardware-in-loop acceptance test.
|
||
|
||
**Confidence.** High for architecture fit; medium for physical zoom timing until measured.
|
||
|
||
### Dimension 4: VLM Confirmation
|
||
|
||
**Fact confirmation.** NanoLLM documents local multimodal VILA1.5-3B image+text prompting with MLC and quantisation options (Fact #23). Orin Nano 8 GB VLM deployment is memory-sensitive and needs strict context / token limits (Fact #24). The user confirmed VLM is required for MVP only if the exact model / runtime passes ≤5 s/ROI and memory gates.
|
||
|
||
**Reference comparison.** Using VLM for every ROI would overload latency and memory. Skipping VLM entirely would miss the requirement. A separate local VLM IPC process preserves no-cloud and isolation constraints while allowing a scheduler to avoid concurrent GPU use.
|
||
|
||
**Conclusion.** Select NanoLLM + VILA1.5-3B MLC quantised as the lead VLM, run only on bounded zoom-in crops, and enforce hard benchmark gates before MVP acceptance.
|
||
|
||
**Confidence.** Medium. API capability is proven; runtime-quality fit is not proven without target hardware.
|
||
|
||
### Dimension 5: Data and Acceptance Risk
|
||
|
||
**Fact confirmation.** All-season MVP was confirmed by the user. UAV small-object and camouflaged-object detection is sensitive to background, scale, and season (Fact #8, Fact #11). Annotation effort is plausible only with assistance and careful prioritisation (Fact #14, Fact #15).
|
||
|
||
**Reference comparison.** Winter-first MVP would lower risk but conflicts with the confirmed requirement. All-season MVP demands stronger dataset gates and should not rely on aggregate metrics.
|
||
|
||
**Conclusion.** Keep all-season MVP, but make per-class, per-season, per-terrain validation mandatory. Use annotation assistance and hard-negative mining from false positives to control schedule risk.
|
||
|
||
**Confidence.** Medium. The requirement is clear; dataset availability is the main risk.
|
||
|
||
## Fact cards
|
||
|
||
These are the load-bearing facts referenced from the reasoning chain and the fit matrix. Each card lists the source, confidence, related dimension, and fit impact. Source numbers refer to the bibliography in §References below.
|
||
|
||
### Fact #1
|
||
|
||
- **Statement**: Jetson Orin Nano Super is officially specified at 67 INT8 TOPS with 8 GB 128-bit LPDDR5 memory and 102 GB/s memory bandwidth.
|
||
- **Source**: Source #1
|
||
- **Confidence**: High
|
||
- **Related dimension**: Hardware feasibility
|
||
- **Fit impact**: Supports the hardware restriction, but does not prove FP16 multi-model latency.
|
||
|
||
### Fact #2
|
||
|
||
- **Statement**: NVIDIA's Super Mode performance gain depends on the JetPack / software configuration and power mode, so benchmark results must record the installed JetPack / L4T and power mode.
|
||
- **Source**: Source #2
|
||
- **Confidence**: High
|
||
- **Related dimension**: Runtime reproducibility
|
||
- **Fit impact**: Adds a missing restriction: lock and report JetPack / power mode for all latency tests.
|
||
|
||
### Fact #3
|
||
|
||
- **Statement**: Ultralytics provides Jetson / TensorRT deployment guidance, but the consulted documentation / search results do not prove a two-model YOLO26 + YOLOE-26 pipeline at 1280 px will stay below 100 ms/frame including preprocessing, tiling, and postprocessing.
|
||
- **Source**: Source #3
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Tier 1 latency
|
||
- **Fit impact**: Makes the ≤100 ms/frame criterion plausible but unproven until benchmarked with the exact exported engines.
|
||
|
||
### Fact #4
|
||
|
||
- **Statement**: ViewPro A40 Pro official specifications list 1080p output, 40× optical zoom with 4.25–170 mm focal range, 30 Hz tracking deviation update rate, less than 30 ms deviation output delay, and 5×5 pixel minimum AI target size for the built-in AI feature.
|
||
- **Source**: Source #4
|
||
- **Confidence**: High
|
||
- **Related dimension**: Camera / gimbal feasibility
|
||
- **Fit impact**: Supports control-loop feasibility but does not prove full wide-to-high optical zoom traversal in ≤2 s.
|
||
|
||
### Fact #5
|
||
|
||
- **Statement**: The official ViewPro A40 Pro page does not provide a direct full-range optical zoom traversal time; the project-specific 1–2 s zoom traversal claim must be measured on the target camera / interface.
|
||
- **Source**: Source #4
|
||
- **Confidence**: High
|
||
- **Related dimension**: zoom-out → zoom-in transition
|
||
- **Fit impact**: Adds a validation prerequisite for the ≤2 s transition criterion.
|
||
|
||
### Fact #6
|
||
|
||
- **Statement**: Recent dynamic-camera moving-object detection work uses optical flow plus additional mechanisms such as tracking-any-point, adaptive bounding-box filtering, segmentation priors, or focus-of-expansion reasoning, because camera motion alone produces apparent motion.
|
||
- **Source**: Source #5, Source #6
|
||
- **Confidence**: High
|
||
- **Related dimension**: Movement detection
|
||
- **Fit impact**: Supports the requirement to compensate UAV / gimbal motion and disqualifies naive frame differencing.
|
||
|
||
### Fact #7
|
||
|
||
- **Statement**: Moving-object detection from UAV footage is difficult because objects are small, camera motion is complex, and structured backgrounds can make optical-flow-only approaches unreliable.
|
||
- **Source**: Source #5, Source #6
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Movement detection reliability
|
||
- **Fit impact**: Adds a missing false-positive / false-negative acceptance criterion for zoom-out motion candidates (and, after the zoom-in benchmark gate, an analogous per-zoom-band criterion for zoom-in).
|
||
|
||
### Fact #8
|
||
|
||
- **Statement**: UAV small-object detection literature repeatedly identifies small pixel footprint, complex backgrounds, low contrast, and scale variation as major causes of missed detections and false alarms.
|
||
- **Source**: Source #7, Source #8
|
||
- **Confidence**: High
|
||
- **Related dimension**: YOLO and semantic detection quality
|
||
- **Fit impact**: Makes 80 % precision / recall for new primitive classes realistic only with class-specific validation, tiling, and seasonal coverage.
|
||
|
||
### Fact #9
|
||
|
||
- **Statement**: Recent UAV YOLO variants improve small-target results through attention, receptive-field, or feature-fusion changes, implying generic YOLO baseline performance should not be assumed to transfer unchanged to small concealed primitives.
|
||
- **Source**: Source #7, Source #8
|
||
- **Confidence**: High
|
||
- **Related dimension**: Model selection
|
||
- **Fit impact**: Supports keeping "existing class performance must not degrade" and adding per-class / season reporting.
|
||
|
||
### Fact #10
|
||
|
||
- **Statement**: Trail / path detection can be treated as a structured perception problem using neural detection plus path-continuity reasoning, not just independent bounding boxes.
|
||
- **Source**: Source #9
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Footpath detection
|
||
- **Fit impact**: Supports requiring path tracing, freshness scoring, endpoint reasoning, and branch-following behaviour.
|
||
|
||
### Fact #11
|
||
|
||
- **Statement**: Camouflaged-object detection papers use specialised attention, illumination, frequency / spatial, or super-resolution methods because camouflaged targets are intentionally similar to the background.
|
||
- **Source**: Source #14
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Concealed-position detection
|
||
- **Fit impact**: Supports the project's claim that visual similarity to known object classes is insufficient.
|
||
|
||
### Fact #12
|
||
|
||
- **Statement**: Small local VLMs can run on Jetson-class devices, but model choice, quantisation, context size, crop size, and runtime container determine whether memory and ≤5 s/ROI are realistic.
|
||
- **Source**: Source #1, Source #12, Source #13
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: VLM feasibility
|
||
- **Fit impact**: Makes local VLM feasible only as a tightly bounded optional Tier 3 module with an exact-model benchmark.
|
||
|
||
### Fact #13
|
||
|
||
- **Statement**: The project has about 6 GB remaining RAM only because existing YOLO is assumed to use about 2 GB; unified-memory contention means VLM and YOLO scheduling must be sequential and benchmarked together, not in isolation.
|
||
- **Source**: Source #1, project restrictions
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Resource budget
|
||
- **Fit impact**: Supports the restriction against concurrent YOLO / VLM GPU inference and adds a whole-pipeline memory test.
|
||
|
||
### Fact #14
|
||
|
||
- **Statement**: Interactive or model-assisted segmentation can reduce mask annotation time compared with manual polygon annotation, but this benefit depends on tooling and object-boundary clarity.
|
||
- **Source**: Source #10
|
||
- **Confidence**: High
|
||
- **Related dimension**: Annotation effort
|
||
- **Fit impact**: Makes hundreds-to-thousands of labels plausible in 225 hours only if annotation assistance and prioritisation are used.
|
||
|
||
### Fact #15
|
||
|
||
- **Statement**: Label propagation can reduce annotation effort for related frames / sequences, which is relevant to movement-detection video data.
|
||
- **Source**: Source #11
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Movement dataset creation
|
||
- **Fit impact**: Supports using video / sequential annotation tools for movement candidates rather than frame-by-frame manual labelling only.
|
||
|
||
### Fact #16
|
||
|
||
- **Statement**: The existing FastAPI service has endpoints that emit normalized boxes and uses a global inference object around Cython / TensorRT inference.
|
||
- **Source**: `../detections/main.py` (existing detections service)
|
||
- **Confidence**: High
|
||
- **Related dimension**: Integration boundary
|
||
- **Fit impact**: Supports keeping normalized-box output but favours isolating VLM and scan control outside the Cython inference path.
|
||
|
||
### Fact #17
|
||
|
||
- **Statement**: The input images show long thin paths, dark narrow entrances, branch / forest-edge concealment, and partial occlusion, so bounding boxes alone may be weak for footpaths and path-follow behaviour.
|
||
- **Source**: original problem-side data parameters (deleted on doc consolidation 2026-05-17; reference PNGs `semantic01..04.png` lived alongside)
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Annotation format
|
||
- **Fit impact**: Supports allowing segmentation masks or polylines for footpaths instead of boxes only.
|
||
|
||
### Fact #18
|
||
|
||
- **Statement**: The project provides no explicit acceptance criteria for false positives per route / time, operator-review workload, queue starvation, telemetry availability, power / thermal throttling, or evidence logging.
|
||
- **Source**: original problem-side acceptance criteria + restrictions (deleted on doc consolidation 2026-05-17; consolidated into `architecture.md §7.3 Restrictions` and `§7.4 Acceptance Criteria`)
|
||
- **Confidence**: High
|
||
- **Related dimension**: Missing criteria
|
||
- **Fit impact**: Requires adding or confirming these criteria before final architecture planning. (Most are now folded into `architecture.md §7.4 Acceptance Criteria > Frozen choices (2026-05-06)`.)
|
||
|
||
### Fact #19
|
||
|
||
- **Statement**: Ultralytics YOLO supports TensorRT engine export with FP16 through `half=True`; TensorRT export is GPU-only and supports arguments including `dynamic`, `half`, `int8`, and workspace configuration.
|
||
- **Source**: Source #15
|
||
- **Confidence**: High
|
||
- **Related dimension**: Tier 1 primitive detector
|
||
- **Fit impact**: Supports selecting custom-trained YOLO26 TensorRT FP16 as the primary primitive detector.
|
||
|
||
### Fact #20
|
||
|
||
- **Statement**: Jetson TensorRT export can run into workspace and dynamic-shape memory issues, so fixed input shapes, batch 1, and on-device export / benchmarking are safer for this project than dynamic batch export.
|
||
- **Source**: Source #18
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Tier 1 latency
|
||
- **Fit impact**: Adds a hard implementation constraint for the FP16 TensorRT engines.
|
||
|
||
### Fact #21
|
||
|
||
- **Statement**: YOLOE supports open-vocabulary detection / segmentation, but TensorRT runtime should not depend on Python open-vocabulary prompt-mutation APIs; MVP runtime should use fixed trained classes or pre-baked class embeddings only.
|
||
- **Source**: Source #19
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: YOLOE exact-fit
|
||
- **Fit impact**: Selects YOLOE-26 only in fixed-class FP16 TensorRT mode, not runtime open-vocabulary mode.
|
||
|
||
### Fact #22
|
||
|
||
- **Statement**: OpenCV 4.x provides Lucas-Kanade sparse optical flow (`calcOpticalFlowPyrLK`), feature detection (`goodFeaturesToTrack`), and global-motion estimation APIs that can estimate frame-to-frame background motion before residual moving-object detection.
|
||
- **Source**: Source #16
|
||
- **Confidence**: High
|
||
- **Related dimension**: Movement detector
|
||
- **Fit impact**: Supports selecting telemetry-aided OpenCV ego-motion compensation as the movement baseline.
|
||
|
||
### Fact #23
|
||
|
||
- **Statement**: NanoLLM supports model loading through MLC / AWQ / HF APIs with quantisation settings such as `q4f16_ft`, and multimodal chat examples using VILA1.5-3B with image prompts.
|
||
- **Source**: Source #17
|
||
- **Confidence**: High
|
||
- **Related dimension**: VLM confirmation
|
||
- **Fit impact**: Supports selecting NanoLLM + VILA1.5-3B MLC as the lead local VLM candidate, subject to runtime-quality benchmark.
|
||
|
||
### Fact #24
|
||
|
||
- **Statement**: VILA1.5-3B on Orin Nano 8 GB is plausible but memory-sensitive; context length, max tokens, crop count, and container / storage footprint must be capped.
|
||
- **Source**: Source #21
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: VLM feasibility
|
||
- **Fit impact**: Requires the VLM process to use bounded crops, short prompts, short answers, and a watchdog.
|
||
|
||
### Fact #25
|
||
|
||
- **Statement**: NanoSAM / MobileSAM-style segmentation is useful for ROI mask refinement and annotation assistance, but not as the zoom-out wide-area sweep lead because it still adds an image-encoder cost and prompt dependency.
|
||
- **Source**: Source #20
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Segmentation fallback
|
||
- **Fit impact**: Marks segmentation foundation models as fallback / annotation-assist, not primary runtime.
|
||
|
||
### Fact #26
|
||
|
||
- **Statement**: Behaviour trees improve modularity for large UAV autonomy systems, but this project's scan lifecycle has a small fixed set of states and strict timing, making a typed deterministic state machine simpler for MVP.
|
||
- **Source**: Source #22
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Scan control
|
||
- **Fit impact**: Selects a deterministic scan state machine with explicit queues / timeouts; behaviour tree remains a later extensibility option (the BT primer in `system-flows.md §F4` is the canonical decomposition the state machine must satisfy).
|
||
|
||
### Fact #27
|
||
|
||
- **Statement**: Multiple GPU inference contexts / processes can complicate TensorRT scheduling and memory behaviour on Jetson; the project should centralise GPU scheduling and preserve the restriction that YOLO and VLM do not run concurrently.
|
||
- **Source**: Source #23, project restrictions
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Integration boundary
|
||
- **Fit impact**: Selects a local IPC VLM process controlled by an integration scheduler, not unmanaged concurrent inference.
|
||
|
||
### Fact #28
|
||
|
||
- **Statement**: The first draft under-specified the proof gates that must happen before implementation: Tier 1 latency, VLM memory / latency, ViewPro zoom timing, movement false-positive replay, and all-season dataset readiness.
|
||
- **Source**: `solution_draft01.md` (superseded), `validation_log` (this file §Validation)
|
||
- **Confidence**: High
|
||
- **Related dimension**: Planning readiness
|
||
- **Fit impact**: Adds a required benchmark-gate stage before decomposition / implementation.
|
||
|
||
### Fact #29
|
||
|
||
- **Statement**: Secure FastAPI file / image handling should not trust client content-type headers alone and should enforce size limits, validation, authorisation, cleanup, and audit logging.
|
||
- **Source**: Source #24
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: API security
|
||
- **Fit impact**: Adds explicit upload / payload validation requirements for image, ROI, and VLM IPC inputs.
|
||
|
||
### Fact #30
|
||
|
||
- **Statement**: Local IPC can use Unix-socket filesystem permissions and peer-credential checks such as `SO_PEERCRED` to restrict which local processes may call the VLM service.
|
||
- **Source**: Source #25
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: IPC security
|
||
- **Fit impact**: Replaces vague "local IPC authorisation" with a concrete Unix-socket permission and peer-credential control.
|
||
|
||
### Fact #31
|
||
|
||
- **Statement**: Production LLM / VLM integrations should validate or constrain outputs against a schema before downstream use, because free-form text is not a stable API contract.
|
||
- **Source**: Source #26
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: VLM output reliability
|
||
- **Fit impact**: Adds a structured `VlmAssessment` schema and retry / fail-closed behaviour.
|
||
|
||
### Fact #32
|
||
|
||
- **Statement**: Sensor-fusion systems use correlated timestamps to align camera frames and telemetry; movement detection should define maximum tolerated skew between video frames, gimbal state, and UAV motion data.
|
||
- **Source**: Source #27
|
||
- **Confidence**: Medium
|
||
- **Related dimension**: Telemetry synchronisation
|
||
- **Fit impact**: Adds a telemetry-synchronisation contract before movement detection can claim compensation correctness.
|
||
|
||
### Fact #33
|
||
|
||
- **Statement**: TensorRT performance must be measured under the actual model configuration and scheduler; documentation-level export support does not prove end-to-end latency with multiple engines and preprocessing.
|
||
- **Source**: Source #28
|
||
- **Confidence**: High
|
||
- **Related dimension**: GPU scheduling
|
||
- **Fit impact**: Strengthens the central GPU scheduler and benchmark gate.
|
||
|
||
### Fact #34
|
||
|
||
- **Statement**: OpenCV image decoders have had critical crafted-image vulnerabilities in recent 4.x versions, including CVE-2025-53644 affecting 4.10.0 / 4.11.0 and patched in 4.12.0.
|
||
- **Source**: Source #29
|
||
- **Confidence**: High
|
||
- **Related dimension**: Image-processing security
|
||
- **Fit impact**: Requires patched OpenCV version and image-format allow-list for untrusted inputs.
|
||
|
||
### Fact #35
|
||
|
||
- **Statement**: The existing `main.py` swallows refresh / posting / detection exceptions in several paths and returns healthy status even when inference initialisation fails, which would hide critical runtime failures in the expanded system.
|
||
- **Source**: `../detections/main.py` (existing detections service)
|
||
- **Confidence**: High
|
||
- **Related dimension**: Observability and reliability
|
||
- **Fit impact**: Adds a reliability task to replace silent exception handling in touched service paths.
|
||
|
||
### MVE: Ultralytics YOLO26 / YOLOE-26 in fixed-class TensorRT FP16 mode
|
||
|
||
- **Source**: Source #15, Source #19
|
||
- **Pinned mode**: Custom-trained YOLO26 detector and YOLOE-26 segmentation / detection engines exported as TensorRT FP16 with fixed project classes, batch 1, fixed 1280 px input, no runtime open-vocabulary prompt mutation.
|
||
- **Inputs in the example**: Image input passed to `YOLO("yolo26n.pt")`, exported with `model.export(format="engine", half=True)`, then loaded as `.engine` for prediction.
|
||
- **Outputs in the example**: Detection / segmentation results from a TensorRT engine.
|
||
- **Project inputs**: 1080p UAV frames or tiles resized / split for 1280 px model input.
|
||
- **Project outputs required**: Normalized boxes for primitives and operator display; optional masks / polylines for path / branch reasoning.
|
||
- **Match assessment**: Exact API / deployment match for fixed-class TensorRT FP16 engines; runtime open-vocabulary YOLOE behaviour is rejected.
|
||
|
||
#### Restrictions and AC binding — YOLO26 / YOLOE-26 fixed-class FP16
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| FP16 precision | TensorRT export supports `half=True`. | Pass | Fact #19 |
|
||
| TensorRT primary / ONNX fallback | TensorRT engine export is documented; ONNX remains project fallback. | Pass | Fact #19 |
|
||
| 1280 px input | Export supports `imgsz`; exact latency requires benchmark. | Pass for API; runtime gate | Fact #19, Fact #20 |
|
||
| ≤100 ms/frame Tier 1 | API can run TensorRT FP16; runtime quality must be measured end-to-end. | Pass with runtime-quality gate | Fact #20 |
|
||
| Normalized boxes output | YOLO result conversion can preserve existing normalized-box DTO contract. | Pass | Fact #16 |
|
||
| No degradation of existing classes | Requires validation, not an API capability. | Pass with runtime-quality gate | Fact #9 |
|
||
| All seasons MVP | Requires dataset / training coverage, not an API capability. | Pass with data-quality gate | Fact #8 |
|
||
|
||
### MVE: OpenCV telemetry-aided ego-motion compensation
|
||
|
||
- **Source**: Source #16
|
||
- **Pinned mode**: OpenCV 4.x sparse optical flow + feature tracking + global-motion estimation, fused with timestamped gimbal angle / zoom and UAV motion telemetry before residual moving-candidate extraction.
|
||
- **Inputs in the example**: Consecutive video frames converted to grayscale; features from `goodFeaturesToTrack`; tracked points from `calcOpticalFlowPyrLK`.
|
||
- **Outputs in the example**: Matched point trajectories and estimated motion between frames.
|
||
- **Project inputs**: 1080p zoom-out frame sequences plus timestamped gimbal / UAV telemetry; zoom-in frame sequences for the per-zoom-band benchmark.
|
||
- **Project outputs required**: Small residual moving point / cluster candidate boxes queued within 1 s.
|
||
- **Match assessment**: Exact match for ego-motion compensation primitives; project-specific candidate thresholds require benchmark.
|
||
|
||
#### Restrictions and AC binding — OpenCV movement detector
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Movement detection at zoom-out + zoom-in | OpenCV frame-to-frame processing applies at both zoom levels with per-zoom-band thresholds. Classical-CV adequacy at zoom-in is benchmark-gated; if the FP cap fails, fall back per Q14. | Pass with runtime-quality gate | Fact #22 |
|
||
| Compensate UAV / gimbal motion | Optical flow / global motion plus telemetry directly supports compensation. | Pass | Fact #6, Fact #22 |
|
||
| Enqueue within 1 s | CPU / GPU cost depends on implementation; API supports required operations. | Pass with runtime-quality gate | Fact #22 |
|
||
| Stable objects must not be moving due to platform motion | Compensation design directly targets this failure mode. | Pass | Fact #6 |
|
||
| Timestamped telemetry available | User confirmed full telemetry is available for MVP. | Pass | User decision |
|
||
|
||
### MVE: NanoLLM VILA1.5-3B local VLM ROI confirmation
|
||
|
||
- **Source**: Source #17, Source #21
|
||
- **Pinned mode**: NanoLLM multimodal chat with MLC backend, `Efficient-Large-Model/VILA1.5-3b`, quantised mode such as `q4f16_ft`, one bounded ROI crop, short prompt, short answer.
|
||
- **Inputs in the example**: Image-path prompt plus text prompt, e.g. `--prompt '/data/images/lake.jpg' --prompt 'please describe the scene.'`.
|
||
- **Outputs in the example**: Natural-language generated answer from the VLM.
|
||
- **Project inputs**: zoom-in ROI crop around path endpoint, branch pile, dark entrance, dugout, person, or vehicle candidate.
|
||
- **Project outputs required**: Confirmation label / reason that can be converted to POI metadata and operator display-box status.
|
||
- **Match assessment**: Exact API capability match for image+text ROI reasoning; latency and memory are runtime-quality gates.
|
||
|
||
#### Restrictions and AC binding — NanoLLM VILA1.5-3B
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Local VLM, no cloud | NanoLLM runs local models. | Pass | Fact #23 |
|
||
| Separate IPC process | NanoLLM can run as a separate process / container invoked by local IPC. | Pass | Fact #23, Fact #27 |
|
||
| Sequential with YOLO | Scheduler can enforce no concurrent GPU execution. | Pass | Fact #27 |
|
||
| ≤5 s/ROI | API can process image prompts; exact latency must be benchmarked on Jetson. | Pass with runtime-quality gate | Fact #24 |
|
||
| ≤6 GB remaining RAM | Quantised mode is supported; exact memory must be benchmarked with YOLO container present. | Pass with runtime-quality gate | Fact #23, Fact #24 |
|
||
| MVP requires VLM if benchmark passes | User-confirmed policy. | Pass | User decision |
|
||
|
||
## Component fit matrix
|
||
|
||
### Top-level component fit matrix
|
||
|
||
| Component Area | Candidate | Pinned Mode/Config | Option Family | Intended Role | API Capability Evidence | Mismatches / Disqualifiers | Status | Decision Rationale |
|
||
|---|---|---|---|---|---|---|---|---|
|
||
| Tier 1 primitive detection | Ultralytics YOLO26 + YOLOE-26 | Fixed-class TensorRT FP16 engines, batch 1, fixed 1280 px input, no runtime open-vocabulary prompt mutation | Established/open-source + current SOTA | Fast zoom-out primitive boxes/masks for paths, roads, trees, branch piles, entrances | MVE above; docs: Source #15, #19 | Runtime open-vocabulary TensorRT APIs rejected; dynamic batch rejected | Selected with runtime-quality gate | Best fit with existing TensorRT/Cython service and FP16 restriction. |
|
||
| Tier 2 semantic analyzer | Primitive graph + lightweight custom CNN | ROI crop and Tier 1 primitives → POI score, path freshness, endpoint, concealment candidate | Simple baseline + custom model | Confirm and reason over primitives within ≤200 ms/ROI | Facts #10, #17, #25 | None at API level; data-quality gate remains | Selected | Keeps reasoning explainable and faster than VLM-first confirmation. |
|
||
| Movement detection | OpenCV 4.x optical flow / global motion + timestamped UAV / gimbal telemetry | Zoom-out and zoom-in frame pairs plus telemetry → residual moving-point / cluster boxes (per-zoom-band thresholds) | Established production baseline | Detect moving candidates while rejecting platform-induced motion at both zoom levels | MVE above; docs: Source #16 | Video-only mode is not selected for MVP. Zoom-in classical CV is benchmark-gated; learned fallback per Q14 if the FP cap fails. | Selected with runtime-quality gate | Directly matches user-confirmed telemetry and movement restrictions. |
|
||
| Tier 3 VLM confirmation | NanoLLM + VILA1.5-3B | MLC backend, quantised mode such as `q4f16_ft`, one bounded ROI crop, short prompt, short response | Open-source edge VLM | Local confirmation of endpoint / branch-pile / entrance / dugout ROI | MVE above; docs: Source #17, #21 | Must pass ≤5 s/ROI and memory gate; otherwise smaller-VLM fallback | Selected with runtime-quality gate | Satisfies local / no-cloud / VLM-required policy if benchmark passes. |
|
||
| Scan control | Typed deterministic state machine | `ZoomedOut`, `ZoomedIn`, `TargetFollow` states with POI queue, timeouts, target-loss, gimbal command adapters | Simple baseline | Camera sweep, zoom, POI servicing, target follow | Source #4, #22, Fact #26 | Behaviour tree deferred (canonical decomposition kept in `system-flows.md §F4`) | Selected | Small fixed lifecycle favours deterministic timing and testability. |
|
||
| Integration boundary | Existing FastAPI / Cython YOLO core + `scan_controller` scheduler + local IPC VLM process | Normalized-box contract + POI metadata; central GPU scheduler enforces sequential YOLO / VLM | Established production pattern | Integrate modules without compiling VLM into Cython | Fact #16, #27 | Unmanaged multiprocessing / concurrent GPU rejected | Selected | Preserves existing service and isolates memory-heavy VLM. |
|
||
|
||
### Sub-matrix: YOLO26 / YOLOE-26 fixed-class TensorRT FP16
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Hardware: Jetson Orin Nano Super 8 GB | TensorRT FP16 is an NVIDIA GPU deployment path; memory must be benchmarked. | Pass with runtime-quality gate | Fact #1, Fact #20 |
|
||
| FP16 precision | Uses `half=True` TensorRT export. | Pass | Fact #19 |
|
||
| 1280 px model input | Export supports image-size configuration; use fixed 1280 px / batch 1. | Pass | Fact #19 |
|
||
| Existing tile splitting | Candidate accepts image / tiles and returns detections per tile. | Pass | Fact #16, Fact #19 |
|
||
| YOLO and VLM sequential | Tier 1 runs before VLM; scheduler prevents concurrency. | Pass | Fact #27 |
|
||
| Output normalized boxes | Existing DTO contract can wrap candidate outputs. | Pass | Fact #16 |
|
||
| New primitive classes | Fixed custom classes support the required primitive set. | Pass | Fact #19, Fact #21 |
|
||
| P ≥80 %, R ≥80 % and no degradation | Model API supports training / validation; actual performance is data / runtime quality. | Pass with runtime-quality gate | Fact #8, Fact #9 |
|
||
| All-season MVP | Requires dataset coverage rather than API feature. | Pass with data-quality gate | Fact #8, user confirmation |
|
||
|
||
### Sub-matrix: Primitive graph + lightweight CNN
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Tier 2 ≤200 ms/ROI | Bounded ROI crop and lightweight CNN / rules keep workload limited. | Pass with runtime-quality gate | Fact #10, Fact #17 |
|
||
| Consumes YOLO primitives | Candidate uses primitive boxes / masks as primary input. | Pass | Fact #10 |
|
||
| Path freshness and endpoint tracing | Graph / path model represents path continuity and endpoint scoring. | Pass | Fact #10, Fact #17 |
|
||
| Branch choice at intersections | Queue / path scorer can select freshest / most promising branch by configured score. | Pass | Fact #10 |
|
||
| VLM sequentiality | Candidate can run before VLM and invoke VLM only after endpoint hold. | Pass | Fact #27 |
|
||
|
||
### Sub-matrix: OpenCV telemetry-aided movement detector
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Both zoom levels | Runs during both zoom-out and zoom-in scan states. | Pass with runtime-quality gate | Fact #22 |
|
||
| Wide / light / medium zoom | Candidate consumes only selected zoom-state frames. | Pass | User confirmation, Fact #22 |
|
||
| Timestamped video / gimbal / UAV telemetry | User confirmed full telemetry is available for MVP. | Pass | User decision |
|
||
| Compensate UAV / gimbal motion | Optical flow / global motion plus telemetry estimate ego-motion before residuals. | Pass | Fact #6, Fact #22 |
|
||
| Enqueue within 1 s | Candidate operations support streaming implementation; exact latency is runtime quality. | Pass with runtime-quality gate | Fact #22 |
|
||
| Stable objects not treated as moving | Ego-motion compensation directly addresses this failure mode. | Pass | Fact #6 |
|
||
| Output normalized movement boxes | Residual clusters can be converted to normalized candidate boxes. | Pass | Fact #16 |
|
||
|
||
### Sub-matrix: NanoLLM VILA1.5-3B
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Local VLM, no cloud | Runs local model through NanoLLM. | Pass | Fact #23 |
|
||
| Separate IPC process | Candidate can run as an isolated process / container behind local IPC. | Pass | Fact #23, Fact #27 |
|
||
| Sequential with YOLO | Scheduler grants VLM GPU slot only after YOLO / Tier 2 work. | Pass | Fact #27 |
|
||
| ≤5 s/ROI | API supports image+text prompt; exact latency is runtime quality. | Pass with runtime-quality gate | Fact #23, Fact #24 |
|
||
| ≤6 GB remaining RAM | Quantised mode supports smaller memory footprint; exact budget is runtime quality. | Pass with runtime-quality gate | Fact #23, Fact #24 |
|
||
| Required for MVP if benchmark passes | User-confirmed policy. | Pass | User decision |
|
||
| Output usable for operator display | Text confirmation can be converted into POI metadata while display box comes from Tier 1 / 2. | Pass | Fact #23 |
|
||
|
||
### Sub-matrix: scan controller state machine
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Zoom-out route sweep | State machine owns sweep pattern and POI queueing. | Pass | Fact #26 |
|
||
| Zoom-out → zoom-in ≤2 s | State machine can command transition; physical zoom timing must be measured. | Pass with runtime-quality gate | Fact #5 |
|
||
| Zoom-in lock, pan, hold, timeout | Explicit states encode lock, follow path, endpoint hold, VLM request, timeout, return. | Pass | Fact #26 |
|
||
| Target-follow centre 25 % | Target-follow state can enforce centre-window metric. | Pass | Source #4, Fact #26 |
|
||
| Decision-to-movement ≤500 ms | Controller can timestamp commands; physical / protocol latency is runtime quality. | Pass with runtime-quality gate | Fact #4 |
|
||
| Ordered POI queue with confidence / proximity | Queue can also include user-confirmed ≤5 POIs/minute cap and ageing. | Pass | User decision |
|
||
|
||
### Sub-matrix: integration scheduler and IPC
|
||
|
||
| Restriction / AC | Candidate-mode behaviour | Result | Evidence |
|
||
|---|---|---|---|
|
||
| Extend existing FastAPI + Cython service | Keeps existing YOLO core and adds scheduler / adapters around it. | Pass | Fact #16 |
|
||
| VLM separate IPC | VLM remains outside Cython and communicates locally. | Pass | Fact #23, Fact #27 |
|
||
| No concurrent YOLO / VLM GPU inference | Central scheduler serializes GPU-heavy work. | Pass | Fact #27 |
|
||
| Same normalized-box output | Integration layer preserves current DTOs and adds POI metadata separately. | Pass | Fact #16 |
|
||
| GPS-denied coordinates out of scope | Scheduler stores external coordinate references but does not estimate them. | Pass | Project restrictions |
|
||
| Annotation / training separate repos | Integration consumes trained-model artefacts and label schema only. | Pass | Project restrictions |
|
||
|
||
### Mode B revised fit additions
|
||
|
||
These rows extend the top-level matrix with the cross-cutting concerns surfaced during the second draft of the solution.
|
||
|
||
| Component Area | Candidate | Pinned Mode/Config | Intended Role | API Capability Evidence | Mismatches / Disqualifiers | Status | Decision Rationale |
|
||
|---|---|---|---|---|---|---|---|
|
||
| Benchmark gate | Pre-implementation proof suite | Hardware-in-loop and replay benchmarks for Tier 1, Tier 2, VLM, A40 zoom, movement, all-season data readiness | Prevent implementation from depending on unproven runtime-quality assumptions | Facts #28, #33 | None | Selected | Converts draft caveats into explicit go / no-go gates. |
|
||
| Telemetry sync contract | Frame / telemetry alignment layer | Timestamped frame, gimbal angle, zoom, and UAV motion samples with maximum-skew validation | Make movement compensation testable and reproducible | Fact #32 | Telemetry-missing MVP rejected by user decision | Selected | Required for movement-detector exact fit. |
|
||
| VLM output contract | Structured `VlmAssessment` schema | Label enum, confidence, evidence spans, reason text, timeout / error status; validate before accepting | Prevent free-form VLM text from becoming an unstable API | Fact #31 | Raw free-text VLM output rejected | Selected | Needed for operator display and downstream logs. |
|
||
| IPC security | Unix-domain socket permissions + peer credentials | Local socket with filesystem permissions, peer-credential check, payload-size limits | Restrict local VLM callers and bound payload abuse | Fact #30 | Unauthenticated localhost HTTP rejected for VLM control plane | Selected | Local-only is not sufficient without local authorisation. |
|
||
| Input security | Image / ROI payload validation | MIME / format allow-list, size limits, patched OpenCV, decode sandbox where practical | Reduce crafted-input and resource-exhaustion risk | Fact #29, Fact #34 | Trusting headers / client filenames rejected | Selected | Existing service will process more image / ROI inputs. |
|
||
| Service reliability | Explicit errors and health semantics | No silent exception swallowing in touched paths; health reflects inference / scheduler / VLM availability | Make failures visible during scans and tests | Fact #35 | "Always healthy" failure masking rejected | Selected | Required before expanding mission-critical behaviour. |
|
||
|
||
## Validation
|
||
|
||
### Validation scenario
|
||
|
||
A winged UAV flies a planned route at 600–1000 m over mixed winter forest and field terrain. In `ZoomedOut`, the camera sweeps left-right at wide / light zoom. The system detects a faint footpath and a small moving dot, queues both, zooms to the path endpoint within 2 s (entering `ZoomedIn`), keeps the endpoint centred while the UAV moves, asks the local VLM for a bounded confirmation, then returns to `ZoomedOut`. Later, an operator confirms a target and `TargetFollow` mode keeps it in the centre 25 % of frame.
|
||
|
||
### Expected behaviour (based on conclusions)
|
||
|
||
- Tier 1 emits primitive boxes / masks for path, branch pile, road / tree context, and fixed known object classes.
|
||
- Movement detector compensates gimbal / UAV ego-motion with telemetry and optical flow before residual moving-cluster extraction.
|
||
- `scan_controller` queues POIs by confidence / proximity plus ageing and enforces ≤5 POIs/minute operator-review budget.
|
||
- Zoom-in zoom and endpoint hold run through a deterministic state machine with timeouts and target-loss handling.
|
||
- VLM runs only on bounded ROI crops through local IPC and only when the scheduler grants the GPU slot.
|
||
|
||
### Actual validation results
|
||
|
||
The architecture is internally consistent with the researched constraints and user confirmations. Runtime quality still requires hardware validation:
|
||
|
||
1. Tier 1 end-to-end frame latency for fixed-shape YOLO26 + YOLOE-26 FP16 engines at 1280 px.
|
||
2. ViewPro A40 medium-to-high zoom transition under the selected control protocol.
|
||
3. Movement false-positive rate with timestamped telemetry and representative zoom-out panning, plus zoom-in tracking. Both must satisfy per-zoom-band caps.
|
||
4. NanoLLM VILA1.5-3B ≤5 s/ROI and memory budget while the existing YOLO container is present.
|
||
5. All-season validation coverage and hard-negative mining.
|
||
|
||
### Counterexamples
|
||
|
||
- If YOLOE TensorRT requires runtime prompt mutation for the chosen classes, it is not a valid MVP runtime path; use fixed trained classes only.
|
||
- If VILA1.5-3B misses memory or latency gates, MVP cannot claim VLM-required acceptance until a smaller local VLM passes the same API and runtime gates. In that case, `scan_controller` operates with VLM disabled per the optionality model in `architecture.md §7.6 Local VLM confirmation`.
|
||
- If telemetry is unavailable or unsynchronised, the movement detector must degrade to stabilised video-only mode and should not claim the zoom-out movement criteria.
|
||
|
||
### Review checklist
|
||
|
||
- [x] Draft conclusions are consistent with fact cards.
|
||
- [x] Important dimensions include hardware, model runtime, movement compensation, scan control, data, security, and integration boundaries.
|
||
- [x] No selected runtime component depends on cloud services.
|
||
- [x] No selected TensorRT YOLOE path depends on runtime open-vocabulary prompt mutation.
|
||
- [x] Runtime-quality gates are separated from API capability gates.
|
||
- [x] All selected components match the project constraint matrix at the API / architecture level.
|
||
|
||
### Conclusions requiring revision
|
||
|
||
None at research-draft level. Hardware benchmark failures may revise the selected model variants during planning or implementation.
|
||
|
||
## References (source registry)
|
||
|
||
Access date for web sources: 2026-05-06.
|
||
|
||
### Source #1 — Jetson Orin Nano Super Developer Kit
|
||
|
||
- **Link**: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit
|
||
- **Tier**: L1 (vendor primary)
|
||
- **Version info**: 67 INT8 TOPS, 8 GB LPDDR5, 102 GB/s
|
||
- **Boundary match**: Full match
|
||
- **Summary**: Official NVIDIA page for Jetson Orin Nano Super Developer Kit confirms 67 INT8 TOPS, 8 GB 128-bit LPDDR5 at 102 GB/s, 7–25 W power, generative-AI edge positioning.
|
||
- **Used for**: Latency and memory feasibility (Facts #1, #12, #13).
|
||
|
||
### Source #2 — NVIDIA JetPack 6.2 Super Mode
|
||
|
||
- **Link**: https://developer.nvidia.com/blog/nvidia-jetpack-6-2-brings-super-mode-to-nvidia-jetson-orin-nano-and-jetson-orin-nx-modules/
|
||
- **Tier**: L2 (vendor blog)
|
||
- **Version info**: JetPack 6.2 / Super Mode
|
||
- **Boundary match**: Full match
|
||
- **Summary**: Describes the software-enabled Super Mode performance gain and applicable power modes for Jetson Orin Nano.
|
||
- **Used for**: Reproducibility constraint (Fact #2).
|
||
|
||
### Source #3 — Ultralytics Jetson / TensorRT deployment
|
||
|
||
- **Link**: https://docs.ultralytics.com/yolov5/jetson_nano/
|
||
- **Tier**: L1 (vendor docs)
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Official Ultralytics documentation for Jetson deployment and TensorRT export.
|
||
- **Used for**: Tier 1 latency / deployment feasibility (Fact #3).
|
||
|
||
### Source #4 — ViewPro A40 Pro official spec
|
||
|
||
- **Link**: https://www.viewprotech.com/index.php?ac=article&at=read&did=561
|
||
- **Tier**: L1 (vendor primary)
|
||
- **Version info**: A40 Pro
|
||
- **Boundary match**: Full match
|
||
- **Summary**: 1080p output, 40× optical zoom, 4.25–170 mm focal range, 30 Hz tracking deviation update, <30 ms deviation output delay, 5×5 px minimum AI target size for built-in AI.
|
||
- **Used for**: Camera and gimbal control feasibility (Facts #4, #5).
|
||
|
||
### Source #5 — MONA: Moving Object Detection from Videos Shot by Dynamic Camera
|
||
|
||
- **Link**: https://arxiv.org/html/2501.13183v1
|
||
- **Tier**: L1 (peer-reviewed venue / arXiv)
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Optical flow + tracking-any-point + adaptive bounding-box filtering + segmentation for moving-camera object detection.
|
||
- **Used for**: Movement detection under moving camera (Facts #6, #7).
|
||
|
||
### Source #6 — Moving Object Detection from Moving Camera using Focus of Expansion Likelihood and Segmentation
|
||
|
||
- **Link**: https://arxiv.org/html/2507.13628v1
|
||
- **Tier**: L1
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Optical flow, focus-of-expansion likelihood, segmentation priors for moving-camera object detection.
|
||
- **Used for**: Movement detection under moving camera (Facts #6, #7).
|
||
|
||
### Source #7 — RFAG-YOLO: receptive-field attention-guided YOLO for small-object detection in UAV images
|
||
|
||
- **Link**: https://pmc.ncbi.nlm.nih.gov/articles/PMC11991089/
|
||
- **Tier**: L1
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: UAV small-object detection difficulties; improvements from receptive-field attention.
|
||
- **Used for**: Detection-quality realism (Facts #8, #9).
|
||
|
||
### Source #8 — YOLO-CAM: lightweight UAV detector with combined attention for small targets
|
||
|
||
- **Link**: https://www.mdpi.com/2072-4292/17/21/3575
|
||
- **Tier**: L1
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Lightweight UAV small-target detection using attention.
|
||
- **Used for**: Detection-quality realism (Facts #8, #9).
|
||
|
||
### Source #9 — Accurate Natural Trail Detection (DNN + Dynamic Programming)
|
||
|
||
- **Link**: https://mdpi-res.com/d_attachment/sensors/sensors-18-00178/article_deploy/sensors-18-00178.pdf?version=1515603128
|
||
- **Tier**: L1
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Trail detection using DNN + dynamic programming; supports path-as-structured-perception view.
|
||
- **Used for**: Footpath / trail detection (Fact #10).
|
||
|
||
### Source #10 — Large-Scale Interactive Object Segmentation with Human Annotators
|
||
|
||
- **Link**: https://arxiv.org/pdf/1903.10830
|
||
- **Tier**: L1
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Interactive segmentation faster than manual polygon annotation while maintaining quality.
|
||
- **Used for**: Annotation effort (Fact #14).
|
||
|
||
### Source #11 — Scaling up Instance Annotation via Label Propagation
|
||
|
||
- **Link**: http://scaling-anno.csail.mit.edu/
|
||
- **Tier**: L1
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Label propagation reducing annotation effort for video / object masks.
|
||
- **Used for**: Annotation effort for movement sequences (Fact #15).
|
||
|
||
### Source #12 — Getting Started with VLM on Jetson Nano
|
||
|
||
- **Link**: https://learnopencv.com/vlm-on-jetson-nano/
|
||
- **Tier**: L3 (third-party tutorial)
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Small VLMs can run on Jetson-class hardware with careful runtime / memory tuning.
|
||
- **Used for**: Local VLM feasibility (Fact #12).
|
||
|
||
### Source #13 — NVIDIA Jetson AI Lab
|
||
|
||
- **Link**: https://www.jetson-ai-lab.com/
|
||
- **Tier**: L2
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: NVIDIA-linked Jetson AI Lab as the official-adjacent source for model-specific local VLM deployment evidence.
|
||
- **Used for**: Local VLM feasibility (Fact #12).
|
||
|
||
### Source #14 — CAMOUFLAGE-Net / Improved YOLOv7 Tiny / FSEL camouflaged-object detection
|
||
|
||
- **Link**: https://link.springer.com/article/10.1007/s40031-024-01152-6
|
||
- **Tier**: L1
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Camouflage detection requires specialised features / attention; cannot be assumed from generic object detection.
|
||
- **Used for**: Concealed-target detection realism (Fact #11).
|
||
|
||
### Source #15 — Ultralytics YOLO TensorRT integration
|
||
|
||
- **Link**: https://docs.ultralytics.com/integrations/tensorrt/
|
||
- **Tier**: L1
|
||
- **Boundary match**: Full match
|
||
- **Summary**: Ultralytics export to TensorRT engine format with FP16 through `half=True`; TensorRT as the recommended high-performance NVIDIA deployment path.
|
||
- **Used for**: Tier 1 primitive detector (Facts #19, #21).
|
||
|
||
### Source #16 — OpenCV 4.x optical-flow / videostab APIs
|
||
|
||
- **Link**: https://docs.opencv.org/4.x/d4/dee/tutorial_optical_flow.html
|
||
- **Tier**: L1
|
||
- **Boundary match**: Full match
|
||
- **Summary**: Lucas-Kanade optical flow, feature tracking, global-motion estimation APIs useful for ego-motion compensation.
|
||
- **Used for**: Movement detection (Fact #22).
|
||
|
||
### Source #17 — NanoLLM multimodal documentation
|
||
|
||
- **Link**: https://github.com/dusty-nv/nanollm/blob/main/docs/multimodal.md
|
||
- **Tier**: L1
|
||
- **Boundary match**: Full match
|
||
- **Summary**: NanoLLM multimodal chat with `Efficient-Large-Model/VILA1.5-3b`, image prompts, MLC quantisation options.
|
||
- **Used for**: Local VLM confirmation (Fact #23).
|
||
|
||
### Source #18 — TensorRT FP16 YOLO export on Jetson — limitations and issues
|
||
|
||
- **Link**: https://docs.ultralytics.com/integrations/tensorrt/
|
||
- **Tier**: L1 / L4 mixed (docs + community issues)
|
||
- **Boundary match**: Full match
|
||
- **Summary**: Official docs support FP16 TensorRT export; community results highlight Jetson workspace / dynamic-shape memory issues, mitigated with fixed shapes and careful workspace configuration.
|
||
- **Used for**: Tier 1 latency / export reliability (Fact #20).
|
||
|
||
### Source #19 — YOLOE open-vocabulary detection and TensorRT export notes
|
||
|
||
- **Link**: https://v8docs.ultralytics.com/models/yoloe/
|
||
- **Tier**: L1 / L4 mixed
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: YOLOE supports open-vocabulary detection / segmentation, but TensorRT-engine use should not rely on runtime prompt APIs; fixed trained classes are safer for MVP runtime.
|
||
- **Used for**: Semantic primitive detector (Fact #21).
|
||
|
||
### Source #20 — NanoSAM / MobileSAM Jetson Orin Nano segmentation
|
||
|
||
- **Link**: https://github.com/NVIDIA-AI-IOT/nanosam/
|
||
- **Tier**: L1 / L2
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: NanoSAM / MobileSAM as Jetson-optimised segmentation; useful as ROI mask refinement / annotation assist rather than primary sweep model.
|
||
- **Used for**: Segmentation fallback (Fact #25).
|
||
|
||
### Source #21 — VILA1.5-3B and VLM performance on Jetson Orin Nano
|
||
|
||
- **Link**: https://dusty-nv.github.io/NanoLLM/multimodal.html
|
||
- **Tier**: L1 / L4 mixed
|
||
- **Boundary match**: Full match
|
||
- **Summary**: VILA1.5-3B documented for NanoLLM multimodal usage; community results warn that Orin Nano 8 GB requires strict context / token / crop limits.
|
||
- **Used for**: VLM feasibility (Fact #24).
|
||
|
||
### Source #22 — Behaviour trees for UAV autonomy
|
||
|
||
- **Link**: https://www.sciencedirect.com/science/article/pii/S0921889022000513
|
||
- **Tier**: L1
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Behaviour-tree literature supports modular, reactive UAV behaviour; this project's zoom-out / zoom-in scan behaviour is small enough for a deterministic FSM first.
|
||
- **Used for**: Scan controller architecture (Fact #26).
|
||
|
||
### Source #23 — TensorRT concurrency / multiprocessing issue evidence
|
||
|
||
- **Link**: https://github.com/NVIDIA/TensorRT/issues/2474
|
||
- **Tier**: L4 (community issue tracker)
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Multiple TensorRT engines / processes on one GPU can cause context and performance problems; central GPU scheduler is safer for sequential-inference restriction.
|
||
- **Used for**: Integration boundary / GPU scheduling (Fact #27).
|
||
|
||
### Source #24 — FastAPI file-upload security references
|
||
|
||
- **Link**: https://fastapi.tiangolo.com/tutorial/request-files
|
||
- **Tier**: L1 / L3 mixed
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Secure upload handling needs content-type verification beyond headers, size limits, streaming behaviour, cleanup, authorisation, audit logging.
|
||
- **Used for**: Security weak points (Fact #29).
|
||
|
||
### Source #25 — Unix-domain socket authentication and peer credentials
|
||
|
||
- **Link**: https://linux.die.net/man/7/unix
|
||
- **Tier**: L1 / L3 mixed
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Local IPC can use filesystem permissions and peer-credential checks (`SO_PEERCRED`) to restrict which processes may connect.
|
||
- **Used for**: VLM IPC security (Fact #30).
|
||
|
||
### Source #26 — Structured output for LLM / VLM production use
|
||
|
||
- **Link**: https://docs.vllm.ai/en/v0.6.5/usage/structured_outputs.html
|
||
- **Tier**: L1 / L3 mixed
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Production systems should constrain or validate model output against schemas before using it in APIs / databases.
|
||
- **Used for**: VLM output reliability (Fact #31).
|
||
|
||
### Source #27 — NVIDIA / Isaac ROS timestamp synchronisation
|
||
|
||
- **Link**: https://nvidia-isaac-ros.github.io/v/release-3.2/repositories_and_packages/isaac_ros_nova/isaac_ros_correlated_timestamp_driver/index.html
|
||
- **Tier**: L1 / L2 mixed
|
||
- **Boundary match**: Reference only
|
||
- **Summary**: Jetson sensor-fusion uses hardware / correlated timestamps to reduce synchronisation jitter.
|
||
- **Used for**: Telemetry synchronisation (Fact #32).
|
||
|
||
### Source #28 — NVIDIA TensorRT performance optimisation
|
||
|
||
- **Link**: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/optimization.html
|
||
- **Tier**: L1
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: TensorRT performance depends on batching, engine configuration, and runtime scheduling; project-specific latency must be measured under the actual scheduler.
|
||
- **Used for**: GPU scheduler / benchmark gate (Fact #33).
|
||
|
||
### Source #29 — OpenCV CVE-2025-53644
|
||
|
||
- **Link**: https://securitylab.github.com/advisories/GHSL-2025-057_OpenCV
|
||
- **Tier**: L1 / L2 mixed
|
||
- **Version info**: OpenCV 4.10.0 / 4.11.0 affected; 4.12.0 patched
|
||
- **Boundary match**: Partial overlap
|
||
- **Summary**: Crafted image inputs caused critical OpenCV decoder vulnerabilities; image-input validation and pinned patched OpenCV versions matter.
|
||
- **Used for**: Image-processing security (Fact #34).
|
||
|
||
## Solution evolution
|
||
|
||
The final solution architecture (now in `architecture.md §7.6 Solution Architecture`) evolved from an earlier draft that under-specified several gating concerns. Each row below pairs an old draft component with the weak point that made it insufficient, and the corresponding fix that landed in the final design.
|
||
|
||
| Old component | Weak point | New solution |
|
||
|---|---|---|
|
||
| Tiered Jetson pipeline with runtime gates | Gates were listed as caveats, not as a concrete pre-implementation stage. | Add a mandatory benchmark gate before implementation decomposition: Tier 1 latency, Tier 2 ROI latency, VLM latency / memory, A40 zoom timing, movement replay, and all-season dataset readiness. |
|
||
| YOLO26 / YOLOE-26 TensorRT FP16 | YOLOE runtime prompt / open-vocabulary behaviour could be accidentally assumed. | Runtime uses only fixed trained classes / pre-baked embeddings in FP16 TensorRT; runtime open-vocabulary mutation remains rejected. |
|
||
| Movement detector with telemetry | Telemetry availability was confirmed, but synchronisation tolerance was not specified. | Add a telemetry-synchronisation contract with frame / gimbal / zoom / UAV timestamps and a maximum tolerated skew before motion compensation. |
|
||
| NanoLLM VLM IPC | Free-form VLM output is not a stable interface for operator-facing decisions. | Add a structured `VlmAssessment` schema, validation, retry / timeout handling, and fail-closed behaviour. |
|
||
| Local VLM process | "Local IPC authorisation" was too vague. | Use Unix-domain socket permissions plus peer-credential checks where available; enforce payload size limits. |
|
||
| FastAPI / image processing surface | The draft did not address file / image payload validation or OpenCV decoder risk. | Add content validation, image-format allow-list, size limits, patched OpenCV requirement, and audit logs. |
|
||
| Existing service integration | Existing code swallows several exceptions and reports healthy status even when inference fails. | Add reliability tasks for touched paths: explicit error propagation, meaningful health, structured failure logs. |
|
||
| Scan controller | Queue cap was present, but not tied to benchmark evidence. | Include ≤5 POIs/minute in replay tests and queue backpressure behaviour. |
|
||
|
||
## Historical seed
|
||
|
||
This is the original (March 2026) articulation of the semantic-detection problem that the system ultimately addresses. It is preserved here for traceability — it is the seed of the entire `architecture.md §7.1 Problem` narrative. The reference images it mentions (`semantic01.png` … `semantic04.png`) lived in the original problem-side data parameters (deleted on doc consolidation 2026-05-17); they are not duplicated here.
|
||
|
||
Currently, the system consists of mainly 3 parts:
|
||
|
||
1. **AI object detection.** Allows automatic object detection from the video / images by classes, using a pre-trained AI model. The detection is based on visual similarity. The idea is that the UAV can automatically detect objects and work with them. If it is a reconnaissance UAV, it should deliver a short message with the detected image to the operator to confirm the target. The detection process is described in the suite-level detections doc.
|
||
2. **GPS-Denied.** Detection of the current GPS coordinates based on a downward-facing camera and IMU, AI models for optical flow, and pre-downloaded satellite imagery for the route of the plane. Implemented by the suite-level GPS-Denied service (`gps-denied-onboard`).
|
||
3. **Search algorithm.** Before the flight, the operator selects a region and a route. During the flight, the system uses the scanning strategy described in `architecture.md §7.2 Mission Regions and Reconnaissance Flow`.
|
||
|
||
But this whole workflow has a fundamental flaw, which lies in AI object detection. The regular object detection cannot help with the current frontline situation. Regular object detection picks up old and already-destroyed vehicles and military vehicles, and they have zero value for the system.
|
||
|
||
Current targets are well-hidden and well-masked. Current targets are mostly hidden positions of FPV operators. There are also well-hidden positions of artillery and other well-masked / well-hidden positions. Right now, simple object detection is not enough, because the main object to search for is a small entrance to a hidden safe place — typically a black circular or squared hole near a building, or a dugout masked by tree branches, where personnel or artillery is hidden.
|
||
|
||
The reference images (deleted on doc consolidation 2026-05-17) showed the typical pattern: a footpath through forest or snow leading to a mass of black colour (mostly tree branches concealing a hideout); footpaths leading to open clearings used as FPV launch points; footpaths terminating at squared hideout structures; footpaths terminating at tree-branch concealment.
|
||
|
||
The main research question that motivated the design: which AI can handle these tasks? Is it possible to instruct AI to recognise these patterns, follow footsteps (fresh only) and footpaths, analyse the potential hideouts, and signal about them to the operator? First, it should pick up footpaths; then it should distinguish stale vs fresh footpaths; then it should find potential hideouts at the freshest footpath endpoints; then it can signal potential targets.
|
||
|
||
This question is now answered by the three-tier perception pipeline (Tier 1 fixed-class YOLO primitives → Tier 2 primitive-graph + lightweight ROI CNN → optional Tier 3 local VLM confirmation), the deterministic `scan_controller` state machine, and the H3-indexed `mapobjects_store`, all documented in `architecture.md §7 Detailed Design`.
|