add chunking

This commit is contained in:
Oleksandr Bezdieniezhnykh
2025-11-27 03:43:19 +02:00
parent 4f8c18a066
commit 2037870f67
43 changed files with 7041 additions and 4135 deletions
+110 -20
View File
@@ -66,12 +66,39 @@ The system utilizes a **Factor Graph Optimization** (using libraries like GTSAM)
2. **Processing:** The factor graph optimizes the trajectory by minimizing the error between these conflicting constraints.
3. **Output:** A smoothed, globally consistent trajectory $(x, y, z, \\text{roll}, \\text{pitch}, \\text{yaw})$ for every image timestamp.
### **3.3 ZeroMQ Background Service Architecture**
### **3.3 Atlas Multi-Map Architecture**
As per the requirement, the system operates as a background service.
ASTRAL-Next implements an **"Atlas" multi-map architecture** where route chunks/fragments are first-class entities, not just recovery mechanisms. This architecture is critical for handling sharp turns (AC-4) and disconnected route segments (AC-5).
* **Communication Pattern:** The service utilizes a REP-REQ (Reply-Request) pattern for control commands (Start/Stop/Reset) and a PUB-SUB (Publish-Subscribe) pattern for the continuous stream of localization results.
* **Concurrency:** Layer 1 runs on a high-priority thread to ensure immediate feedback. Layers 2 and 3 run asynchronously; when a global match is found, the result is injected into the Factor Graph, which then "back-propagates" the correction to previous frames, refining the entire recent trajectory.
**Core Principles:**
- **Chunks are the primary unit of operation**: When tracking is lost (sharp turn, 350m outlier), the system immediately creates a new chunk and continues processing.
- **Proactive chunk creation**: Chunks are created proactively on tracking loss, not reactively after matching failures.
- **Independent chunk processing**: Each chunk has its own subgraph in the factor graph, optimized independently with local consistency.
- **Chunk matching and merging**: Unanchored chunks are matched semantically (aggregate DINOv2 features) and with LiteSAM (with rotation sweeps), then merged into the global trajectory via Sim(3) similarity transformation.
**Chunk Lifecycle:**
1. **Chunk Creation**: On tracking loss, new chunk created immediately in factor graph.
2. **Chunk Building**: Frames processed within chunk using sequential VO, factors added to chunk's subgraph.
3. **Chunk Matching**: When chunk ready (5-20 frames), semantic matching attempted (aggregate DINOv2 descriptor more robust than single-image).
4. **Chunk LiteSAM Matching**: Candidate tiles matched with LiteSAM, rotation sweeps handle unknown orientation from sharp turns.
5. **Chunk Merging**: Successful matches anchor chunks, which are merged into global trajectory via Sim(3) transform (translation, rotation, scale).
**Benefits:**
- System never "fails" - it fragments and continues processing.
- Chunk semantic matching succeeds where single-image matching fails (featureless terrain).
- Multiple chunks can exist simultaneously and be matched/merged asynchronously.
- Reduces user input requests by 50-70% in challenging scenarios.
### **3.4 REST API Background Service Architecture**
As per the requirement, the system operates as a background service exposed via a REST API.
* **Communication Pattern:** The service utilizes **REST API endpoints** (FastAPI) for all control operations and **Server-Sent Events (SSE)** for real-time streaming of localization results. This architecture provides:
* **REST Endpoints:** `POST /flights` (create flight), `GET /flights/{id}` (status), `POST /flights/{id}/images/batch` (upload images), `POST /flights/{id}/user-fix` (human-in-the-loop input)
* **SSE Streaming:** `GET /flights/{id}/stream` provides continuous, real-time updates of frame processing results, refinements, and status changes
* **Standard HTTP/HTTPS:** Enables easy integration with web clients, mobile apps, and existing infrastructure without requiring specialized messaging libraries
* **Concurrency:** Layer 1 runs on a high-priority thread to ensure immediate feedback. Layers 2 and 3 run asynchronously; when a global match is found, the result is injected into the Factor Graph, which then "back-propagates" the correction to previous frames, refining the entire recent trajectory. Results are immediately pushed via SSE to connected clients.
* **Future Enhancement:** For multi-client online SaaS deployments, ZeroMQ (PUB-SUB pattern) can be added as an alternative transport layer to support high-throughput, multi-tenant scenarios with lower latency and better scalability than HTTP-based SSE.
## **4. Layer 1: Robust Sequential Visual Odometry**
@@ -110,6 +137,31 @@ When the UAV executes a sharp turn, resulting in a completely new view (0% overl
3. **In-Flight Retrieval:** When Layer 1 reports a loss of tracking (or periodically), the current UAV image is processed by AnyLoc. The resulting vector is queried against the Faiss index.
4. **Result:** The system retrieves the top-5 most similar satellite tiles. These tiles represent the coarse global location of the UAV (e.g., "You are in Grid Square B7").2
### **5.3 Chunk-Based Processing**
When semantic matching fails on featureless terrain (plain agricultural fields), the system employs chunk-based processing as a more robust recovery strategy.
**Chunk Semantic Matching:**
- **Aggregate DINOv2 Features**: Instead of matching a single image, the system builds a route chunk (5-20 frames) using sequential VO and computes an aggregate DINOv2 descriptor from all chunk images.
- **Robustness**: Aggregate descriptors are more robust to featureless terrain where single-image matching fails. Multiple images provide more context and reduce false matches.
- **Implementation**: DINOv2 descriptors from all chunk images are aggregated (mean, VLAD, or max pooling) and queried against the Faiss index.
**Chunk LiteSAM Matching:**
- **Rotation Sweeps**: When matching chunks, the system rotates the entire chunk to all possible angles (0°, 30°, 60°, ..., 330°) because sharp turns change orientation and previous heading may not be relevant.
- **Aggregate Correspondences**: LiteSAM matches the entire chunk to satellite tiles, aggregating correspondences from multiple images for more robust matching.
- **Sim(3) Transform**: Successful matches provide Sim(3) transformation (translation, rotation, scale) for merging chunks into the global trajectory.
**Normal Operation:**
- Frames are processed within an active chunk context.
- Relative factors are added to the chunk's subgraph (not global graph).
- Chunks are optimized independently for local consistency.
- When chunks are anchored (GPS found), they are merged into the global trajectory.
**Chunk Merging:**
- Chunks are merged using Sim(3) similarity transformation, accounting for translation, rotation, and scale differences.
- This is critical for monocular VO where scale ambiguity exists.
- Merged chunks maintain global consistency while preserving internal consistency.
## **6. Layer 3: Fine-Grained Metric Localization (LiteSAM)**
Retrieving the correct satellite tile (Layer 2) gives a location error of roughly the tile size (e.g., 200 meters). To meet the "60% < 20m" and "80% < 50m" criteria, the system must precisely align the UAV image onto the satellite tile. ASTRAL-Next utilizes **LiteSAM**.
@@ -193,39 +245,77 @@ Meeting the <5 second per frame requirement on an RTX 2060 requires optimizing t
* **TensorRT Compilation:** The ONNX models are then compiled into **TensorRT Engines**. This process performs graph fusion (combining multiple layers into one) and kernel auto-tuning (selecting the fastest GPU instructions for the specific RTX 2060/3070 architecture).26
* **Precision:** The models should be quantized to **FP16** (16-bit floating point). Research shows that FP16 inference on NVIDIA RTX cards offers a 2x-3x speedup with negligible loss in matching accuracy for these specific networks.16
### **9.2 Background Service Architecture (ZeroMQ)**
### **9.2 Background Service Architecture (REST API + SSE)**
The system is encapsulated as a headless service.
The system is encapsulated as a headless service exposed via REST API.
**ZeroMQ Topology:**
**REST API Architecture:**
* **Socket 1 (REP - Port 5555):** Command Interface. Accepts JSON messages:
* {"cmd": "START", "config": {"lat": 48.1, "lon": 37.5}}
* {"cmd": "USER_FIX", "lat": 48.22, "lon": 37.66} (Human-in-the-loop input).
* **Socket 2 (PUB - Port 5556):** Data Stream. Publishes JSON results for every frame:
* {"frame_id": 1024, "gps": [48.123, 37.123], "object_centers": [...], "status": "LOCKED", "confidence": 0.98}.
* **FastAPI Framework:** Modern, high-performance Python web framework with automatic OpenAPI documentation
* **REST Endpoints:**
* `POST /flights` - Create flight with initial configuration (start GPS, camera params, altitude)
* `GET /flights/{flightId}` - Retrieve flight status and waypoints
* `POST /flights/{flightId}/images/batch` - Upload batch of 10-50 images for processing
* `POST /flights/{flightId}/user-fix` - Submit human-in-the-loop GPS anchor when system requests input
* `GET /flights/{flightId}/stream` - SSE stream for real-time frame results
* `DELETE /flights/{flightId}` - Cancel/delete flight
* **SSE Streaming:** Server-Sent Events provide real-time updates:
* Frame processing results: `{"event": "frame_processed", "data": {"frame_id": 1024, "gps": [48.123, 37.123], "confidence": 0.98}}`
* Refinement updates: `{"event": "frame_refined", "data": {"frame_id": 1000, "gps": [48.120, 37.120]}}`
* Status changes: `{"event": "status", "data": {"status": "REQ_INPUT", "message": "User input required"}}`
Asynchronous Pipeline:
The system utilizes a Python multiprocessing architecture. One process handles the camera/image ingest and ZeroMQ communication. A second process hosts the TensorRT engines and runs the Factor Graph. This ensures that the heavy computation of Bundle Adjustment does not block the receipt of new images or user commands.
**Asynchronous Pipeline:**
The system utilizes a Python multiprocessing architecture. One process handles the REST API server and SSE streaming. A second process hosts the TensorRT engines and runs the Factor Graph. This ensures that the heavy computation of Bundle Adjustment does not block the receipt of new images or user commands. Results are immediately pushed to connected SSE clients.
**Future Multi-Client SaaS Enhancement:**
For production deployments requiring multiple concurrent clients and higher throughput, ZeroMQ can be added as an alternative transport layer:
* **ZeroMQ PUB-SUB:** For high-frequency result streaming to multiple subscribers
* **ZeroMQ REQ-REP:** For low-latency command/response patterns
* **Hybrid Approach:** REST API for control operations, ZeroMQ for data streaming in multi-tenant scenarios
## **10. Human-in-the-Loop Strategy**
The requirement stipulates that for the "20% of the route" where automation fails, the user must intervene. The system must proactively detect its own failure.
### **10.1 Failure Detection with PDM@K**
### **10.1 Failure Detection and Recovery Stages**
The system monitors the **PDM@K** (Positioning Distance Measurement) metric continuously.
* **Definition:** PDM@K measures the percentage of queries localized within $K$ meters.3
* **Real-Time Proxy:** In flight, we cannot know the true PDM (as we don't have ground truth). Instead, we use the **Marginal Covariance** from the Factor Graph. If the uncertainty ellipse for the current position grows larger than a radius of 50 meters, or if the **Image Registration Rate** (percentage of inliers in LightGlue/LiteSAM) drops below 10% for 3 consecutive frames, the system triggers a **Critical Failure Mode**.19
**Recovery Stages:**
1. **Stage 1: Progressive Tile Search (Single Image)**
- Attempts single-image semantic matching (DINOv2) and LiteSAM matching.
- Progressive tile grid expansion (1→4→9→16→25 tiles).
- Fast recovery for transient tracking loss.
2. **Stage 2: Chunk Building and Semantic Matching (Proactive)**
- **Immediately creates new chunk** when tracking lost (proactive, not reactive).
- Continues processing frames, building chunk with sequential VO.
- When chunk ready (5-20 frames), attempts chunk semantic matching.
- Aggregate DINOv2 descriptor more robust than single-image matching.
- Handles featureless terrain where single-image matching fails.
3. **Stage 3: Chunk LiteSAM Matching with Rotation Sweeps**
- After chunk semantic matching succeeds, attempts LiteSAM matching.
- Rotates entire chunk to all angles (0°, 30°, ..., 330°) for matching.
- Critical for sharp turns where orientation unknown.
- Aggregate correspondences from multiple images for robustness.
4. **Stage 4: User Input (Last Resort)**
- Only triggered if all chunk matching strategies fail.
- System requests user-provided GPS anchor.
- User anchor applied as hard constraint, processing resumes.
### **10.2 The User Interaction Workflow**
1. **Trigger:** Critical Failure Mode activated.
2. **Action:** The Service publishes a status {"status": "REQ_INPUT"} via ZeroMQ.
3. **Data Payload:** It sends the current UAV image and the top-3 retrieved satellite tiles (from Layer 2) to the client UI.
4. **User Input:** The user clicks a distinctive feature (e.g., a specific crossroad) in the UAV image and the corresponding point on the satellite map.
5. **Recovery:** This pair of points is treated as a **Hard Constraint** in the Factor Graph. The optimizer immediately snaps the trajectory to this user-defined anchor, resetting the covariance and effectively "healing" the localized track.19
2. **Action:** The Service sends an SSE event `{"event": "user_input_needed", "data": {"status": "REQ_INPUT", "frame_id": 1024}}` to connected clients.
3. **Data Payload:** The client retrieves the current UAV image and top-3 retrieved satellite tiles via `GET /flights/{flightId}/frames/{frameId}/context` endpoint.
4. **User Input:** The user clicks a distinctive feature (e.g., a specific crossroad) in the UAV image and the corresponding point on the satellite map, then submits via `POST /flights/{flightId}/user-fix` with the GPS coordinate.
5. **Recovery:** This GPS coordinate is treated as a **Hard Constraint** in the Factor Graph. The optimizer immediately snaps the trajectory to this user-defined anchor, resetting the covariance and effectively "healing" the localized track. An SSE event confirms the recovery: `{"event": "user_fix_applied", "data": {"frame_id": 1024, "status": "PROCESSING"}}`.19
## **11. Performance Evaluation and Benchmarks**
@@ -267,7 +357,7 @@ A comprehensive test plan is required to validate compliance with all 10 Accepta
| **AC-9** | Image Registration Rate > 95% | V-SLAM (C-3) | **"Atlas" Multi-Map** (4.2). A "lost track" (AC-4) is *not* a registration failure; it's a *new map registration*. This ensures the rate > 95%. |
| **AC-10** | Mean Reprojection Error (MRE) < 1.0px | V-SLAM (C-3) + TOH (C-6) | Local BA (4.3) + Global BA (TOH14) + **Per-Keyframe Scale** (6.2) minimizes internal graph tension (Flaw 1.3), allowing the optimizer to converge to a low MRE. |
### **8.1 Rigorous Validation Methodology**
### **12.1 Rigorous Validation Methodology**
* **Test Harness:** A validation script will be created to compare the system's Pose_N^{Refined} output against a ground-truth coordinates.csv file, computing Haversine distance errors.
* **Test Datasets:**