gps-denied-onboard/docs/00_problem/1.3_06_assesment_prompt.md at f4def053e80afd5c0ca090805ba90cbf46a2c355

mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-04-22 22:46:36 +00:00

Files

T

Oleksandr Bezdieniezhnykh bed8e6d52a Make prompts more stuctured.

Separate tutorial.md for developers from commands for AI
WIP

2025-11-22 19:57:16 +02:00

38 KiB

Raw Blame History

Read carefully about the problem:

We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD Photos are taken and named consecutively within 100 meters of each other. We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos

System has next restrictions and conditions:

Photos are taken by only airplane type UAVs.
Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
There is NO data from IMU
Flights are done mostly in sunny weather
We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
Number of photos could be up to 3000, usually in the 500-1500 range
During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)

Output of the system should address next acceptance criteria: - The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS

- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS

- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.

- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 200m drift and at an angle of less than 70%

- System should try to operate when UAV made a sharp turn, and all the next photos has no common points with previous route. In that situation system should try to figure out location of the new piece of the route and connect it to the previous route. Also this separate chunks could be more than 2, so this strategy should be in the core of the system

- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location

- Less than 5 seconds for processing one image

- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user 

- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory

- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.

- The whole system should work as a background service. The interaction should be done by zeromq. Sevice should be up and running and awaiting for the initial input message. On the input message processing should started, and immediately after the first results system should provide them to the client

Here is a solution draft:

# **ASTRAL-Next: A Resilient, GNSS-Denied Geo-Localization Architecture for Wing-Type UAVs in Complex Semantic Environments**

## **1. Executive Summary and Operational Context**

The strategic necessity of operating Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments has precipitated a fundamental shift in autonomous navigation research. The specific operational profile under analysis—high-speed, fixed-wing UAVs operating without Inertial Measurement Units (IMU) over the visually homogenous and texture-repetitive terrain of Eastern and Southern Ukraine—presents a confluence of challenges that render traditional Simultaneous Localization and Mapping (SLAM) approaches insufficient. The target environment, characterized by vast agricultural expanses, seasonal variability, and potential conflict-induced terrain alteration, demands a navigation architecture that moves beyond simple visual odometry to a robust, multi-layered Absolute Visual Localization (AVL) system.

This report articulates the design and theoretical validation of **ASTRAL-Next**, a comprehensive architectural framework engineered to supersede the limitations of preliminary dead-reckoning solutions. By synthesizing state-of-the-art (SOTA) research emerging in 2024 and 2025, specifically leveraging **LiteSAM** for efficient cross-view matching 1, **AnyLoc** for universal place recognition 2, and **SuperPoint+LightGlue** for robust sequential tracking 1, the proposed system addresses the critical failure modes inherent in wing-type UAV flight dynamics. These dynamics include sharp banking maneuvers, significant pitch variations leading to ground sampling distance (GSD) disparities, and the potential for catastrophic track loss (the "kidnapped robot" problem).

The analysis indicates that relying solely on sequential image overlap is viable only for short-term trajectory smoothing. The core innovation of ASTRAL-Next lies in its "Hierarchical + Anchor" topology, which decouples the relative motion estimation from absolute global anchoring. This ensures that even during zero-overlap turns or 350-meter positional outliers caused by airframe tilt, the system can re-localize against a pre-cached satellite reference map within the required 5-second latency window.3 Furthermore, the system accounts for the semantic disconnect between live UAV imagery and potentially outdated satellite reference data (e.g., Google Maps) by prioritizing semantic geometry over pixel-level photometric consistency.

### **1.1 Operational Environment and Constraints Analysis**

The operational theater—specifically the left bank of the Dnipro River in Ukraine—imposes rigorous constraints on computer vision algorithms. The absence of IMU data removes the ability to directly sense acceleration and angular velocity, creating a scale ambiguity in monocular vision systems that must be resolved through external priors (altitude) and absolute reference data.

| Constraint Category | Specific Challenge | Implication for System Design |
| :---- | :---- | :---- |
| **Sensor Limitation** | **No IMU Data** | The system cannot distinguish between pure translation and camera rotation (pitch/roll) without visual references. Scale must be constrained via altitude priors and satellite matching.5 |
| **Flight Dynamics** | **Wing-Type UAV** | Unlike quadcopters, fixed-wing aircraft cannot hover. They bank to turn, causing horizon shifts and perspective distortions. "Sharp turns" result in 0% image overlap.6 |
| **Terrain Texture** | **Agricultural Fields** | Repetitive crop rows create aliasing for standard descriptors (SIFT/ORB). Feature matching requires context-aware deep learning methods (SuperPoint).7 |
| **Reference Data** | **Google Maps (2025)** | Public satellite data may be outdated or lower resolution than restricted military feeds. Matches must rely on invariant features (roads, tree lines) rather than ephemeral textures.9 |
| **Compute Hardware** | **NVIDIA RTX 2060/3070** | Algorithms must be optimized for TensorRT to meet the <5s per frame requirement. Heavy transformers (e.g., ViT-Huge) are prohibitive; efficient architectures (LiteSAM) are required.1 |

The confluence of these factors necessitates a move away from simple "dead reckoning" (accumulating relative movements) which drifts exponentially. Instead, ASTRAL-Next operates as a **Global-Local Hybrid System**, where a high-frequency visual odometry layer handles frame-to-frame continuity, while a parallel global localization layer periodically "resets" the drift by anchoring the UAV to the satellite map.

## **2. Architectural Critique of Legacy Approaches**

The initial draft solution ("ASTRAL") and similar legacy approaches typically rely on a unified SLAM pipeline, often attempting to use the same feature extractors for both sequential tracking and global localization. Recent literature highlights substantial deficiencies in this monolithic approach, particularly when applied to the specific constraints of this project.

### **2.1 The Failure of Classical Descriptors in Agricultural Settings**

Classical feature descriptors like SIFT (Scale-Invariant Feature Transform) and ORB (Oriented FAST and Rotated BRIEF) rely on detecting "corners" and "blobs" based on local pixel intensity gradients. In the agricultural landscapes of Eastern Ukraine, this approach faces severe aliasing. A field of sunflowers or wheat presents thousands of identical "blobs," causing the nearest-neighbor matching stage to generate a high ratio of outliers.8  
Research demonstrates that deep-learning-based feature extractors, specifically SuperPoint, trained on large datasets of synthetic and real-world imagery, learn to identify interest points that are semantically significant (e.g., the intersection of a tractor path and a crop line) rather than just texturally distinct.1 Consequently, a redesign must replace SIFT/ORB with SuperPoint for the front-end tracking.

### **2.2 The Inadequacy of Dead Reckoning without IMU**

In a standard Visual-Inertial Odometry (VIO) system, the IMU provides a high-frequency prediction of the camera's pose, which the visual system then refines. Without an IMU, the system is purely Visual Odometry (VO). In VO, the scale of the world is unobservable from a single camera (monocular scale ambiguity). A 1-meter movement of a small object looks identical to a 10-meter movement of a large object.5  
While the prompt specifies a "predefined altitude," relying on this as a static constant is dangerous due to terrain undulations and barometric drift. ASTRAL-Next must implement a Scale-Constrained Bundle Adjustment, treating the altitude not as a hard fact, but as a strong prior that prevents the scale drift common in monocular systems.5

### **2.3 Vulnerability to "Kidnapped Robot" Scenarios**

The requirement to recover from sharp turns where the "next photo doesn't overlap at all" describes the classic "Kidnapped Robot Problem" in robotics—where a robot is teleported to an unknown location and must relocalize.14  
Sequential matching algorithms (optical flow, feature tracking) function on the assumption of overlap. When overlap is zero, these algorithms fail catastrophically. The legacy solution's reliance on continuous tracking makes it fragile to these flight dynamics. The redesigned architecture must incorporate a dedicated Global Place Recognition module that treats every frame as a potential independent query against the satellite database, independent of the previous frame's history.2

## **3. ASTRAL-Next: System Architecture and Methodology**

To meet the acceptance criteria—specifically the 80% success rate within 50m error and the <5 second processing time—ASTRAL-Next utilizes a tri-layer processing topology. These layers operate concurrently, feeding into a central state estimator.

### **3.1 The Tri-Layer Localization Strategy**

The architecture separates the concerns of continuity, recovery, and precision into three distinct algorithmic pathways.

| Layer | Functionality | Algorithm | Latency | Role in Acceptance Criteria |
| :---- | :---- | :---- | :---- | :---- |
| **L1: Sequential Tracking** | Frame-to-Frame Relative Pose | **SuperPoint + LightGlue** | \~50-100ms | Handles continuous flight, bridges small gaps (overlap < 5%), and maintains trajectory smoothness. Essential for the 100m spacing requirement. 1 |
| **L2: Global Re-Localization** | "Kidnapped Robot" Recovery | **AnyLoc (DINOv2 + VLAD)** | \~200ms | Detects location after sharp turns (0% overlap) or track loss. Matches current view to the satellite database tile. Addresses the sharp turn recovery criterion. 2 |
| **L3: Metric Refinement** | Precise GPS Anchoring | **LiteSAM / HLoc** | \~300-500ms | "Stitches" the UAV image to the satellite tile with pixel-level accuracy to reset drift. Ensures the "80% < 50m" and "60% < 20m" accuracy targets. 1 |

### **3.2 Data Flow and State Estimation**

The system utilizes a **Factor Graph Optimization** (using libraries like GTSAM) as the central "brain."

1. **Inputs:**  
* **Relative Factors:** Provided by Layer 1 (Change in pose from $t-1$ to $t$).  
* **Absolute Factors:** Provided by Layer 3 (Global GPS coordinate at $t$).  
* **Priors:** Altitude constraint and Ground Plane assumption.  
2. **Processing:** The factor graph optimizes the trajectory by minimizing the error between these conflicting constraints.  
3. **Output:** A smoothed, globally consistent trajectory $(x, y, z, \\text{roll}, \\text{pitch}, \\text{yaw})$ for every image timestamp.

### **3.3 ZeroMQ Background Service Architecture**

As per the requirement, the system operates as a background service.

* **Communication Pattern:** The service utilizes a REP-REQ (Reply-Request) pattern for control commands (Start/Stop/Reset) and a PUB-SUB (Publish-Subscribe) pattern for the continuous stream of localization results.  
* **Concurrency:** Layer 1 runs on a high-priority thread to ensure immediate feedback. Layers 2 and 3 run asynchronously; when a global match is found, the result is injected into the Factor Graph, which then "back-propagates" the correction to previous frames, refining the entire recent trajectory.

## **4. Layer 1: Robust Sequential Visual Odometry**

The first line of defense against localization loss is robust tracking between consecutive UAV images. Given the challenging agricultural environment, standard feature matching is prone to failure. ASTRAL-Next employs **SuperPoint** and **LightGlue**.

### **4.1 SuperPoint: Semantic Feature Detection**

SuperPoint is a fully convolutional neural network trained to detect interest points and compute their descriptors. Unlike SIFT, which uses handcrafted mathematics to find corners, SuperPoint is trained via self-supervision on millions of images.

* **Relevance to Ukraine:** In a wheat field, SIFT might latch onto hundreds of identical wheat stalks. SuperPoint, however, learns to prioritize more stable features, such as the boundary between the field and a dirt road, or a specific patch of discoloration in the crop canopy.1  
* **Performance:** SuperPoint runs efficiently on the RTX 2060/3070, with inference times around 15ms per image when optimized with TensorRT.16

### **4.2 LightGlue: The Attention-Based Matcher**

**LightGlue** represents a paradigm shift from the traditional "Nearest Neighbor + RANSAC" matching pipeline. It is a deep neural network that takes two sets of SuperPoint features and jointly predicts the matches.

* **Mechanism:** LightGlue uses a transformer-based attention mechanism. It allows features in Image A to "look at" all features in Image B (and vice versa) to determine the best correspondence. Crucially, it has a "dustbin" mechanism to explicitly reject points that have no match (occlusion or field of view change).12  
* **Addressing the <5% Overlap:** The user specifies handling overlaps of "less than 5%." Traditional RANSAC fails here because the inlier ratio is too low. LightGlue, however, can confidently identify the few remaining matches because its attention mechanism considers the global geometric context of the points. If only a single road intersection is visible in the corner of both images, LightGlue is significantly more likely to match it correctly than SIFT.8  
* **Efficiency:** LightGlue is designed to be "light." It features an adaptive depth mechanism—if the images are easy to match, it exits early. If they are hard (low overlap), it uses more layers. This adaptability is perfect for the variable difficulty of the UAV flight path.19

## **5. Layer 2: Global Place Recognition (The "Kidnapped Robot" Solver)**

When the UAV executes a sharp turn, resulting in a completely new view (0% overlap), sequential tracking (Layer 1) is mathematically impossible. The system must recognize the new terrain solely based on its appearance. This is the domain of **AnyLoc**.

### **5.1 Universal Place Recognition with Foundation Models**

**AnyLoc** leverages **DINOv2**, a massive self-supervised vision transformer developed by Meta. DINOv2 is unique because it is not trained with labels; it is trained to understand the geometry and semantic layout of images.

* **Why DINOv2 for Satellite Matching:** Satellite images and UAV images have different "domains." The satellite image might be from summer (green), while the UAV flies in autumn (brown). DINOv2 features are remarkably invariant to these texture changes. It "sees" the shape of the road network or the layout of the field boundaries, rather than the color of the leaves.2  
* **VLAD Aggregation:** AnyLoc extracts dense features from the image using DINOv2 and aggregates them using **VLAD** (Vector of Locally Aggregated Descriptors) into a single, compact vector (e.g., 4096 dimensions). This vector represents the "fingerprint" of the location.21

### **5.2 Implementation Strategy**

1. **Database Preparation:** Before the mission, the system downloads the satellite imagery for the operational bounding box (Eastern/Southern Ukraine). These images are tiled (e.g., 512x512 pixels with overlap) and processed through AnyLoc to generate a database of descriptors.  
2. **Faiss Indexing:** These descriptors are indexed using **Faiss**, a library for efficient similarity search.  
3. **In-Flight Retrieval:** When Layer 1 reports a loss of tracking (or periodically), the current UAV image is processed by AnyLoc. The resulting vector is queried against the Faiss index.  
4. **Result:** The system retrieves the top-5 most similar satellite tiles. These tiles represent the coarse global location of the UAV (e.g., "You are in Grid Square B7").2

## **6. Layer 3: Fine-Grained Metric Localization (LiteSAM)**

Retrieving the correct satellite tile (Layer 2) gives a location error of roughly the tile size (e.g., 200 meters). To meet the "60% < 20m" and "80% < 50m" criteria, the system must precisely align the UAV image onto the satellite tile. ASTRAL-Next utilizes **LiteSAM**.

### **6.1 Justification for LiteSAM over TransFG**

While **TransFG** (Transformer for Fine-Grained recognition) is a powerful architecture for cross-view geo-localization, it is computationally heavy.23 **LiteSAM** (Lightweight Satellite-Aerial Matching) is specifically architected for resource-constrained platforms (like UAV onboard computers or efficient ground stations) while maintaining state-of-the-art accuracy.

* **Architecture:** LiteSAM utilizes a **Token Aggregation-Interaction Transformer (TAIFormer)**. It employs a convolutional token mixer (CTM) to model correlations between the UAV and satellite images.  
* **Multi-Scale Processing:** LiteSAM processes features at multiple scales. This is critical because the UAV altitude varies (<1km), meaning the scale of objects in the UAV image will not perfectly match the fixed scale of the satellite image (Google Maps Zoom Level 19). LiteSAM's multi-scale approach inherently handles this discrepancy.1  
* **Performance Data:** Empirical benchmarks on the **UAV-VisLoc** dataset show LiteSAM achieving an RMSE@30 (Root Mean Square Error within 30 meters) of 17.86 meters, directly supporting the project's accuracy requirements. Its inference time is approximately 61.98ms on standard GPUs, ensuring it fits within the overall 5-second budget.1

### **6.2 The Alignment Process**

1. **Input:** The UAV Image and the Top-1 Satellite Tile from Layer 2.  
2. **Processing:** LiteSAM computes the dense correspondence field between the two images.  
3. **Homography Estimation:** Using the correspondences, the system computes a homography matrix $H$ that maps pixels in the UAV image to pixels in the georeferenced satellite tile.  
4. **Pose Extraction:** The camera's absolute GPS position is derived from this homography, utilizing the known GSD of the satellite tile.18

## **7. Satellite Data Management and Coordinate Systems**

The reliability of the entire system hinges on the quality and handling of the reference map data. The restriction to "Google Maps" necessitates a rigorous approach to coordinate transformation and data freshness management.

### **7.1 Google Maps Static API and Mercator Projection**

The Google Maps Static API delivers images without embedded georeferencing metadata (GeoTIFF tags). The system must mathematically derive the bounding box of each downloaded tile to assign coordinates to the pixels. Google Maps uses the **Web Mercator Projection (EPSG:3857)**.

The system must implement the following derivation to establish the **Ground Sampling Distance (GSD)**, or meters_per_pixel, which varies significantly with latitude:

$$ \\text{meters_per_pixel} = 156543.03392 \\times \\frac{\\cos(\\text{latitude} \\times \\frac{\\pi}{180})}{2^{\\text{zoom}}} $$

For the operational region (Ukraine, approx. Latitude 48N):

* At **Zoom Level 19**, the resolution is approximately 0.30 meters/pixel. This resolution is compatible with the input UAV imagery (Full HD at <1km altitude), providing sufficient detail for the LiteSAM matcher.24

**Bounding Box Calculation Algorithm:**

1. **Input:** Center Coordinate $(lat, lon)$, Zoom Level ($z$), Image Size $(w, h)$.  
2. **Project to World Coordinates:** Convert $(lat, lon)$ to world pixel coordinates $(px, py)$ at the given zoom level.  
3. **Corner Calculation:**  
* px_{NW} = px - (w / 2)  
* py_{NW} = py - (h / 2)  
4. Inverse Projection: Convert $(px_{NW}, py_{NW})$ back to Latitude/Longitude to get the North-West corner. Repeat for South-East.  
This calculation is critical. A precision error here translates directly to a systematic bias in the final GPS output.

### **7.2 Mitigating Data Obsolescence (The 2025 Problem)**

The provided research highlights that satellite imagery access over Ukraine is subject to restrictions and delays (e.g., Maxar restrictions in 2025).10 Google Maps data may be several years old.

* **Semantic Anchoring:** This reinforces the selection of **AnyLoc** (Layer 2) and **LiteSAM** (Layer 3). These algorithms are trained to ignore transient features (cars, temporary structures, vegetation color) and focus on persistent structural features (road geometry, building footprints).  
* **Seasonality:** Research indicates that DINOv2 features (used in AnyLoc) exhibit strong robustness to seasonal changes (e.g., winter satellite map vs. summer UAV flight), maintaining high retrieval recall where pixel-based methods fail.17

## **8. Optimization and State Estimation (The "Brain")**

The individual outputs of the visual layers are noisy. Layer 1 drifts over time; Layer 3 may have occasional outliers. The **Factor Graph Optimization** fuses these inputs into a coherent trajectory.

### **8.1 Handling the 350-Meter Outlier (Tilt)**

The prompt specifies that "up to 350 meters of an outlier... could happen due to tilt." This large displacement masquerading as translation is a classic source of divergence in Kalman Filters.

* **Robust Cost Functions:** In the Factor Graph, the error terms for the visual factors are wrapped in a **Robust Kernel** (specifically the **Cauchy** or **Huber** kernel).  
* *Mechanism:* Standard least-squares optimization penalizes errors quadratically ($e^2$). If a 350m error occurs, the penalty is massive, dragging the entire trajectory off-course. A robust kernel changes the penalty to be linear ($|e|$) or logarithmic after a certain threshold. This allows the optimizer to effectively "ignore" or down-weight the 350m jump if it contradicts the consensus of other measurements, treating it as a momentary outlier or solving for it as a rotation rather than a translation.19

### **8.2 The Altitude Soft Constraint**

To resolve the monocular scale ambiguity without IMU, the altitude ($h_{prior}$) is added as a **Unary Factor** to the graph.

* $E_{alt} = |

| z_{est} \- h_{prior} ||*{\\Sigma*{alt}}$

* $\\Sigma_{alt}$ (covariance) is set relatively high (soft constraint), allowing the visual odometry to adjust the altitude slightly to maintain consistency, but preventing the scale from collapsing to zero or exploding to infinity. This effectively creates an **Altimeter-Aided Monocular VIO** system, where the altimeter (virtual or barometric) replaces the accelerometer for scale determination.5

## **9. Implementation Specifications**

### **9.1 Hardware Acceleration (TensorRT)**

Meeting the <5 second per frame requirement on an RTX 2060 requires optimizing the deep learning models. Python/PyTorch inference is typically too slow due to overhead.

* **Model Export:** All core models (SuperPoint, LightGlue, LiteSAM) must be exported to **ONNX** (Open Neural Network Exchange) format.  
* **TensorRT Compilation:** The ONNX models are then compiled into **TensorRT Engines**. This process performs graph fusion (combining multiple layers into one) and kernel auto-tuning (selecting the fastest GPU instructions for the specific RTX 2060/3070 architecture).26  
* **Precision:** The models should be quantized to **FP16** (16-bit floating point). Research shows that FP16 inference on NVIDIA RTX cards offers a 2x-3x speedup with negligible loss in matching accuracy for these specific networks.16

### **9.2 Background Service Architecture (ZeroMQ)**

The system is encapsulated as a headless service.

**ZeroMQ Topology:**

* **Socket 1 (REP - Port 5555):** Command Interface. Accepts JSON messages:  
* {"cmd": "START", "config": {"lat": 48.1, "lon": 37.5}}  
* {"cmd": "USER_FIX", "lat": 48.22, "lon": 37.66} (Human-in-the-loop input).  
* **Socket 2 (PUB - Port 5556):** Data Stream. Publishes JSON results for every frame:  
* {"frame_id": 1024, "gps": [48.123, 37.123], "object_centers": [...], "status": "LOCKED", "confidence": 0.98}.

Asynchronous Pipeline:  
The system utilizes a Python multiprocessing architecture. One process handles the camera/image ingest and ZeroMQ communication. A second process hosts the TensorRT engines and runs the Factor Graph. This ensures that the heavy computation of Bundle Adjustment does not block the receipt of new images or user commands.

## **10. Human-in-the-Loop Strategy**

The requirement stipulates that for the "20% of the route" where automation fails, the user must intervene. The system must proactively detect its own failure.

### **10.1 Failure Detection with PDM@K**

The system monitors the **PDM@K** (Positioning Distance Measurement) metric continuously.

* **Definition:** PDM@K measures the percentage of queries localized within $K$ meters.3  
* **Real-Time Proxy:** In flight, we cannot know the true PDM (as we don't have ground truth). Instead, we use the **Marginal Covariance** from the Factor Graph. If the uncertainty ellipse for the current position grows larger than a radius of 50 meters, or if the **Image Registration Rate** (percentage of inliers in LightGlue/LiteSAM) drops below 10% for 3 consecutive frames, the system triggers a **Critical Failure Mode**.19

### **10.2 The User Interaction Workflow**

1. **Trigger:** Critical Failure Mode activated.  
2. **Action:** The Service publishes a status {"status": "REQ_INPUT"} via ZeroMQ.  
3. **Data Payload:** It sends the current UAV image and the top-3 retrieved satellite tiles (from Layer 2) to the client UI.  
4. **User Input:** The user clicks a distinctive feature (e.g., a specific crossroad) in the UAV image and the corresponding point on the satellite map.  
5. **Recovery:** This pair of points is treated as a **Hard Constraint** in the Factor Graph. The optimizer immediately snaps the trajectory to this user-defined anchor, resetting the covariance and effectively "healing" the localized track.19

## **11. Performance Evaluation and Benchmarks**

### **11.1 Accuracy Validation**

Based on the reported performance of the selected components in relevant datasets (UAV-VisLoc, AnyVisLoc):

* **LiteSAM** demonstrates an accuracy of 17.86m (RMSE) for cross-view matching. This aligns with the requirement that 60% of photos be within 20m error.18  
* **AnyLoc** achieves high recall rates (Top-1 Recall > 85% on aerial benchmarks), supporting the recovery from sharp turns.2  
* **Factor Graph Fusion:** By combining sequential and global measurements, the overall system error is expected to be lower than the individual component errors, satisfying the "80% within 50m" criterion.

### **11.2 Latency Analysis**

The breakdown of processing time per frame on an RTX 3070 is estimated as follows:

* **SuperPoint + LightGlue:** \~50ms.1  
* **AnyLoc (Global Retrieval):** \~150ms (run only on keyframes or tracking loss).  
* **LiteSAM (Metric Refinement):** \~60ms.1  
* **Factor Graph Optimization:** \~100ms (using incremental updates/iSAM2).  
* Total: \~360ms per frame (worst case with all layers active).  
This is an order of magnitude faster than the 5-second limit, providing ample headroom for higher resolution processing or background tasks.

## **12.0 ASTRAL-Next Validation Plan and Acceptance Criteria Matrix**

A comprehensive test plan is required to validate compliance with all 10 Acceptance Criteria. The foundation is a **Ground-Truth Test Harness** using project-provided ground-truth data.

### **Table 4: ASTRAL Component vs. Acceptance Criteria Compliance Matrix**

| ID | Requirement | ASTRAL Solution (Component) | Key Technology / Justification |
| :---- | :---- | :---- | :---- |
| **AC-1** | 80% of photos < 50m error | GDB (C-1) + GAB (C-5) + TOH (C-6) | **Tier-1 (Copernicus)** data 1 is sufficient. SOTA VPR 8 + Sim(3) graph 13 can achieve this. |
| **AC-2** | 60% of photos < 20m error | GDB (C-1) + GAB (C-5) + TOH (C-6) | **Requires Tier-2 (Commercial) Data**.4 Mitigates reference error.3 **Per-Keyframe Scale** 15 model in TOH minimizes drift error. |
| **AC-3** | Robust to 350m outlier | V-SLAM (C-3) + TOH (C-6) | **Stage 2 Failure Logic** (7.3) discards the frame. **Robust M-Estimation** (6.3) in Ceres 14 automatically rejects the constraint. |
| **AC-4** | Robust to sharp turns (<5% overlap) | V-SLAM (C-3) + TOH (C-6) | **"Atlas" Multi-Map** (4.2) initializes new map (Map_Fragment_k+1). **Geodetic Map-Merging** (6.4) in TOH re-connects fragments via GAB anchors. |
| **AC-5** | < 10% outlier anchors | TOH (C-6) | **Robust M-Estimation (Huber Loss)** (6.3) in Ceres 14 automatically down-weights and ignores high-residual (bad) GAB anchors. |
| **AC-6** | Connect route chunks; User input | V-SLAM (C-3) + TOH (C-6) + UI | **Geodetic Map-Merging** (6.4) connects chunks. **Stage 5 Failure Logic** (7.3) provides the user-input-as-prior mechanism. |
| **AC-7** | < 5 seconds processing/image | All Components | **Multi-Scale Pipeline** (5.3) (Low-Res V-SLAM, Hi-Res GAB patches). **Mandatory TensorRT Acceleration** (7.1) for 2-4x speedup.35 |
| **AC-8** | Real-time stream + async refinement | TOH (C-5) + Outputs (C-2.4) | Decoupled architecture provides Pose_N_Est (V-SLAM) in real-time and Pose_N_Refined (TOH) asynchronously as GAB anchors arrive. |
| **AC-9** | Image Registration Rate > 95% | V-SLAM (C-3) | **"Atlas" Multi-Map** (4.2). A "lost track" (AC-4) is *not* a registration failure; it's a *new map registration*. This ensures the rate > 95%. |
| **AC-10** | Mean Reprojection Error (MRE) < 1.0px | V-SLAM (C-3) + TOH (C-6) | Local BA (4.3) + Global BA (TOH14) + **Per-Keyframe Scale** (6.2) minimizes internal graph tension (Flaw 1.3), allowing the optimizer to converge to a low MRE. |

### **8.1 Rigorous Validation Methodology**

* **Test Harness:** A validation script will be created to compare the system's Pose_N^{Refined} output against a ground-truth coordinates.csv file, computing Haversine distance errors.  
* **Test Datasets:**  
* Test_Baseline: Standard flight.  
* Test_Outlier_350m (AC-3): A single, unrelated image inserted.  
* Test_Sharp_Turn_5pct (AC-4): A sequence with a 10-frame gap.  
* Test_Long_Route (AC-9, AC-7): A 2000-image sequence.  
* **Test Cases:**  
* Test_Accuracy: Run Test_Baseline. ASSERT (count(errors < 50m) / total) >= 0.80 (AC-1). ASSERT (count(errors < 20m) / total) >= 0.60 (AC-2).  
* Test_Robustness: Run Test_Outlier_350m and Test_Sharp_Turn_5pct. ASSERT system completes the run and Test_Accuracy assertions still pass on the valid frames.  
* Test_Performance: Run Test_Long_Route on min-spec RTX 2060. ASSERT average_time(Pose_N^{Est} output) < 5.0s (AC-7).  
* Test_MRE: ASSERT TOH.final_MRE < 1.0 (AC-10).

Also, here are more detailed validation plan: ## ASTRAL Validation Plan and Acceptance Criteria Matrix

A comprehensive test plan is required to validate compliance with all 10 Acceptance Criteria. The foundation is a **Ground-Truth Test Harness** using project-provided ground-truth data.

### **Table 4: ASTRAL Component vs. Acceptance Criteria Compliance Matrix**

| ID | Requirement | ASTRAL Solution (Component) | Key Technology / Justification |
| :---- | :---- | :---- | :---- |
| **AC-1** | 80% of photos < 50m error | GDB (C-1) + GAB (C-5) + TOH (C-6) | **Tier-1 (Copernicus)** data 1 is sufficient. SOTA VPR 8 + Sim(3) graph 13 can achieve this. |
| **AC-2** | 60% of photos < 20m error | GDB (C-1) + GAB (C-5) + TOH (C-6) | **Requires Tier-2 (Commercial) Data**.4 Mitigates reference error.3 **Per-Keyframe Scale** 15 model in TOH minimizes drift error. |
| **AC-3** | Robust to 350m outlier | V-SLAM (C-3) + TOH (C-6) | **Stage 2 Failure Logic** (7.3) discards the frame. **Robust M-Estimation** (6.3) in Ceres 14 automatically rejects the constraint. |
| **AC-4** | Robust to sharp turns (<5% overlap) | V-SLAM (C-3) + TOH (C-6) | **"Atlas" Multi-Map** (4.2) initializes new map (Map_Fragment_k+1). **Geodetic Map-Merging** (6.4) in TOH re-connects fragments via GAB anchors. |
| **AC-5** | < 10% outlier anchors | TOH (C-6) | **Robust M-Estimation (Huber Loss)** (6.3) in Ceres 14 automatically down-weights and ignores high-residual (bad) GAB anchors. |
| **AC-6** | Connect route chunks; User input | V-SLAM (C-3) + TOH (C-6) + UI | **Geodetic Map-Merging** (6.4) connects chunks. **Stage 5 Failure Logic** (7.3) provides the user-input-as-prior mechanism. |
| **AC-7** | < 5 seconds processing/image | All Components | **Multi-Scale Pipeline** (5.3) (Low-Res V-SLAM, Hi-Res GAB patches). **Mandatory TensorRT Acceleration** (7.1) for 2-4x speedup.35 |
| **AC-8** | Real-time stream + async refinement | TOH (C-5) + Outputs (C-2.4) | Decoupled architecture provides Pose_N_Est (V-SLAM) in real-time and Pose_N_Refined (TOH) asynchronously as GAB anchors arrive. |
| **AC-9** | Image Registration Rate > 95% | V-SLAM (C-3) | **"Atlas" Multi-Map** (4.2). A "lost track" (AC-4) is *not* a registration failure; it's a *new map registration*. This ensures the rate > 95%. |
| **AC-10** | Mean Reprojection Error (MRE) < 1.0px | V-SLAM (C-3) + TOH (C-6) | Local BA (4.3) + Global BA (TOH14) + **Per-Keyframe Scale** (6.2) minimizes internal graph tension (Flaw 1.3), allowing the optimizer to converge to a low MRE. |

### **8.1 Rigorous Validation Methodology**

* **Test Harness:** A validation script will be created to compare the system's Pose_N^{Refined} output against a ground-truth coordinates.csv file, computing Haversine distance errors.  
* **Test Datasets:**  
* Test_Baseline: Standard flight.  
* Test_Outlier_350m (AC-3): A single, unrelated image inserted.  
* Test_Sharp_Turn_5pct (AC-4): A sequence with a 10-frame gap.  
* Test_Long_Route (AC-9, AC-7): A 2000-image sequence.  
* **Test Cases:**  
* Test_Accuracy: Run Test_Baseline. ASSERT (count(errors < 50m) / total) >= 0.80 (AC-1). ASSERT (count(errors < 20m) / total) >= 0.60 (AC-2).  
* Test_Robustness: Run Test_Outlier_350m and Test_Sharp_Turn_5pct. ASSERT system completes the run and Test_Accuracy assertions still pass on the valid frames.  
* Test_Performance: Run Test_Long_Route on min-spec RTX 2060. ASSERT average_time(Pose_N^{Est} output) < 5.0s (AC-7).  
* Test_MRE: ASSERT TOH.final_MRE < 1.0 (AC-10).

Put all the findings what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution.

Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones. In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch

Also, investigate these ideas:

A Cross-View Geo-Localization Algorithm Using UAV Image https://www.mdpi.com/1424-8220/24/12/3719
Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation condition https://arxiv.org/pdf/2503.10692 and find out more like this.

Assess them and try to either integrate or replace some of the components in the current solution draft

38 KiB Raw Blame History

38 KiB

Raw Blame History