went through 4 iterations of solution draft. Right now it is more or less consistent and reliable

This commit is contained in:
Oleksandr Bezdieniezhnykh
2025-11-10 20:26:40 +02:00
parent 044a90b96f
commit e87c33b0ee
21 changed files with 2323 additions and 2055 deletions
+64
View File
@@ -0,0 +1,64 @@
Research this problem:
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD
Photos are taken and named consecutively within 100 meters of each other.
We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
The system should process data samples in the attached files (if any). They are for reference only.
- We have the next restrictions:
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168
- Altitude is predefined and no more than 1km
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)
- Output of our system should meet these acceptance criteria:
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50%
- The number of outliers during the satellite provider images ground check should be less than 10%
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
- Find out all the state-of-the-art solutions for this problem and produce the resulting solution draft in the next format:
- Short Product solution description. Brief component interaction diagram.
- Architecture approach that meets restrictions and acceptance criteria. For each component, analyze the best possible approaches to solve, and form a table comprising all approaches. Each new approach would be a row, and has the next columns:
- Tools (library, platform) to solve component tasks
- Advantages of this approach
- Limitations of this approach
- Requirements for this approach
- How does it fit for the problem component that has to be solved, and the whole solution
- Testing strategy. Research the best approaches to cover all the acceptance criteria. Form a list of integration functional tests and non-functional tests.
Be concise in formulating. The fewer words, the better, but do not miss any important details.
@@ -0,0 +1,329 @@
Read carefully about the problem:
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD
Photos are taken and named consecutively within 100 meters of each other.
We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
System has next restrictions and conditions:
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
- Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)
Output of the system should address next acceptance criteria:
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50%
- The number of outliers during the satellite provider images ground check should be less than 10%
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
Here is a solution draft:
# **GEo-Referenced Trajectory and Object Localization System (GEORTOLS): A Hybrid SLAM Architecture**
## **1. Executive Summary**
This report outlines the technical design for a robust, real-time geolocalization system. The objective is to determine the precise GPS coordinates for a sequence of high-resolution images (up to 6252x4168) captured by a fixed-wing, non-stabilized Unmanned Aerial Vehicle (UAV) [User Query]. The system must operate under severe constraints, including the absence of any IMU data, a predefined altitude of no more than 1km, and knowledge of only the starting GPS coordinate [User Query]. The system is required to handle significant in-flight challenges, such as sharp turns with minimal image overlap (<5%), frame-to-frame outliers of up to 350 meters, and operation over low-texture terrain as seen in the provided sample images [User Query, Image 1, Image 7].
The proposed solution is a **Hybrid Visual-Geolocalization SLAM (VG-SLAM)** architecture. This system is designed to meet the demanding acceptance criteria, including a sub-5-second initial processing time per image, streaming output with asynchronous refinement, and high-accuracy GPS localization (60% of photos within 20m error, 80% within 50m error) [User Query].
This hybrid architecture is necessitated by the problem's core constraints. The lack of an IMU makes a purely monocular Visual Odometry (VO) system susceptible to catastrophic scale drift.1 Therefore, the system integrates two cooperative sub-systems:
1. A **Visual Odometry (VO) Front-End:** This component uses state-of-the-art deep-learning feature matchers (SuperPoint + SuperGlue/LightGlue) to provide fast, real-time *relative* pose estimates. This approach is selected for its proven robustness in low-texture environments where traditional features fail.4 This component delivers the initial, sub-5-second pose estimate.
2. A **Cross-View Geolocalization (CVGL) Module:** This component provides *absolute*, drift-free GPS pose estimates by matching UAV images against the available satellite provider (Google Maps).7 It functions as the system's "global loop closure" mechanism, correcting the VO's scale drift and, critically, relocalizing the UAV after tracking is lost during sharp turns or outlier frames [User Query].
These two systems run in parallel. A **Back-End Pose-Graph Optimizer** fuses their respective measurements—high-frequency relative poses from VO and high-confidence absolute poses from CVGL—into a single, globally consistent, and incrementally refined trajectory. This architecture directly satisfies the requirements for immediate, streaming results and subsequent asynchronous refinement [User Query].
## **2. Product Solution Description and Component Interaction**
### **Product Solution Description**
The proposed system, "GEo-Referenced Trajectory and Object Localization System (GEORTOLS)," is a real-time, streaming-capable software solution. It is designed for deployment on a stationary computer or laptop equipped with an NVIDIA GPU (RTX 2060 or better) [User Query].
* **Inputs:**
1. A sequence of consecutively named monocular images (FullHD to 6252x4168).
2. The absolute GPS coordinate (Latitude, Longitude) of the *first* image in the sequence.
3. A pre-calibrated camera intrinsic matrix.
4. Access to the Google Maps satellite imagery API.
* **Outputs:**
1. A real-time, streaming feed of estimated GPS coordinates (Latitude, Longitude, Altitude) and 6-DoF poses (including Roll, Pitch, Yaw) for the center of each image.
2. Asynchronous refinement messages for previously computed poses as the back-end optimizer improves the global trajectory.
3. A service to provide the absolute GPS coordinate for any user-selected pixel coordinate (u,v) within any geolocated image.
### **Component Interaction Diagram**
The system is architected as four asynchronous, parallel-processing components to meet the stringent real-time and refinement requirements.
1. **Image Ingestion & Pre-processing:** This module acts as the entry point. It receives the new, high-resolution image (Image N). It immediately creates scaled-down, lower-resolution (e.g., 1024x768) copies of the image for real-time processing by the VO and CVGL modules, while retaining the full-resolution original for object-level GPS lookups.
2. **Visual Odometry (VO) Front-End:** This module's sole task is high-speed, frame-to-frame relative pose estimation. It maintains a short-term "sliding window" of features, matching Image N to Image N-1. It uses GPU-accelerated deep-learning models (SuperPoint + SuperGlue) to find feature matches and calculates the 6-DoF relative transform. This result is immediately sent to the Back-End.
3. **Cross-View Geolocalization (CVGL) Module:** This is a heavier, slower, asynchronous module. It takes the pre-processed Image N and queries the Google Maps database to find an *absolute* GPS pose. This involves a two-stage retrieval-and-match process. When a high-confidence match is found, its absolute pose is sent to the Back-End as a "global-pose constraint."
4. **Trajectory Optimization Back-End:** This is the system's central "brain," managing the complete pose graph.10 It receives two types of data:
* *High-frequency, low-confidence relative poses* from the VO Front-End.
* Low-frequency, high-confidence absolute poses from the CVGL Module.
It continuously fuses these constraints in a pose-graph optimization framework (e.g., g2o or Ceres Solver). When the VO Front-End provides a new relative pose, it is quickly added to the graph to produce the "Initial Pose" (<5s). When the CVGL Module provides a new absolute pose, it triggers a more comprehensive re-optimization of the entire graph, correcting drift and broadcasting "Refined Poses" to the user.11
## **3. Core Architectural Framework: Hybrid Visual-Geolocalization SLAM (VG-SLAM)**
### **Rationale for the Hybrid Approach**
The core constraints of this problem—monocular, IMU-less flight over potentially long distances (up to 3000 images at \~100m intervals equates to a 300km flight) [User Query]—render simple solutions unviable.
A **VO-Only** system is guaranteed to fail. Monocular Visual Odometry (and SLAM) suffers from an inherent, unobservable ambiguity: the *scale* of the world.1 Because there is no IMU to provide an accelerometer-based scale reference or a gravity vector 12, the system has no way to know if it moved 1 meter or 10 meters. This leads to compounding scale drift, where the entire trajectory will grow or shrink over time.3 Over a 300km flight, the resulting positional error would be measured in kilometers, not the 20-50 meters required [User Query].
A **CVGL-Only** system is also unviable. Cross-View Geolocalization (CVGL) matches the UAV image to a satellite map to find an absolute pose.7 While this is drift-free, it is a large-scale image retrieval problem. Querying the entire map of Ukraine for a match for every single frame is computationally impossible within the <5 second time limit.13 Furthermore, this approach is brittle; if the Google Maps data is outdated (a specific user restriction) [User Query], the CVGL match will fail, and the system would have no pose estimate at all.
Therefore, the **Hybrid VG-SLAM** architecture is the only robust solution.
* The **VO Front-End** provides the fast, high-frequency relative motion. It works even if the satellite map is outdated, as it tracks features in the *real*, current world.
* The **CVGL Module** acts as the *only* mechanism for scale correction and absolute georeferencing. It provides periodic, drift-free "anchors" to the real-world GPS coordinates.
* The **Back-End Optimizer** fuses these two data streams. The CVGL poses function as "global loop closures" in the SLAM pose graph. They correct the scale drift accumulated by the VO and, critically, serve to relocalize the system after a "kidnapping" event, such as the specified sharp turns or 350m outliers [User Query].
### **Data Flow for Streaming and Refinement**
This architecture is explicitly designed to meet the <5s initial output and asynchronous refinement criteria [User Query]. The data flow for a single image (Image N) is as follows:
* **T \= 0.0s:** Image N (6200x4100) is received by the **Ingestion Module**.
* **T \= 0.2s:** Image N is pre-processed (scaled to 1024px) and passed to the VO and CVGL modules.
* **T \= 1.0s:** The **VO Front-End** completes GPU-accelerated matching (SuperPoint+SuperGlue) of Image N -> Image N-1. It computes the Relative_Pose(N-1 -> N).
* **T \= 1.1s:** The **Back-End Optimizer** receives this Relative_Pose. It appends this pose to the graph relative to the last known pose of N-1.
* **T \= 1.2s:** The Back-End broadcasts the **Initial Pose_N_Est** to the user interface. (**<5s criterion met**).
* **(Parallel Thread) T \= 1.5s:** The **CVGL Module** (on a separate thread) begins its two-stage search for Image N against the Google Maps database.
* **(Parallel Thread) T \= 6.0s:** The CVGL Module successfully finds a high-confidence Absolute_Pose_N_Abs from the satellite match.
* **T \= 6.1s:** The **Back-End Optimizer** receives this new, high-confidence absolute constraint for Image N.
* **T \= 6.2s:** The Back-End triggers a graph re-optimization. This new "anchor" corrects any scale or positional drift for Image N and all surrounding poses in the graph.
* **T \= 6.3s:** The Back-End broadcasts a **Pose_N_Refined** (and Pose_N-1_Refined, Pose_N-2_Refined, etc.) to the user interface. (**Refinement criterion met**).
## **4. Component Analysis: Front-End (Visual Odometry and Relocalization)**
The task of the VO Front-End is to rapidly and robustly estimate the 6-DoF relative motion between consecutive frames. This component's success is paramount for the high-frequency tracking required to meet the <5s criterion.
The primary challenge is the nature of the imagery. The specified operational area and sample images (e.g., Image 1, Image 7) show vast, low-texture agricultural fields [User Query]. These environments are a known failure case for traditional, gradient-based feature extractors like SIFT or ORB, which rely on high-gradient corners and cannot find stable features in "weak texture areas".5 Furthermore, the non-stabilized camera [User Query] will introduce significant rotational motion and viewpoint change, breaking the assumptions of many simple trackers.16
Deep-learning (DL) based feature extractors and matchers have been developed specifically to overcome these "challenging visual conditions".5 Models like SuperPoint, SuperGlue, and LoFTR are trained to find more robust and repeatable features, even in low-texture scenes.4
### **Table 1: Analysis of State-of-the-Art Feature Extraction and Matching Techniques**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SIFT + BFMatcher/FLANN** (OpenCV) | - Scale and rotation invariant. - High-quality, robust matches. - Well-studied and mature.15 | - Computationally slow (CPU-based). - Poor performance in low-texture or weakly-textured areas.14 - Patented (though expired). | - High-contrast, well-defined features. | **Poor.** Too slow for the <5s target and will fail to find features in the low-texture agricultural landscapes shown in sample images. |
| **ORB + BFMatcher** (OpenCV) | - Extremely fast and lightweight. - Standard for real-time SLAM (e.g., ORB-SLAM).21 - Rotation invariant. | - *Not* scale invariant (uses a pyramid). - Performs very poorly in low-texture scenes.5 - Unstable in high-blur scenarios. | - CPU, lightweight. - High-gradient corners. | **Very Poor.** While fast, it fails on the *robustness* requirement. It is designed for textured, indoor/urban scenes, not sparse, natural terrain. |
| **SuperPoint + SuperGlue** (PyTorch, C++/TensorRT) | - SOTA robustness in low-texture, high-blur, and challenging conditions.4 - End-to-end learning for detection and matching.24 - Multiple open-source SLAM integrations exist (e.g., SuperSLAM).25 | - Requires a powerful GPU for real-time performance. - Sparse feature-based (not dense). | - NVIDIA GPU (RTX 2060+). - PyTorch (research) or TensorRT (deployment).26 | **Excellent.** This approach is *designed* for the exact "challenging conditions" of this problem. It provides SOTA robustness in low-texture scenes.4 The user's hardware (RTX 2060+) meets the requirements. |
| **LoFTR** (PyTorch) | - Detector-free dense matching.14 - Extremely robust to viewpoint and texture challenges.14 - Excellent performance on natural terrain and low-overlap images.19 | - High computational and VRAM cost. - Can cause CUDA Out-of-Memory (OOM) errors on very high-resolution images.30 - Slower than sparse-feature methods. | - High-end NVIDIA GPU. - PyTorch. | **Good, but Risky.** While its robustness is excellent, its dense, Transformer-based nature makes it vulnerable to OOM errors on the 6252x4168 images.30 The sparse SuperPoint approach is a safer, more-scalable choice for the VO front-end. |
### **Selected Approach (VO Front-End): SuperPoint + SuperGlue/LightGlue**
The selected approach is a VO front-end based on **SuperPoint** for feature extraction and **SuperGlue** (or its faster successor, **LightGlue**) for matching.18
* **Robustness:** This combination is proven to provide superior robustness and accuracy in sparse-texture scenes, extracting more and higher-quality matches than ORB.4
* **Performance:** It is designed for GPU acceleration and is used in SOTA real-time SLAM systems, demonstrating its feasibility within the <5s target on an RTX 2060.25
* **Scalability:** As a sparse-feature method, it avoids the memory-scaling issues of dense matchers like LoFTR when faced with the user's maximum 6252x4168 resolution.30 The image can be downscaled for real-time VO, and SuperPoint will still find stable features.
## **5. Component Analysis: Back-End (Trajectory Optimization and Refinement)**
The task of the Back-End is to fuse all incoming measurements (high-frequency/low-accuracy relative VO poses, low-frequency/high-accuracy absolute CVGL poses) into a single, globally consistent trajectory. This component's design is dictated by the user's real-time streaming and refinement requirements [User Query].
A critical architectural choice must be made between a traditional, batch **Structure from Motion (SfM)** pipeline and a real-time **SLAM (Simultaneous Localization and Mapping)** pipeline.
* **Batch SfM:** (e.g., COLMAP).32 This approach is an offline process. It collects all 1500-3000 images, performs feature matching, and then runs a large, non-real-time "Bundle Adjustment" (BA) to solve for all camera poses and 3D points simultaneously.35 While this produces the most accurate possible result, it can take hours to compute. It *cannot* meet the <5s/image or "immediate results" criteria.
* **Real-time SLAM:** (e.g., ORB-SLAM3).28 This approach is *online* and *incremental*. It maintains a "pose graph" of the trajectory.10 It provides an immediate pose estimate based on the VO front-end. When a new, high-quality measurement arrives (like a loop closure 37, or in our case, a CVGL fix), it triggers a fast re-optimization of the graph, publishing a *refined* result.11
The user's requirements for "results...appear immediately" and "system could refine existing calculated results" [User Query] are a textbook description of a real-time SLAM back-end.
### **Table 2: Analysis of Trajectory Optimization Strategies**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (g2o, Ceres Solver, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates. - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (CVGL) data arrives.10 - Meets the <5s and streaming criteria. | - Initial estimate is less accurate than a full batch process. - Susceptible to drift *until* a loop closure (CVGL fix) is made. | - A graph optimization library (g2o, Ceres). - A robust cost function to reject outliers. | **Excellent.** This is the *only* architecture that satisfies the user's real-time streaming and asynchronous refinement constraints. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory.35 - Can import custom DL matches.38 | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails all timing and streaming criteria. | - All images must be available before processing starts. - High RAM and CPU. | **Unsuitable (for the *online* system).** This approach is ideal for an *optional, post-flight, high-accuracy* refinement, but it cannot be the primary system. |
### **Selected Approach (Back-End): Incremental Pose-Graph Optimization (g2o/Ceres)**
The system's back-end will be built as an **Incremental Pose-Graph Optimizer** using a library like **g2o** or **Ceres Solver**. This is the only way to meet the real-time streaming and refinement constraints [User Query].
The graph will contain:
* **Nodes:** The 6-DoF pose of each camera frame.
* **Edges (Constraints):**
1. **Odometry Edges:** Relative 6-DoF transforms from the VO Front-End (SuperPoint+SuperGlue). These are high-frequency but have accumulating drift/scale error.
2. **Georeferencing Edges:** Absolute 6-DoF poses from the CVGL Module. These are low-frequency but are drift-free and provide the absolute scale.
3. **Start-Point Edge:** A high-confidence absolute pose for Image 1, fixed to the user-provided start GPS.
This architecture allows the system to provide an immediate estimate (from odometry) and then drastically improve its accuracy (correcting scale and drift) whenever a new georeferencing edge is added.
## **6. Component Analysis: Global-Pose Correction (Georeferencing Module)**
This module is the most critical component for meeting the accuracy requirements. Its task is to provide absolute GPS pose estimates by matching the UAV's nadir-pointing-but-non-stabilized images to the Google Maps satellite provider [User Query]. This is the only component that can correct the monocular scale drift.
This task is known as **Cross-View Geolocalization (CVGL)**.7 It is extremely challenging due to the "domain gap" 44 between the two image sources:
1. **Viewpoint:** The UAV is at low altitude (<1km) and non-nadir (due to fixed-wing tilt) 45, while the satellite is at a very high altitude and is perfectly nadir.
2. **Appearance:** The images come from different sensors, with different lighting (shadows), and at different times. The Google Maps data may be "outdated" [User Query], showing different seasons, vegetation, or man-made structures.47
A simple, brute-force feature match is computationally impossible. The solution is a **hierarchical, two-stage approach** that mimics SOTA research 7:
* **Stage 1: Coarse Retrieval.** We cannot run expensive matching against the entire map. Instead, we treat this as an image retrieval problem. We use a Deep Learning model (e.g., a Siamese or Dual CNN trained on this task 50) to generate a compact "embedding vector" (a digital signature) for the UAV image. In an offline step, we pre-compute embeddings for *all* satellite map tiles in the operational area. The UAV image's embedding is then used to perform a very fast (e.g., FAISS library) similarity search against the satellite database, returning the Top-K most likely-matching satellite tiles.
* **Stage 2: Fine-Grained Pose.** *Only* for these Top-K candidates do we perform the heavy-duty feature matching. We use our selected **SuperPoint+SuperGlue** matcher 53 to find precise correspondences between the UAV image and the K satellite tiles. If a high-confidence geometric match (e.g., >50 inliers) is found, we can compute the precise 6-DoF pose of the UAV relative to that tile, thus yielding an absolute GPS coordinate.
### **Table 3: Analysis of State-of-the-Art Cross-View Geolocalization (CVGL) Techniques**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Coarse Retrieval (Siamese/Dual CNNs)** (PyTorch, ResNet18) | - Extremely fast for retrieval (database lookup). - Learns features robust to seasonal and appearance changes.50 - Narrows search space from millions to a few. | - Does *not* provide a precise 6-DoF pose, only a "best match" tile. - Requires training on a dataset of matched UAV-satellite pairs. | - Pre-trained model (e.g., on ResNet18).52 - Pre-computed satellite embedding database. | **Essential (as Stage 1).** This is the only computationally feasible way to "find" the UAV on the map. |
| **Fine-Grained Feature Matching** (SuperPoint + SuperGlue) | - Provides a highly-accurate 6-Dof pose estimate.53 - Re-uses the same robust matcher from the VO Front-End.54 | - Too slow to run on the entire map. - *Requires* a good initial guess (from Stage 1) to be effective. | - NVIDIA GPU. - Top-K candidate tiles from Stage 1. | **Essential (as Stage 2).** This is the component that actually computes the precise GPS pose from the coarse candidates. |
| **End-to-End DL Models (Transformers)** (PFED, ReCOT, etc.) | - SOTA accuracy in recent benchmarks.13 - Can be highly efficient (e.g., PFED).13 - Can perform retrieval and pose estimation in one model. | - Often research-grade, not robustly open-sourced. - May be complex to train and deploy. - Less modular and harder to debug than the two-stage approach. | - Specific, complex model architectures.13 - Large-scale training datasets. | **Not Recommended (for initial build).** While powerful, these are less practical for a version 1 build. The two-stage approach is more modular, debuggable, and uses components already required by the VO system. |
### **Selected Approach (CVGL Module): Hierarchical Retrieval + Matching**
The CVGL module will be implemented as a two-stage hierarchical system:
1. **Stage 1 (Coarse):** A **Siamese CNN** 52 (or similar model) generates an embedding for the UAV image. This embedding is used to retrieve the Top-5 most similar satellite tiles from a pre-computed database.
2. **Stage 2 (Fine):** The **SuperPoint+SuperGlue** matcher 53 is run between the UAV image and these 5 tiles. The match with the highest inlier count and lowest reprojection error is used to calculate the absolute 6-DoF pose, which is then sent to the Back-End optimizer.
## **7. Addressing Critical Acceptance Criteria and Failure Modes**
This hybrid architecture's logic is designed to handle the most difficult acceptance criteria [User Query] through a robust, multi-stage escalation process.
### **Stage 1: Initial State (Normal Operation)**
* **Condition:** VO(N-1 -> N) succeeds.
* **System Logic:** The **VO Front-End** provides the high-frequency relative pose. This is added to the graph, and the **Initial Pose** is sent to the user (<5s).
* **Resolution:** The **CVGL Module** runs asynchronously to provide a Refined Pose later, which corrects for scale drift.
### **Stage 2: Transient Failure / Outlier Handling (AC-3)**
* **Condition:** VO(N-1 -> N) fails (e.g., >350m jump, severe motion blur, low overlap) [User Query]. This triggers an immediate, high-priority CVGL(N) query.
* **System Logic:**
1. If CVGL(N) *succeeds*, the system has conflicting data: a failed VO link and a successful CVGL pose. The **Back-End Optimizer** uses a robust kernel to reject the high-error VO link as an outlier and accepts the CVGL pose.56 The trajectory "jumps" to the correct location, and VO resumes from Image N+1.
2. If CVGL(N) *also fails* (e.g., due to cloud cover or outdated map), the system assumes Image N is a single bad frame (an outlier).
* **Resolution (Frame Skipping):** The system buffers Image N and, upon receiving Image N+1, the **VO Front-End** attempts to "bridge the gap" by matching VO(N-1 -> N+1).
* **If successful,** a pose for N+1 is found. Image N is marked as a rejected outlier, and the system continues.
* **If VO(N-1 -> N+1) fails,** it repeats for VO(N-1 -> N+2).
* If this "bridging" fails for 3 consecutive frames, the system concludes it is not a transient outlier but a persistent tracking loss. This escalates to Stage 3.
### **Stage 3: Persistent Tracking Loss / Sharp Turn Handling (AC-4)**
* **Condition:** VO tracking is lost, and the "frame-skipping" in Stage 2 fails (e.g., a "sharp turn" with no overlap) [User Query].
* **System Logic (Multi-Map "Chunking"):** The **Back-End Optimizer** declares a "Tracking Lost" state and creates a *new, independent map* ("Chunk 2").
* The **VO Front-End** is re-initialized and begins populating this new chunk, tracking VO(N+3 -> N+4), VO(N+4 -> N+5), etc. This new chunk is internally consistent but has no absolute GPS position (it is "floating").
* **Resolution (Asynchronous Relocalization):**
1. The **CVGL Module** now runs asynchronously on all frames in this new "Chunk 2".
2. Crucially, it uses the last known GPS coordinate from "Chunk 1" as a *search prior*, narrowing the satellite map search area to the vicinity.
3. The system continues to build Chunk 2 until the CVGL module successfully finds a high-confidence Absolute_Pose for *any* frame in that chunk (e.g., for Image N+20).
4. Once this single GPS "anchor" is found, the **Back-End Optimizer** performs a full graph optimization. It calculates the 7-DoF transformation (3D position, 3D rotation, and **scale**) to align all of Chunk 2 and merge it with Chunk 1.
5. This "chunking" method robustly handles the "correctly continue the work" criterion by allowing the system to keep tracking locally even while globally lost, confident it can merge the maps later.
### **Stage 4: Catastrophic Failure / User Intervention (AC-6)**
* **Condition:** The system has entered Stage 3 and is building "Chunk 2," but the **CVGL Module** has *also* failed for a prolonged period (e.g., 20% of the route, or 50+ consecutive frames) [User Query]. This is a "worst-case" scenario where the UAV is in an area with no VO features (e.g., over a lake) *and* no CVGL features (e.g., heavy clouds or outdated maps).
* **System Logic:** The system is "absolutely incapable" of determining its pose.
* **Resolution (User Input):** The system triggers the "ask the user for input" event. A UI prompt will show the last known good image (from Chunk 1) on the map and the new, "lost" image (e.g., N+50). It will ask the user to "Click on the map to provide a coarse location." This user-provided GPS point is then fed to the CVGL module as a *strong prior*, drastically narrowing the search space and enabling it to re-acquire a lock.
## **8. Implementation and Output Generation**
### **Real-time Workflow (<5s Initial, Async Refinement)**
A concrete implementation plan for processing Image N:
1. **T=0.0s:** Image[N] (6200px) received.
2. **T=0.1s:** Image pre-processed: Scaled to 1024px for VO/CVGL. Full-res original stored.
3. **T=0.5s:** **VO Front-End** (GPU): SuperPoint features extracted for 1024px image.
4. **T=1.0s:** **VO Front-End** (GPU): SuperGlue matches 1024px Image[N] -> 1024px Image[N-1]. Relative_Pose (6-DoF) estimated via RANSAC/PnP.
5. **T=1.1s:** **Back-End:** Relative_Pose added to graph. Optimizer updates trajectory.
6. **T=1.2s:** **OUTPUT:** Initial Pose_N_Est (GPS) sent to user. **(<5s criterion met)**.
7. **T=1.3s:** **CVGL Module (Async Task)** (GPU): Siamese/Dual CNN generates embedding for 1024px Image[N].
8. **T=1.5s:** **CVGL Module (Async Task):** Coarse retrieval (FAISS lookup) returns Top-5 satellite tile candidates.
9. **T=4.0s:** **CVGL Module (Async Task)** (GPU): Fine-grained matching. SuperPoint+SuperGlue runs 5 times (Image[N] vs. 5 satellite tiles).
10. **T=4.5s:** **CVGL Module (Async Task):** A high-confidence match is found. Absolute_Pose_N_Abs (6-DoF) is computed.
11. **T=4.6s:** **Back-End:** High-confidence Absolute_Pose_N_Abs added to pose graph. Graph re-optimization is triggered.
12. **T=4.8s:** **OUTPUT:** Pose_N_Refined (GPS) sent to user. **(Refinement criterion met)**.
### **Determining Object-Level GPS (from Pixel Coordinate)**
The requirement to find the "coordinates of the center of any object in these photos" [User Query] is met by projecting a pixel to its 3D world coordinate. This requires the (u,v) pixel, the camera's 6-DoF pose, and the camera's intrinsic matrix (K).
Two methods will be implemented to support the streaming/refinement architecture:
1. **Method 1 (Immediate, <5s): Flat-Earth Projection.**
* When the user clicks pixel (u,v) on Image[N], the system uses the *Initial Pose_N_Est*.
* It assumes the ground is a flat plane at the predefined altitude (e.g., 900m altitude if flying at 1km and ground is at 100m) [User Query].
* It computes the 3D ray from the camera center through (u,v) using the intrinsic matrix (K).
* It calculates the 3D intersection point of this ray with the flat ground plane.
* This 3D world point is converted to a GPS coordinate and sent to the user. This is very fast but less accurate in non-flat terrain.
2. **Method 2 (Refined, Post-BA): Structure-from-Motion Projection.**
* The Back-End's pose-graph optimization, as a byproduct, will create a sparse 3D point cloud of the world (i.e., the "SfM" part of SLAM).35
* When the user clicks (u,v), the system uses the *Pose_N_Refined*.
* It raycasts from the camera center through (u,v) and finds the 3D intersection point with the *actual 3D point cloud* generated by the system.
* This 3D point's coordinate (X,Y,Z) is converted to GPS. This is far more accurate as it accounts for real-world topography (hills, ditches) captured in the 3D map.
## **9. Testing and Validation Strategy**
A rigorous testing strategy is required to validate all 10 acceptance criteria. The foundation of this strategy is the creation of a **Ground-Truth Test Dataset**. This will involve flying several test routes and manually creating a "checkpoint" (CP) file, similar to the provided coordinates.csv 58, using a high-precision RTK/PPK GPS. This provides the "real GPS" for validation.59
### **Accuracy Validation Methodology (AC-1, AC-2, AC-5, AC-8, AC-9)**
These tests validate the system's accuracy and completion metrics.59
1. A test flight of 1000 images with high-precision ground-truth CPs is prepared.
2. The system is run given only the first GPS coordinate.
3. A test script compares the system's *final refined GPS output* for each image against its *ground-truth CP*. The Haversine distance (error in meters) is calculated for all 1000 images.
4. This yields a list of 1000 error values.
5. **Test_Accuracy_50m (AC-1):** ASSERT (count(errors < 50m) / 1000) >= 0.80
6. **Test_Accuracy_20m (AC-2):** ASSERT (count(errors < 20m) / 1000) >= 0.60
7. **Test_Outlier_Rate (AC-5):** ASSERT (count(un-localized_images) / 1000) < 0.10
8. **Test_Image_Registration_Rate (AC-8):** ASSERT (count(localized_images) / 1000) > 0.95
9. **Test_Mean_Reprojection_Error (AC-9):** ASSERT (Back-End.final_MRE) < 1.0
10. **Test_RMSE:** The overall Root Mean Square Error (RMSE) of the entire trajectory will be calculated as a primary performance benchmark.59
### **Integration and Functional Tests (AC-3, AC-4, AC-6)**
These tests validate the system's logic and robustness to failure modes.62
* Test_Low_Overlap_Relocalization (AC-4):
* **Setup:** Create a test sequence of 50 images. From this, manually delete images 20-24 (simulating 5 lost frames during a sharp turn).63
* **Test:** Run the system on this "broken" sequence.
* **Pass/Fail:** The system must report "Tracking Lost" at frame 20, initiate a new "chunk," and then "Tracking Re-acquired" and "Maps Merged" when the CVGL module successfully localizes frame 25 (or a subsequent frame). The final trajectory error for frame 25 must be < 50m.
* Test_350m_Outlier_Rejection (AC-3):
* **Setup:** Create a test sequence. At image 30, insert a "rogue" image (Image 30b) known to be 350m away.
* **Test:** Run the system on this sequence (..., 29, 30, 30b, 31,...).
* **Pass/Fail:** The system must correctly identify Image 30b as an outlier (RANSAC failure 56), reject it (or jump to its CVGL-verified pose), and "correctly continue the work" by successfully tracking Image 31 from Image 30 (using the frame-skipping logic). The trajectory must not be corrupted.
* Test_User_Intervention_Prompt (AC-6):
* **Setup:** Create a test sequence with 50 consecutive "bad" frames (e.g., pure sky, lens cap) to ensure the transient and chunking logics are bypassed.
* **Test:** Run the system.
* **Pass/Fail:** The system must enter a "LOST" state, attempt and fail to relocalize via CVGL for 50 frames, and then correctly trigger the "ask for user input" event.
### **Non-Functional Tests (AC-7, AC-8, Hardware)**
These tests validate performance and resource requirements.66
* Test_Performance_Per_Image (AC-7):
* **Setup:** Run the 1000-image test set on the minimum-spec RTX 2060.
* **Test:** Measure the time from "Image In" to "Initial Pose Out" for every frame.
* **Pass/Fail:** ASSERT average_time < 5.0s.
* Test_Streaming_Refinement (AC-8):
* **Setup:** Run the 1000-image test set.
* **Test:** A logger must verify that *two* poses are received for >80% of images: an "Initial" pose (T < 5s) and a "Refined" pose (T > 5s, after CVGL).
* **Pass/Fail:** The refinement mechanism is functioning correctly.
* Test_Scalability_Large_Route (Constraints):
* **Setup:** Run the system on a full 3000-image dataset.
* **Test:** Monitor system RAM, VRAM, and processing time per frame over the entire run.
* **Pass/Fail:** The system must complete the run without memory leaks, and the processing time per image must not degrade significantly as the pose graph grows.
Identify all potential weak points and problems. Address them and find out ways to solve them. Based on your findings, form a new solution draft in the same format.
If your finding requires a complete reorganization of the flow and different components, state it.
Put all the findings regarding what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution.
Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones.
In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch
@@ -0,0 +1,325 @@
Read carefully about the problem:
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD
Photos are taken and named consecutively within 100 meters of each other.
We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
System has next restrictions and conditions:
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
- Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)
Output of the system should address next acceptance criteria:
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50%
- The number of outliers during the satellite provider images ground check should be less than 10%
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
Here is a solution draft:
# **GEORTOLS-SA UAV Image Geolocalization in IMU-Denied Environments**
The GEORTOLS-SA system is an asynchronous, four-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific challenges of IMU-denied, scale-aware localization and real-time streaming output.
### **Product Solution Description**
* **Inputs:**
1. A sequence of consecutively named images (FullHD to 6252x4168).
2. The absolute GPS coordinate (Latitude, Longitude) for the first image (Image 0).
3. A pre-calibrated camera intrinsic matrix ($K$).
4. The predefined, absolute metric altitude of the UAV ($H$, e.g., 900 meters).
5. API access to the Google Maps satellite provider.
* **Outputs (Streaming):**
1. **Initial Pose (T \< 5s):** A high-confidence, *metric-scale* estimate ($Pose\_N\_Est$) of the image's 6-DoF pose and GPS coordinate. This is sent to the user immediately upon calculation (AC-7, AC-8).
2. **Refined Pose (T > 5s):** A globally-optimized pose ($Pose\_N\_Refined$) sent asynchronously as the back-end optimizer fuses data from the CVGL module (AC-8).
### **Component Interaction Diagram and Data Flow**
The system is architected as four parallel-processing components to meet the stringent real-time and refinement requirements.
1. **Image Ingestion & Pre-processing:** This module receives the new, high-resolution Image_N. It immediately creates two copies:
* Image_N_LR (Low-Resolution, e.g., 1536x1024): This copy is immediately dispatched to the SA-VO Front-End for real-time processing.
* Image_N_HR (High-Resolution, 6.2K): This copy is stored and made available to the CVGL Module for its asynchronous, high-accuracy matching pipeline.
2. **Scale-Aware VO (SA-VO) Front-End (High-Frequency Thread):** This component's sole task is high-speed, *metric-scale* relative pose estimation. It matches Image_N_LR to Image_N-1_LR, computes the 6-DoF relative transform, and critically, uses the "known altitude" ($H$) constraint to recover the absolute scale (detailed in Section 3.0). It sends this high-confidence Relative_Metric_Pose to the Back-End.
3. **Cross-View Geolocalization (CVGL) Module (Low-Frequency, Asynchronous Thread):** This is a heavier, slower module. It takes Image_N (both LR and HR) and queries the Google Maps database to find an *absolute GPS pose*. When a high-confidence match is found, its Absolute_GPS_Pose is sent to the Back-End as a global "anchor" constraint.
4. **Trajectory Optimization Back-End (Central Hub):** This component manages the complete flight trajectory as a pose graph.10 It continuously fuses two distinct, high-quality data streams:
* **On receiving Relative_Metric_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and **sends this initial result to the user (AC-7, AC-8 met)**.
* **On receiving Absolute_GPS_Pose (T > 5s):** It adds this as a high-confidence "global anchor" constraint 12, triggers a full graph re-optimization to correct any minor biases, and **sends the Pose_N_Refined to the user (AC-8 refinement met)**.
###
### **VO "Trust Model" of GEORTOLS-SA**
In GEORTOLS-SA, the trust model:
* The **SA-VO Front-End** is now *highly trusted* for its local, frame-to-frame *metric* accuracy.
* The **CVGL Module** remains *highly trusted* for its *global* (GPS) accuracy.
Both components are operating in the same scale-aware, metric space. The Back-End's job is no longer to fix a broken, drifting VO. Instead, it performs a robust fusion of two independent, high-quality metric measurements.12
This model is self-correcting. If the user's predefined altitude $H$ is slightly incorrect (e.g., entered as 900m but is truly 880m), the SA-VO front-end will be *consistently* off by a small percentage. The periodic, high-confidence CVGL "anchors" will create a consistent, low-level "tension" in the pose graph. The graph optimizer (e.g., Ceres Solver) 3 will resolve this tension by slightly "pulling" the SA-VO poses to fit the global anchors, effectively *learning* and correcting for the altitude bias. This robust fusion is the key to meeting the 20-meter and 50-meter accuracy targets (AC-1, AC-2).
## **3.0 Core Component: The Scale-Aware Visual Odometry (SA-VO) Front-End**
This component is the new, critical engine of the system. Its sole task is to compute the *metric-scale* 6-DoF relative motion between consecutive frames, thereby eliminating scale drift at its source.
### **3.1 Rationale and Mechanism for Per-Frame Scale Recovery**
The SA-VO front-end implements a geometric algorithm to recover the absolute scale $s$ for *every* frame-to-frame transition. This algorithm directly leverages the query's "known altitude" ($H$) and "planar ground" constraints.5
The SA-VO algorithm for processing Image_N (relative to Image_N-1) is as follows:
1. **Feature Matching:** Extract and match robust features between Image_N and Image_N-1 using the selected feature matcher (see Section 3.2). This yields a set of corresponding 2D pixel coordinates.
2. **Essential Matrix:** Use RANSAC (Random Sample Consensus) and the camera intrinsic matrix $K$ to compute the Essential Matrix $E$ from the "inlier" correspondences.2
3. **Pose Decomposition:** Decompose $E$ to find the relative Rotation $R$ and the *unscaled* translation vector $t$, where the magnitude $||t||$ is fixed to 1.2
4. **Triangulation:** Triangulate the 3D-world points $X$ for all inlier features using the unscaled pose $$.15 These 3D points ($X_i$) are now in a local, *unscaled* coordinate system (i.e., we know the *shape* of the point cloud, but not its *size*).
5. **Ground Plane Fitting:** The query states "terrain height can be neglected," meaning we assume a planar ground. A *second* RANSAC pass is performed, this time fitting a 3D plane to the set of triangulated 3D points $X$. The inliers to this RANSAC are identified as the ground points $X_g$.5 This method is highly robust as it does not rely on a single point, but on the consensus of all visible ground features.16
6. **Unscaled Height ($h$):** From the fitted plane equation n^T X + d = 0, the parameter $d$ represents the perpendicular distance from the camera (at the coordinate system's origin) to the computed ground plane. This is our *unscaled* height $h$.
7. **Scale Computation:** We now have two values: the *real, metric* altitude $h$ (e.g., 900m) provided by the user, and our *computed, unscaled* altitude $h$. The absolute scale $s$ for this frame is the ratio of these two values: s = h / h.
8. **Metric Pose:** The final, metric-scale relative pose is $$, where the metric translation $T = s * t$. This high-confidence, scale-aware pose is sent to the Back-End.
### **3.2 Feature Matching Sub-System Analysis**
The success of the SA-VO algorithm depends *entirely* on the quality of the initial feature matches, especially in the low-texture agricultural terrain specified in the query. The system requires a matcher that is both robust (for sparse textures) and extremely fast (for AC-7).
The initial draft's choice of SuperGlue 17 is a strong, proven baseline. However, its successor, LightGlue 18, offers a critical, non-obvious advantage: **adaptivity**.
The UAV flight is specified as *mostly* straight, with high overlap. Sharp turns (AC-4) are "rather an exception." This means \~95% of our image pairs are "easy" to match, while 5% are "hard."
* SuperGlue uses a fixed-depth Graph Neural Network (GNN), spending the *same* (large) amount of compute on an "easy" pair as a "hard" pair.19 This is inefficient.
* LightGlue is *adaptive*.19 For an easy, high-overlap pair, it can exit early (e.g., at layer 3/9), returning a high-confidence match in a fraction of the time. For a "hard" low-overlap pair, it will use its full depth to get the best possible result.19
By using LightGlue, the system saves *enormous* amounts of computational budget on the 95% of "easy" frames, ensuring it *always* meets the \<5s budget (AC-7) and reserving that compute for the harder CVGL tasks. LightGlue is a "plug-and-play replacement" 19 that is faster, more accurate, and easier to train.19
### **Table 1: Analysis of State-of-the-Art Feature Matchers (For SA-VO Front-End)**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SuperPoint + SuperGlue** 17 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems.22 | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 - Training is complex.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.25 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). |
| **SuperPoint + LightGlue** 18 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy.19 - **Easier to Train:** Simpler architecture and loss.19 - Direct plug-and-play replacement for SuperGlue. | - Newer, less long-term-SLAM-proven than SuperGlue (though rapidly being adopted). | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.28 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, preserving the budget for the 5% of hard (turn) frames, maximizing our ability to meet AC-7. |
### **3.3 Selected Approach (SA-VO): SuperPoint + LightGlue**
The SA-VO front-end will be built using:
* **Detector:** **SuperPoint** 24 to detect sparse, robust features on the Image_N_LR.
* **Matcher:** **LightGlue** 18 to match features from Image_N_LR to Image_N-1_LR.
This combination provides the SOTA robustness required for low-texture fields, while LightGlue's adaptive performance 19 is the key to meeting the \<5s (AC-7) real-time requirement.
## **4.0 Global Anchoring: The Cross-View Geolocalization (CVGL) Module**
With the SA-VO front-end handling metric scale, the CVGL module's task is refined. Its purpose is no longer to *correct scale*, but to provide *absolute global "anchor" poses*. This corrects for any accumulated bias (e.g., if the $h$ prior is off by 5m) and, critically, *relocalizes* the system after a persistent tracking loss (AC-4).
### **4.1 Hierarchical Retrieval-and-Match Pipeline**
This module runs asynchronously and is computationally heavy. A brute-force search against the entire Google Maps database is impossible. A two-stage hierarchical pipeline is required:
1. **Stage 1: Coarse Retrieval.** This is treated as an image retrieval problem.29
* A **Siamese CNN** 30 (or similar Dual-CNN architecture) is used to generate a compact "embedding vector" (a digital signature) for the Image_N_LR.
* An embedding database will be pre-computed for *all* Google Maps satellite tiles in the specified Eastern Ukraine operational area.
* The UAV image's embedding is then used to perform a very fast (e.g., FAISS library) similarity search against the satellite database, returning the *Top-K* (e.g., K=5) most likely-matching satellite tiles.
2. **Stage 2: Fine-Grained Pose.**
* *Only* for these Top-5 candidates, the system performs the heavy-duty **SuperPoint + LightGlue** matching.
* This match is *not* Image_N -> Image_N-1. It is Image_N -> Satellite_Tile_K.
* The match with the highest inlier count and lowest reprojection error (MRE \< 1.0, AC-10) is used to compute the precise 6-DoF pose of the UAV relative to that georeferenced satellite tile. This yields the final Absolute_GPS_Pose.
### **4.2 Critical Insight: Solving the Oblique-to-Nadir "Domain Gap"**
A critical, unaddressed failure mode exists. The query states the camera is **"not autostabilized"** [User Query]. On a fixed-wing UAV, this guarantees that during a bank or sharp turn (AC-4), the camera will *not* be nadir (top-down). It will be *oblique*, capturing the ground from an angle. The Google Maps reference, however, is *perfectly nadir*.32
This creates a severe "domain gap".33 A CVGL system trained *only* to match nadir-to-nadir images will *fail* when presented with an oblique UAV image.34 This means the CVGL module will fail *precisely* when it is needed most: during the sharp turns (AC-4) when SA-VO tracking is also lost.
The solution is to *close this domain gap* during training. Since the real-world UAV images will be oblique, the network must be taught to match oblique views to nadir ones.
Solution: Synthetic Data Generation for Robust Training
The Stage 1 Siamese CNN 30 must be trained on a custom, synthetically-generated dataset.37 The process is as follows:
1. Acquire nadir satellite imagery and a corresponding Digital Elevation Model (DEM) for the operational area.
2. Use this data to *synthetically render* the nadir satellite imagery from a wide variety of *oblique* viewpoints, simulating the UAV's roll and pitch.38
3. Create thousands of training pairs, each consisting of (Nadir_Satellite_Tile, Synthetically_Oblique_Tile_Angle_30_Deg).
4. Train the Siamese network 29 to learn that these two images—despite their *vastly* different appearances—are a *match*.
This process teaches the retrieval network to be *viewpoint-invariant*.35 It learns to ignore perspective distortion and match the true underlying ground features (road intersections, field boundaries). This is the *only* way to ensure the CVGL module can robustly relocalize the UAV during a sharp turn (AC-4).
## **5.0 Trajectory Fusion: The Robust Optimization Back-End**
This component is the system's central "brain." It runs continuously, fusing all incoming measurements (high-frequency/metric-scale SA-VO poses, low-frequency/globally-absolute CVGL poses) into a single, globally consistent trajectory. This component's design is dictated by the requirements for streaming (AC-8), refinement (AC-8), and outlier-rejection (AC-3).
### **5.1 Selected Strategy: Incremental Pose-Graph Optimization**
The user's requirements for "results...appear immediately" and "system could refine existing calculated results" [User Query] are a textbook description of a real-time SLAM back-end.11 A batch Structure from Motion (SfM) process, which requires all images upfront and can take hours, is unsuitable for the primary system.
### **Table 2: Analysis of Trajectory Optimization Strategies**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (g2o 13, Ceres Solver 10, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (CVGL) data arrives (AC-8).11 - **Robust:** Can handle outliers via robust kernels.39 | - Initial estimate is less accurate than a full batch process. - Can drift *if* not anchored (though our SA-VO minimizes this). | - A graph optimization library (g2o, Ceres). - A robust cost function.41 | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process after the flight. |
The system's back-end will be built as an **Incremental Pose-Graph Optimizer** using **Ceres Solver**.10 Ceres is selected due to its large user community, robust documentation, excellent support for robust loss functions 10, and proven scalability for large-scale nonlinear least-squares problems.42
### **5.2 Mechanism for Automatic Outlier Rejection (AC-3, AC-5)**
The system must "correctly continue the work even in the presence of up to 350 meters of an outlier" (AC-3). A standard least-squares optimizer would be catastrophically corrupted by this event, as it would try to *average* this 350m error, pulling the *entire* 300km trajectory out of alignment.
A modern optimizer does not need to use brittle, hand-coded if-then logic to reject outliers. It can *mathematically* and *automatically* down-weight them using **Robust Loss Functions (Kernels)**.41
The mechanism is as follows:
1. The Ceres Back-End 10 maintains a graph of nodes (poses) and edges (constraints, or measurements).
2. A 350m outlier (AC-3) will create an edge with a *massive* error (residual).
3. A standard (quadratic) loss function $cost(error) = error^2$ would create a *catastrophic* cost, forcing the optimizer to ruin the entire graph to accommodate it.
4. Instead, the system will wrap its cost functions in a **Robust Loss Function**, such as **CauchyLoss** or **HuberLoss**.10
5. A robust loss function behaves quadratically for small errors (which it tries hard to fix) but becomes *sub-linear* for large errors. When it "sees" the 350m error, it mathematically *down-weights its influence*.43
6. The optimizer effectively *acknowledges* the 350m error but *refuses* to pull the entire graph to fix this one "insane" measurement. It automatically, and gracefully, treats the outlier as a "lost cause" and optimizes the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5.
## **6.0 High-Resolution (6.2K) and Performance Optimization**
The system must simultaneously handle massive 6252x4168 (26-Megapixel) images and run on a modest RTX 2060 GPU [User Query] with a \<5s time limit (AC-7). These are opposing constraints.
### **6.1 The Multi-Scale Patch-Based Processing Pipeline**
Running *any* deep learning model (SuperPoint, LightGlue) on a full 6.2K image will be impossibly slow and will *immediately* cause a CUDA Out-of-Memory (OOM) error on a 6GB RTX 2060.45
The solution is not to process the full 6.2K image in real-time. Instead, a **multi-scale, patch-based pipeline** is required, where different components use the resolution best suited to their task.46
1. **For SA-VO (Real-time, \<5s):** The SA-VO front-end is concerned with *motion*, not fine-grained detail. The 6.2K Image_N_HR is *immediately* downscaled to a manageable 1536x1024 (Image_N_LR). The entire SA-VO (SuperPoint + LightGlue) pipeline runs *only* on this low-resolution, fast-to-process image. This is how the \<5s (AC-7) budget is met.
2. **For CVGL (High-Accuracy, Async):** The CVGL module, which runs asynchronously, is where the 6.2K detail is *selectively* used to meet the 20m (AC-2) accuracy target. It uses a "coarse-to-fine" 48 approach:
* **Step A (Coarse):** The Siamese CNN 30 runs on the *downscaled* 1536px Image_N_LR to get a coarse [Lat, Lon] guess.
* **Step B (Fine):** The system uses this coarse guess to fetch the corresponding *high-resolution* satellite tile.
* **Step C (Patching):** The system runs the SuperPoint detector on the *full 6.2K* Image_N_HR to find the Top 100 *most confident* feature keypoints. It then extracts 100 small (e.g., 256x256) *patches* from the full-resolution image, centered on these keypoints.49
* **Step D (Matching):** The system then matches *these small, full-resolution patches* against the high-res satellite tile.
This hybrid method provides the best of both worlds: the fine-grained matching accuracy 50 of the 6.2K image, but without the catastrophic OOM errors or performance penalties.45
### **6.2 Real-Time Deployment with TensorRT**
PyTorch is a research and training framework. Its default inference speed, even on an RTX 2060, is often insufficient to meet a \<5s production requirement.23
For the final production system, the key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from their PyTorch-native format into a highly-optimized **NVIDIA TensorRT engine**.
* **Benefits:** TensorRT is an inference optimizer that applies graph optimizations, layer fusion, and precision reduction (e.g., to FP16).52 This can achieve a 2x-4x (or more) speedup over native PyTorch.28
* **Deployment:** The resulting TensorRT engine can be deployed via a C++ API 25, which is far more suitable for a robust, high-performance production system.
This conversion is a *mandatory* deployment step. It is what makes a 2-second inference (well within the 5-second AC-7 budget) *achievable* on the specified RTX 2060 hardware.
## **7.0 System Robustness: Failure Mode and Logic Escalation**
The system's logic is designed as a multi-stage escalation process to handle the specific failure modes in the acceptance criteria (AC-3, AC-4, AC-6), ensuring the >95% registration rate (AC-9).
### **Stage 1: Normal Operation (Tracking)**
* **Condition:** SA-VO(N-1 -> N) succeeds. The LightGlue match is high-confidence, and the computed scale $s$ is reasonable.
* **Logic:**
1. The Relative_Metric_Pose is sent to the Back-End.
2. The Pose_N_Est is calculated and sent to the user (\<5s).
3. The CVGL module is queued to run asynchronously to provide a Pose_N_Refined at a later time.
### **Stage 2: Transient SA-VO Failure (AC-3 Outlier Handling)**
* **Condition:** SA-VO(N-1 -> N) fails. This could be a 350m outlier (AC-3), a severely blurred image, or an image with no features (e.g., over a cloud). The LightGlue match fails, or the computed scale $s$ is nonsensical.
* **Logic (Frame Skipping):**
1. The system *buffers* Image_N and marks it as "tentatively lost."
2. When Image_N+1 arrives, the SA-VO front-end attempts to "bridge the gap" by matching SA-VO(N-1 -> N+1).
3. **If successful:** A Relative_Metric_Pose for N+1 is found. Image_N is officially marked as a rejected outlier (AC-5). The system "correctly continues the work" (AC-3 met).
4. **If fails:** The system repeats for SA-VO(N-1 -> N+2).
5. If this "bridging" fails for 3 consecutive frames, the system concludes it is not a transient outlier but a persistent tracking loss, and escalates to Stage 3.
### **Stage 3: Persistent Tracking Loss (AC-4 Sharp Turn Handling)**
* **Condition:** The "frame-skipping" in Stage 2 fails. This is the "sharp turn" scenario [AC-4] where there is \<5% overlap between Image_N-1 and Image_N+k.
* **Logic (Multi-Map "Chunking"):**
1. The Back-End declares a "Tracking Lost" state at Image_N and creates a *new, independent map chunk* ("Chunk 2").
2. The SA-VO Front-End is re-initialized at Image_N and begins populating this new chunk, tracking SA-VO(N -> N+1), SA-VO(N+1 -> N+2), etc.
3. Because the front-end is **Scale-Aware**, this new "Chunk 2" is *already in metric scale*. It is a "floating island" of *known size and shape*; it just is not anchored to the global GPS map.
* **Resolution (Asynchronous Relocalization):**
1. The **CVGL Module** is now tasked, high-priority, to find a *single* Absolute_GPS_Pose for *any* frame in this new "Chunk 2".
2. Once the CVGL module (which is robust to oblique views, per Section 4.2) finds one (e.g., for Image_N+20), the Back-End has all the information it needs.
3. **Merging:** The Back-End calculates the simple 6-DoF transformation (3D translation and rotation, scale=1) to align all of "Chunk 2" and merge it with "Chunk 1". This robustly handles the "correctly continue the work" criterion (AC-4).
### **Stage 4: Catastrophic Failure (AC-6 User Intervention)**
* **Condition:** The system has entered Stage 3 and is building "Chunk 2," but the **CVGL Module** has *also* failed for a prolonged period (e.g., 20% of the route, or 50+ consecutive frames). This is the "worst-case" scenario (e.g., heavy clouds *and* over a large, featureless lake). The system is "absolutely incapable" [User Query].
* **Logic:**
1. The system has a metric-scale "Chunk 2" but zero idea where it is in the world.
2. The Back-End triggers the AC-6 flag.
* **Resolution (User Input):**
1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image."
2. The UI displays the last known good image (from Chunk 1) and the new, "lost" image (e.g., Image_N+50).
3. The user clicks *one point* on the satellite map.
4. This user-provided [Lat, Lon] is *not* taken as ground truth. It is fed to the CVGL module as a *strong prior*, drastically narrowing its search area from "all of Ukraine" to "a 10km-radius circle."
5. This allows the CVGL module to re-acquire a lock, which triggers the Stage 3 merge, and the system continues.
## **8.0 Output Generation and Validation Strategy**
This section details how the final user-facing outputs are generated and how the system's compliance with all 10 acceptance criteria will be validated.
### **8.1 Generating Object-Level GPS (from Pixel Coordinate)**
This meets the requirement to find the "coordinates of the center of any object in these photos" [User Query]. The system provides this via a **Ray-Plane Intersection** method.
* **Inputs:**
1. The user clicks pixel coordinate $(u,v)$ on Image_N.
2. The system retrieves the refined, global 6-DoF pose $$ for Image_N from the Back-End.
3. The system uses the known camera intrinsic matrix $K$.
4. The system uses the known *global ground-plane equation* (e.g., $Z=150m$, based on the predefined altitude and start coordinate).
* **Method:**
1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} \= K^{-1} \cdot [u, v, 1]^T$.
2. **Transform Ray:** This ray direction is transformed into the *global* coordinate system using the pose's rotation matrix: $d_{global} \= R \cdot d_{cam}$.
3. **Define Ray:** A 3D ray is now defined, originating at the camera's global position $T$ (from the pose) and traveling in the direction $d_{global}$.
4. **Intersect:** The system solves the 3D line-plane intersection equation for this ray and the known global ground plane (e.g., find the intersection with $Z=150m$).
5. **Result:** The 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object on the ground.
6. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate. This process is immediate and can be performed for any pixel on any geolocated image.
### **8.2 Rigorous Validation Methodology**
A comprehensive test plan is required to validate all 10 acceptance criteria. The foundation of this is the creation of a **Ground-Truth Test Harness**.
* **Test Harness:**
1. **Ground-Truth Data:** Several test flights will be conducted in the operational area using a UAV equipped with a high-precision RTK/PPK GPS. This provides the "real GPS" (ground truth) for every image.
2. **Test Datasets:** Multiple test datasets will be curated from this ground-truth data:
* Test_Baseline_1000: A standard 1000-image flight.
* Test_Outlier_350m (AC-3): Test_Baseline_1000 with a single image from 350m away manually inserted at frame 30.
* Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump.
* Test_Catastrophic_Fail_20pct (AC-6): A sequence with 200 (20%) consecutive "bad" frames (e.g., pure sky, lens cap) inserted.
* Test_Full_3000: A full 3000-image sequence to test scalability and memory usage.
* **Test Cases:**
* **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):**
* Run Test_Baseline_1000. A test script will compare the system's *final refined GPS output* for each image against its *ground-truth GPS*.
* ASSERT (count(errors \< 50m) / 1000) \geq 0.80 (AC-1)
* ASSERT (count(errors \< 20m) / 1000) \geq 0.60 (AC-2)
* ASSERT (count(un-localized_images) / 1000) \< 0.10 (AC-5)
* ASSERT (count(localized_images) / 1000) > 0.95 (AC-9)
* **Test_MRE (AC-10):**
* ASSERT (BackEnd.final_MRE) \< 1.0 (AC-10)
* **Test_Performance (AC-7, AC-8):**
* Run Test_Full_3000 on the minimum-spec RTX 2060.
* Log timestamps for "Image In" -> "Initial Pose Out". ASSERT average_time \< 5.0s (AC-7).
* Log the output stream. ASSERT that >80% of images receive *two* poses: an "Initial" and a "Refined" (AC-8).
* **Test_Robustness (AC-3, AC-4, AC-6):**
* Run Test_Outlier_350m. ASSERT the system correctly continues and the final trajectory error for Image_31 is \< 50m (AC-3).
* Run Test_Sharp_Turn_5pct. ASSERT the system logs "Tracking Lost" and "Maps Merged," and the final trajectory is complete and accurate (AC-4).
* Run Test_Catastrophic_Fail_20pct. ASSERT the system correctly triggers the "ask for user input" event (AC-6).
Identify all potential weak points and problems. Address them and find out ways to solve them. Based on your findings, form a new solution draft in the same format.
If your finding requires a complete reorganization of the flow and different components, state it.
Put all the findings regarding what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution.
Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones.
In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch
+301
View File
@@ -0,0 +1,301 @@
Read carefully about the problem:
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD
Photos are taken and named consecutively within 100 meters of each other.
We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
System has next restrictions and conditions:
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
- Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)
Output of the system should address next acceptance criteria:
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50%
- The number of outliers during the satellite provider images ground check should be less than 10%
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
Here is a solution draft:
**GEORTEX-R: A Geospatial-Temporal Robust Extraction System for IMU-Denied UAV Geolocalization**
## **1.0 GEORTEX-R: System Architecture and Data Flow**
The GEORTEX-R system is an asynchronous, three-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific, demonstrated challenges of IMU-denied localization in *non-planar terrain* (as seen in Images 1-9) and *temporally-divergent* (outdated) reference maps (AC-5).
The system's core design principle is the *decoupling of unscaled relative motion from global metric scale*. The front-end estimates high-frequency, robust, but *unscaled* motion. The back-end asynchronously provides sparse, high-confidence *metric* and *geospatial* anchors. The central hub fuses these two data streams into a single, globally-optimized, metric-scale trajectory.
### **1.1 Inputs**
1. **Image Sequence:** Consecutively named images (FullHD to 6252x4168).
2. **Start Coordinate (Image 0):** A single, absolute GPS coordinate (Latitude, Longitude) for the first image.
3. **Camera Intrinsics ($K$):** A pre-calibrated camera intrinsic matrix.
4. **Altitude Prior ($H_{prior}$):** The *approximate* predefined metric altitude (e.g., 900 meters). This is used as a *prior* (a hint) for optimization, *not* a hard constraint.
5. **Geospatial API Access:** Credentials for an on-demand satellite and DEM provider (e.g., Copernicus, EOSDA).
### **1.2 Streaming Outputs**
1. **Initial Pose ($Pose\\_N\\_Est$):** An *unscaled* pose estimate. This is sent immediately to the UI for real-time visualization of the UAV's *path shape* (AC-7, AC-8).
2. **Refined Pose ($Pose\\_N\\_Refined$) [Asynchronous]:** A globally-optimized, *metric-scale* 7-DoF pose (X, Y, Z, Qx, Qy, Qz, Qw) and its corresponding [Lat, Lon, Alt] coordinate. This is sent to the user whenever the Trajectory Optimization Hub re-converges, updating all past poses (AC-1, AC-2, AC-8).
### **1.3 Component Interaction and Data Flow**
The system is architected as three parallel-processing components:
1. **Image Ingestion & Pre-processing:** This module receives the new Image_N (up to 6.2K). It creates two copies:
* Image_N_LR (Low-Resolution, e.g., 1536x1024): Dispatched *immediately* to the V-SLAM Front-End for real-time processing.
* Image_N_HR (High-Resolution, 6.2K): Stored for asynchronous use by the Geospatial Anchoring Back-End (GAB).
2. **V-SLAM Front-End (High-Frequency Thread):** This component's sole task is high-speed, *unscaled* relative pose estimation. It tracks Image_N_LR against a *local map of keyframes*. It performs local bundle adjustment to minimize drift 12 and maintains a co-visibility graph of all keyframes. It sends Relative_Unscaled_Pose estimates to the Trajectory Optimization Hub (TOH).
3. **Geospatial Anchoring Back-End (GAB) (Low-Frequency, Asynchronous Thread):** This is the system's "anchor." When triggered by the TOH, it fetches *on-demand* geospatial data (satellite imagery and DEMs) from an external API.3 It then performs a robust *hybrid semantic-visual* search 5 to find an *absolute, metric, global pose* for a given keyframe, robust to outdated maps (AC-5) 5 and oblique views (AC-4).14 This Absolute_Metric_Anchor is sent to the TOH.
4. **Trajectory Optimization Hub (TOH) (Central Hub):** This component manages the complete flight trajectory as a **Sim(3) pose graph** (7-DoF). It continuously fuses two distinct data streams:
* **On receiving Relative_Unscaled_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and sends this *unscaled* initial result to the user (AC-7, AC-8 met).
* **On receiving Absolute_Metric_Anchor (T > 5s):** This is the critical event. It adds this as a high-confidence *global metric constraint*. This anchor creates "tension" in the graph, which the optimizer (Ceres Solver 15) resolves by finding the *single global scale factor* that best fits all V-SLAM and CVGL measurements. It then triggers a full graph re-optimization, "stretching" the entire trajectory to the correct metric scale, and sends the new Pose_N_Refined stream to the user for all affected poses (AC-1, AC-2, AC-8 refinement met).
## **2.0 Core Component: The High-Frequency V-SLAM Front-End**
This component's sole task is to robustly and accurately compute the *unscaled* 6-DoF relative motion of the UAV and build a geometrically-consistent map of keyframes. It is explicitly designed to be more robust to drift than simple frame-to-frame odometry.
### **2.1 Rationale: Keyframe-Based Monocular SLAM**
The choice of a keyframe-based V-SLAM front-end over a frame-to-frame VO is deliberate and critical for system robustness.
* **Drift Mitigation:** Frame-to-frame VO is "prone to drift accumulation due to errors introduced by each frame-to-frame motion estimation".13 A single poor match permanently corrupts all future poses.
* **Robustness:** A keyframe-based system tracks new images against a *local map* of *multiple* previous keyframes, not just Image_N-1. This provides resilience to transient failures (e.g., motion blur, occlusion).
* **Optimization:** This architecture enables "local bundle adjustment" 12, a process where a sliding window of recent keyframes is continuously re-optimized, actively minimizing error and drift *before* it can accumulate.
* **Relocalization:** This architecture possesses *innate relocalization capabilities* (see Section 6.3), which is the correct, robust solution to the "sharp turn" (AC-4) requirement.
### **2.2 Feature Matching Sub-System**
The success of the V-SLAM front-end depends entirely on high-quality feature matches, especially in the sparse, low-texture agricultural terrain seen in the provided images (e.g., Image 6, Image 7). The system requires a matcher that is robust (for sparse textures 17) and extremely fast (for AC-7).
The selected approach is **SuperPoint + LightGlue**.
* **SuperPoint:** A SOTA (State-of-the-Art) feature detector proven to find robust, repeatable keypoints in challenging, low-texture conditions 17
* **LightGlue:** A highly optimized GNN-based matcher that is the successor to SuperGlue 19
The key advantage of selecting LightGlue 19 over SuperGlue 20 is its *adaptive nature*. The query states sharp turns (AC-4) are "rather an exception." This implies \~95% of image pairs are "easy" (high-overlap, straight flight) and 5% are "hard" (low-overlap, turns). SuperGlue uses a fixed-depth GNN, spending the *same* large amount of compute on an "easy" pair as a "hard" one. LightGlue is *adaptive*.19 For an "easy" pair, it can exit its GNN early, returning a high-confidence match in a fraction of the time. This saves *enormous* computational budget on the 95% of "easy" frames, ensuring the system *always* meets the \<5s budget (AC-7) and reserving that compute for the GAB.
#### **Table 1: Analysis of State-of-the-Art Feature Matchers (For V-SLAM Front-End)**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SuperPoint + SuperGlue** 20 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems. | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.21 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). |
| **SuperPoint + LightGlue** 17 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy. - SOTA "in practice" choice for large-scale matching.17 | - Newer, but rapidly being adopted and proven.21 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.22 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, maximizing our ability to meet AC-7. |
## **3.0 Core Component: The Geospatial Anchoring Back-End (GAB)**
This component is the system's "anchor to reality." It runs asynchronously to provide the *absolute, metric-scale* constraints needed to solve the trajectory. It is an *on-demand* system that solves three distinct "domain gaps": the hardware/scale gap, the temporal gap, and the viewpoint gap.
### **3.1 On-Demand Geospatial Data Retrieval**
A "pre-computed database" for all of Eastern Ukraine is operationally unfeasible on laptop-grade hardware.1 This design is replaced by an on-demand, API-driven workflow.
* **Mechanism:** When the TOH requests a global anchor, the GAB receives a *coarse* [Lat, Lon] estimate. The GAB then performs API calls to a geospatial data provider (e.g., EOSDA 3, Copernicus 8).
* **Dual-Retrieval:** The API query requests *two* distinct products for the specified Area of Interest (AOI):
1. **Visual Tile:** A high-resolution (e.g., 30-50cm) satellite ortho-image.26
2. **Terrain Tile:** The corresponding **Digital Elevation Model (DEM)**, such as the Copernicus GLO-30 (30m resolution) or SRTM (30m).7
This "Dual-Retrieval" mechanism is the central, enabling synergy of the new architecture. The **Visual Tile** is used by the CVGL (Section 3.2) to find the *geospatial pose*. The **DEM Tile** is used by the *output module* (Section 7.1) to perform high-accuracy **Ray-DEM Intersection**, solving the final output accuracy problem.
### **3.2 Hybrid Semantic-Visual Localization**
The "temporal gap" (evidenced by burn scars in Images 1-9) and "outdated maps" (AC-5) makes a purely visual CVGL system unreliable.5 The GAB solves this using a robust, two-stage *hybrid* matching pipeline.
1. **Stage 1: Coarse Visual Retrieval (Siamese CNN).** A lightweight Siamese CNN 14 is used to find the *approximate* location of the Image_N_LR *within* the large, newly-fetched satellite tile. This acts as a "candidate generator."
2. **Stage 2: Fine-Grained Semantic-Visual Fusion.** For the top candidates, the GAB performs a *dual-channel alignment*.
* **Visual Channel (Unreliable):** It runs SuperPoint+LightGlue on high-resolution *patches* (from Image_N_HR) against the satellite tile. This match may be *weak* due to temporal gaps.5
* **Semantic Channel (Reliable):** It extracts *temporally-invariant* semantic features (e.g., road-vectors, field-boundaries, tree-cluster-polygons, lake shorelines) from *both* the UAV image (using a segmentation model) and the satellite/OpenStreetMap data.5
* **Fusion:** A RANSAC-based optimizer finds the 6-DoF pose that *best aligns* this *hybrid* set of features.
This hybrid approach is robust to the exact failure mode seen in the images. When matching Image 3 (burn scars), the *visual* LightGlue match will be poor. However, the *semantic* features (the dirt road, the tree line) are *unchanged*. The optimizer will find a high-confidence pose by *trusting the semantic alignment* over the poor visual alignment, thereby succeeding despite the "outdated map" (AC-5).
### **3.3 Solution to Viewpoint Gap: Synthetic Oblique View Training**
This component is critical for handling "sharp turns" (AC-4). The camera *will* be oblique, not nadir, during turns.
* **Problem:** The GAB's Stage 1 Siamese CNN 14 will be matching an *oblique* UAV view to a *nadir* satellite tile. This "viewpoint gap" will cause a match failure.14
* **Mechanism (Synthetic Data Generation):** The network must be trained for *viewpoint invariance*.28
1. Using the on-demand DEMs (fetched in 3.1) and satellite tiles, the system can *synthetically render* the satellite imagery from *any* roll, pitch, and altitude.
2. The Siamese network is trained on (Nadir_Tile, Synthetic_Oblique_Tile) pairs.14
* **Result:** This process teaches the network to match the *underlying ground features*, not the *perspective distortion*. It ensures the GAB can relocalize the UAV *precisely* when it is needed most: during a sharp, banking turn (AC-4) when VO tracking has been lost.
## **4.0 Core Component: The Trajectory Optimization Hub (TOH)**
This component is the system's central "brain." It runs continuously, fusing all measurements (high-frequency/unscaled V-SLAM, low-frequency/metric-scale GAB anchors) into a single, globally consistent trajectory.
### **4.1 Incremental Sim(3) Pose-Graph Optimization**
The "planar ground" SA-VO (Finding 1) is removed. This component is its replacement. The system must *discover* the global scale, not *assume* it.
* **Selected Strategy:** An incremental pose-graph optimizer using **Ceres Solver**.15
* **The Sim(3) Insight:** The V-SLAM front-end produces *unscaled* 6-DoF ($SE(3)$) relative poses. The GAB produces *metric-scale* 6-DoF ($SE(3)$) *absolute* poses. These cannot be directly combined. The graph must be optimized in **Sim(3) (7-DoF)**, which adds a *single global scale factor $s$* as an optimizable variable.
* **Mechanism (Ceres Solver):**
1. **Nodes:** Each keyframe pose (7-DoF: $X, Y, Z, Qx, Qy, Qz, s$).
2. **Edge 1 (V-SLAM):** A relative pose constraint between Keyframe_i and Keyframe_j. The error is computed in Sim(3).
3. **Edge 2 (GAB):** An *absolute* pose constraint on Keyframe_k. This constraint *fixes* Keyframe_k's pose to the *metric* GPS coordinate and *fixes its scale $s$ to 1.0*.
* **Bootstrapping Scale:** The TOH graph "bootstraps" the scale.32 The GAB's $s=1.0$ anchor creates "tension" in the graph. The Ceres optimizer 15 resolves this tension by finding the *one* global scale $s$ for all V-SLAM nodes that minimizes the total error, effectively "stretching" the entire unscaled trajectory to fit the metric anchors. This is robust to *any* terrain.34
#### **Table 2: Analysis of Trajectory Optimization Strategies**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (Ceres Solver 15, g2o 35, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (GAB) data arrives (AC-8).13 - **Robust:** Can handle outliers via robust kernels.15 | - Initial estimate is *unscaled* until a GAB anchor arrives. - Can drift *if* not anchored (though V-SLAM minimizes this). | - A graph optimization library (Ceres). - A robust cost function. | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process. |
### **4.2 Automatic Outlier Rejection (AC-3, AC-5)**
The system must handle 350m outliers (AC-3) and \<10% bad GAB matches (AC-5).
* **Mechanism (Robust Loss Functions):** A standard least-squares optimizer (like Ceres 15) would be catastrophically corrupted by a 350m error. The solution is to wrap *all* constraints in a **Robust Loss Function (e.g., HuberLoss, CauchyLoss)**.15
* **Result:** A robust loss function mathematically *down-weights* the influence of constraints with large errors. When it "sees" the 350m error (AC-3), it effectively acknowledges the measurement but *refuses* to pull the entire 3000-image trajectory to fit this one "insane" data point. It automatically and gracefully *ignores* the outlier, optimizing the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5.
## **5.0 High-Performance Compute & Deployment**
The system must run on an RTX 2060 (AC-7) and process 6.2K images. These are opposing constraints.
### **5.1 Multi-Scale, Patch-Based Processing Pipeline**
Running deep learning models (SuperPoint, LightGlue) on a full 6.2K (26-Megapixel) image will cause a CUDA Out-of-Memory (OOM) error and be impossibly slow.
* **Mechanism (Coarse-to-Fine):**
1. **For V-SLAM (Real-time, \<5s):** The V-SLAM front-end (Section 2.0) runs *only* on the Image_N_LR (e.g., 1536x1024) copy. This is fast enough to meet the AC-7 budget.
2. **For GAB (High-Accuracy, Async):** The GAB (Section 3.0) uses the full-resolution Image_N_HR *selectively* to meet the 20m accuracy (AC-2).
* It first runs its coarse Siamese CNN 27 on the Image_N_LR.
* It then runs the SuperPoint detector on the *full 6.2K* image to find the *most confident* feature keypoints.
* It then extracts small, 256x256 *patches* from the *full-resolution* image, centered on these keypoints.
* It matches *these small, full-resolution patches* against the high-res satellite tile.
* **Result:** This hybrid method provides the fine-grained matching accuracy of the 6.2K image (needed for AC-2) without the catastrophic OOM errors or performance penalties.
### **5.2 Mandatory Deployment: NVIDIA TensorRT Acceleration**
PyTorch is a research framework. For production, its inference speed is insufficient.
* **Requirement:** The key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from PyTorch into a highly-optimized **NVIDIA TensorRT engine**.
* **Research Validation:** 23 demonstrates this process for LightGlue, achieving "2x-4x speed gains over compiled PyTorch." 22 and 21 provide open-source repositories for SuperPoint+LightGlue conversion to ONNX and TensorRT.
* **Result:** This is not an "optional" optimization. It is a *mandatory* deployment step. This conversion (which applies layer fusion, graph optimization, and FP16 precision) is what makes achieving the \<5s (AC-7) performance *possible* on the specified RTX 2060 hardware.36
## **6.0 System Robustness: Failure Mode Escalation Logic**
This logic defines the system's behavior during real-world failures, ensuring it meets criteria AC-3, AC-4, AC-6, and AC-9.
### **6.1 Stage 1: Normal Operation (Tracking)**
* **Condition:** V-SLAM front-end (Section 2.0) is healthy.
* **Logic:**
1. V-SLAM successfully tracks Image_N_LR against its local keyframe map.
2. A new Relative_Unscaled_Pose is sent to the TOH.
3. TOH sends Pose_N_Est (unscaled) to the user (\<5s).
4. If Image_N is selected as a new keyframe, the GAB (Section 3.0) is *queued* to find an Absolute_Metric_Anchor for it, which will trigger a Pose_N_Refined update later.
### **6.2 Stage 2: Transient VO Failure (Outlier Rejection)**
* **Condition:** Image_N is unusable (e.g., severe blur, sun-glare, 350m outlier per AC-3).
* **Logic (Frame Skipping):**
1. V-SLAM front-end fails to track Image_N_LR against the local map.
2. The system *discards* Image_N (marking it as a rejected outlier, AC-5).
3. When Image_N+1 arrives, the V-SLAM front-end attempts to track it against the *same* local keyframe map (from Image_N-1).
4. **If successful:** Tracking resumes. Image_N is officially an outlier. The system "correctly continues the work" (AC-3 met).
5. **If fails:** The system repeats for Image_N+2, N+3. If this fails for \~5 consecutive frames, it escalates to Stage 3.
### **6.3 Stage 3: Persistent VO Failure (Relocalization)**
* **Condition:** Tracking is lost for multiple frames. This is the "sharp turn" (AC-4) or "low overlap" (AC-4) scenario.
* **Logic (Keyframe-Based Relocalization):**
1. The V-SLAM front-end declares "Tracking Lost."
2. **Critically:** It does *not* create a "new map chunk."
3. Instead, it enters **Relocalization Mode**. For every new Image_N+k, it extracts features (SuperPoint) and queries the *entire* existing database of past keyframes for a match.
* **Resolution:** The UAV completes its sharp turn. Image_N+5 now has high overlap with Image_N-10 (from *before* the turn).
1. The relocalization query finds a strong match.
2. The V-SLAM front-end computes the 6-DoF pose of Image_N+5 relative to the *existing map*.
3. Tracking is *resumed* seamlessly. The system "correctly continues the work" (AC-4 met). This is vastly more robust than the previous "map-merging" logic.
### **6.4 Stage 4: Catastrophic Failure (User Intervention)**
* **Condition:** The system is in Stage 3 (Lost), but *also*, the **GAB (Section 3.0) has failed** to find *any* global anchors for a prolonged period (e.g., 20% of the route). This is the "absolutely incapable" scenario (AC-6), (e.g., heavy fog *and* over a featureless ocean).
* **Logic:**
1. The system has an *unscaled* trajectory, and *zero* idea where it is in the world.
2. The TOH triggers the AC-6 flag.
* **Resolution (User-Aided Prior):**
1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image."
2. The user clicks *one point* on a map.
3. This [Lat, Lon] is *not* taken as ground truth. It is fed to the **GAB (Section 3.1)** as a *strong prior* for its on-demand API query.
4. This narrows the GAB's search area from "all of Ukraine" to "a 5km radius." This *guarantees* the GAB's Dual-Retrieval (Section 3.1) will fetch the *correct* satellite and DEM tiles, allowing the Hybrid Matcher (Section 3.2) to find a high-confidence Absolute_Metric_Anchor, which in turn re-scales (Section 4.1) and relocalizes the entire trajectory.
## **7.0 Output Generation and Validation Strategy**
This section details how the final user-facing outputs are generated, specifically solving the "planar ground" output flaw, and how the system's compliance with all 10 ACs will be validated.
### **7.1 High-Accuracy Object Geolocalization via Ray-DEM Intersection**
The "Ray-Plane Intersection" method is inaccurate for non-planar terrain 37 and is replaced with a high-accuracy ray-tracing method. This is the correct method for geolocating an object on the *non-planar* terrain visible in Images 1-9.
* **Inputs:**
1. User clicks pixel coordinate $(u,v)$ on Image_N.
2. System retrieves the *final, refined, metric* 7-DoF pose $P = (R, T, s)$ for Image_N from the TOH.
3. The system uses the known camera intrinsic matrix $K$.
4. System retrieves the specific **30m DEM tile** 8 that was fetched by the GAB (Section 3.1) for this region of the map. This DEM is a 3D terrain mesh.
* **Algorithm (Ray-DEM Intersection):**
1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} = K^{-1} \\cdot [u, v, 1]^T$.
2. **Transform Ray:** This ray direction $d_{cam}$ and origin (0,0,0) are transformed into the *global, metric* coordinate system using the pose $P$. This yields a ray originating at $T$ and traveling in direction $R \\cdot d_{cam}$.
3. **Intersect:** The system performs a numerical *ray-mesh intersection* 39 to find the 3D point $(X, Y, Z)$ where this global ray *intersects the 3D terrain mesh* of the DEM.
4. **Result:** This 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object *on the actual terrain*.
5. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate.
This method correctly accounts for terrain. A pixel aimed at the top of a hill will intersect the DEM at a high Z-value. A pixel aimed at the ravine (Image 1) will intersect at a low Z-value. This is the *only* method that can reliably meet the 20m accuracy (AC-2) for object localization.
### **7.2 Rigorous Validation Methodology**
A comprehensive test plan is required. The foundation is a **Ground-Truth Test Harness** using the provided coordinates.csv.42
* **Test Harness:**
1. **Ground-Truth Data:** The file coordinates.csv 42 provides ground-truth [Lat, Lon] for 60 images (e.g., AD000001.jpg...AD000060.jpg).
2. **Test Datasets:**
* Test_Baseline_60 42: The 60 images and their coordinates.
* Test_Outlier_350m (AC-3): Test_Baseline_60 with a single, unrelated image inserted at frame 30.
* Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump.
* **Test Cases:**
* **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):**
* **Run:** Execute GEORTEX-R on Test_Baseline_60, providing AD000001.jpg's coordinate (48.275292, 37.385220) as the Start Coordinate 42
* **Script:** A validation script will compute the Haversine distance error between the *system's refined GPS output* for each image (2-60) and the *ground-truth GPS* from coordinates.csv.
* **ASSERT** (count(errors \< 50m) / 60) >= 0.80 **(AC-1 Met)**
* **ASSERT** (count(errors \< 20m) / 60) >= 0.60 **(AC-2 Met)**
* **ASSERT** (count(un-localized_images) / 60) \< 0.10 **(AC-5 Met)**
* **ASSERT** (count(localized_images) / 60) > 0.95 **(AC-9 Met)**
* **Test_MRE (AC-10):**
* **Run:** After Test_Baseline_60 completes.
* **ASSERT** TOH.final_Mean_Reprojection_Error \< 1.0 **(AC-10 Met)**
* **Test_Performance (AC-7, AC-8):**
* **Run:** Execute on a 1500-image sequence on the minimum-spec RTX 2060.
* **Log:** Log timestamps for "Image In" -> "Initial Pose Out".
* **ASSERT** average_time \< 5.0s **(AC-7 Met)**
* **Log:** Log the output stream.
* **ASSERT** >80% of images receive *two* poses: an "Initial" and a "Refined" **(AC-8 Met)**
* **Test_Robustness (AC-3, AC-4):**
* **Run:** Execute Test_Outlier_350m.
* **ASSERT** System logs "Stage 2: Discarding Outlier" and the final trajectory error for Image_31 is \< 50m **(AC-3 Met)**.
* **Run:** Execute Test_Sharp_Turn_5pct.
* **ASSERT** System logs "Stage 3: Tracking Lost" and "Relocalization Succeeded," and the final trajectory is complete and accurate **(AC-4 Met)**.
Identify all potential weak points and problems. Address them and find out ways to solve them. Based on your findings, form a new solution draft in the same format.
If your finding requires a complete reorganization of the flow and different components, state it.
Put all the findings regarding what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution.
Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones.
In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch
+19 -9
View File
@@ -1,9 +1,19 @@
- System should find out GPS of centers of 80% of the photos from the flight within error no more than 50 meters in comparison to the real GPS
- System should find out GPS of centers of 60% of the photos from the flight within error no more than 20 meters in comparison to the real GPS
- System should correctly continue the work even in a presence of up to 350 meters outlier photo between 2 consecutive photos en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. Next photo should be in less than 150m drift and angle less than 50%
- Number of outliers during the satellite provider images ground check should be less than 10%
- In case of absolute incapable of the system to determine next, second next, and third next images gps, by any methods, (these 20% of the route), then it should ask user for an input for the next image, so that user can specify location
- Less than 2 seconds for processing one image
- Image Registration Rate > 95%. System can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 200m drift and at an angle of less than 70%
- System should try to operate when UAV made a sharp turn, and all the next photos has no common points with previous route. In that situation system should try to figure out location of the new piece of the route and connect it to the previous route. Also this separate chunks could be more than 2, so this strategy should be in the core of the system
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
+3 -3
View File
@@ -1,3 +1,3 @@
We have a lot of images taken from wing-type UAV by camera with at least FullHD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flight could be FullHD, f.e.
Photos are taken and named consecutively within 100 meters distance between each other.
We know only starting GPS coordinates. We need to determine coordinates of centers of each image. And also coordinates of the center of any object on these photos. We can use external satellite provider for ground check the existing photos
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD
Photos are taken and named consecutively within 100 meters of each other.
We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
+7 -6
View File
@@ -1,10 +1,11 @@
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by eastern and southern part of Ukraine (To the left of Dnipro river)
- The image resolution could be from FullHd to 6252*4168
- Altitude is prefefined and no more than 1km
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
- Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be possibly outdated for some regions
- Number of photos could be up to 3000, usually in 500-1500 range
- During the flight UAV can make sharp turns, so that it is possible that next photo is absolutely different from the previous one (no same objects), but it is rather exception than the rule
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)