docs -> _docs
34 KiB
Read carefully about the problem:
We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD Photos are taken and named consecutively within 100 meters of each other. We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos
System has next restrictions and conditions:
- Photos are taken by only airplane type UAVs.
- Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized.
- The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River)
- The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on.
- Altitude is predefined and no more than 1km. The height of the terrain can be neglected.
- There is NO data from IMU
- Flights are done mostly in sunny weather
- We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions
- Number of photos could be up to 3000, usually in the 500-1500 range
- During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule
- Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.)
Output of the system should address next acceptance criteria:
- The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS
- The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS
- The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane.
- System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50%
- The number of outliers during the satellite provider images ground check should be less than 10%
- In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location
- Less than 5 seconds for processing one image
- Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user
- Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory
- Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location.
Here is a solution draft:
# **GEORTOLS-SA UAV Image Geolocalization in IMU-Denied Environments**
The GEORTOLS-SA system is an asynchronous, four-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific challenges of IMU-denied, scale-aware localization and real-time streaming output.
### **Product Solution Description**
* **Inputs:**
1. A sequence of consecutively named images (FullHD to 6252x4168).
2. The absolute GPS coordinate (Latitude, Longitude) for the first image (Image 0).
3. A pre-calibrated camera intrinsic matrix ($K$).
4. The predefined, absolute metric altitude of the UAV ($H$, e.g., 900 meters).
5. API access to the Google Maps satellite provider.
* **Outputs (Streaming):**
1. **Initial Pose (T \< 5s):** A high-confidence, *metric-scale* estimate ($Pose\_N\_Est$) of the image's 6-DoF pose and GPS coordinate. This is sent to the user immediately upon calculation (AC-7, AC-8).
2. **Refined Pose (T > 5s):** A globally-optimized pose ($Pose\_N\_Refined$) sent asynchronously as the back-end optimizer fuses data from the CVGL module (AC-8).
### **Component Interaction Diagram and Data Flow**
The system is architected as four parallel-processing components to meet the stringent real-time and refinement requirements.
1. **Image Ingestion & Pre-processing:** This module receives the new, high-resolution Image_N. It immediately creates two copies:
* Image_N_LR (Low-Resolution, e.g., 1536x1024): This copy is immediately dispatched to the SA-VO Front-End for real-time processing.
* Image_N_HR (High-Resolution, 6.2K): This copy is stored and made available to the CVGL Module for its asynchronous, high-accuracy matching pipeline.
2. **Scale-Aware VO (SA-VO) Front-End (High-Frequency Thread):** This component's sole task is high-speed, *metric-scale* relative pose estimation. It matches Image_N_LR to Image_N-1_LR, computes the 6-DoF relative transform, and critically, uses the "known altitude" ($H$) constraint to recover the absolute scale (detailed in Section 3.0). It sends this high-confidence Relative_Metric_Pose to the Back-End.
3. **Cross-View Geolocalization (CVGL) Module (Low-Frequency, Asynchronous Thread):** This is a heavier, slower module. It takes Image_N (both LR and HR) and queries the Google Maps database to find an *absolute GPS pose*. When a high-confidence match is found, its Absolute_GPS_Pose is sent to the Back-End as a global "anchor" constraint.
4. **Trajectory Optimization Back-End (Central Hub):** This component manages the complete flight trajectory as a pose graph.10 It continuously fuses two distinct, high-quality data streams:
* **On receiving Relative_Metric_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and **sends this initial result to the user (AC-7, AC-8 met)**.
* **On receiving Absolute_GPS_Pose (T > 5s):** It adds this as a high-confidence "global anchor" constraint 12, triggers a full graph re-optimization to correct any minor biases, and **sends the Pose_N_Refined to the user (AC-8 refinement met)**.
###
### **VO "Trust Model" of GEORTOLS-SA**
In GEORTOLS-SA, the trust model:
* The **SA-VO Front-End** is now *highly trusted* for its local, frame-to-frame *metric* accuracy.
* The **CVGL Module** remains *highly trusted* for its *global* (GPS) accuracy.
Both components are operating in the same scale-aware, metric space. The Back-End's job is no longer to fix a broken, drifting VO. Instead, it performs a robust fusion of two independent, high-quality metric measurements.12
This model is self-correcting. If the user's predefined altitude $H$ is slightly incorrect (e.g., entered as 900m but is truly 880m), the SA-VO front-end will be *consistently* off by a small percentage. The periodic, high-confidence CVGL "anchors" will create a consistent, low-level "tension" in the pose graph. The graph optimizer (e.g., Ceres Solver) 3 will resolve this tension by slightly "pulling" the SA-VO poses to fit the global anchors, effectively *learning* and correcting for the altitude bias. This robust fusion is the key to meeting the 20-meter and 50-meter accuracy targets (AC-1, AC-2).
## **3.0 Core Component: The Scale-Aware Visual Odometry (SA-VO) Front-End**
This component is the new, critical engine of the system. Its sole task is to compute the *metric-scale* 6-DoF relative motion between consecutive frames, thereby eliminating scale drift at its source.
### **3.1 Rationale and Mechanism for Per-Frame Scale Recovery**
The SA-VO front-end implements a geometric algorithm to recover the absolute scale $s$ for *every* frame-to-frame transition. This algorithm directly leverages the query's "known altitude" ($H$) and "planar ground" constraints.5
The SA-VO algorithm for processing Image_N (relative to Image_N-1) is as follows:
1. **Feature Matching:** Extract and match robust features between Image_N and Image_N-1 using the selected feature matcher (see Section 3.2). This yields a set of corresponding 2D pixel coordinates.
2. **Essential Matrix:** Use RANSAC (Random Sample Consensus) and the camera intrinsic matrix $K$ to compute the Essential Matrix $E$ from the "inlier" correspondences.2
3. **Pose Decomposition:** Decompose $E$ to find the relative Rotation $R$ and the *unscaled* translation vector $t$, where the magnitude $||t||$ is fixed to 1.2
4. **Triangulation:** Triangulate the 3D-world points $X$ for all inlier features using the unscaled pose $$.15 These 3D points ($X_i$) are now in a local, *unscaled* coordinate system (i.e., we know the *shape* of the point cloud, but not its *size*).
5. **Ground Plane Fitting:** The query states "terrain height can be neglected," meaning we assume a planar ground. A *second* RANSAC pass is performed, this time fitting a 3D plane to the set of triangulated 3D points $X$. The inliers to this RANSAC are identified as the ground points $X_g$.5 This method is highly robust as it does not rely on a single point, but on the consensus of all visible ground features.16
6. **Unscaled Height ($h$):** From the fitted plane equation n^T X + d = 0, the parameter $d$ represents the perpendicular distance from the camera (at the coordinate system's origin) to the computed ground plane. This is our *unscaled* height $h$.
7. **Scale Computation:** We now have two values: the *real, metric* altitude $h$ (e.g., 900m) provided by the user, and our *computed, unscaled* altitude $h$. The absolute scale $s$ for this frame is the ratio of these two values: s = h / h.
8. **Metric Pose:** The final, metric-scale relative pose is $$, where the metric translation $T = s * t$. This high-confidence, scale-aware pose is sent to the Back-End.
### **3.2 Feature Matching Sub-System Analysis**
The success of the SA-VO algorithm depends *entirely* on the quality of the initial feature matches, especially in the low-texture agricultural terrain specified in the query. The system requires a matcher that is both robust (for sparse textures) and extremely fast (for AC-7).
The initial draft's choice of SuperGlue 17 is a strong, proven baseline. However, its successor, LightGlue 18, offers a critical, non-obvious advantage: **adaptivity**.
The UAV flight is specified as *mostly* straight, with high overlap. Sharp turns (AC-4) are "rather an exception." This means \~95% of our image pairs are "easy" to match, while 5% are "hard."
* SuperGlue uses a fixed-depth Graph Neural Network (GNN), spending the *same* (large) amount of compute on an "easy" pair as a "hard" pair.19 This is inefficient.
* LightGlue is *adaptive*.19 For an easy, high-overlap pair, it can exit early (e.g., at layer 3/9), returning a high-confidence match in a fraction of the time. For a "hard" low-overlap pair, it will use its full depth to get the best possible result.19
By using LightGlue, the system saves *enormous* amounts of computational budget on the 95% of "easy" frames, ensuring it *always* meets the \<5s budget (AC-7) and reserving that compute for the harder CVGL tasks. LightGlue is a "plug-and-play replacement" 19 that is faster, more accurate, and easier to train.19
### **Table 1: Analysis of State-of-the-Art Feature Matchers (For SA-VO Front-End)**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SuperPoint + SuperGlue** 17 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems.22 | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 - Training is complex.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.25 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). |
| **SuperPoint + LightGlue** 18 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy.19 - **Easier to Train:** Simpler architecture and loss.19 - Direct plug-and-play replacement for SuperGlue. | - Newer, less long-term-SLAM-proven than SuperGlue (though rapidly being adopted). | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.28 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, preserving the budget for the 5% of hard (turn) frames, maximizing our ability to meet AC-7. |
### **3.3 Selected Approach (SA-VO): SuperPoint + LightGlue**
The SA-VO front-end will be built using:
* **Detector:** **SuperPoint** 24 to detect sparse, robust features on the Image_N_LR.
* **Matcher:** **LightGlue** 18 to match features from Image_N_LR to Image_N-1_LR.
This combination provides the SOTA robustness required for low-texture fields, while LightGlue's adaptive performance 19 is the key to meeting the \<5s (AC-7) real-time requirement.
## **4.0 Global Anchoring: The Cross-View Geolocalization (CVGL) Module**
With the SA-VO front-end handling metric scale, the CVGL module's task is refined. Its purpose is no longer to *correct scale*, but to provide *absolute global "anchor" poses*. This corrects for any accumulated bias (e.g., if the $h$ prior is off by 5m) and, critically, *relocalizes* the system after a persistent tracking loss (AC-4).
### **4.1 Hierarchical Retrieval-and-Match Pipeline**
This module runs asynchronously and is computationally heavy. A brute-force search against the entire Google Maps database is impossible. A two-stage hierarchical pipeline is required:
1. **Stage 1: Coarse Retrieval.** This is treated as an image retrieval problem.29
* A **Siamese CNN** 30 (or similar Dual-CNN architecture) is used to generate a compact "embedding vector" (a digital signature) for the Image_N_LR.
* An embedding database will be pre-computed for *all* Google Maps satellite tiles in the specified Eastern Ukraine operational area.
* The UAV image's embedding is then used to perform a very fast (e.g., FAISS library) similarity search against the satellite database, returning the *Top-K* (e.g., K=5) most likely-matching satellite tiles.
2. **Stage 2: Fine-Grained Pose.**
* *Only* for these Top-5 candidates, the system performs the heavy-duty **SuperPoint + LightGlue** matching.
* This match is *not* Image_N -> Image_N-1. It is Image_N -> Satellite_Tile_K.
* The match with the highest inlier count and lowest reprojection error (MRE \< 1.0, AC-10) is used to compute the precise 6-DoF pose of the UAV relative to that georeferenced satellite tile. This yields the final Absolute_GPS_Pose.
### **4.2 Critical Insight: Solving the Oblique-to-Nadir "Domain Gap"**
A critical, unaddressed failure mode exists. The query states the camera is **"not autostabilized"** [User Query]. On a fixed-wing UAV, this guarantees that during a bank or sharp turn (AC-4), the camera will *not* be nadir (top-down). It will be *oblique*, capturing the ground from an angle. The Google Maps reference, however, is *perfectly nadir*.32
This creates a severe "domain gap".33 A CVGL system trained *only* to match nadir-to-nadir images will *fail* when presented with an oblique UAV image.34 This means the CVGL module will fail *precisely* when it is needed most: during the sharp turns (AC-4) when SA-VO tracking is also lost.
The solution is to *close this domain gap* during training. Since the real-world UAV images will be oblique, the network must be taught to match oblique views to nadir ones.
Solution: Synthetic Data Generation for Robust Training
The Stage 1 Siamese CNN 30 must be trained on a custom, synthetically-generated dataset.37 The process is as follows:
1. Acquire nadir satellite imagery and a corresponding Digital Elevation Model (DEM) for the operational area.
2. Use this data to *synthetically render* the nadir satellite imagery from a wide variety of *oblique* viewpoints, simulating the UAV's roll and pitch.38
3. Create thousands of training pairs, each consisting of (Nadir_Satellite_Tile, Synthetically_Oblique_Tile_Angle_30_Deg).
4. Train the Siamese network 29 to learn that these two images—despite their *vastly* different appearances—are a *match*.
This process teaches the retrieval network to be *viewpoint-invariant*.35 It learns to ignore perspective distortion and match the true underlying ground features (road intersections, field boundaries). This is the *only* way to ensure the CVGL module can robustly relocalize the UAV during a sharp turn (AC-4).
## **5.0 Trajectory Fusion: The Robust Optimization Back-End**
This component is the system's central "brain." It runs continuously, fusing all incoming measurements (high-frequency/metric-scale SA-VO poses, low-frequency/globally-absolute CVGL poses) into a single, globally consistent trajectory. This component's design is dictated by the requirements for streaming (AC-8), refinement (AC-8), and outlier-rejection (AC-3).
### **5.1 Selected Strategy: Incremental Pose-Graph Optimization**
The user's requirements for "results...appear immediately" and "system could refine existing calculated results" [User Query] are a textbook description of a real-time SLAM back-end.11 A batch Structure from Motion (SfM) process, which requires all images upfront and can take hours, is unsuitable for the primary system.
### **Table 2: Analysis of Trajectory Optimization Strategies**
| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (g2o 13, Ceres Solver 10, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (CVGL) data arrives (AC-8).11 - **Robust:** Can handle outliers via robust kernels.39 | - Initial estimate is less accurate than a full batch process. - Can drift *if* not anchored (though our SA-VO minimizes this). | - A graph optimization library (g2o, Ceres). - A robust cost function.41 | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process after the flight. |
The system's back-end will be built as an **Incremental Pose-Graph Optimizer** using **Ceres Solver**.10 Ceres is selected due to its large user community, robust documentation, excellent support for robust loss functions 10, and proven scalability for large-scale nonlinear least-squares problems.42
### **5.2 Mechanism for Automatic Outlier Rejection (AC-3, AC-5)**
The system must "correctly continue the work even in the presence of up to 350 meters of an outlier" (AC-3). A standard least-squares optimizer would be catastrophically corrupted by this event, as it would try to *average* this 350m error, pulling the *entire* 300km trajectory out of alignment.
A modern optimizer does not need to use brittle, hand-coded if-then logic to reject outliers. It can *mathematically* and *automatically* down-weight them using **Robust Loss Functions (Kernels)**.41
The mechanism is as follows:
1. The Ceres Back-End 10 maintains a graph of nodes (poses) and edges (constraints, or measurements).
2. A 350m outlier (AC-3) will create an edge with a *massive* error (residual).
3. A standard (quadratic) loss function $cost(error) = error^2$ would create a *catastrophic* cost, forcing the optimizer to ruin the entire graph to accommodate it.
4. Instead, the system will wrap its cost functions in a **Robust Loss Function**, such as **CauchyLoss** or **HuberLoss**.10
5. A robust loss function behaves quadratically for small errors (which it tries hard to fix) but becomes *sub-linear* for large errors. When it "sees" the 350m error, it mathematically *down-weights its influence*.43
6. The optimizer effectively *acknowledges* the 350m error but *refuses* to pull the entire graph to fix this one "insane" measurement. It automatically, and gracefully, treats the outlier as a "lost cause" and optimizes the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5.
## **6.0 High-Resolution (6.2K) and Performance Optimization**
The system must simultaneously handle massive 6252x4168 (26-Megapixel) images and run on a modest RTX 2060 GPU [User Query] with a \<5s time limit (AC-7). These are opposing constraints.
### **6.1 The Multi-Scale Patch-Based Processing Pipeline**
Running *any* deep learning model (SuperPoint, LightGlue) on a full 6.2K image will be impossibly slow and will *immediately* cause a CUDA Out-of-Memory (OOM) error on a 6GB RTX 2060.45
The solution is not to process the full 6.2K image in real-time. Instead, a **multi-scale, patch-based pipeline** is required, where different components use the resolution best suited to their task.46
1. **For SA-VO (Real-time, \<5s):** The SA-VO front-end is concerned with *motion*, not fine-grained detail. The 6.2K Image_N_HR is *immediately* downscaled to a manageable 1536x1024 (Image_N_LR). The entire SA-VO (SuperPoint + LightGlue) pipeline runs *only* on this low-resolution, fast-to-process image. This is how the \<5s (AC-7) budget is met.
2. **For CVGL (High-Accuracy, Async):** The CVGL module, which runs asynchronously, is where the 6.2K detail is *selectively* used to meet the 20m (AC-2) accuracy target. It uses a "coarse-to-fine" 48 approach:
* **Step A (Coarse):** The Siamese CNN 30 runs on the *downscaled* 1536px Image_N_LR to get a coarse [Lat, Lon] guess.
* **Step B (Fine):** The system uses this coarse guess to fetch the corresponding *high-resolution* satellite tile.
* **Step C (Patching):** The system runs the SuperPoint detector on the *full 6.2K* Image_N_HR to find the Top 100 *most confident* feature keypoints. It then extracts 100 small (e.g., 256x256) *patches* from the full-resolution image, centered on these keypoints.49
* **Step D (Matching):** The system then matches *these small, full-resolution patches* against the high-res satellite tile.
This hybrid method provides the best of both worlds: the fine-grained matching accuracy 50 of the 6.2K image, but without the catastrophic OOM errors or performance penalties.45
### **6.2 Real-Time Deployment with TensorRT**
PyTorch is a research and training framework. Its default inference speed, even on an RTX 2060, is often insufficient to meet a \<5s production requirement.23
For the final production system, the key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from their PyTorch-native format into a highly-optimized **NVIDIA TensorRT engine**.
* **Benefits:** TensorRT is an inference optimizer that applies graph optimizations, layer fusion, and precision reduction (e.g., to FP16).52 This can achieve a 2x-4x (or more) speedup over native PyTorch.28
* **Deployment:** The resulting TensorRT engine can be deployed via a C++ API 25, which is far more suitable for a robust, high-performance production system.
This conversion is a *mandatory* deployment step. It is what makes a 2-second inference (well within the 5-second AC-7 budget) *achievable* on the specified RTX 2060 hardware.
## **7.0 System Robustness: Failure Mode and Logic Escalation**
The system's logic is designed as a multi-stage escalation process to handle the specific failure modes in the acceptance criteria (AC-3, AC-4, AC-6), ensuring the >95% registration rate (AC-9).
### **Stage 1: Normal Operation (Tracking)**
* **Condition:** SA-VO(N-1 -> N) succeeds. The LightGlue match is high-confidence, and the computed scale $s$ is reasonable.
* **Logic:**
1. The Relative_Metric_Pose is sent to the Back-End.
2. The Pose_N_Est is calculated and sent to the user (\<5s).
3. The CVGL module is queued to run asynchronously to provide a Pose_N_Refined at a later time.
### **Stage 2: Transient SA-VO Failure (AC-3 Outlier Handling)**
* **Condition:** SA-VO(N-1 -> N) fails. This could be a 350m outlier (AC-3), a severely blurred image, or an image with no features (e.g., over a cloud). The LightGlue match fails, or the computed scale $s$ is nonsensical.
* **Logic (Frame Skipping):**
1. The system *buffers* Image_N and marks it as "tentatively lost."
2. When Image_N+1 arrives, the SA-VO front-end attempts to "bridge the gap" by matching SA-VO(N-1 -> N+1).
3. **If successful:** A Relative_Metric_Pose for N+1 is found. Image_N is officially marked as a rejected outlier (AC-5). The system "correctly continues the work" (AC-3 met).
4. **If fails:** The system repeats for SA-VO(N-1 -> N+2).
5. If this "bridging" fails for 3 consecutive frames, the system concludes it is not a transient outlier but a persistent tracking loss, and escalates to Stage 3.
### **Stage 3: Persistent Tracking Loss (AC-4 Sharp Turn Handling)**
* **Condition:** The "frame-skipping" in Stage 2 fails. This is the "sharp turn" scenario [AC-4] where there is \<5% overlap between Image_N-1 and Image_N+k.
* **Logic (Multi-Map "Chunking"):**
1. The Back-End declares a "Tracking Lost" state at Image_N and creates a *new, independent map chunk* ("Chunk 2").
2. The SA-VO Front-End is re-initialized at Image_N and begins populating this new chunk, tracking SA-VO(N -> N+1), SA-VO(N+1 -> N+2), etc.
3. Because the front-end is **Scale-Aware**, this new "Chunk 2" is *already in metric scale*. It is a "floating island" of *known size and shape*; it just is not anchored to the global GPS map.
* **Resolution (Asynchronous Relocalization):**
1. The **CVGL Module** is now tasked, high-priority, to find a *single* Absolute_GPS_Pose for *any* frame in this new "Chunk 2".
2. Once the CVGL module (which is robust to oblique views, per Section 4.2) finds one (e.g., for Image_N+20), the Back-End has all the information it needs.
3. **Merging:** The Back-End calculates the simple 6-DoF transformation (3D translation and rotation, scale=1) to align all of "Chunk 2" and merge it with "Chunk 1". This robustly handles the "correctly continue the work" criterion (AC-4).
### **Stage 4: Catastrophic Failure (AC-6 User Intervention)**
* **Condition:** The system has entered Stage 3 and is building "Chunk 2," but the **CVGL Module** has *also* failed for a prolonged period (e.g., 20% of the route, or 50+ consecutive frames). This is the "worst-case" scenario (e.g., heavy clouds *and* over a large, featureless lake). The system is "absolutely incapable" [User Query].
* **Logic:**
1. The system has a metric-scale "Chunk 2" but zero idea where it is in the world.
2. The Back-End triggers the AC-6 flag.
* **Resolution (User Input):**
1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image."
2. The UI displays the last known good image (from Chunk 1) and the new, "lost" image (e.g., Image_N+50).
3. The user clicks *one point* on the satellite map.
4. This user-provided [Lat, Lon] is *not* taken as ground truth. It is fed to the CVGL module as a *strong prior*, drastically narrowing its search area from "all of Ukraine" to "a 10km-radius circle."
5. This allows the CVGL module to re-acquire a lock, which triggers the Stage 3 merge, and the system continues.
## **8.0 Output Generation and Validation Strategy**
This section details how the final user-facing outputs are generated and how the system's compliance with all 10 acceptance criteria will be validated.
### **8.1 Generating Object-Level GPS (from Pixel Coordinate)**
This meets the requirement to find the "coordinates of the center of any object in these photos" [User Query]. The system provides this via a **Ray-Plane Intersection** method.
* **Inputs:**
1. The user clicks pixel coordinate $(u,v)$ on Image_N.
2. The system retrieves the refined, global 6-DoF pose $$ for Image_N from the Back-End.
3. The system uses the known camera intrinsic matrix $K$.
4. The system uses the known *global ground-plane equation* (e.g., $Z=150m$, based on the predefined altitude and start coordinate).
* **Method:**
1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} \= K^{-1} \cdot [u, v, 1]^T$.
2. **Transform Ray:** This ray direction is transformed into the *global* coordinate system using the pose's rotation matrix: $d_{global} \= R \cdot d_{cam}$.
3. **Define Ray:** A 3D ray is now defined, originating at the camera's global position $T$ (from the pose) and traveling in the direction $d_{global}$.
4. **Intersect:** The system solves the 3D line-plane intersection equation for this ray and the known global ground plane (e.g., find the intersection with $Z=150m$).
5. **Result:** The 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object on the ground.
6. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate. This process is immediate and can be performed for any pixel on any geolocated image.
### **8.2 Rigorous Validation Methodology**
A comprehensive test plan is required to validate all 10 acceptance criteria. The foundation of this is the creation of a **Ground-Truth Test Harness**.
* **Test Harness:**
1. **Ground-Truth Data:** Several test flights will be conducted in the operational area using a UAV equipped with a high-precision RTK/PPK GPS. This provides the "real GPS" (ground truth) for every image.
2. **Test Datasets:** Multiple test datasets will be curated from this ground-truth data:
* Test_Baseline_1000: A standard 1000-image flight.
* Test_Outlier_350m (AC-3): Test_Baseline_1000 with a single image from 350m away manually inserted at frame 30.
* Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump.
* Test_Catastrophic_Fail_20pct (AC-6): A sequence with 200 (20%) consecutive "bad" frames (e.g., pure sky, lens cap) inserted.
* Test_Full_3000: A full 3000-image sequence to test scalability and memory usage.
* **Test Cases:**
* **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):**
* Run Test_Baseline_1000. A test script will compare the system's *final refined GPS output* for each image against its *ground-truth GPS*.
* ASSERT (count(errors \< 50m) / 1000) \geq 0.80 (AC-1)
* ASSERT (count(errors \< 20m) / 1000) \geq 0.60 (AC-2)
* ASSERT (count(un-localized_images) / 1000) \< 0.10 (AC-5)
* ASSERT (count(localized_images) / 1000) > 0.95 (AC-9)
* **Test_MRE (AC-10):**
* ASSERT (BackEnd.final_MRE) \< 1.0 (AC-10)
* **Test_Performance (AC-7, AC-8):**
* Run Test_Full_3000 on the minimum-spec RTX 2060.
* Log timestamps for "Image In" -> "Initial Pose Out". ASSERT average_time \< 5.0s (AC-7).
* Log the output stream. ASSERT that >80% of images receive *two* poses: an "Initial" and a "Refined" (AC-8).
* **Test_Robustness (AC-3, AC-4, AC-6):**
* Run Test_Outlier_350m. ASSERT the system correctly continues and the final trajectory error for Image_31 is \< 50m (AC-3).
* Run Test_Sharp_Turn_5pct. ASSERT the system logs "Tracking Lost" and "Maps Merged," and the final trajectory is complete and accurate (AC-4).
* Run Test_Catastrophic_Fail_20pct. ASSERT the system correctly triggers the "ask for user input" event (AC-6).
Identify all potential weak points and problems. Address them and find out ways to solve them. Based on your findings, form a new solution draft in the same format.
If your finding requires a complete reorganization of the flow and different components, state it. Put all the findings regarding what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution.
Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones. In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch