gps-denied-desktop/docs/01_solution/02_solution_draft.md

# **GEORTOLS-SA UAV Image Geolocalization in IMU-Denied Environments**

The GEORTOLS-SA system is an asynchronous, four-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific challenges of IMU-denied, scale-aware localization and real-time streaming output.

### **Product Solution Description**

* **Inputs:**
  1. A sequence of consecutively named images (FullHD to 6252x4168).
  2. The absolute GPS coordinate (Latitude, Longitude) for the first image (Image 0).
  3. A pre-calibrated camera intrinsic matrix ($K$).
  4. The predefined, absolute metric altitude of the UAV ($H$, e.g., 900 meters).
  5. API access to the Google Maps satellite provider.
* **Outputs (Streaming):**
  1. **Initial Pose (T \< 5s):** A high-confidence, *metric-scale* estimate ($Pose\_N\_Est$) of the image's 6-DoF pose and GPS coordinate. This is sent to the user immediately upon calculation (AC-7, AC-8).
  2. **Refined Pose (T > 5s):** A globally-optimized pose ($Pose\_N\_Refined$) sent asynchronously as the back-end optimizer fuses data from the CVGL module (AC-8).

### **Component Interaction Diagram and Data Flow**

The system is architected as four parallel-processing components to meet the stringent real-time and refinement requirements.

1. **Image Ingestion & Pre-processing:** This module receives the new, high-resolution Image_N. It immediately creates two copies:
   * Image_N_LR (Low-Resolution, e.g., 1536x1024): This copy is immediately dispatched to the SA-VO Front-End for real-time processing.
   * Image_N_HR (High-Resolution, 6.2K): This copy is stored and made available to the CVGL Module for its asynchronous, high-accuracy matching pipeline.
2. **Scale-Aware VO (SA-VO) Front-End (High-Frequency Thread):** This component's sole task is high-speed, *metric-scale* relative pose estimation. It matches Image_N_LR to Image_N-1_LR, computes the 6-DoF relative transform, and critically, uses the "known altitude" ($H$) constraint to recover the absolute scale (detailed in Section 3.0). It sends this high-confidence Relative_Metric_Pose to the Back-End.
3. **Cross-View Geolocalization (CVGL) Module (Low-Frequency, Asynchronous Thread):** This is a heavier, slower module. It takes Image_N (both LR and HR) and queries the Google Maps database to find an *absolute GPS pose*. When a high-confidence match is found, its Absolute_GPS_Pose is sent to the Back-End as a global "anchor" constraint.
4. **Trajectory Optimization Back-End (Central Hub):** This component manages the complete flight trajectory as a pose graph.10 It continuously fuses two distinct, high-quality data streams:
   * **On receiving Relative_Metric_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and **sends this initial result to the user (AC-7, AC-8 met)**.
   * **On receiving Absolute_GPS_Pose (T > 5s):** It adds this as a high-confidence "global anchor" constraint 12, triggers a full graph re-optimization to correct any minor biases, and **sends the Pose_N_Refined to the user (AC-8 refinement met)**.

###

### **VO "Trust Model" of GEORTOLS-SA**

In GEORTOLS-SA, the trust model:

* The **SA-VO Front-End** is now *highly trusted* for its local, frame-to-frame *metric* accuracy.
* The **CVGL Module** remains *highly trusted* for its *global* (GPS) accuracy.

Both components are operating in the same scale-aware, metric space. The Back-End's job is no longer to fix a broken, drifting VO. Instead, it performs a robust fusion of two independent, high-quality metric measurements.12

This model is self-correcting. If the user's predefined altitude $H$ is slightly incorrect (e.g., entered as 900m but is truly 880m), the SA-VO front-end will be *consistently* off by a small percentage. The periodic, high-confidence CVGL "anchors" will create a consistent, low-level "tension" in the pose graph. The graph optimizer (e.g., Ceres Solver) 3 will resolve this tension by slightly "pulling" the SA-VO poses to fit the global anchors, effectively *learning* and correcting for the altitude bias. This robust fusion is the key to meeting the 20-meter and 50-meter accuracy targets (AC-1, AC-2).

## **3.0 Core Component: The Scale-Aware Visual Odometry (SA-VO) Front-End**

This component is the new, critical engine of the system. Its sole task is to compute the *metric-scale* 6-DoF relative motion between consecutive frames, thereby eliminating scale drift at its source.

### **3.1 Rationale and Mechanism for Per-Frame Scale Recovery**

The SA-VO front-end implements a geometric algorithm to recover the absolute scale $s$ for *every* frame-to-frame transition. This algorithm directly leverages the query's "known altitude" ($H$) and "planar ground" constraints.5

The SA-VO algorithm for processing Image_N (relative to Image_N-1) is as follows:

1. **Feature Matching:** Extract and match robust features between Image_N and Image_N-1 using the selected feature matcher (see Section 3.2). This yields a set of corresponding 2D pixel coordinates.
2. **Essential Matrix:** Use RANSAC (Random Sample Consensus) and the camera intrinsic matrix $K$ to compute the Essential Matrix $E$ from the "inlier" correspondences.2
3. **Pose Decomposition:** Decompose $E$ to find the relative Rotation $R$ and the *unscaled* translation vector $t$, where the magnitude $||t||$ is fixed to 1.2
4. **Triangulation:** Triangulate the 3D-world points $X$ for all inlier features using the unscaled pose $$.15 These 3D points ($X_i$) are now in a local, *unscaled* coordinate system (i.e., we know the *shape* of the point cloud, but not its *size*).
5. **Ground Plane Fitting:** The query states "terrain height can be neglected," meaning we assume a planar ground. A *second* RANSAC pass is performed, this time fitting a 3D plane to the set of triangulated 3D points $X$. The inliers to this RANSAC are identified as the ground points $X_g$.5 This method is highly robust as it does not rely on a single point, but on the consensus of all visible ground features.16
6. **Unscaled Height ($h$):** From the fitted plane equation n^T X + d = 0, the parameter $d$ represents the perpendicular distance from the camera (at the coordinate system's origin) to the computed ground plane. This is our *unscaled* height $h$.
7. **Scale Computation:** We now have two values: the *real, metric* altitude $h$ (e.g., 900m) provided by the user, and our *computed, unscaled* altitude $h$. The absolute scale $s$ for this frame is the ratio of these two values: s = h / h.
8. **Metric Pose:** The final, metric-scale relative pose is $$, where the metric translation $T = s * t$. This high-confidence, scale-aware pose is sent to the Back-End.

### **3.2 Feature Matching Sub-System Analysis**

The success of the SA-VO algorithm depends *entirely* on the quality of the initial feature matches, especially in the low-texture agricultural terrain specified in the query. The system requires a matcher that is both robust (for sparse textures) and extremely fast (for AC-7).

The initial draft's choice of SuperGlue 17 is a strong, proven baseline. However, its successor, LightGlue 18, offers a critical, non-obvious advantage: **adaptivity**.

The UAV flight is specified as *mostly* straight, with high overlap. Sharp turns (AC-4) are "rather an exception." This means \~95% of our image pairs are "easy" to match, while 5% are "hard."

* SuperGlue uses a fixed-depth Graph Neural Network (GNN), spending the *same* (large) amount of compute on an "easy" pair as a "hard" pair.19 This is inefficient.
* LightGlue is *adaptive*.19 For an easy, high-overlap pair, it can exit early (e.g., at layer 3/9), returning a high-confidence match in a fraction of the time. For a "hard" low-overlap pair, it will use its full depth to get the best possible result.19

By using LightGlue, the system saves *enormous* amounts of computational budget on the 95% of "easy" frames, ensuring it *always* meets the \<5s budget (AC-7) and reserving that compute for the harder CVGL tasks. LightGlue is a "plug-and-play replacement" 19 that is faster, more accurate, and easier to train.19

### **Table 1: Analysis of State-of-the-Art Feature Matchers (For SA-VO Front-End)**

| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SuperPoint + SuperGlue** 17 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems.22 | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 - Training is complex.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.25 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). |
| **SuperPoint + LightGlue** 18 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy.19 - **Easier to Train:** Simpler architecture and loss.19 - Direct plug-and-play replacement for SuperGlue. | - Newer, less long-term-SLAM-proven than SuperGlue (though rapidly being adopted). | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.28 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, preserving the budget for the 5% of hard (turn) frames, maximizing our ability to meet AC-7. |

### **3.3 Selected Approach (SA-VO): SuperPoint + LightGlue**

The SA-VO front-end will be built using:

* **Detector:** **SuperPoint** 24 to detect sparse, robust features on the Image_N_LR.
* **Matcher:** **LightGlue** 18 to match features from Image_N_LR to Image_N-1_LR.

This combination provides the SOTA robustness required for low-texture fields, while LightGlue's adaptive performance 19 is the key to meeting the \<5s (AC-7) real-time requirement.

## **4.0 Global Anchoring: The Cross-View Geolocalization (CVGL) Module**

With the SA-VO front-end handling metric scale, the CVGL module's task is refined. Its purpose is no longer to *correct scale*, but to provide *absolute global "anchor" poses*. This corrects for any accumulated bias (e.g., if the $h$ prior is off by 5m) and, critically, *relocalizes* the system after a persistent tracking loss (AC-4).

### **4.1 Hierarchical Retrieval-and-Match Pipeline**

This module runs asynchronously and is computationally heavy. A brute-force search against the entire Google Maps database is impossible. A two-stage hierarchical pipeline is required:

1. **Stage 1: Coarse Retrieval.** This is treated as an image retrieval problem.29
   * A **Siamese CNN** 30 (or similar Dual-CNN architecture) is used to generate a compact "embedding vector" (a digital signature) for the Image_N_LR.
   * An embedding database will be pre-computed for *all* Google Maps satellite tiles in the specified Eastern Ukraine operational area.
   * The UAV image's embedding is then used to perform a very fast (e.g., FAISS library) similarity search against the satellite database, returning the *Top-K* (e.g., K=5) most likely-matching satellite tiles.
2. **Stage 2: Fine-Grained Pose.**
   * *Only* for these Top-5 candidates, the system performs the heavy-duty **SuperPoint + LightGlue** matching.
   * This match is *not* Image_N -> Image_N-1. It is Image_N -> Satellite_Tile_K.
   * The match with the highest inlier count and lowest reprojection error (MRE \< 1.0, AC-10) is used to compute the precise 6-DoF pose of the UAV relative to that georeferenced satellite tile. This yields the final Absolute_GPS_Pose.

### **4.2 Critical Insight: Solving the Oblique-to-Nadir "Domain Gap"**

A critical, unaddressed failure mode exists. The query states the camera is **"not autostabilized"** [User Query]. On a fixed-wing UAV, this guarantees that during a bank or sharp turn (AC-4), the camera will *not* be nadir (top-down). It will be *oblique*, capturing the ground from an angle. The Google Maps reference, however, is *perfectly nadir*.32

This creates a severe "domain gap".33 A CVGL system trained *only* to match nadir-to-nadir images will *fail* when presented with an oblique UAV image.34 This means the CVGL module will fail *precisely* when it is needed most: during the sharp turns (AC-4) when SA-VO tracking is also lost.

The solution is to *close this domain gap* during training. Since the real-world UAV images will be oblique, the network must be taught to match oblique views to nadir ones.

Solution: Synthetic Data Generation for Robust Training
The Stage 1 Siamese CNN 30 must be trained on a custom, synthetically-generated dataset.37 The process is as follows:

1. Acquire nadir satellite imagery and a corresponding Digital Elevation Model (DEM) for the operational area.
2. Use this data to *synthetically render* the nadir satellite imagery from a wide variety of *oblique* viewpoints, simulating the UAV's roll and pitch.38
3. Create thousands of training pairs, each consisting of (Nadir_Satellite_Tile, Synthetically_Oblique_Tile_Angle_30_Deg).
4. Train the Siamese network 29 to learn that these two images—despite their *vastly* different appearances—are a *match*.

This process teaches the retrieval network to be *viewpoint-invariant*.35 It learns to ignore perspective distortion and match the true underlying ground features (road intersections, field boundaries). This is the *only* way to ensure the CVGL module can robustly relocalize the UAV during a sharp turn (AC-4).

## **5.0 Trajectory Fusion: The Robust Optimization Back-End**

This component is the system's central "brain." It runs continuously, fusing all incoming measurements (high-frequency/metric-scale SA-VO poses, low-frequency/globally-absolute CVGL poses) into a single, globally consistent trajectory. This component's design is dictated by the requirements for streaming (AC-8), refinement (AC-8), and outlier-rejection (AC-3).

### **5.1 Selected Strategy: Incremental Pose-Graph Optimization**

The user's requirements for "results...appear immediately" and "system could refine existing calculated results" [User Query] are a textbook description of a real-time SLAM back-end.11 A batch Structure from Motion (SfM) process, which requires all images upfront and can take hours, is unsuitable for the primary system.

### **Table 2: Analysis of Trajectory Optimization Strategies**

| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (g2o 13, Ceres Solver 10, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (CVGL) data arrives (AC-8).11 - **Robust:** Can handle outliers via robust kernels.39 | - Initial estimate is less accurate than a full batch process. - Can drift *if* not anchored (though our SA-VO minimizes this). | - A graph optimization library (g2o, Ceres). - A robust cost function.41 | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process after the flight. |

The system's back-end will be built as an **Incremental Pose-Graph Optimizer** using **Ceres Solver**.10 Ceres is selected due to its large user community, robust documentation, excellent support for robust loss functions 10, and proven scalability for large-scale nonlinear least-squares problems.42

### **5.2 Mechanism for Automatic Outlier Rejection (AC-3, AC-5)**

The system must "correctly continue the work even in the presence of up to 350 meters of an outlier" (AC-3). A standard least-squares optimizer would be catastrophically corrupted by this event, as it would try to *average* this 350m error, pulling the *entire* 300km trajectory out of alignment.

A modern optimizer does not need to use brittle, hand-coded if-then logic to reject outliers. It can *mathematically* and *automatically* down-weight them using **Robust Loss Functions (Kernels)**.41

The mechanism is as follows:

1. The Ceres Back-End 10 maintains a graph of nodes (poses) and edges (constraints, or measurements).
2. A 350m outlier (AC-3) will create an edge with a *massive* error (residual).
3. A standard (quadratic) loss function $cost(error) = error^2$ would create a *catastrophic* cost, forcing the optimizer to ruin the entire graph to accommodate it.
4. Instead, the system will wrap its cost functions in a **Robust Loss Function**, such as **CauchyLoss** or **HuberLoss**.10
5. A robust loss function behaves quadratically for small errors (which it tries hard to fix) but becomes *sub-linear* for large errors. When it "sees" the 350m error, it mathematically *down-weights its influence*.43
6. The optimizer effectively *acknowledges* the 350m error but *refuses* to pull the entire graph to fix this one "insane" measurement. It automatically, and gracefully, treats the outlier as a "lost cause" and optimizes the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5.

## **6.0 High-Resolution (6.2K) and Performance Optimization**

The system must simultaneously handle massive 6252x4168 (26-Megapixel) images and run on a modest RTX 2060 GPU [User Query] with a \<5s time limit (AC-7). These are opposing constraints.

### **6.1 The Multi-Scale Patch-Based Processing Pipeline**

Running *any* deep learning model (SuperPoint, LightGlue) on a full 6.2K image will be impossibly slow and will *immediately* cause a CUDA Out-of-Memory (OOM) error on a 6GB RTX 2060.45

The solution is not to process the full 6.2K image in real-time. Instead, a **multi-scale, patch-based pipeline** is required, where different components use the resolution best suited to their task.46

1. **For SA-VO (Real-time, \<5s):** The SA-VO front-end is concerned with *motion*, not fine-grained detail. The 6.2K Image_N_HR is *immediately* downscaled to a manageable 1536x1024 (Image_N_LR). The entire SA-VO (SuperPoint + LightGlue) pipeline runs *only* on this low-resolution, fast-to-process image. This is how the \<5s (AC-7) budget is met.
2. **For CVGL (High-Accuracy, Async):** The CVGL module, which runs asynchronously, is where the 6.2K detail is *selectively* used to meet the 20m (AC-2) accuracy target. It uses a "coarse-to-fine" 48 approach:
   * **Step A (Coarse):** The Siamese CNN 30 runs on the *downscaled* 1536px Image_N_LR to get a coarse [Lat, Lon] guess.
   * **Step B (Fine):** The system uses this coarse guess to fetch the corresponding *high-resolution* satellite tile.
   * **Step C (Patching):** The system runs the SuperPoint detector on the *full 6.2K* Image_N_HR to find the Top 100 *most confident* feature keypoints. It then extracts 100 small (e.g., 256x256) *patches* from the full-resolution image, centered on these keypoints.49
   * **Step D (Matching):** The system then matches *these small, full-resolution patches* against the high-res satellite tile.

This hybrid method provides the best of both worlds: the fine-grained matching accuracy 50 of the 6.2K image, but without the catastrophic OOM errors or performance penalties.45

### **6.2 Real-Time Deployment with TensorRT**

PyTorch is a research and training framework. Its default inference speed, even on an RTX 2060, is often insufficient to meet a \<5s production requirement.23

For the final production system, the key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from their PyTorch-native format into a highly-optimized **NVIDIA TensorRT engine**.

* **Benefits:** TensorRT is an inference optimizer that applies graph optimizations, layer fusion, and precision reduction (e.g., to FP16).52 This can achieve a 2x-4x (or more) speedup over native PyTorch.28
* **Deployment:** The resulting TensorRT engine can be deployed via a C++ API 25, which is far more suitable for a robust, high-performance production system.

This conversion is a *mandatory* deployment step. It is what makes a 2-second inference (well within the 5-second AC-7 budget) *achievable* on the specified RTX 2060 hardware.

## **7.0 System Robustness: Failure Mode and Logic Escalation**

The system's logic is designed as a multi-stage escalation process to handle the specific failure modes in the acceptance criteria (AC-3, AC-4, AC-6), ensuring the >95% registration rate (AC-9).

### **Stage 1: Normal Operation (Tracking)**

* **Condition:** SA-VO(N-1 -> N) succeeds. The LightGlue match is high-confidence, and the computed scale $s$ is reasonable.
* **Logic:**
  1. The Relative_Metric_Pose is sent to the Back-End.
  2. The Pose_N_Est is calculated and sent to the user (\<5s).
  3. The CVGL module is queued to run asynchronously to provide a Pose_N_Refined at a later time.

### **Stage 2: Transient SA-VO Failure (AC-3 Outlier Handling)**

* **Condition:** SA-VO(N-1 -> N) fails. This could be a 350m outlier (AC-3), a severely blurred image, or an image with no features (e.g., over a cloud). The LightGlue match fails, or the computed scale $s$ is nonsensical.
* **Logic (Frame Skipping):**
  1. The system *buffers* Image_N and marks it as "tentatively lost."
  2. When Image_N+1 arrives, the SA-VO front-end attempts to "bridge the gap" by matching SA-VO(N-1 -> N+1).
  3. **If successful:** A Relative_Metric_Pose for N+1 is found. Image_N is officially marked as a rejected outlier (AC-5). The system "correctly continues the work" (AC-3 met).
  4. **If fails:** The system repeats for SA-VO(N-1 -> N+2).
  5. If this "bridging" fails for 3 consecutive frames, the system concludes it is not a transient outlier but a persistent tracking loss, and escalates to Stage 3.

### **Stage 3: Persistent Tracking Loss (AC-4 Sharp Turn Handling)**

* **Condition:** The "frame-skipping" in Stage 2 fails. This is the "sharp turn" scenario [AC-4] where there is \<5% overlap between Image_N-1 and Image_N+k.
* **Logic (Multi-Map "Chunking"):**
  1. The Back-End declares a "Tracking Lost" state at Image_N and creates a *new, independent map chunk* ("Chunk 2").
  2. The SA-VO Front-End is re-initialized at Image_N and begins populating this new chunk, tracking SA-VO(N -> N+1), SA-VO(N+1 -> N+2), etc.
  3. Because the front-end is **Scale-Aware**, this new "Chunk 2" is *already in metric scale*. It is a "floating island" of *known size and shape*; it just is not anchored to the global GPS map.
* **Resolution (Asynchronous Relocalization):**
  1. The **CVGL Module** is now tasked, high-priority, to find a *single* Absolute_GPS_Pose for *any* frame in this new "Chunk 2".
  2. Once the CVGL module (which is robust to oblique views, per Section 4.2) finds one (e.g., for Image_N+20), the Back-End has all the information it needs.
  3. **Merging:** The Back-End calculates the simple 6-DoF transformation (3D translation and rotation, scale=1) to align all of "Chunk 2" and merge it with "Chunk 1". This robustly handles the "correctly continue the work" criterion (AC-4).

### **Stage 4: Catastrophic Failure (AC-6 User Intervention)**

* **Condition:** The system has entered Stage 3 and is building "Chunk 2," but the **CVGL Module** has *also* failed for a prolonged period (e.g., 20% of the route, or 50+ consecutive frames). This is the "worst-case" scenario (e.g., heavy clouds *and* over a large, featureless lake). The system is "absolutely incapable" [User Query].
* **Logic:**
  1. The system has a metric-scale "Chunk 2" but zero idea where it is in the world.
  2. The Back-End triggers the AC-6 flag.
* **Resolution (User Input):**
  1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image."
  2. The UI displays the last known good image (from Chunk 1) and the new, "lost" image (e.g., Image_N+50).
  3. The user clicks *one point* on the satellite map.
  4. This user-provided [Lat, Lon] is *not* taken as ground truth. It is fed to the CVGL module as a *strong prior*, drastically narrowing its search area from "all of Ukraine" to "a 10km-radius circle."
  5. This allows the CVGL module to re-acquire a lock, which triggers the Stage 3 merge, and the system continues.

## **8.0 Output Generation and Validation Strategy**

This section details how the final user-facing outputs are generated and how the system's compliance with all 10 acceptance criteria will be validated.

### **8.1 Generating Object-Level GPS (from Pixel Coordinate)**

This meets the requirement to find the "coordinates of the center of any object in these photos" [User Query]. The system provides this via a **Ray-Plane Intersection** method.

* **Inputs:**
  1. The user clicks pixel coordinate $(u,v)$ on Image_N.
  2. The system retrieves the refined, global 6-DoF pose $$ for Image_N from the Back-End.
  3. The system uses the known camera intrinsic matrix $K$.
  4. The system uses the known *global ground-plane equation* (e.g., $Z=150m$, based on the predefined altitude and start coordinate).
* **Method:**
  1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} \= K^{-1} \cdot [u, v, 1]^T$.
  2. **Transform Ray:** This ray direction is transformed into the *global* coordinate system using the pose's rotation matrix: $d_{global} \= R \cdot d_{cam}$.
  3. **Define Ray:** A 3D ray is now defined, originating at the camera's global position $T$ (from the pose) and traveling in the direction $d_{global}$.
  4. **Intersect:** The system solves the 3D line-plane intersection equation for this ray and the known global ground plane (e.g., find the intersection with $Z=150m$).
  5. **Result:** The 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object on the ground.
  6. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate. This process is immediate and can be performed for any pixel on any geolocated image.

### **8.2 Rigorous Validation Methodology**

A comprehensive test plan is required to validate all 10 acceptance criteria. The foundation of this is the creation of a **Ground-Truth Test Harness**.

* **Test Harness:**
  1. **Ground-Truth Data:** Several test flights will be conducted in the operational area using a UAV equipped with a high-precision RTK/PPK GPS. This provides the "real GPS" (ground truth) for every image.
  2. **Test Datasets:** Multiple test datasets will be curated from this ground-truth data:
     * Test_Baseline_1000: A standard 1000-image flight.
     * Test_Outlier_350m (AC-3): Test_Baseline_1000 with a single image from 350m away manually inserted at frame 30.
     * Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump.
     * Test_Catastrophic_Fail_20pct (AC-6): A sequence with 200 (20%) consecutive "bad" frames (e.g., pure sky, lens cap) inserted.
     * Test_Full_3000: A full 3000-image sequence to test scalability and memory usage.
* **Test Cases:**
  * **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):**
    * Run Test_Baseline_1000. A test script will compare the system's *final refined GPS output* for each image against its *ground-truth GPS*.
    * ASSERT (count(errors \< 50m) / 1000) \geq 0.80 (AC-1)
    * ASSERT (count(errors \< 20m) / 1000) \geq 0.60 (AC-2)
    * ASSERT (count(un-localized_images) / 1000) \< 0.10 (AC-5)
    * ASSERT (count(localized_images) / 1000) > 0.95 (AC-9)
  * **Test_MRE (AC-10):**
    * ASSERT (BackEnd.final_MRE) \< 1.0 (AC-10)
  * **Test_Performance (AC-7, AC-8):**
    * Run Test_Full_3000 on the minimum-spec RTX 2060.
    * Log timestamps for "Image In" -> "Initial Pose Out". ASSERT average_time \< 5.0s (AC-7).
    * Log the output stream. ASSERT that >80% of images receive *two* poses: an "Initial" and a "Refined" (AC-8).
  * **Test_Robustness (AC-3, AC-4, AC-6):**
    * Run Test_Outlier_350m. ASSERT the system correctly continues and the final trajectory error for Image_31 is \< 50m (AC-3).
    * Run Test_Sharp_Turn_5pct. ASSERT the system logs "Tracking Lost" and "Maps Merged," and the final trajectory is complete and accurate (AC-4).
    * Run Test_Catastrophic_Fail_20pct. ASSERT the system correctly triggers the "ask for user input" event (AC-6).