gps-denied-onboard/docs/01_solution/03_solution_draft.md

**GEORTEX-R: A Geospatial-Temporal Robust Extraction System for IMU-Denied UAV Geolocalization**

## **1.0 GEORTEX-R: System Architecture and Data Flow**

The GEORTEX-R system is an asynchronous, three-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific, demonstrated challenges of IMU-denied localization in *non-planar terrain* (as seen in Images 1-9) and *temporally-divergent* (outdated) reference maps (AC-5).

The system's core design principle is the *decoupling of unscaled relative motion from global metric scale*. The front-end estimates high-frequency, robust, but *unscaled* motion. The back-end asynchronously provides sparse, high-confidence *metric* and *geospatial* anchors. The central hub fuses these two data streams into a single, globally-optimized, metric-scale trajectory.

### **1.1 Inputs**

1. **Image Sequence:** Consecutively named images (FullHD to 6252x4168).
2. **Start Coordinate (Image 0):** A single, absolute GPS coordinate (Latitude, Longitude) for the first image.
3. **Camera Intrinsics ($K$):** A pre-calibrated camera intrinsic matrix.
4. **Altitude Prior ($H_{prior}$):** The *approximate* predefined metric altitude (e.g., 900 meters). This is used as a *prior* (a hint) for optimization, *not* a hard constraint.
5. **Geospatial API Access:** Credentials for an on-demand satellite and DEM provider (e.g., Copernicus, EOSDA).

### **1.2 Streaming Outputs**

1. **Initial Pose ($Pose\\_N\\_Est$):** An *unscaled* pose estimate. This is sent immediately to the UI for real-time visualization of the UAV's *path shape* (AC-7, AC-8).
2. **Refined Pose ($Pose\\_N\\_Refined$) [Asynchronous]:** A globally-optimized, *metric-scale* 7-DoF pose (X, Y, Z, Qx, Qy, Qz, Qw) and its corresponding [Lat, Lon, Alt] coordinate. This is sent to the user whenever the Trajectory Optimization Hub re-converges, updating all past poses (AC-1, AC-2, AC-8).

### **1.3 Component Interaction and Data Flow**

The system is architected as three parallel-processing components:

1. **Image Ingestion & Pre-processing:** This module receives the new Image_N (up to 6.2K). It creates two copies:
   * Image_N_LR (Low-Resolution, e.g., 1536x1024): Dispatched *immediately* to the V-SLAM Front-End for real-time processing.
   * Image_N_HR (High-Resolution, 6.2K): Stored for asynchronous use by the Geospatial Anchoring Back-End (GAB).
2. **V-SLAM Front-End (High-Frequency Thread):** This component's sole task is high-speed, *unscaled* relative pose estimation. It tracks Image_N_LR against a *local map of keyframes*. It performs local bundle adjustment to minimize drift 12 and maintains a co-visibility graph of all keyframes. It sends Relative_Unscaled_Pose estimates to the Trajectory Optimization Hub (TOH).
3. **Geospatial Anchoring Back-End (GAB) (Low-Frequency, Asynchronous Thread):** This is the system's "anchor." When triggered by the TOH, it fetches *on-demand* geospatial data (satellite imagery and DEMs) from an external API.3 It then performs a robust *hybrid semantic-visual* search 5 to find an *absolute, metric, global pose* for a given keyframe, robust to outdated maps (AC-5) 5 and oblique views (AC-4).14 This Absolute_Metric_Anchor is sent to the TOH.
4. **Trajectory Optimization Hub (TOH) (Central Hub):** This component manages the complete flight trajectory as a **Sim(3) pose graph** (7-DoF). It continuously fuses two distinct data streams:
   * **On receiving Relative_Unscaled_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and sends this *unscaled* initial result to the user (AC-7, AC-8 met).
   * **On receiving Absolute_Metric_Anchor (T > 5s):** This is the critical event. It adds this as a high-confidence *global metric constraint*. This anchor creates "tension" in the graph, which the optimizer (Ceres Solver 15) resolves by finding the *single global scale factor* that best fits all V-SLAM and CVGL measurements. It then triggers a full graph re-optimization, "stretching" the entire trajectory to the correct metric scale, and sends the new Pose_N_Refined stream to the user for all affected poses (AC-1, AC-2, AC-8 refinement met).

## **2.0 Core Component: The High-Frequency V-SLAM Front-End**

This component's sole task is to robustly and accurately compute the *unscaled* 6-DoF relative motion of the UAV and build a geometrically-consistent map of keyframes. It is explicitly designed to be more robust to drift than simple frame-to-frame odometry.

### **2.1 Rationale: Keyframe-Based Monocular SLAM**

The choice of a keyframe-based V-SLAM front-end over a frame-to-frame VO is deliberate and critical for system robustness.

* **Drift Mitigation:** Frame-to-frame VO is "prone to drift accumulation due to errors introduced by each frame-to-frame motion estimation".13 A single poor match permanently corrupts all future poses.
* **Robustness:** A keyframe-based system tracks new images against a *local map* of *multiple* previous keyframes, not just Image_N-1. This provides resilience to transient failures (e.g., motion blur, occlusion).
* **Optimization:** This architecture enables "local bundle adjustment" 12, a process where a sliding window of recent keyframes is continuously re-optimized, actively minimizing error and drift *before* it can accumulate.
* **Relocalization:** This architecture possesses *innate relocalization capabilities* (see Section 6.3), which is the correct, robust solution to the "sharp turn" (AC-4) requirement.

### **2.2 Feature Matching Sub-System**

The success of the V-SLAM front-end depends entirely on high-quality feature matches, especially in the sparse, low-texture agricultural terrain seen in the provided images (e.g., Image 6, Image 7). The system requires a matcher that is robust (for sparse textures 17) and extremely fast (for AC-7).

The selected approach is **SuperPoint + LightGlue**.

* **SuperPoint:** A SOTA (State-of-the-Art) feature detector proven to find robust, repeatable keypoints in challenging, low-texture conditions 17
* **LightGlue:** A highly optimized GNN-based matcher that is the successor to SuperGlue 19

The key advantage of selecting LightGlue 19 over SuperGlue 20 is its *adaptive nature*. The query states sharp turns (AC-4) are "rather an exception." This implies \~95% of image pairs are "easy" (high-overlap, straight flight) and 5% are "hard" (low-overlap, turns). SuperGlue uses a fixed-depth GNN, spending the *same* large amount of compute on an "easy" pair as a "hard" one. LightGlue is *adaptive*.19 For an "easy" pair, it can exit its GNN early, returning a high-confidence match in a fraction of the time. This saves *enormous* computational budget on the 95% of "easy" frames, ensuring the system *always* meets the \<5s budget (AC-7) and reserving that compute for the GAB.

#### **Table 1: Analysis of State-of-the-Art Feature Matchers (For V-SLAM Front-End)**

| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **SuperPoint + SuperGlue** 20 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems. | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.21 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). |
| **SuperPoint + LightGlue** 17 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy. - SOTA "in practice" choice for large-scale matching.17 | - Newer, but rapidly being adopted and proven.21 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.22 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, maximizing our ability to meet AC-7. |

## **3.0 Core Component: The Geospatial Anchoring Back-End (GAB)**

This component is the system's "anchor to reality." It runs asynchronously to provide the *absolute, metric-scale* constraints needed to solve the trajectory. It is an *on-demand* system that solves three distinct "domain gaps": the hardware/scale gap, the temporal gap, and the viewpoint gap.

### **3.1 On-Demand Geospatial Data Retrieval**

A "pre-computed database" for all of Eastern Ukraine is operationally unfeasible on laptop-grade hardware.1 This design is replaced by an on-demand, API-driven workflow.

* **Mechanism:** When the TOH requests a global anchor, the GAB receives a *coarse* [Lat, Lon] estimate. The GAB then performs API calls to a geospatial data provider (e.g., EOSDA 3, Copernicus 8).
* **Dual-Retrieval:** The API query requests *two* distinct products for the specified Area of Interest (AOI):
  1. **Visual Tile:** A high-resolution (e.g., 30-50cm) satellite ortho-image.26
  2. **Terrain Tile:** The corresponding **Digital Elevation Model (DEM)**, such as the Copernicus GLO-30 (30m resolution) or SRTM (30m).7

This "Dual-Retrieval" mechanism is the central, enabling synergy of the new architecture. The **Visual Tile** is used by the CVGL (Section 3.2) to find the *geospatial pose*. The **DEM Tile** is used by the *output module* (Section 7.1) to perform high-accuracy **Ray-DEM Intersection**, solving the final output accuracy problem.

### **3.2 Hybrid Semantic-Visual Localization**

The "temporal gap" (evidenced by burn scars in Images 1-9) and "outdated maps" (AC-5) makes a purely visual CVGL system unreliable.5 The GAB solves this using a robust, two-stage *hybrid* matching pipeline.

1. **Stage 1: Coarse Visual Retrieval (Siamese CNN).** A lightweight Siamese CNN 14 is used to find the *approximate* location of the Image_N_LR *within* the large, newly-fetched satellite tile. This acts as a "candidate generator."
2. **Stage 2: Fine-Grained Semantic-Visual Fusion.** For the top candidates, the GAB performs a *dual-channel alignment*.
   * **Visual Channel (Unreliable):** It runs SuperPoint+LightGlue on high-resolution *patches* (from Image_N_HR) against the satellite tile. This match may be *weak* due to temporal gaps.5
   * **Semantic Channel (Reliable):** It extracts *temporally-invariant* semantic features (e.g., road-vectors, field-boundaries, tree-cluster-polygons, lake shorelines) from *both* the UAV image (using a segmentation model) and the satellite/OpenStreetMap data.5
   * **Fusion:** A RANSAC-based optimizer finds the 6-DoF pose that *best aligns* this *hybrid* set of features.

This hybrid approach is robust to the exact failure mode seen in the images. When matching Image 3 (burn scars), the *visual* LightGlue match will be poor. However, the *semantic* features (the dirt road, the tree line) are *unchanged*. The optimizer will find a high-confidence pose by *trusting the semantic alignment* over the poor visual alignment, thereby succeeding despite the "outdated map" (AC-5).

### **3.3 Solution to Viewpoint Gap: Synthetic Oblique View Training**

This component is critical for handling "sharp turns" (AC-4). The camera *will* be oblique, not nadir, during turns.

* **Problem:** The GAB's Stage 1 Siamese CNN 14 will be matching an *oblique* UAV view to a *nadir* satellite tile. This "viewpoint gap" will cause a match failure.14
* **Mechanism (Synthetic Data Generation):** The network must be trained for *viewpoint invariance*.28
  1. Using the on-demand DEMs (fetched in 3.1) and satellite tiles, the system can *synthetically render* the satellite imagery from *any* roll, pitch, and altitude.
  2. The Siamese network is trained on (Nadir_Tile, Synthetic_Oblique_Tile) pairs.14
* **Result:** This process teaches the network to match the *underlying ground features*, not the *perspective distortion*. It ensures the GAB can relocalize the UAV *precisely* when it is needed most: during a sharp, banking turn (AC-4) when VO tracking has been lost.

## **4.0 Core Component: The Trajectory Optimization Hub (TOH)**

This component is the system's central "brain." It runs continuously, fusing all measurements (high-frequency/unscaled V-SLAM, low-frequency/metric-scale GAB anchors) into a single, globally consistent trajectory.

### **4.1 Incremental Sim(3) Pose-Graph Optimization**

The "planar ground" SA-VO (Finding 1) is removed. This component is its replacement. The system must *discover* the global scale, not *assume* it.

* **Selected Strategy:** An incremental pose-graph optimizer using **Ceres Solver**.15
* **The Sim(3) Insight:** The V-SLAM front-end produces *unscaled* 6-DoF ($SE(3)$) relative poses. The GAB produces *metric-scale* 6-DoF ($SE(3)$) *absolute* poses. These cannot be directly combined. The graph must be optimized in **Sim(3) (7-DoF)**, which adds a *single global scale factor $s$* as an optimizable variable.
* **Mechanism (Ceres Solver):**
  1. **Nodes:** Each keyframe pose (7-DoF: $X, Y, Z, Qx, Qy, Qz, s$).
  2. **Edge 1 (V-SLAM):** A relative pose constraint between Keyframe_i and Keyframe_j. The error is computed in Sim(3).
  3. **Edge 2 (GAB):** An *absolute* pose constraint on Keyframe_k. This constraint *fixes* Keyframe_k's pose to the *metric* GPS coordinate and *fixes its scale $s$ to 1.0*.
* **Bootstrapping Scale:** The TOH graph "bootstraps" the scale.32 The GAB's $s=1.0$ anchor creates "tension" in the graph. The Ceres optimizer 15 resolves this tension by finding the *one* global scale $s$ for all V-SLAM nodes that minimizes the total error, effectively "stretching" the entire unscaled trajectory to fit the metric anchors. This is robust to *any* terrain.34

#### **Table 2: Analysis of Trajectory Optimization Strategies**

| Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component |
| :---- | :---- | :---- | :---- | :---- |
| **Incremental SLAM (Pose-Graph Optimization)** (Ceres Solver 15, g2o 35, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (GAB) data arrives (AC-8).13 - **Robust:** Can handle outliers via robust kernels.15 | - Initial estimate is *unscaled* until a GAB anchor arrives. - Can drift *if* not anchored (though V-SLAM minimizes this). | - A graph optimization library (Ceres). - A robust cost function. | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. |
| **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process. |

### **4.2 Automatic Outlier Rejection (AC-3, AC-5)**

The system must handle 350m outliers (AC-3) and \<10% bad GAB matches (AC-5).

* **Mechanism (Robust Loss Functions):** A standard least-squares optimizer (like Ceres 15) would be catastrophically corrupted by a 350m error. The solution is to wrap *all* constraints in a **Robust Loss Function (e.g., HuberLoss, CauchyLoss)**.15
* **Result:** A robust loss function mathematically *down-weights* the influence of constraints with large errors. When it "sees" the 350m error (AC-3), it effectively acknowledges the measurement but *refuses* to pull the entire 3000-image trajectory to fit this one "insane" data point. It automatically and gracefully *ignores* the outlier, optimizing the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5.

## **5.0 High-Performance Compute & Deployment**

The system must run on an RTX 2060 (AC-7) and process 6.2K images. These are opposing constraints.

### **5.1 Multi-Scale, Patch-Based Processing Pipeline**

Running deep learning models (SuperPoint, LightGlue) on a full 6.2K (26-Megapixel) image will cause a CUDA Out-of-Memory (OOM) error and be impossibly slow.

* **Mechanism (Coarse-to-Fine):**
  1. **For V-SLAM (Real-time, \<5s):** The V-SLAM front-end (Section 2.0) runs *only* on the Image_N_LR (e.g., 1536x1024) copy. This is fast enough to meet the AC-7 budget.
  2. **For GAB (High-Accuracy, Async):** The GAB (Section 3.0) uses the full-resolution Image_N_HR *selectively* to meet the 20m accuracy (AC-2).
     * It first runs its coarse Siamese CNN 27 on the Image_N_LR.
     * It then runs the SuperPoint detector on the *full 6.2K* image to find the *most confident* feature keypoints.
     * It then extracts small, 256x256 *patches* from the *full-resolution* image, centered on these keypoints.
     * It matches *these small, full-resolution patches* against the high-res satellite tile.
* **Result:** This hybrid method provides the fine-grained matching accuracy of the 6.2K image (needed for AC-2) without the catastrophic OOM errors or performance penalties.

### **5.2 Mandatory Deployment: NVIDIA TensorRT Acceleration**

PyTorch is a research framework. For production, its inference speed is insufficient.

* **Requirement:** The key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from PyTorch into a highly-optimized **NVIDIA TensorRT engine**.
* **Research Validation:** 23 demonstrates this process for LightGlue, achieving "2x-4x speed gains over compiled PyTorch." 22 and 21 provide open-source repositories for SuperPoint+LightGlue conversion to ONNX and TensorRT.
* **Result:** This is not an "optional" optimization. It is a *mandatory* deployment step. This conversion (which applies layer fusion, graph optimization, and FP16 precision) is what makes achieving the \<5s (AC-7) performance *possible* on the specified RTX 2060 hardware.36

## **6.0 System Robustness: Failure Mode Escalation Logic**

This logic defines the system's behavior during real-world failures, ensuring it meets criteria AC-3, AC-4, AC-6, and AC-9.

### **6.1 Stage 1: Normal Operation (Tracking)**

* **Condition:** V-SLAM front-end (Section 2.0) is healthy.
* **Logic:**
  1. V-SLAM successfully tracks Image_N_LR against its local keyframe map.
  2. A new Relative_Unscaled_Pose is sent to the TOH.
  3. TOH sends Pose_N_Est (unscaled) to the user (\<5s).
  4. If Image_N is selected as a new keyframe, the GAB (Section 3.0) is *queued* to find an Absolute_Metric_Anchor for it, which will trigger a Pose_N_Refined update later.

### **6.2 Stage 2: Transient VO Failure (Outlier Rejection)**

* **Condition:** Image_N is unusable (e.g., severe blur, sun-glare, 350m outlier per AC-3).
* **Logic (Frame Skipping):**
  1. V-SLAM front-end fails to track Image_N_LR against the local map.
  2. The system *discards* Image_N (marking it as a rejected outlier, AC-5).
  3. When Image_N+1 arrives, the V-SLAM front-end attempts to track it against the *same* local keyframe map (from Image_N-1).
  4. **If successful:** Tracking resumes. Image_N is officially an outlier. The system "correctly continues the work" (AC-3 met).
  5. **If fails:** The system repeats for Image_N+2, N+3. If this fails for \~5 consecutive frames, it escalates to Stage 3.

### **6.3 Stage 3: Persistent VO Failure (Relocalization)**

* **Condition:** Tracking is lost for multiple frames. This is the "sharp turn" (AC-4) or "low overlap" (AC-4) scenario.
* **Logic (Keyframe-Based Relocalization):**
  1. The V-SLAM front-end declares "Tracking Lost."
  2. **Critically:** It does *not* create a "new map chunk."
  3. Instead, it enters **Relocalization Mode**. For every new Image_N+k, it extracts features (SuperPoint) and queries the *entire* existing database of past keyframes for a match.
* **Resolution:** The UAV completes its sharp turn. Image_N+5 now has high overlap with Image_N-10 (from *before* the turn).
  1. The relocalization query finds a strong match.
  2. The V-SLAM front-end computes the 6-DoF pose of Image_N+5 relative to the *existing map*.
  3. Tracking is *resumed* seamlessly. The system "correctly continues the work" (AC-4 met). This is vastly more robust than the previous "map-merging" logic.

### **6.4 Stage 4: Catastrophic Failure (User Intervention)**

* **Condition:** The system is in Stage 3 (Lost), but *also*, the **GAB (Section 3.0) has failed** to find *any* global anchors for a prolonged period (e.g., 20% of the route). This is the "absolutely incapable" scenario (AC-6), (e.g., heavy fog *and* over a featureless ocean).
* **Logic:**
  1. The system has an *unscaled* trajectory, and *zero* idea where it is in the world.
  2. The TOH triggers the AC-6 flag.
* **Resolution (User-Aided Prior):**
  1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image."
  2. The user clicks *one point* on a map.
  3. This [Lat, Lon] is *not* taken as ground truth. It is fed to the **GAB (Section 3.1)** as a *strong prior* for its on-demand API query.
  4. This narrows the GAB's search area from "all of Ukraine" to "a 5km radius." This *guarantees* the GAB's Dual-Retrieval (Section 3.1) will fetch the *correct* satellite and DEM tiles, allowing the Hybrid Matcher (Section 3.2) to find a high-confidence Absolute_Metric_Anchor, which in turn re-scales (Section 4.1) and relocalizes the entire trajectory.

## **7.0 Output Generation and Validation Strategy**

This section details how the final user-facing outputs are generated, specifically solving the "planar ground" output flaw, and how the system's compliance with all 10 ACs will be validated.

### **7.1 High-Accuracy Object Geolocalization via Ray-DEM Intersection**

The "Ray-Plane Intersection" method is inaccurate for non-planar terrain 37 and is replaced with a high-accuracy ray-tracing method. This is the correct method for geolocating an object on the *non-planar* terrain visible in Images 1-9.

* **Inputs:**
  1. User clicks pixel coordinate $(u,v)$ on Image_N.
  2. System retrieves the *final, refined, metric* 7-DoF pose $P = (R, T, s)$ for Image_N from the TOH.
  3. The system uses the known camera intrinsic matrix $K$.
  4. System retrieves the specific **30m DEM tile** 8 that was fetched by the GAB (Section 3.1) for this region of the map. This DEM is a 3D terrain mesh.
* **Algorithm (Ray-DEM Intersection):**
  1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} = K^{-1} \\cdot [u, v, 1]^T$.
  2. **Transform Ray:** This ray direction $d_{cam}$ and origin (0,0,0) are transformed into the *global, metric* coordinate system using the pose $P$. This yields a ray originating at $T$ and traveling in direction $R \\cdot d_{cam}$.
  3. **Intersect:** The system performs a numerical *ray-mesh intersection* 39 to find the 3D point $(X, Y, Z)$ where this global ray *intersects the 3D terrain mesh* of the DEM.
  4. **Result:** This 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object *on the actual terrain*.
  5. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate.

This method correctly accounts for terrain. A pixel aimed at the top of a hill will intersect the DEM at a high Z-value. A pixel aimed at the ravine (Image 1) will intersect at a low Z-value. This is the *only* method that can reliably meet the 20m accuracy (AC-2) for object localization.

### **7.2 Rigorous Validation Methodology**

A comprehensive test plan is required. The foundation is a **Ground-Truth Test Harness** using the provided coordinates.csv.42

* **Test Harness:**
  1. **Ground-Truth Data:** The file coordinates.csv 42 provides ground-truth [Lat, Lon] for 60 images (e.g., AD000001.jpg...AD000060.jpg).
  2. **Test Datasets:**
     * Test_Baseline_60 42: The 60 images and their coordinates.
     * Test_Outlier_350m (AC-3): Test_Baseline_60 with a single, unrelated image inserted at frame 30.
     * Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump.
* **Test Cases:**
  * **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):**
    * **Run:** Execute GEORTEX-R on Test_Baseline_60, providing AD000001.jpg's coordinate (48.275292, 37.385220) as the Start Coordinate 42
    * **Script:** A validation script will compute the Haversine distance error between the *system's refined GPS output* for each image (2-60) and the *ground-truth GPS* from coordinates.csv.
    * **ASSERT** (count(errors \< 50m) / 60) >= 0.80 **(AC-1 Met)**
    * **ASSERT** (count(errors \< 20m) / 60) >= 0.60 **(AC-2 Met)**
    * **ASSERT** (count(un-localized_images) / 60) \< 0.10 **(AC-5 Met)**
    * **ASSERT** (count(localized_images) / 60) > 0.95 **(AC-9 Met)**
  * **Test_MRE (AC-10):**
    * **Run:** After Test_Baseline_60 completes.
    * **ASSERT** TOH.final_Mean_Reprojection_Error \< 1.0 **(AC-10 Met)**
  * **Test_Performance (AC-7, AC-8):**
    * **Run:** Execute on a 1500-image sequence on the minimum-spec RTX 2060.
    * **Log:** Log timestamps for "Image In" -> "Initial Pose Out".
    * **ASSERT** average_time \< 5.0s **(AC-7 Met)**
    * **Log:** Log the output stream.
    * **ASSERT** >80% of images receive *two* poses: an "Initial" and a "Refined" **(AC-8 Met)**
  * **Test_Robustness (AC-3, AC-4):**
    * **Run:** Execute Test_Outlier_350m.
    * **ASSERT** System logs "Stage 2: Discarding Outlier" and the final trajectory error for Image_31 is \< 50m **(AC-3 Met)**.
    * **Run:** Execute Test_Sharp_Turn_5pct.
    * **ASSERT** System logs "Stage 3: Tracking Lost" and "Relocalization Succeeded," and the final trajectory is complete and accurate **(AC-4 Met)**.