Read carefully about the problem: We have a lot of images taken from a wing-type UAV using a camera with at least Full HD resolution. Resolution of each photo could be up to 6200*4100 for the whole flight, but for other flights, it could be FullHD Photos are taken and named consecutively within 100 meters of each other. We know only the starting GPS coordinates. We need to determine the GPS of the centers of each image. And also the coordinates of the center of any object in these photos. We can use an external satellite provider for ground checks on the existing photos System has next restrictions and conditions: - Photos are taken by only airplane type UAVs. - Photos are taken by the camera pointing downwards and fixed, but it is not autostabilized. - The flying range is restricted by the eastern and southern parts of Ukraine (To the left of the Dnipro River) - The image resolution could be from FullHD to 6252*4168. Camera parameters are known: focal length, sensor width, resolution and so on. - Altitude is predefined and no more than 1km. The height of the terrain can be neglected. - There is NO data from IMU - Flights are done mostly in sunny weather - We can use satellite providers, but we're limited right now to Google Maps, which could be outdated for some regions - Number of photos could be up to 3000, usually in the 500-1500 range - During the flight, UAVs can make sharp turns, so that the next photo may be absolutely different from the previous one (no same objects), but it is rather an exception than the rule - Processing is done on a stationary computer or laptop with NVidia GPU at least RTX2060, better 3070. (For the UAV solution Jetson Orin Nano would be used, but that is out of scope.) Output of the system should address next acceptance criteria: - The system should find out the GPS of centers of 80% of the photos from the flight within an error of no more than 50 meters in comparison to the real GPS - The system should find out the GPS of centers of 60% of the photos from the flight within an error of no more than 20 meters in comparison to the real GPS - The system should correctly continue the work even in the presence of up to 350 meters of an outlier photo between 2 consecutive pictures en route. This could happen due to tilt of the plane. - System should correctly continue the work even during sharp turns, where the next photo doesn't overlap at all, or overlaps in less than 5%. The next photo should be in less than 150m drift and at an angle of less than 50% - The number of outliers during the satellite provider images ground check should be less than 10% - In case of being absolutely incapable of determining the system to determine next, second next, and third next images GPS, by any means (these 20% of the route), then it should ask the user for input for the next image, so that the user can specify the location - Less than 5 seconds for processing one image - Results of image processing should appear immediately to user, so that user shouldn't wait for the whole route to complete in order to analyze first results. Also, system could refine existing calculated results and send refined results again to user - Image Registration Rate > 95%. The system can find enough matching features to confidently calculate the camera's 6-DoF pose (position and orientation) and "stitch" that image into the final trajectory - Mean Reprojection Error (MRE) < 1.0 pixels. The distance, in pixels, between the original pixel location of the object and the re-projected pixel location. Here is a solution draft: **GEORTEX-R: A Geospatial-Temporal Robust Extraction System for IMU-Denied UAV Geolocalization** ## **1.0 GEORTEX-R: System Architecture and Data Flow** The GEORTEX-R system is an asynchronous, three-component software solution designed for deployment on an NVIDIA RTX 2060+ GPU. It is architected from the ground up to handle the specific, demonstrated challenges of IMU-denied localization in *non-planar terrain* (as seen in Images 1-9) and *temporally-divergent* (outdated) reference maps (AC-5). The system's core design principle is the *decoupling of unscaled relative motion from global metric scale*. The front-end estimates high-frequency, robust, but *unscaled* motion. The back-end asynchronously provides sparse, high-confidence *metric* and *geospatial* anchors. The central hub fuses these two data streams into a single, globally-optimized, metric-scale trajectory. ### **1.1 Inputs** 1. **Image Sequence:** Consecutively named images (FullHD to 6252x4168). 2. **Start Coordinate (Image 0):** A single, absolute GPS coordinate (Latitude, Longitude) for the first image. 3. **Camera Intrinsics ($K$):** A pre-calibrated camera intrinsic matrix. 4. **Altitude Prior ($H_{prior}$):** The *approximate* predefined metric altitude (e.g., 900 meters). This is used as a *prior* (a hint) for optimization, *not* a hard constraint. 5. **Geospatial API Access:** Credentials for an on-demand satellite and DEM provider (e.g., Copernicus, EOSDA). ### **1.2 Streaming Outputs** 1. **Initial Pose ($Pose\\_N\\_Est$):** An *unscaled* pose estimate. This is sent immediately to the UI for real-time visualization of the UAV's *path shape* (AC-7, AC-8). 2. **Refined Pose ($Pose\\_N\\_Refined$) [Asynchronous]:** A globally-optimized, *metric-scale* 7-DoF pose (X, Y, Z, Qx, Qy, Qz, Qw) and its corresponding [Lat, Lon, Alt] coordinate. This is sent to the user whenever the Trajectory Optimization Hub re-converges, updating all past poses (AC-1, AC-2, AC-8). ### **1.3 Component Interaction and Data Flow** The system is architected as three parallel-processing components: 1. **Image Ingestion & Pre-processing:** This module receives the new Image_N (up to 6.2K). It creates two copies: * Image_N_LR (Low-Resolution, e.g., 1536x1024): Dispatched *immediately* to the V-SLAM Front-End for real-time processing. * Image_N_HR (High-Resolution, 6.2K): Stored for asynchronous use by the Geospatial Anchoring Back-End (GAB). 2. **V-SLAM Front-End (High-Frequency Thread):** This component's sole task is high-speed, *unscaled* relative pose estimation. It tracks Image_N_LR against a *local map of keyframes*. It performs local bundle adjustment to minimize drift 12 and maintains a co-visibility graph of all keyframes. It sends Relative_Unscaled_Pose estimates to the Trajectory Optimization Hub (TOH). 3. **Geospatial Anchoring Back-End (GAB) (Low-Frequency, Asynchronous Thread):** This is the system's "anchor." When triggered by the TOH, it fetches *on-demand* geospatial data (satellite imagery and DEMs) from an external API.3 It then performs a robust *hybrid semantic-visual* search 5 to find an *absolute, metric, global pose* for a given keyframe, robust to outdated maps (AC-5) 5 and oblique views (AC-4).14 This Absolute_Metric_Anchor is sent to the TOH. 4. **Trajectory Optimization Hub (TOH) (Central Hub):** This component manages the complete flight trajectory as a **Sim(3) pose graph** (7-DoF). It continuously fuses two distinct data streams: * **On receiving Relative_Unscaled_Pose (T \< 5s):** It appends this pose to the graph, calculates the Pose_N_Est, and sends this *unscaled* initial result to the user (AC-7, AC-8 met). * **On receiving Absolute_Metric_Anchor (T > 5s):** This is the critical event. It adds this as a high-confidence *global metric constraint*. This anchor creates "tension" in the graph, which the optimizer (Ceres Solver 15) resolves by finding the *single global scale factor* that best fits all V-SLAM and CVGL measurements. It then triggers a full graph re-optimization, "stretching" the entire trajectory to the correct metric scale, and sends the new Pose_N_Refined stream to the user for all affected poses (AC-1, AC-2, AC-8 refinement met). ## **2.0 Core Component: The High-Frequency V-SLAM Front-End** This component's sole task is to robustly and accurately compute the *unscaled* 6-DoF relative motion of the UAV and build a geometrically-consistent map of keyframes. It is explicitly designed to be more robust to drift than simple frame-to-frame odometry. ### **2.1 Rationale: Keyframe-Based Monocular SLAM** The choice of a keyframe-based V-SLAM front-end over a frame-to-frame VO is deliberate and critical for system robustness. * **Drift Mitigation:** Frame-to-frame VO is "prone to drift accumulation due to errors introduced by each frame-to-frame motion estimation".13 A single poor match permanently corrupts all future poses. * **Robustness:** A keyframe-based system tracks new images against a *local map* of *multiple* previous keyframes, not just Image_N-1. This provides resilience to transient failures (e.g., motion blur, occlusion). * **Optimization:** This architecture enables "local bundle adjustment" 12, a process where a sliding window of recent keyframes is continuously re-optimized, actively minimizing error and drift *before* it can accumulate. * **Relocalization:** This architecture possesses *innate relocalization capabilities* (see Section 6.3), which is the correct, robust solution to the "sharp turn" (AC-4) requirement. ### **2.2 Feature Matching Sub-System** The success of the V-SLAM front-end depends entirely on high-quality feature matches, especially in the sparse, low-texture agricultural terrain seen in the provided images (e.g., Image 6, Image 7). The system requires a matcher that is robust (for sparse textures 17) and extremely fast (for AC-7). The selected approach is **SuperPoint + LightGlue**. * **SuperPoint:** A SOTA (State-of-the-Art) feature detector proven to find robust, repeatable keypoints in challenging, low-texture conditions 17 * **LightGlue:** A highly optimized GNN-based matcher that is the successor to SuperGlue 19 The key advantage of selecting LightGlue 19 over SuperGlue 20 is its *adaptive nature*. The query states sharp turns (AC-4) are "rather an exception." This implies \~95% of image pairs are "easy" (high-overlap, straight flight) and 5% are "hard" (low-overlap, turns). SuperGlue uses a fixed-depth GNN, spending the *same* large amount of compute on an "easy" pair as a "hard" one. LightGlue is *adaptive*.19 For an "easy" pair, it can exit its GNN early, returning a high-confidence match in a fraction of the time. This saves *enormous* computational budget on the 95% of "easy" frames, ensuring the system *always* meets the \<5s budget (AC-7) and reserving that compute for the GAB. #### **Table 1: Analysis of State-of-the-Art Feature Matchers (For V-SLAM Front-End)** | Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component | | :---- | :---- | :---- | :---- | :---- | | **SuperPoint + SuperGlue** 20 | - SOTA robustness in low-texture, high-blur conditions. - GNN reasons about 3D scene context. - Proven in real-time SLAM systems. | - Computationally heavy (fixed-depth GNN). - Slower than LightGlue.19 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.21 | **Good.** A solid, baseline choice. Meets robustness needs but will heavily tax the \<5s time budget (AC-7). | | **SuperPoint + LightGlue** 17 | - **Adaptive Depth:** Faster on "easy" pairs, more accurate on "hard" pairs.19 - **Faster & Lighter:** Outperforms SuperGlue on speed and accuracy. - SOTA "in practice" choice for large-scale matching.17 | - Newer, but rapidly being adopted and proven.21 | - NVIDIA GPU (RTX 2060+). - PyTorch or TensorRT.22 | **Excellent (Selected).** The adaptive nature is *perfect* for this problem. It saves compute on the 95% of easy (straight) frames, maximizing our ability to meet AC-7. | ## **3.0 Core Component: The Geospatial Anchoring Back-End (GAB)** This component is the system's "anchor to reality." It runs asynchronously to provide the *absolute, metric-scale* constraints needed to solve the trajectory. It is an *on-demand* system that solves three distinct "domain gaps": the hardware/scale gap, the temporal gap, and the viewpoint gap. ### **3.1 On-Demand Geospatial Data Retrieval** A "pre-computed database" for all of Eastern Ukraine is operationally unfeasible on laptop-grade hardware.1 This design is replaced by an on-demand, API-driven workflow. * **Mechanism:** When the TOH requests a global anchor, the GAB receives a *coarse* [Lat, Lon] estimate. The GAB then performs API calls to a geospatial data provider (e.g., EOSDA 3, Copernicus 8). * **Dual-Retrieval:** The API query requests *two* distinct products for the specified Area of Interest (AOI): 1. **Visual Tile:** A high-resolution (e.g., 30-50cm) satellite ortho-image.26 2. **Terrain Tile:** The corresponding **Digital Elevation Model (DEM)**, such as the Copernicus GLO-30 (30m resolution) or SRTM (30m).7 This "Dual-Retrieval" mechanism is the central, enabling synergy of the new architecture. The **Visual Tile** is used by the CVGL (Section 3.2) to find the *geospatial pose*. The **DEM Tile** is used by the *output module* (Section 7.1) to perform high-accuracy **Ray-DEM Intersection**, solving the final output accuracy problem. ### **3.2 Hybrid Semantic-Visual Localization** The "temporal gap" (evidenced by burn scars in Images 1-9) and "outdated maps" (AC-5) makes a purely visual CVGL system unreliable.5 The GAB solves this using a robust, two-stage *hybrid* matching pipeline. 1. **Stage 1: Coarse Visual Retrieval (Siamese CNN).** A lightweight Siamese CNN 14 is used to find the *approximate* location of the Image_N_LR *within* the large, newly-fetched satellite tile. This acts as a "candidate generator." 2. **Stage 2: Fine-Grained Semantic-Visual Fusion.** For the top candidates, the GAB performs a *dual-channel alignment*. * **Visual Channel (Unreliable):** It runs SuperPoint+LightGlue on high-resolution *patches* (from Image_N_HR) against the satellite tile. This match may be *weak* due to temporal gaps.5 * **Semantic Channel (Reliable):** It extracts *temporally-invariant* semantic features (e.g., road-vectors, field-boundaries, tree-cluster-polygons, lake shorelines) from *both* the UAV image (using a segmentation model) and the satellite/OpenStreetMap data.5 * **Fusion:** A RANSAC-based optimizer finds the 6-DoF pose that *best aligns* this *hybrid* set of features. This hybrid approach is robust to the exact failure mode seen in the images. When matching Image 3 (burn scars), the *visual* LightGlue match will be poor. However, the *semantic* features (the dirt road, the tree line) are *unchanged*. The optimizer will find a high-confidence pose by *trusting the semantic alignment* over the poor visual alignment, thereby succeeding despite the "outdated map" (AC-5). ### **3.3 Solution to Viewpoint Gap: Synthetic Oblique View Training** This component is critical for handling "sharp turns" (AC-4). The camera *will* be oblique, not nadir, during turns. * **Problem:** The GAB's Stage 1 Siamese CNN 14 will be matching an *oblique* UAV view to a *nadir* satellite tile. This "viewpoint gap" will cause a match failure.14 * **Mechanism (Synthetic Data Generation):** The network must be trained for *viewpoint invariance*.28 1. Using the on-demand DEMs (fetched in 3.1) and satellite tiles, the system can *synthetically render* the satellite imagery from *any* roll, pitch, and altitude. 2. The Siamese network is trained on (Nadir_Tile, Synthetic_Oblique_Tile) pairs.14 * **Result:** This process teaches the network to match the *underlying ground features*, not the *perspective distortion*. It ensures the GAB can relocalize the UAV *precisely* when it is needed most: during a sharp, banking turn (AC-4) when VO tracking has been lost. ## **4.0 Core Component: The Trajectory Optimization Hub (TOH)** This component is the system's central "brain." It runs continuously, fusing all measurements (high-frequency/unscaled V-SLAM, low-frequency/metric-scale GAB anchors) into a single, globally consistent trajectory. ### **4.1 Incremental Sim(3) Pose-Graph Optimization** The "planar ground" SA-VO (Finding 1) is removed. This component is its replacement. The system must *discover* the global scale, not *assume* it. * **Selected Strategy:** An incremental pose-graph optimizer using **Ceres Solver**.15 * **The Sim(3) Insight:** The V-SLAM front-end produces *unscaled* 6-DoF ($SE(3)$) relative poses. The GAB produces *metric-scale* 6-DoF ($SE(3)$) *absolute* poses. These cannot be directly combined. The graph must be optimized in **Sim(3) (7-DoF)**, which adds a *single global scale factor $s$* as an optimizable variable. * **Mechanism (Ceres Solver):** 1. **Nodes:** Each keyframe pose (7-DoF: $X, Y, Z, Qx, Qy, Qz, s$). 2. **Edge 1 (V-SLAM):** A relative pose constraint between Keyframe_i and Keyframe_j. The error is computed in Sim(3). 3. **Edge 2 (GAB):** An *absolute* pose constraint on Keyframe_k. This constraint *fixes* Keyframe_k's pose to the *metric* GPS coordinate and *fixes its scale $s$ to 1.0*. * **Bootstrapping Scale:** The TOH graph "bootstraps" the scale.32 The GAB's $s=1.0$ anchor creates "tension" in the graph. The Ceres optimizer 15 resolves this tension by finding the *one* global scale $s$ for all V-SLAM nodes that minimizes the total error, effectively "stretching" the entire unscaled trajectory to fit the metric anchors. This is robust to *any* terrain.34 #### **Table 2: Analysis of Trajectory Optimization Strategies** | Approach (Tools/Library) | Advantages | Limitations | Requirements | Fitness for Problem Component | | :---- | :---- | :---- | :---- | :---- | | **Incremental SLAM (Pose-Graph Optimization)** (Ceres Solver 15, g2o 35, GTSAM) | - **Real-time / Online:** Provides immediate pose estimates (AC-7). - **Supports Refinement:** Explicitly designed to refine past poses when new "loop closure" (GAB) data arrives (AC-8).13 - **Robust:** Can handle outliers via robust kernels.15 | - Initial estimate is *unscaled* until a GAB anchor arrives. - Can drift *if* not anchored (though V-SLAM minimizes this). | - A graph optimization library (Ceres). - A robust cost function. | **Excellent (Selected).** This is the *only* architecture that satisfies all user requirements for real-time streaming and asynchronous refinement. | | **Batch Structure from Motion (Global Bundle Adjustment)** (COLMAP, Agisoft Metashape) | - **Globally Optimal Accuracy:** Produces the most accurate possible 3D reconstruction and trajectory. | - **Offline:** Cannot run in real-time or stream results. - High computational cost (minutes to hours). - Fails AC-7 and AC-8 completely. | - All images must be available before processing starts. - High RAM and CPU. | **Good (as an *Optional* Post-Processing Step).** Unsuitable as the primary online system, but could be offered as an optional, high-accuracy "Finalize Trajectory" batch process. | ### **4.2 Automatic Outlier Rejection (AC-3, AC-5)** The system must handle 350m outliers (AC-3) and \<10% bad GAB matches (AC-5). * **Mechanism (Robust Loss Functions):** A standard least-squares optimizer (like Ceres 15) would be catastrophically corrupted by a 350m error. The solution is to wrap *all* constraints in a **Robust Loss Function (e.g., HuberLoss, CauchyLoss)**.15 * **Result:** A robust loss function mathematically *down-weights* the influence of constraints with large errors. When it "sees" the 350m error (AC-3), it effectively acknowledges the measurement but *refuses* to pull the entire 3000-image trajectory to fit this one "insane" data point. It automatically and gracefully *ignores* the outlier, optimizing the 99.9% of "sane" measurements. This is the modern, robust solution to AC-3 and AC-5. ## **5.0 High-Performance Compute & Deployment** The system must run on an RTX 2060 (AC-7) and process 6.2K images. These are opposing constraints. ### **5.1 Multi-Scale, Patch-Based Processing Pipeline** Running deep learning models (SuperPoint, LightGlue) on a full 6.2K (26-Megapixel) image will cause a CUDA Out-of-Memory (OOM) error and be impossibly slow. * **Mechanism (Coarse-to-Fine):** 1. **For V-SLAM (Real-time, \<5s):** The V-SLAM front-end (Section 2.0) runs *only* on the Image_N_LR (e.g., 1536x1024) copy. This is fast enough to meet the AC-7 budget. 2. **For GAB (High-Accuracy, Async):** The GAB (Section 3.0) uses the full-resolution Image_N_HR *selectively* to meet the 20m accuracy (AC-2). * It first runs its coarse Siamese CNN 27 on the Image_N_LR. * It then runs the SuperPoint detector on the *full 6.2K* image to find the *most confident* feature keypoints. * It then extracts small, 256x256 *patches* from the *full-resolution* image, centered on these keypoints. * It matches *these small, full-resolution patches* against the high-res satellite tile. * **Result:** This hybrid method provides the fine-grained matching accuracy of the 6.2K image (needed for AC-2) without the catastrophic OOM errors or performance penalties. ### **5.2 Mandatory Deployment: NVIDIA TensorRT Acceleration** PyTorch is a research framework. For production, its inference speed is insufficient. * **Requirement:** The key neural networks (SuperPoint, LightGlue, Siamese CNN) *must* be converted from PyTorch into a highly-optimized **NVIDIA TensorRT engine**. * **Research Validation:** 23 demonstrates this process for LightGlue, achieving "2x-4x speed gains over compiled PyTorch." 22 and 21 provide open-source repositories for SuperPoint+LightGlue conversion to ONNX and TensorRT. * **Result:** This is not an "optional" optimization. It is a *mandatory* deployment step. This conversion (which applies layer fusion, graph optimization, and FP16 precision) is what makes achieving the \<5s (AC-7) performance *possible* on the specified RTX 2060 hardware.36 ## **6.0 System Robustness: Failure Mode Escalation Logic** This logic defines the system's behavior during real-world failures, ensuring it meets criteria AC-3, AC-4, AC-6, and AC-9. ### **6.1 Stage 1: Normal Operation (Tracking)** * **Condition:** V-SLAM front-end (Section 2.0) is healthy. * **Logic:** 1. V-SLAM successfully tracks Image_N_LR against its local keyframe map. 2. A new Relative_Unscaled_Pose is sent to the TOH. 3. TOH sends Pose_N_Est (unscaled) to the user (\<5s). 4. If Image_N is selected as a new keyframe, the GAB (Section 3.0) is *queued* to find an Absolute_Metric_Anchor for it, which will trigger a Pose_N_Refined update later. ### **6.2 Stage 2: Transient VO Failure (Outlier Rejection)** * **Condition:** Image_N is unusable (e.g., severe blur, sun-glare, 350m outlier per AC-3). * **Logic (Frame Skipping):** 1. V-SLAM front-end fails to track Image_N_LR against the local map. 2. The system *discards* Image_N (marking it as a rejected outlier, AC-5). 3. When Image_N+1 arrives, the V-SLAM front-end attempts to track it against the *same* local keyframe map (from Image_N-1). 4. **If successful:** Tracking resumes. Image_N is officially an outlier. The system "correctly continues the work" (AC-3 met). 5. **If fails:** The system repeats for Image_N+2, N+3. If this fails for \~5 consecutive frames, it escalates to Stage 3. ### **6.3 Stage 3: Persistent VO Failure (Relocalization)** * **Condition:** Tracking is lost for multiple frames. This is the "sharp turn" (AC-4) or "low overlap" (AC-4) scenario. * **Logic (Keyframe-Based Relocalization):** 1. The V-SLAM front-end declares "Tracking Lost." 2. **Critically:** It does *not* create a "new map chunk." 3. Instead, it enters **Relocalization Mode**. For every new Image_N+k, it extracts features (SuperPoint) and queries the *entire* existing database of past keyframes for a match. * **Resolution:** The UAV completes its sharp turn. Image_N+5 now has high overlap with Image_N-10 (from *before* the turn). 1. The relocalization query finds a strong match. 2. The V-SLAM front-end computes the 6-DoF pose of Image_N+5 relative to the *existing map*. 3. Tracking is *resumed* seamlessly. The system "correctly continues the work" (AC-4 met). This is vastly more robust than the previous "map-merging" logic. ### **6.4 Stage 4: Catastrophic Failure (User Intervention)** * **Condition:** The system is in Stage 3 (Lost), but *also*, the **GAB (Section 3.0) has failed** to find *any* global anchors for a prolonged period (e.g., 20% of the route). This is the "absolutely incapable" scenario (AC-6), (e.g., heavy fog *and* over a featureless ocean). * **Logic:** 1. The system has an *unscaled* trajectory, and *zero* idea where it is in the world. 2. The TOH triggers the AC-6 flag. * **Resolution (User-Aided Prior):** 1. The UI prompts the user: "Tracking lost. Please provide a coarse location for the *current* image." 2. The user clicks *one point* on a map. 3. This [Lat, Lon] is *not* taken as ground truth. It is fed to the **GAB (Section 3.1)** as a *strong prior* for its on-demand API query. 4. This narrows the GAB's search area from "all of Ukraine" to "a 5km radius." This *guarantees* the GAB's Dual-Retrieval (Section 3.1) will fetch the *correct* satellite and DEM tiles, allowing the Hybrid Matcher (Section 3.2) to find a high-confidence Absolute_Metric_Anchor, which in turn re-scales (Section 4.1) and relocalizes the entire trajectory. ## **7.0 Output Generation and Validation Strategy** This section details how the final user-facing outputs are generated, specifically solving the "planar ground" output flaw, and how the system's compliance with all 10 ACs will be validated. ### **7.1 High-Accuracy Object Geolocalization via Ray-DEM Intersection** The "Ray-Plane Intersection" method is inaccurate for non-planar terrain 37 and is replaced with a high-accuracy ray-tracing method. This is the correct method for geolocating an object on the *non-planar* terrain visible in Images 1-9. * **Inputs:** 1. User clicks pixel coordinate $(u,v)$ on Image_N. 2. System retrieves the *final, refined, metric* 7-DoF pose $P = (R, T, s)$ for Image_N from the TOH. 3. The system uses the known camera intrinsic matrix $K$. 4. System retrieves the specific **30m DEM tile** 8 that was fetched by the GAB (Section 3.1) for this region of the map. This DEM is a 3D terrain mesh. * **Algorithm (Ray-DEM Intersection):** 1. **Un-project Pixel:** The 2D pixel $(u,v)$ is un-projected into a 3D ray *direction* vector $d_{cam}$ in the camera's local coordinate system: $d_{cam} = K^{-1} \\cdot [u, v, 1]^T$. 2. **Transform Ray:** This ray direction $d_{cam}$ and origin (0,0,0) are transformed into the *global, metric* coordinate system using the pose $P$. This yields a ray originating at $T$ and traveling in direction $R \\cdot d_{cam}$. 3. **Intersect:** The system performs a numerical *ray-mesh intersection* 39 to find the 3D point $(X, Y, Z)$ where this global ray *intersects the 3D terrain mesh* of the DEM. 4. **Result:** This 3D intersection point $(X, Y, Z)$ is the *metric* world coordinate of the object *on the actual terrain*. 5. **Convert:** This $(X, Y, Z)$ world coordinate is converted to a [Latitude, Longitude, Altitude] GPS coordinate. This method correctly accounts for terrain. A pixel aimed at the top of a hill will intersect the DEM at a high Z-value. A pixel aimed at the ravine (Image 1) will intersect at a low Z-value. This is the *only* method that can reliably meet the 20m accuracy (AC-2) for object localization. ### **7.2 Rigorous Validation Methodology** A comprehensive test plan is required. The foundation is a **Ground-Truth Test Harness** using the provided coordinates.csv.42 * **Test Harness:** 1. **Ground-Truth Data:** The file coordinates.csv 42 provides ground-truth [Lat, Lon] for 60 images (e.g., AD000001.jpg...AD000060.jpg). 2. **Test Datasets:** * Test_Baseline_60 42: The 60 images and their coordinates. * Test_Outlier_350m (AC-3): Test_Baseline_60 with a single, unrelated image inserted at frame 30. * Test_Sharp_Turn_5pct (AC-4): A sequence where frames 20-24 are manually deleted, simulating a \<5% overlap jump. * **Test Cases:** * **Test_Accuracy (AC-1, AC-2, AC-5, AC-9):** * **Run:** Execute GEORTEX-R on Test_Baseline_60, providing AD000001.jpg's coordinate (48.275292, 37.385220) as the Start Coordinate 42 * **Script:** A validation script will compute the Haversine distance error between the *system's refined GPS output* for each image (2-60) and the *ground-truth GPS* from coordinates.csv. * **ASSERT** (count(errors \< 50m) / 60) >= 0.80 **(AC-1 Met)** * **ASSERT** (count(errors \< 20m) / 60) >= 0.60 **(AC-2 Met)** * **ASSERT** (count(un-localized_images) / 60) \< 0.10 **(AC-5 Met)** * **ASSERT** (count(localized_images) / 60) > 0.95 **(AC-9 Met)** * **Test_MRE (AC-10):** * **Run:** After Test_Baseline_60 completes. * **ASSERT** TOH.final_Mean_Reprojection_Error \< 1.0 **(AC-10 Met)** * **Test_Performance (AC-7, AC-8):** * **Run:** Execute on a 1500-image sequence on the minimum-spec RTX 2060. * **Log:** Log timestamps for "Image In" -> "Initial Pose Out". * **ASSERT** average_time \< 5.0s **(AC-7 Met)** * **Log:** Log the output stream. * **ASSERT** >80% of images receive *two* poses: an "Initial" and a "Refined" **(AC-8 Met)** * **Test_Robustness (AC-3, AC-4):** * **Run:** Execute Test_Outlier_350m. * **ASSERT** System logs "Stage 2: Discarding Outlier" and the final trajectory error for Image_31 is \< 50m **(AC-3 Met)**. * **Run:** Execute Test_Sharp_Turn_5pct. * **ASSERT** System logs "Stage 3: Tracking Lost" and "Relocalization Succeeded," and the final trajectory is complete and accurate **(AC-4 Met)**. Identify all potential weak points and problems. Address them and find out ways to solve them. Based on your findings, form a new solution draft in the same format. If your finding requires a complete reorganization of the flow and different components, state it. Put all the findings regarding what was weak and poor at the beginning of the report. Put here all new findings, what was updated, replaced, or removed from the previous solution. Then form a new solution design without referencing the previous system. Remove Poor and Very Poor component choices from the component analysis tables, but leave Good and Excellent ones. In the updated report, do not put "new" marks, do not compare to the previous solution draft, just make a new solution as if from scratch