add clarification to research methodology by including a step for solution comparison and user consultation

2026-06-22 19:51:12 +00:00 · 2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,141 @@
+# Reasoning Chain
+
+## WP-1: Lens Undistortion
+
+### Fact Confirmation
+According to Fact #13, lens distortion correction is crucial for UAV photogrammetry with non-metric cameras. Distortion at image edges can be 5-20px for wide-angle lenses. The camera parameters (K matrix + distortion coefficients) are known.
+
+### Current State
+Draft05 mentions "rectify" in preprocessing step 2 but does not explicitly include undistortion using camera intrinsics (K, distortion coefficients). Feature matching operates on distorted images, introducing position errors especially at image edges.
+
+### Conclusion
+Add explicit cv2.undistort() step after image loading, before downscaling. This corrects radial and tangential distortion across the entire image. Camera calibration matrix K and distortion coefficients are provided as camera_params in the job request. Cost: ~5-10ms per image — negligible vs 5s budget.
+
+### Confidence
+✅ High — well-established photogrammetry practice
+
+---
+
+## WP-2: Camera Tilt GSD Compensation
+
+### Fact Confirmation
+According to Fact #1, camera tilt of 18° produces >5% GSD error. During turns (10-30° bank angle), error ranges 1.5-15.5%. According to Fact #2, homography decomposition (already in the VO pipeline) extracts rotation matrix R from which tilt can be derived.
+
+### Current State
+Draft05 computes GSD assuming perfectly nadir (straight-down) camera. The restrictions state the camera is "not autostabilized." During turns, the UAV banks causing significant camera tilt. GSD error of 10-15% during turns propagates to VO displacement estimates and then to position estimates.
+
+### Conclusion
+After homography decomposition in VO step 6, extract tilt angle θ from rotation matrix R. Apply correction: GSD_corrected = GSD_nadir / cos(θ). For the first frame in a segment (no homography yet), use GSD_nadir (tilt unknown, assume straight flight). Zero additional computation cost — the rotation matrix R is already computed.
+
+### Confidence
+✅ High — mathematical relationship, data already available in pipeline
+
+---
+
+## WP-3: DINOv2 Aggregation
+
+### Fact Confirmation
+According to Fact #3, SALAD aggregation improves DINOv2 retrieval by +12.4pp R@1 on MSLS Challenge over GeM. According to Fact #5, GeM pooling itself is +20pp over VLAD-style average pooling.
+
+### Current State
+Draft05 uses "spatial average pooling" of DINOv2 patch tokens — the simplest and weakest aggregation method.
+
+### Reasoning
+Coarse retrieval quality directly impacts satellite matching success rate. If the correct tile isn't in top-5 retrieval results, fine matching cannot succeed regardless of LiteSAM quality. A 20pp improvement in retrieval (via GeM) is substantial and costs nothing. SALAD adds another +12pp but requires a trained adapter layer — reasonable future enhancement.
+
+### Conclusion
+Replace average pooling with GeM pooling as the immediate upgrade (one-line change, zero overhead). Document SALAD as a future enhancement if retrieval recall proves insufficient.
+
+### Confidence
+✅ High for GeM improvement; ⚠️ Medium for SALAD on UAV-satellite cross-view (not directly benchmarked)
+
+---
+
+## WP-4: GPU Scheduling
+
+### Fact Confirmation
+According to Fact #6, compute-bound models cannot run truly concurrently on a single GPU via CUDA streams. According to Fact #7, recommended pattern is sequential GPU execution with async Python.
+
+### Current State
+Draft05 states "satellite matching for frame N overlaps with VO processing for frame N+1" — this implies true GPU-level parallelism which is not achievable.
+
+### Conclusion
+Clarify the pipeline model: GPU executes VO and satellite matching sequentially for each frame. Total GPU time per frame: ~450ms (VO ~200ms + satellite ~250ms). Well within 5s budget. The async benefit is in Python-level logic: while GPU processes satellite matching for frame N, the CPU can prepare frame N+1 data (image loading, preprocessing, GTSAM update). Satellite results for frame N are added to the factor graph when ready. The critical path per frame is ~200ms (VO only for position estimate); satellite correction is asynchronous at the application level, not GPU level.
+
+### Confidence
+✅ High — official CUDA/PyTorch documentation
+
+---
+
+## WP-5-9: Security Dependency Updates
+
+### Fact Confirmation
+Facts #8-12 establish concrete CVEs with specific affected versions and fixes.
+
+### Current State
+Draft05 pins PyTorch ≥2.10.0 and Pillow ≥11.3.0. It uses python-jose for JWT, aiohttp for HTTP, and ONNX Runtime without version pinning.
+
+### Conclusion
+1. Replace python-jose with PyJWT ≥2.10.0 (maintained, secure, drop-in replacement for JWT)
+2. Upgrade Pillow pin to ≥12.1.1 (CVE-2026-25990)
+3. Pin aiohttp ≥3.13.3 (7 CVEs)
+4. Pin h11 ≥0.16.0 (CVE-2025-43859, CVSS 9.1)
+5. Pin ONNX Runtime ≥1.24.1 (path traversal)
+6. Monitor safetensors metadata RCE
+
+### Confidence
+✅ High — all from NVD/official advisories
+
+---
+
+## WP-10: ENU vs UTM for Long Flights
+
+### Fact Confirmation
+According to Fact #14, ENU approximation is accurate within 4km. Beyond 4km, errors become significant. At 10km: ~0.5m error; at 50km: ~12.5m.
+
+### Current State
+Draft05 uses ENU centered on starting GPS. UAV flights can cover 30-50km+ (3000 photos at 100m spacing = 300km theoretical max).
+
+### Reasoning
+300km is well beyond ENU's 4km accuracy range. Even typical flights (500-1500 photos at 100m = 50-150km) far exceed this. UTM projection is accurate to <1m within a 360km-wide zone, covers any realistic flight. pyproj (already mentioned in draft05 for WGS84↔ENU) supports UTM natively.
+
+### Conclusion
+Replace ENU with UTM coordinates. Use pyproj to auto-select UTM zone from starting GPS. All internal positions in UTM meters. Convert to WGS84 for output. Factor graph operates in UTM — same math as ENU, just different projection. No re-centering needed.
+
+### Confidence
+✅ High — well-established geodesy
+
+---
+
+## WP-11: Memory Management
+
+### Fact Confirmation
+According to Fact #15, visual SLAM systems use rolling windows for feature descriptors, keeping only recent frames in active memory.
+
+### Current State
+Draft05 doesn't specify when SuperPoint features are freed. For 3000 images, keeping all features would use ~6GB RAM (2000 keypoints × 256 dims × 4 bytes × 3000 = 6.1GB).
+
+### Reasoning
+Only consecutive frame pairs need SuperPoint features for VO. After matching frame N with frame N-1, frame N-1's features are no longer needed. Factor graph stores only Pose2 variables (~24 bytes each), not features. Satellite matching uses DINOv2 + LiteSAM (separate features, not cached per frame).
+
+### Conclusion
+Explicitly specify: after VO matching between frame N and N-1, discard frame N-1's SuperPoint features. Keep only current frame's features for next iteration. Memory: constant ~2MB regardless of flight length. Document total memory budget per component.
+
+### Confidence
+✅ High — standard SLAM practice
+
+---
+
+## WP-12: safetensors Security
+
+### Fact Confirmation
+According to Fact #16, safetensors metadata RCE is under review. Polyglot and header-bomb attacks are known vectors.
+
+### Current State
+Draft05 recommends safetensors format for DINOv2 but doesn't validate header size.
+
+### Conclusion
+Add safetensors header size validation: reject files with header > 10MB (normal header is <1KB for DINOv2). This mitigates header-bomb DoS and reduces attack surface for metadata RCE.
+
+### Confidence
+⚠️ Medium — vulnerability is under review, mitigation is precautionary