7.3 KiB
Reasoning Chain
WP-1: Lens Undistortion
Fact Confirmation
According to Fact #13, lens distortion correction is crucial for UAV photogrammetry with non-metric cameras. Distortion at image edges can be 5-20px for wide-angle lenses. The camera parameters (K matrix + distortion coefficients) are known.
Current State
Draft05 mentions "rectify" in preprocessing step 2 but does not explicitly include undistortion using camera intrinsics (K, distortion coefficients). Feature matching operates on distorted images, introducing position errors especially at image edges.
Conclusion
Add explicit cv2.undistort() step after image loading, before downscaling. This corrects radial and tangential distortion across the entire image. Camera calibration matrix K and distortion coefficients are provided as camera_params in the job request. Cost: ~5-10ms per image — negligible vs 5s budget.
Confidence
✅ High — well-established photogrammetry practice
WP-2: Camera Tilt GSD Compensation
Fact Confirmation
According to Fact #1, camera tilt of 18° produces >5% GSD error. During turns (10-30° bank angle), error ranges 1.5-15.5%. According to Fact #2, homography decomposition (already in the VO pipeline) extracts rotation matrix R from which tilt can be derived.
Current State
Draft05 computes GSD assuming perfectly nadir (straight-down) camera. The restrictions state the camera is "not autostabilized." During turns, the UAV banks causing significant camera tilt. GSD error of 10-15% during turns propagates to VO displacement estimates and then to position estimates.
Conclusion
After homography decomposition in VO step 6, extract tilt angle θ from rotation matrix R. Apply correction: GSD_corrected = GSD_nadir / cos(θ). For the first frame in a segment (no homography yet), use GSD_nadir (tilt unknown, assume straight flight). Zero additional computation cost — the rotation matrix R is already computed.
Confidence
✅ High — mathematical relationship, data already available in pipeline
WP-3: DINOv2 Aggregation
Fact Confirmation
According to Fact #3, SALAD aggregation improves DINOv2 retrieval by +12.4pp R@1 on MSLS Challenge over GeM. According to Fact #5, GeM pooling itself is +20pp over VLAD-style average pooling.
Current State
Draft05 uses "spatial average pooling" of DINOv2 patch tokens — the simplest and weakest aggregation method.
Reasoning
Coarse retrieval quality directly impacts satellite matching success rate. If the correct tile isn't in top-5 retrieval results, fine matching cannot succeed regardless of LiteSAM quality. A 20pp improvement in retrieval (via GeM) is substantial and costs nothing. SALAD adds another +12pp but requires a trained adapter layer — reasonable future enhancement.
Conclusion
Replace average pooling with GeM pooling as the immediate upgrade (one-line change, zero overhead). Document SALAD as a future enhancement if retrieval recall proves insufficient.
Confidence
✅ High for GeM improvement; ⚠️ Medium for SALAD on UAV-satellite cross-view (not directly benchmarked)
WP-4: GPU Scheduling
Fact Confirmation
According to Fact #6, compute-bound models cannot run truly concurrently on a single GPU via CUDA streams. According to Fact #7, recommended pattern is sequential GPU execution with async Python.
Current State
Draft05 states "satellite matching for frame N overlaps with VO processing for frame N+1" — this implies true GPU-level parallelism which is not achievable.
Conclusion
Clarify the pipeline model: GPU executes VO and satellite matching sequentially for each frame. Total GPU time per frame: ~450ms (VO ~200ms + satellite ~250ms). Well within 5s budget. The async benefit is in Python-level logic: while GPU processes satellite matching for frame N, the CPU can prepare frame N+1 data (image loading, preprocessing, GTSAM update). Satellite results for frame N are added to the factor graph when ready. The critical path per frame is ~200ms (VO only for position estimate); satellite correction is asynchronous at the application level, not GPU level.
Confidence
✅ High — official CUDA/PyTorch documentation
WP-5-9: Security Dependency Updates
Fact Confirmation
Facts #8-12 establish concrete CVEs with specific affected versions and fixes.
Current State
Draft05 pins PyTorch ≥2.10.0 and Pillow ≥11.3.0. It uses python-jose for JWT, aiohttp for HTTP, and ONNX Runtime without version pinning.
Conclusion
- Replace python-jose with PyJWT ≥2.10.0 (maintained, secure, drop-in replacement for JWT)
- Upgrade Pillow pin to ≥12.1.1 (CVE-2026-25990)
- Pin aiohttp ≥3.13.3 (7 CVEs)
- Pin h11 ≥0.16.0 (CVE-2025-43859, CVSS 9.1)
- Pin ONNX Runtime ≥1.24.1 (path traversal)
- Monitor safetensors metadata RCE
Confidence
✅ High — all from NVD/official advisories
WP-10: ENU vs UTM for Long Flights
Fact Confirmation
According to Fact #14, ENU approximation is accurate within 4km. Beyond 4km, errors become significant. At 10km: ~0.5m error; at 50km: ~12.5m.
Current State
Draft05 uses ENU centered on starting GPS. UAV flights can cover 30-50km+ (3000 photos at 100m spacing = 300km theoretical max).
Reasoning
300km is well beyond ENU's 4km accuracy range. Even typical flights (500-1500 photos at 100m = 50-150km) far exceed this. UTM projection is accurate to <1m within a 360km-wide zone, covers any realistic flight. pyproj (already mentioned in draft05 for WGS84↔ENU) supports UTM natively.
Conclusion
Replace ENU with UTM coordinates. Use pyproj to auto-select UTM zone from starting GPS. All internal positions in UTM meters. Convert to WGS84 for output. Factor graph operates in UTM — same math as ENU, just different projection. No re-centering needed.
Confidence
✅ High — well-established geodesy
WP-11: Memory Management
Fact Confirmation
According to Fact #15, visual SLAM systems use rolling windows for feature descriptors, keeping only recent frames in active memory.
Current State
Draft05 doesn't specify when SuperPoint features are freed. For 3000 images, keeping all features would use ~6GB RAM (2000 keypoints × 256 dims × 4 bytes × 3000 = 6.1GB).
Reasoning
Only consecutive frame pairs need SuperPoint features for VO. After matching frame N with frame N-1, frame N-1's features are no longer needed. Factor graph stores only Pose2 variables (~24 bytes each), not features. Satellite matching uses DINOv2 + LiteSAM (separate features, not cached per frame).
Conclusion
Explicitly specify: after VO matching between frame N and N-1, discard frame N-1's SuperPoint features. Keep only current frame's features for next iteration. Memory: constant ~2MB regardless of flight length. Document total memory budget per component.
Confidence
✅ High — standard SLAM practice
WP-12: safetensors Security
Fact Confirmation
According to Fact #16, safetensors metadata RCE is under review. Polyglot and header-bomb attacks are known vectors.
Current State
Draft05 recommends safetensors format for DINOv2 but doesn't validate header size.
Conclusion
Add safetensors header size validation: reject files with header > 10MB (normal header is <1KB for DINOv2). This mitigates header-bomb DoS and reduces attack surface for metadata RCE.
Confidence
⚠️ Medium — vulnerability is under review, mitigation is precautionary