Update NetVLAD checkpoint paths and enhance .gitignore
ci/woodpecker/push/02-build-push Pipeline failed

- Changed paths in documentation and configuration files to reflect the new naming convention for the NetVLAD model, transitioning from `models/netvlad/netvlad.pt` to `models/net_vlad/net_vlad.pt`.
- Updated the `.gitignore` to include additional file types and directories related to input data and locally-generated evidence frames.
- Removed the old NetVLAD checkpoint file as part of the transition to the new naming scheme.

These changes ensure consistency across the project and improve the management of generated files.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-31 19:27:32 +03:00
parent 97f5f9793c
commit ba70381346
18 changed files with 4195 additions and 24 deletions
+8
View File
@@ -49,6 +49,14 @@ e2e/fixtures/sitl_replay/
_docs/00_problem/input_data/**/*.tlog _docs/00_problem/input_data/**/*.tlog
_docs/00_problem/input_data/**/*.mp4 _docs/00_problem/input_data/**/*.mp4
_docs/00_problem/input_data/**/*.h264 _docs/00_problem/input_data/**/*.h264
_docs/00_problem/input_data/**/*.mkv
_docs/00_problem/input_data/**/*.zip
# Locally-generated evidence frames for extraction fixtures (large, regenerable)
_docs/00_problem/input_data/**/frames_src/
_docs/00_problem/input_data/**/frames_optA/
_docs/00_problem/input_data/**/frames_optB/
_docs/00_problem/input_data/**/frames_optC/
# Editor / OS noise # Editor / OS noise
.idea/ .idea/
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,167 @@
# Question Decomposition — Mode B (focused) — Video Extraction from GCS Recording
> Run date: 2026-05-29. Triggered by user question on
> `_docs/00_problem/input_data/10.05.2026/2026-05-09 16-10-54.mkv`.
> Active mode: **Mode B** (solution_draft01.md exists). Scope of this run is
> deliberately narrower than a full solution reassessment — it asks whether the
> existing solution can ingest a *new representative-data class* (operator-side
> GCS screen recordings of gimbaled multi-sensor balls) as replay fixtures, and
> what cleanup pipeline is required.
## Original question
> "I have `2026-05-09 16-10-54.mkv` but it's obscured by other elements. Is it
> possible to make out of it a proper video as from a nadir camera? What's
> possible options?"
## Research Output Class
**Technical-component selection** (per SKILL.md → Research Output Class table).
The deliverable will name specific tools (FFmpeg filters, deep video inpainting
models, mask-aware feature extractors) that will be implemented or operated
against. All technical-component gates apply (per-mode API verification, MVE,
fit matrix, Restrictions × Candidate-Mode sub-matrix).
## Active mode
| Aspect | Value |
|---|---|
| Skill mode | Mode B (Solution Assessment) |
| Existing draft | `_docs/01_solution/solution_draft01.md` (329 lines) |
| Scope of revision | Additive — propose a new test-fixture-prep component (does **not** alter runtime pipeline) |
| Output | `_docs/01_solution/solution_draft02.md` |
| Working dir | `_docs/00_research/_mode_b_2026-05-29_video_extraction/` |
## Question type
**Decision Support** — weigh trade-offs across multiple options for converting
an OSD-burned-in screen-recorded video into a clean nadir replay fixture.
## Novelty Sensitivity
**Medium**. Underlying tools (FFmpeg filters) are stable for >15 years.
Deep-learning video inpainting evolves rapidly (E2FGVI 2022 → ProPainter 2023 →
VideoPainter 2025 → VidPivot 2025); version annotations required.
## Project context grounding
From `_docs/00_problem/`:
- **Spec'd nav-camera (`restrictions.md`)**: ADTi 20MP 20L V1, APS-C, ~5472×3648
px, fixed-downward, no gimbal. The `flight_derkachi.mp4` representative
fixture is a Topotek KHP20S30 1/2.8" CMOS, 1920×1080, mechanically locked
nadir, OSD-off — already pre-cleaned.
- **The new MKV is a different class of input**: a screen capture of a Ground
Control Station UI displaying a Topotek/Viewpro multi-sensor gimbal feed,
1280×720 30 fps H.264, ~6 m 7 s, with three layers of overlay: (a) GCS UI
chrome (sidebars, minimap, status bar), (b) gimbal-burned-in OSD (attitude,
crosshair, FOV brackets, status text, IR PIP), (c) the underlying EO video.
- **Use-case (per user's selection)**: replay/test fixture for the runtime
C1/C2/C3/C4/C5 pipeline, analogous to `flight_derkachi.mp4`.
- **Constraint (per user's selection)**: only the recorded MKV is available;
cannot re-record with OSD off, cannot lock gimbal nadir, cannot pull RTSP
stream from the camera.
## Research subject boundary
| Dimension | Boundary |
|---|---|
| Population | Single MKV file + the *class* of similar future GCS screen recordings |
| Geography | Project's operational area (eastern/southern Ukraine) |
| Timeframe | Cleanup tooling for legacy recordings (no live-system requirement) |
| Operating context | Offline, developer workstation; output consumed by `tests/e2e/replay/` |
| Required interfaces | Input: `.mkv` (any container with H.264). Output: H.264 MP4 ingestable by `flight_derkachi.mp4`-style replay path |
| Non-functional envelope | Offline (no real-time constraint). Hardware: developer workstation (CPU+optional GPU). Output ≤ a few hundred MB per flight. |
## Project Constraint Matrix (relevant subset)
Extracted from `restrictions.md`, `acceptance_criteria.md`, and the Derkachi
fixture conventions:
| # | Constraint | Source | Binding for this run? |
|---|---|---|---|
| C1 | Replay fixtures must be ingestable by `tests/e2e/replay/test_az835_e2e_real_flight.py` (takes a `.mp4` + `.tlog` + calibration JSON) | `flight_derkachi/README.md` | **Yes** |
| C2 | Output must NOT have synthetic content fabricated by generative models (would invalidate VPR/matching evaluation — pipeline could anchor on hallucinated features instead of real terrain) | `coderule.mdc` "Real Results, Not Simulated Ones" + `meta-rule.mdc` | **Yes** |
| C3 | Output frame rate may differ from the spec'd 3 Hz; replay layer subsamples | Existing fixtures (Derkachi.mp4 is 30 fps) | No (downstream handles) |
| C4 | Frame-to-frame registration must succeed for >95% of normal-flight segments (AC-2.1a) — applies if and only if the cleaned fixture is treated as a normal-flight fixture | `acceptance_criteria.md` | Soft: only if frames qualify as nadir |
| C5 | Output cannot lie about the underlying camera spec; calibration file must reflect the actual recording source (Topotek/Viewpro, not ADTi 20MP) | `flight_derkachi/camera_info.md` shows the convention is to ship a per-camera calibration JSON | **Yes** |
| C6 | The pipeline producing fixtures should be **reproducible** (versioned scripts, pinned tool versions) so a re-run produces the same fixture | `coderule.mdc` testing principles | **Yes** |
| C7 | Cleanup must NOT introduce false-positive features the downstream matcher could anchor on | derived from C2; specific to mask-aware vs inpaint trade-off | **Yes** |
| C8 | Gimbaled, non-nadir frames must be either filtered out or labeled — feeding forward-looking frames into a nadir-tuned VPR will produce nonsense matches | `restrictions.md` "navigation camera fixed downward (no gimbal)" + project's level-flight assumption | **Yes** |
## Sub-questions
1. **SQ-1 — Layer identification**: What spatially-distinct layers are in the
recorded video, and which are removable by cropping vs which require active
removal?
2. **SQ-2 — GCS UI chrome removal**: Best technique to remove the deterministic
GCS UI sidebars, minimap, status bar, IR PIP?
3. **SQ-3 — Gimbal-burned OSD removal**: Best technique to remove burned-in
gimbal HUD elements (attitude ladder, crosshair, FOV brackets, status text)
without fabricating content the downstream matcher could anchor on?
4. **SQ-4 — Mask-aware downstream alternative**: Can the project's existing
C2/C3 stack (DISK + LightGlue) consume a binary mask of OSD regions
directly, sidestepping the need to inpaint at all?
5. **SQ-5 — Non-nadir frame filtering**: How to detect and exclude frames where
the gimbal is pointed off-nadir (the burned-in attitude ladder shows the
gimbal angle)?
6. **SQ-6 — Acceptance against existing replay infrastructure**: What
metadata/companion-files does the new fixture need to drop into the
`flight_derkachi.mp4`-style replay path?
## Perspectives chosen (≥3)
| Perspective | Why | Sub-questions emphasized |
|---|---|---|
| **Implementer / Engineer** | This is fundamentally a tooling/pipeline question — the engineer building the fixture cleanup script needs concrete commands and gotchas | SQ-2, SQ-3, SQ-5 |
| **Contrarian / Devil's advocate** | The naive "just inpaint it with AI" approach has a specific failure mode in this domain (fabricated terrain features) that must be flagged | SQ-3, SQ-4 |
| **Domain expert / Academic** | VPR + matching algorithms have published mask-aware inference paths; the question of "do we need clean pixels or can we just signal which pixels to ignore" has a literature answer | SQ-4 |
## Question Explosion (search query variants)
For SQ-1 (layer identification): inspection-based, no web search.
For SQ-2 (GCS UI chrome removal):
- "FFmpeg crop filter exact pixel coordinates"
- "FFmpeg crop video specific region command line"
For SQ-3 (gimbal-burned OSD removal):
- "FFmpeg delogo filter remove static OSD overlay video burned-in HUD"
- "FFmpeg removelogo PNG mask filter syntax"
- "ProPainter E2FGVI video inpainting state of the art 2025 2026 mask region"
- "video OSD removal practitioner experience drone gimbal"
- "temporal median filter remove static HUD OSD video keep moving content"
- "drone gimbal video OSD removal extract clean nadir feed Topotek Viewpro"
For SQ-4 (mask-aware downstream):
- "SuperPoint LightGlue masked feature detection ignore region keypoints"
- "DISK keypoint detector mask region of interest pytorch implementation"
- "Kornia DISK mask parameter forward pass"
For SQ-5 (non-nadir frame filtering):
- "MAVLink MOUNT_STATUS gimbal attitude tlog parsing"
- "OCR pitch angle text from drone HUD video frame"
For SQ-6 (replay infrastructure):
- (no web; read project docs directly)
## Component Option Search Plan
| Component area | Option families to cover | Required evidence to mark Selected |
|---|---|---|
| Frame extraction & re-encode | Simple baseline (FFmpeg `crop`), Established (FFmpeg `crop` + container remux), Open-source (FFmpeg-python wrapper) | Verified `crop` syntax against FFmpeg 8.1 docs; PoC produces playable output |
| Static-region OSD removal | Simple (FFmpeg `delogo`), Established (FFmpeg `removelogo` with PNG mask), Open-source (Python+OpenCV inpaint per-frame), SOTA (ProPainter, VideoPainter), Adjacent (temporal-median `tmedian`/`atadenoise`), No-build (skip; pass mask downstream), Known-bad (generative models that fabricate content) | Comparison of per-region quality vs cost vs fabrication risk |
| Mask-aware downstream matcher | The project's existing DISK + LightGlue path with a binary mask injected | Verified Kornia DISK has a `mask` parameter; verified LightGlue maintainers recommend score-map masking |
| Non-nadir frame filtering | Tlog-based (parse `MOUNT_STATUS`/`MOUNT_ORIENTATION`), OCR-based (read burned-in pitch text), Pixel-pattern-based (detect attitude-ladder rotation), No-build (accept all frames; downstream covariance grows) | Known whether the paired `.tlog` contains gimbal attitude messages |
| Calibration metadata | Per-camera JSON file in same form as `khp20s30_factory.json` | Topotek/Viewpro spec sheet exists; "factory_sheet" approximation acceptable per AZ-702 precedent |
## Completeness Audit
-**Layer identification** covered (SQ-1).
-**Removal techniques** covered for both GCS UI (SQ-2) and gimbal OSD (SQ-3).
-**Alternative path** considered (SQ-4 — mask-aware matchers, no inpainting).
-**Frame relevance** covered (SQ-5 — gimbal pointing).
-**Integration** covered (SQ-6 — replay path metadata).
-**Contrarian view** covered (generative-AI fabrication risk).
- 🚫 **Audio handling** — not covered; trivially answered (discard audio stream).
- 🚫 **Frame rate normalization** — not covered; trivially answered (replay
layer already subsamples; preserve native 30 fps).
@@ -0,0 +1,202 @@
# Source Registry — Mode B Video Extraction Run
> All sources accessed 2026-05-29.
## L1 — Official documentation / source code
### #1 — FFmpeg `delogo` filter (official ffmpeg-filters-docs)
- URL: https://ayosec.github.io/ffmpeg-filters-docs/6.0/Filters/Video/delogo.html
- Type: L1 (mirror of official FFmpeg filter docs)
- Tier rationale: Direct documentation of a built-in FFmpeg filter
- Key claims: rectangular logo region, parameters `x, y, w, h, show`,
interpolation from immediately-outside pixels
- Verified locally: yes — `ffmpeg -h filter=delogo` on FFmpeg 8.1 confirms the
parameter set (the `band` parameter present in older versions has been
removed in 8.1)
### #2 — FFmpeg `delogo` source (`vf_delogo.c`)
- URL: https://github.com/FFmpeg/FFmpeg/blob/master/libavfilter/vf_delogo.c
- Type: L1 (FFmpeg upstream source)
- Tier rationale: Authoritative implementation
- Key claims: applies a "simple delogo algorithm" interpolating surrounding
pixels into the rectangular logo region
### #3 — FFmpeg `removelogo` source (`vf_removelogo.c`)
- URL: https://www.ffmpeg.org/doxygen/trunk/vf__removelogo_8c_source.html
- Type: L1 (FFmpeg upstream source)
- Tier rationale: Authoritative implementation
- Key claims: bitmap-mask-based blur; "major improvement on the old delogo
filter"; mask must be a PNG where pixels are LOGO (white) vs source (black);
"only pixels in the mask that line up to pixels outside the logo are used"
- Local note: Filter exists in FFmpeg 8.1 but rejected our PNG mask with
"Invalid argument" (-22) — likely format expectation is stricter than
documented; sub-matrix marks this `Verify` rather than blocking.
### #4 — Topotek Gimbals on ArduPilot Copter docs
- URL: https://ardupilot.org/copter/docs/common-topotek-gimbal.html
- Type: L1 (ArduPilot upstream documentation)
- Tier rationale: Direct integration documentation for the camera class shown
in this project's screenshots `1.jpeg``4.png`
- Key claims (relevant subset):
- Two RTSP video streams: `rtsp://192.168.144.108:554/stream=0` (1080p) and
`stream=1` (480p)
- Configuration via "GimbalControl" Ethernet app (OSD on/off configurable)
- Captured images/videos retrievable from `camera/DCIM/snap` and
`camera/DCIM/record` over Ethernet/SMB
- Implication for this run: The cleanest source recovery path (raw RTSP or
on-camera DCIM) was explicitly excluded by the user's "only have this MKV"
constraint, but is recorded here as the recommended Option Z for any future
recordings.
### #5 — LightGlue maintainer guidance on mask injection (cvg/LightGlue#97)
- URL: https://github.com/cvg/LightGlue/issues/97
- Type: L1 (issue answered by repo maintainer @Phil26AT, an author)
- Tier rationale: Direct from the project that this codebase already uses
(per `solution_draft01.md` C3 component)
- Key claims:
- SuperPoint does **not** natively accept a mask in its forward pass
- Two recommended workarounds: (a) extract all keypoints, then filter by
mask post-hoc, or (b) multiply the SuperPoint score map by a binary mask
before NMS
- Maintainer comment: "(b) you would get more points in the specified area,
and thus more matches"
### #6 — Kornia `DISK.forward(img, mask=None)` API (Kornia docs)
- URL: https://kornia.readthedocs.io/en/latest/feature.html
- Type: L1 (Kornia official documentation)
- Tier rationale: Authoritative for the Kornia DISK wrapper; relevant because
the DISK detector is project's chosen C3 detector per `solution_draft01.md`
- Key claims:
- `kornia.feature.DISK.forward(img, mask=None)` accepts `mask` as
`(B, 1, H, W)` with values in `[0, 1]`
- "the score map is multiplied by this mask before keypoint detection so
that features are suppressed in masked regions"
- Implication: **the project's existing C3 stack is already mask-capable**.
This makes Option B (mask-aware downstream, no inpainting) the lowest-risk
high-quality path.
### #7 — DISK upstream source (`disk/model/disk.py`)
- URL: https://github.com/cvlab-epfl/disk/blob/master/disk/model/disk.py
- Type: L1 (DISK upstream)
- Tier rationale: Authoritative for DISK semantics
- Key claims: DISK produces a per-pixel `heatmap` of detection scores;
multiplying this by a spatial mask before NMS / sampling is the canonical
way to restrict detection to a region
### #8 — FFmpeg `tmedian` filter (built-in)
- URL: https://ffmpeg.org/ffmpeg-filters.html#tmedian
- Type: L1 (FFmpeg official filter docs)
- Tier rationale: Authoritative
- Key claims: `tmedian` computes per-pixel temporal median over a configurable
radius window; built into recent FFmpeg
### #9 — `flight_derkachi/README.md` (project's existing fixture convention)
- URL: `_docs/00_problem/input_data/flight_derkachi/README.md` (in-repo)
- Type: L1 (project documentation)
- Key claims:
- Replay fixture is 880×720 H.264 30 fps MP4 with paired `.tlog`-derived
`data_imu.csv` and per-camera calibration JSON
- The MP4 is a "cleaned/cropped replay fixture rather than the raw camera
feed"
- "the rotating camera was mechanically fixed in a downward/nadir orientation"
- Implication: the new MKV-derived fixture should match the same shape
(cleaned/cropped MP4 + calibration JSON + telemetry CSV)
### #10 — `flight_derkachi/camera_info.md`
- URL: `_docs/00_problem/input_data/flight_derkachi/camera_info.md` (in-repo)
- Type: L1 (project documentation)
- Key claims:
- Derkachi camera: Topotek KHP20S30, 1/2.8" CMOS, 1920×1080
- Calibration via "factory_sheet" approximation (AZ-702) is project-accepted
when checkerboard isn't possible — same approach applies to the
new gimbal
## L2 — Peer-reviewed papers / preprints
### #11 — ProPainter (ICCV 2023)
- URL: https://shangchenzhou.com/projects/ProPainter/
- Date accessed: 2026-05-29
- Type: L2 (peer-reviewed conference paper, project page)
- Tier rationale: ICCV 2023 paper; SOTA (at publication) non-generative video
inpainting baseline
- Key claims:
- Recurrent flow completion + dual-domain (image+feature) propagation +
mask-guided sparse Transformer
- 808G FLOPs/10 frames at 480p; 0.249 s/frame on undisclosed GPU
- +1.46 dB PSNR vs prior SOTA
- Relevance: Baseline option for offline OSD inpainting; non-generative means
it propagates pixels from neighboring frames (no fabricated content) — this
is the property our project requires.
### #12 — VideoPainter (arXiv 2503.05639, 2025)
- URL: https://arxiv.org/html/2503.05639v3
- Type: L2 (arXiv preprint)
- Tier rationale: Most recent generative video inpainting (2025)
- Key claims:
- Generative dual-branch architecture
- Outperforms ProPainter on segmentation-based VPBench
- **Critical caveat for our use case**: explicitly described as a
*generative* model that synthesizes fully-masked-object content
- Implication: **Disqualified for our use case**. Synthesized terrain features
would corrupt VPR/matching evaluation (project's `meta-rule.mdc` "Real
Results, Not Simulated Ones").
### #13 — VidPivot / DiffuEraser comparison (arXiv 2510.21461, 2025)
- URL: https://arxiv.org/html/2510.21461v2
- Type: L2 (arXiv preprint)
- Key claims: cross-comparison between ProPainter, DiffuEraser, VideoPainter,
VidPivot on object removal; ProPainter "effectively removes the target
region but struggles to generate semantically consistent content"
- Implication: confirms ProPainter is the best non-generative option;
generative variants share the fabrication risk.
### #14 — DISK paper (NeurIPS 2020, arXiv 2006.13566)
- URL: https://arxiv.org/abs/2006.13566
- Type: L2 (peer-reviewed)
- Key claims: DISK is RL-trained; produces a dense heatmap; trains on
homographies
- Relevance: confirms DISK exposes a heatmap that can be multiplied by a
spatial mask before keypoint sampling
## L3 — Practitioner / blog / community
### #15 — "Removing obnoxious logos from videos" (Domain of the Technomancer blog)
- URL: https://www.technomancer.com/archives/248
- Type: L3 (practitioner blog)
- Key claims: practitioner walkthrough of FFmpeg `delogo`+`removelogo`,
including the workflow of building a PNG mask from a single frame screenshot
### #16 — Conditional Temporal Median Filter (kevina.org)
- URL: http://www.kevina.org/temporal_median/
- Type: L3 (older practitioner page; methodology still cited)
- Key claims: motion-conditional temporal median — apply median only where
motion is below threshold, preserves moving content while suppressing
static artifacts
- Relevance: the "static OSD on moving video" use case maps directly to this
filter family. However, in our test the burned-in OSD is *also moving*
visually because text values change every frame, so motion-conditional
median has limitations.
### #17 — Foundry Nuke `TemporalMedian` reference
- URL: https://learn.foundry.com/nuke/content/reference_guide/time_nodes/temporalmedian.html
- Type: L3 (commercial-tool documentation)
- Key claims: Nuke's `TemporalMedian` exposes a mask channel; effect can be
limited to the masked region only — same pattern that FFmpeg `tmedian` lacks
natively
## In-repo cross-references (project artifacts)
### #R1 — `_docs/01_solution/solution_draft01.md`
- C2 component: MixVPR (TensorRT, INT8+FP16) for retrieval
- C3 component: DISK + LightGlue for matching
- C5 component: GTSAM iSAM2 + CombinedImuFactor
- The pipeline does not have a "data ingestion / fixture-prep" component —
this is the gap this run addresses.
### #R2 — `_docs/00_research/06_component_fit_matrix/00_summary.md`
- Lists every component in the existing solution with selection status
- Confirms no fixture-cleanup component exists
### #R3 — `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json`
- Existing per-camera calibration JSON convention; new gimbal needs an
equivalent
@@ -0,0 +1,283 @@
# Fact Cards — Mode B Video Extraction Run
> Confidence symbols: ✅ High (L1 official) — ⚠️ Medium (L2 academic / official
> blog) — ❓ Low (L3 practitioner / inference)
## Layer characterization (from local pixel-variance analysis)
### Fact #1 — Three independent overlay layers
- **Statement**: The recorded `2026-05-09 16-10-54.mkv` (1280×720 H.264 30 fps,
6 m 7 s) contains three spatially-overlapping layers: (a) GCS UI chrome
rendered as fixed pixel rectangles by the operator's GCS application,
(b) gimbal-burned-in OSD rendered upstream of the recorder by the camera
itself (attitude ladder, crosshair, FOV brackets, status text, IR
picture-in-picture), (c) the underlying EO video.
- **Source**: Local 12-frame variance analysis (`/tmp/nadir_research/`),
extracted frames at t=10,30,60,90,120,150,180,210,240,270,300,330 s
- **Confidence**: ✅ High (direct measurement)
- **Related Dimension**: SQ-1 (layer identification)
- **Fit Impact**: Establishes the action space — each layer needs its own
removal/handling strategy
### Fact #2 — IR PIP is itself a live video stream, not a static element
- **Statement**: The picture-in-picture in the upper-right (~x=7201080,
y=25235) has 85% dynamic-pixel fraction across the 12 sample frames,
consistent with a live IR/thermal video feed, not a static UI element.
- **Source**: Local variance analysis
- **Confidence**: ✅ High
- **Fit Impact**: Cannot be ignored as "noise". Either crop it out
geometrically or treat as an opaque rectangle in the OSD mask.
### Fact #3 — GCS UI sidebars contain live values, not pure-static chrome
- **Statement**: Left sidebar (SL STATS panel) and right sidebar (ROLL/SPEED/
DIST/BATT/CURRENT) have mean per-pixel std ≈3040 across frames, comparable
to the actual EO video region. They are pixel-deterministic — same fixed
positions on every frame — but the *values* update.
- **Source**: Local variance analysis
- **Confidence**: ✅ High
- **Fit Impact**: Pure geometric crop removes them entirely; no need to
inpaint. Easy.
### Fact #4 — Gimbal HUD text is *also* dynamic-content text on top of moving video
- **Statement**: The top-left HUD block (`00:00/00`, timestamps, EO/IR zoom,
FOV) and bottom-right gimbal text show high std (≈3940), because both the
HUD values change AND the underlying video changes. The HUD is rendered
upstream by the camera and is **always at the same screen position**.
- **Source**: Local variance analysis + visual inspection of frames
- **Confidence**: ✅ High
- **Fit Impact**: Position-deterministic but content-dynamic. Inpainting must
either propagate from neighboring frames (temporal) or from spatially
adjacent pixels (FFmpeg `delogo`).
### Fact #5 — Frame at t=30 s shows gimbal pointed forward (horizon visible), frame at t=300 s shows nadir
- **Statement**: The gimbal is operator-pointable; not all frames are nadir.
Burned-in attitude indicator shows pitch numbers from `-3.7°` (near level)
to clearly off-nadir values. The aircraft also appears to be a multirotor
(frame at t=300 s shows DIST=17.0 m at low altitude, inconsistent with
fixed-wing 1 km AGL).
- **Source**: Direct visual inspection of `f_030.png` and `f_300.png`
- **Confidence**: ✅ High (visual)
- **Fit Impact**: Frame-level filtering required before treating output as a
nadir fixture. The replay pipeline tuned for nadir-only would mis-handle
forward-looking frames.
## FFmpeg techniques
### Fact #6 — FFmpeg `crop` is a pixel-level deterministic geometric crop
- **Statement**: `crop=W:H:X:Y` produces a sub-region; arbitrary integer
coordinates; lossless when paired with `-c:v copy` if the codec supports
arbitrary crop, otherwise a re-encode is needed.
- **Source**: Source #1 + locally tested (PoC1 in `/tmp/nadir_research/`)
- **Confidence**: ✅ High
- **Related Dimension**: SQ-2 (GCS chrome removal)
- **Fit Impact**: Trivially implements the entire GCS-chrome-removal step.
### Fact #7 — FFmpeg `delogo` replaces a rectangle with interpolation from neighboring pixels
- **Statement**: `delogo=x=X:y=Y:w=W:h=H` interpolates from the immediately-
outside pixels of the rectangle. In FFmpeg 8.1 the `band` parameter has been
removed; only `x, y, w, h, show` remain. The filter is timeline-enabled
(can be activated only on certain frames via `enable=` expression).
- **Source**: Source #1 + Source #2 + locally verified (`ffmpeg -h
filter=delogo` on FFmpeg 8.1)
- **Confidence**: ✅ High
- **Related Dimension**: SQ-3 (gimbal OSD removal)
- **Fit Impact**: Cheap, deterministic, works for small rectangles. Quality
degrades for large rectangles or when the region's interior is full of
texture (e.g., text on grass).
- **Caveat**: Cannot place the rectangle touching the image edge — there are
no surrounding pixels to interpolate from.
### Fact #8 — Multiple `delogo` filters can be chained via comma
- **Statement**: A filter graph like
`crop=W:H:X:Y,delogo=...,delogo=...,delogo=...` chains successive `delogo`
passes, each operating on the output of the previous.
- **Source**: Locally verified (PoC4 produced `poc4_delogo.mp4` via 3 chained
`delogo` filters after `crop`)
- **Confidence**: ✅ High (direct test)
- **Fit Impact**: Practical recipe for removing the 56 burned-OSD regions in
this video.
### Fact #9 — FFmpeg `removelogo` accepts a PNG mask but is fragile in FFmpeg 8.1
- **Statement**: `removelogo=mask.png` should accept a PNG where black=clean,
white=logo. In our local FFmpeg 8.1 tests it failed with `Invalid argument`
(`-22`) on both grayscale and RGB masks of the correct dimensions.
Documentation (Source #3) suggests strict requirements on the mask format
that FFmpeg 8.1 enforces but does not document clearly. Practitioner
walkthroughs (Source #15) used the filter successfully on older FFmpeg.
- **Source**: Source #3 + Source #15 + local test failure
- **Confidence**: ⚠️ Medium (works in principle, version-dependent in practice)
- **Fit Impact**: Use chained `delogo` instead, or use a per-frame OpenCV
inpaint script if `removelogo` cannot be made to work on the team's pinned
FFmpeg version.
### Fact #10 — FFmpeg `tmedian` computes per-pixel temporal median over a window
- **Statement**: `tmedian=radius=N` outputs each pixel as the median of pixels
at the same coordinates over the window of `2N+1` frames. For a moving
camera over rich terrain, the underlying scene changes every frame so the
temporal median tends to wash out — producing motion-blur-like
ghosting rather than clean output.
- **Source**: Source #8 + locally tested (PoC3 produced
`poc3_crop_tmedian.mp4`)
- **Confidence**: ✅ High (direct test)
- **Fit Impact**: **Not suitable** for our case — both the OSD values and the
underlying video change every frame, so temporal median produces ghosted
output that's worse for downstream matching than the original OSD-laden
frames.
## Deep-learning video inpainting
### Fact #11 — ProPainter is the SOTA non-generative video inpainter (as of late 2023)
- **Statement**: ProPainter (Zhou et al., ICCV 2023) uses recurrent flow
completion + dual-domain propagation (image and feature) + mask-guided
sparse Transformer. Explicitly described as non-generative — it propagates
pixels from non-masked frames rather than synthesizing new content.
~0.249 s/frame at 480p, 808G FLOPs/10 frames.
- **Source**: #11 (ProPainter project page)
- **Confidence**: ⚠️ Medium (paper claims; per-deployment runtime varies)
- **Related Dimension**: SQ-3 (gimbal OSD removal, high-quality option)
- **Fit Impact**: Highest-quality option for OSD removal that respects the
"no fabrication" constraint. Cost: GPU + Python toolchain; offline-only.
### Fact #12 — VideoPainter and successors are *generative* and DISQUALIFIED for our use case
- **Statement**: VideoPainter (2025), DiffuEraser (2025), VidPivot (2025),
OmniPainter use I2V or diffusion backbones to *synthesize* content for fully
masked regions. They produce more visually pleasing output than ProPainter
but the synthesized content is **not** a faithful representation of the real
underlying scene.
- **Source**: #12 + #13
- **Confidence**: ✅ High (explicit in the papers)
- **Related Dimension**: SQ-3
- **Fit Impact**: **Disqualifier**. Project rule (`meta-rule.mdc` "Real
Results, Not Simulated Ones"): a fixture that fabricates terrain features
the matcher might anchor on is worse than no fixture. Status: `Rejected`.
## Mask-aware downstream
### Fact #13 — Kornia's `DISK.forward()` accepts a binary mask natively
- **Statement**: `kornia.feature.DISK.forward(img, mask=None)` takes a mask
argument of shape `(B, 1, H, W)` with values in `[0, 1]`. The score map is
multiplied by this mask before keypoint detection — keypoints in masked
regions are suppressed by construction, with no preprocessing of pixels.
- **Source**: #6 (Kornia docs L1)
- **Confidence**: ✅ High
- **Related Dimension**: SQ-4 (mask-aware downstream)
- **Fit Impact**: **Lowest-risk highest-quality option**. The project's chosen
C3 detector (DISK per `solution_draft01.md`) already supports mask injection
out of the box — *no video preprocessing required* beyond the deterministic
GCS-chrome crop.
### Fact #14 — LightGlue's matching layer needs no mask; suppression at detect time is sufficient
- **Statement**: LightGlue's authors recommend (issue #97, by maintainer
Phil26AT) suppressing keypoints at detect time via score-map masking; once
no keypoints are produced in the masked region, LightGlue has nothing to
match there.
- **Source**: #5 (LightGlue issue, maintainer reply)
- **Confidence**: ✅ High
- **Related Dimension**: SQ-4
- **Fit Impact**: Confirms Option B is feasible end-to-end with the existing
C3 stack.
## Source recovery (informational; ruled out by user)
### Fact #15 — Topotek/Viewpro multi-sensor balls expose RTSP and DCIM directly
- **Statement**: Topotek camera class (per ArduPilot integration docs) exposes
two RTSP streams (`rtsp://192.168.144.108:554/stream=0` 1080p,
`stream=1` 480p) and on-camera recordings retrievable via Ethernet/SMB at
`camera/DCIM/snap` and `camera/DCIM/record`. OSD overlays can be disabled
via the GimbalControl Ethernet utility.
- **Source**: #4
- **Confidence**: ✅ High
- **Fit Impact**: For *future* recordings this is the dominant path
(no cleanup needed). Out of scope for the current MKV per user constraint
but recorded as Option Z in the comparison framework.
## Project-context inheritance
### Fact #16 — `flight_derkachi.mp4` is the existing reference fixture shape
- **Statement**: Existing replay fixture is 880×720 H.264 30 fps MP4, paired
with `data_imu.csv` (10 Hz from `.tlog`) and per-camera calibration JSON
(`khp20s30_factory.json`). The MP4 is described as "cleaned/cropped replay
fixture rather than the raw camera feed" with the "rotating camera
mechanically fixed in a downward/nadir orientation".
- **Source**: #9 + #10
- **Confidence**: ✅ High
- **Fit Impact**: New fixture must match this structure to drop into the
existing `tests/e2e/replay/test_az835_e2e_real_flight.py` harness.
### Fact #17 — A "factory_sheet" calibration approximation is project-accepted when checkerboard isn't possible
- **Statement**: The Derkachi calibration was sourced via "factory_sheet"
approximation (AZ-702) since per-unit checkerboard refinement was deferred
for lack of hardware access. Residual focal-length error expected in 13%
band. Project acknowledges this is the cheapest acceptable starting point.
- **Source**: #10 (`camera_info.md`)
- **Confidence**: ✅ High
- **Fit Impact**: A new calibration JSON for the Topotek/Viewpro multi-sensor
ball can use the same approach — published spec sheet → focal length, FOV,
pixel size approximations, marked `factory_sheet` source.
### Fact #18 — Existing solution has no "data ingestion / fixture-prep" component
- **Statement**: Components C1 (VIO) through C12 (build cache orchestrator) in
`solution_draft01.md` cover runtime + pre-flight + deploy concerns but do
not include a fixture-cleanup or data-ingestion component. Fixtures appear
in the `tests/e2e/replay/` infrastructure as already-cleaned MP4s.
- **Source**: #R1 + #R2
- **Confidence**: ✅ High
- **Fit Impact**: This is the *gap* the Mode B revision addresses. The new
fixture-prep component does not modify the runtime; it adds a developer
tool under `tools/` or `tests/fixtures/` that produces fixtures consumable
by the existing replay path.
## API Capability Verification — applied to lead candidates
This section is mandatory per SKILL.md → Step 2 → API Capability Verification.
### MVE — Kornia DISK in mask-aware mode
- **Source**: Source #6 (Kornia docs, accessed 2026-05-29)
- **Inputs in the docs example**: `img` of shape `(B, C, H, W)`, `mask` of
shape `(B, 1, H, W)` with values in `[0, 1]`
- **Outputs in the example**: list of `Features` (keypoints + descriptors)
with no keypoints in masked regions
- **Project inputs**: 1 image (`B=1`), `mask` derived once from a static OSD
layout, applied per-frame
- **Project outputs required**: keypoints + descriptors that can be passed
into LightGlue (the project's existing C3.2 component)
- **Match assessment**: ✅ exact match — Kornia DISK is the same library the
existing solution uses; the mask path is documented and exercised by Kornia
tests
- **MVE code (project's expected use):**
```python
import torch, kornia.feature as KF
from PIL import Image
import numpy as np
disk = KF.DISK.from_pretrained("depth").eval()
mask_np = np.asarray(Image.open("osd_mask.png").convert("L")) / 255.0
# mask: 1 where keep, 0 where suppress (matches Kornia semantics)
mask = torch.from_numpy((mask_np < 0.5).astype("float32"))[None, None]
img = ... # (1, 3, H, W)
feats = disk(img, mask=mask, n=2048)
```
### MVE — FFmpeg crop + chained delogo (project's primary cleanup path)
- **Source**: Source #1 (FFmpeg delogo docs) + local PoC4
- **Inputs in our test**: `2026-05-09 16-10-54.mkv` (1280×720 H.264 30 fps)
- **Outputs in our test**: `poc4_delogo.mp4` (900×445 H.264 30 fps with three
burned-OSD rectangles overwritten by interpolated pixels)
- **Project inputs**: matches
- **Project outputs required**: a file the replay harness can consume
- **Match assessment**: ✅ exact match — local PoC produced a valid playable
output, dimensions match the existing fixture convention class
(sub-1080p H.264 MP4)
- **MVE command:**
```bash
ffmpeg -i input.mkv \
-vf "crop=900:445:50:25,delogo=x=5:y=35:w=180:h=115,delogo=x=395:y=5:w=275:h=70,delogo=x=130:y=265:w=690:h=50" \
-an -c:v libx264 -crf 18 fixture.mp4
```
### Skipped — VideoPainter / DiffuEraser / VidPivot
- These candidates are rejected on the fabrication-risk disqualifier (Fact
#12), not on API capability. No MVE built; not progressing to Step 7.5
Selected status.
@@ -0,0 +1,88 @@
# Comparison Framework — Video Extraction Options
## Selected Framework Type
**Decision Support** — multiple candidates, weighted on cost vs quality vs
risk, with the goal of selecting the best path (or composition of paths) for
the project's replay-fixture use case.
## Selected Dimensions
1. **Output fidelity** — Are the underlying terrain pixels preserved
verbatim, or modified/synthesized?
2. **Fabrication risk** — Could the technique introduce features the
downstream matcher could anchor on but that don't exist in reality?
(Project's "Real Results, Not Simulated Ones" rule.)
3. **Pixel coverage** — How much of the original EO video region is usable
in the output?
4. **Cost & complexity** — Lines of code, dependencies, runtime per frame,
GPU required?
5. **Reproducibility** — Same input → same output across runs, machines, and
time?
6. **Project-pipeline integration cost** — How much of the existing C2/C3
pipeline needs to change to consume the output?
7. **Coverage of layers** — Which of the three layers (GCS chrome /
gimbal-burned OSD / IR PIP) does the technique address?
8. **Per-frame gimbal-pointing handling** — Does the technique help filter
non-nadir frames?
## Initial Population — Option Matrix
> Notation: ✅ ideal — ✓ acceptable — ⚠️ caveat — ❌ disqualifier
> Pixel coverage is in % of the 1280×720 original (1280×720 = 921 600 px)
| # | Option | Output fidelity | Fabrication risk | Pixel coverage | Cost & complexity | Reproducibility | C2/C3 integration cost | Layer coverage | Non-nadir filtering |
|---|---|---|---|---|---|---|---|---|---|
| **A** | **Crop only** (FFmpeg `crop`) | ✅ Verbatim | ✅ None | ⚠️ ~58% (740×525 ≈ 388 500 px after removing chrome+IR-PIP+minimap; ~70% of EO area) | ✅ Trivial (one filter) | ✅ Bit-deterministic | ✅ Zero changes | GCS chrome: ✅ — Gimbal OSD: ❌ remains burned in — IR PIP: ✅ excluded by tight crop | ❌ No |
| **B** | **Crop + mask-aware DISK** (Fact #13) | ✅ Verbatim | ✅ None | ✅ ~80% of EO area (mask only suppresses keypoints in OSD pixels, pixels themselves are unchanged) | ✓ Trivial pipeline change: pass `osd_mask.png` to DISK forward call; one-time mask build | ✅ Mask is a static PNG | ⚠️ One-line C3 code change to pass `mask=` parameter | GCS chrome: ✅ — Gimbal OSD: ✅ via score-map suppression — IR PIP: ✅ via mask | ❌ No (orthogonal concern) |
| **C** | **Crop + chained `delogo`** (Fact #7, #8) | ✓ Mostly verbatim, OSD regions are interpolated from neighbor pixels | ✓ Low — interpolation produces blurry but plausible content; could create weak features but no semantic terrain hallucination | ✅ ~85% (interpolation fills the OSD region) | ✓ Cheap (one FFmpeg invocation, ~5 chained filters) | ✅ Bit-deterministic | ✅ Zero changes (output is plain MP4) | GCS chrome: ✅ — Gimbal OSD: ✓ each OSD region passed to a separate `delogo` — IR PIP: ⚠️ too large for `delogo`, must crop or use removelogo | ❌ No |
| **D** | **Crop + `removelogo` PNG mask** (Fact #9) | ✓ Mostly verbatim, mask-shaped blur fills OSD regions | ✓ Low (same blur-based approach as `delogo`) | ✅ ~85% | ⚠️ Cheap but version-fragile in our tests on FFmpeg 8.1 (failed); more reliable on older FFmpeg | ✅ if it works on the target version | ✅ Zero changes | All layers via single mask | ❌ No |
| **E** | **Crop + ProPainter video inpainting** (Fact #11) | ✓ Verbatim where possible, propagated from non-masked frames where occluded | ✓ Low — non-generative, propagation-based; but if the OSD covers the same scene region for many frames the propagation may guess | ✅ ~85% | ❌ Expensive: GPU required; ~0.25 s/frame at 480p, scales with resolution; Python toolchain (PyTorch + custom build) | ✓ Reproducible if model weights pinned | ✅ Zero changes (output is plain MP4) | All layers if mask covers them | ❌ No |
| **F** | **Crop + temporal-median (`tmedian`)** (Fact #10) | ❌ Smeared — both OSD and underlying scene change per frame; median washes both | High risk: smeared output may produce false features OR suppress real ones | ⚠️ Coverage is full but quality is degraded everywhere | ✓ Cheap | ✅ | ✅ | All if motion is right; **doesn't work for our case** because OSD values *also* change per frame | ❌ No |
| **G** | **Crop + generative video inpainting (VideoPainter et al.)** (Fact #12) | ❌ Synthesized | ❌❌ **High** — fabricates terrain features that don't exist | ✅ ~85% | ❌ Very expensive: SOTA generative VIs require multi-GB models on H100-class GPUs | ✓ but content is non-deterministic across runs (unless seed pinned) | ✅ output is plain MP4 | All layers | ❌ No |
| **H** | **Per-frame OpenCV navier-stokes / telea inpaint** (with the same OSD mask) | ✓ Verbatim where possible, deterministic non-generative inpaint | ✓ Low | ✅ ~85% | ✓ Cheap (Python + OpenCV); slower than FFmpeg but trivial code | ✅ | ✅ output is plain MP4 | All layers | ❌ No |
| **I** | **Tlog-based gimbal-attitude filter** (orthogonal, applied to A/B/C) | n/a — filtering only | n/a | Reduces output to nadir-band frames only | ✓ Cheap if `MOUNT_STATUS`/`MOUNT_ORIENTATION` is in the paired `2026-05-09 16-09-54.tlog` | ✅ | ✓ stand-alone tool that drops frames before encoding | n/a (frame-level) | ✅ **Yes** — gates by gimbal pitch from telemetry |
| **J** | **OCR-based pitch-from-OSD filter** (orthogonal, applied to A/B/C) | n/a | n/a | Reduces output to nadir-band frames only | ⚠️ More complex (Tesseract or PaddleOCR per-frame) and OCR errors propagate | ✓ | ✓ stand-alone tool | n/a (frame-level) | ✅ via OCR on the `-3.7°` text in the burned attitude indicator |
| **Z** | **Source recovery** (re-record with OSD off / pull RTSP / pull DCIM) | ✅ Native | ✅ None | ✅ 100% | ✅ Trivial *if* hardware access | ✅ | ✅ Zero changes | All layers (no overlay produced) | ⚠️ Depends on whether gimbal can be locked nadir |
## Composition note
Options are not all mutually exclusive. The three orthogonal axes are:
- **Pixel handling**: choose ONE of {A, B, C, D, E, F, G, H, Z}
- **Frame filtering** (non-nadir rejection): choose ZERO OR ONE of {I, J} on
top of the pixel-handling choice
- **Source class**: Option Z replaces all of the above when source access is
available; for the current MKV (user constraint = "only have this MKV"), Z
is unavailable.
## Recommended composition
**Primary**: **B + I** — crop the GCS chrome geometrically, build a binary
OSD mask (a PNG once, hand-edited or scripted from the variance map), and
inject the mask into the project's existing DISK detector via the
already-supported `mask=` parameter; in parallel, parse the paired
`.tlog` for gimbal attitude and drop frames where the gimbal is off-nadir.
**Fallback** (when modifying the C3 path is not desirable for this fixture):
**C + I** — produce a plain `.mp4` via crop + chained `delogo` so the new
fixture can drop into the existing replay path with **zero** code changes,
then apply the same tlog-based frame filter.
**Disqualified options**: G (generative inpainting), F (temporal median —
doesn't work for our case because OSD values change per frame).
**Excluded by user constraint, but recommended for future recordings**:
Z (source recovery — pull RTSP or DCIM directly from the camera).
## Reasoning summary table
| Question dimension | Winner | Why |
|---|---|---|
| Output fidelity | B (and Z when available) | No pixels modified |
| Fabrication risk | B, A | No new pixels invented |
| Pixel coverage | B, C, D, H | Whole EO region usable |
| Cost & complexity | A, C | Single FFmpeg command |
| Reproducibility | All except G | Deterministic |
| C2/C3 integration | A, C, D, H | No code changes |
| Layer coverage | B, D | Single mask handles all |
| Non-nadir filtering | I (with any pixel option) | Telemetry-driven |
@@ -0,0 +1,247 @@
# Reasoning Chain — Video Extraction Decisions
## Dimension 1 — Why three layers and not two
### Fact confirmation
Local 12-frame variance analysis (Fact #1) showed at least three pixel
populations distinguishable by their behavior over time:
1. Pixel-stable rectangles around the periphery (left/right sidebars,
minimap) — the GCS UI chrome.
2. Pixel-stable rectangles in the central video area (top-left HUD,
top-center attitude ladder, crosshair, FOV brackets, bottom-right
coordinates) — gimbal-burned-in OSD.
3. The dynamic remainder — the actual EO video, plus the IR PIP, which is
*itself* a dynamic video stream stamped at fixed coordinates.
### Reference comparison
A simpler "UI vs video" two-layer model would suggest a single mask covering
all overlays. But the IR PIP behaves like the EO video (Fact #2), and the
GCS chrome includes live-updating values (Fact #3) — so the actual
distinction that matters is *who renders the pixel and how* not *whether
the pixel is constant*:
- GCS chrome is rendered by the GCS application **after** the camera stream
arrives → it's removable by cropping to the region the GCS shows the EO
in.
- Burned-in gimbal OSD is rendered **inside the camera** before the recorder
sees it → it's pixel-baked into the EO video and only removable by
inpainting or by mask-aware downstream consumption.
- IR PIP is **also** rendered by the camera (the gimbal stamps the IR
channel into a corner of the EO output stream) → behaves like burned-in
OSD: pixel-baked, removable only by masking or cropping it out.
### Conclusion
Three layers, two removal classes:
- Class 1 (GCS chrome): pure crop.
- Class 2 (gimbal OSD + IR PIP): mask or inpaint.
### Confidence
✅ High — pixel-variance evidence is direct measurement.
---
## Dimension 2 — Why mask-aware downstream wins over inpainting
### Fact confirmation
The project's chosen C3 detector is DISK + LightGlue
(`solution_draft01.md`). Kornia's DISK accepts a `mask=(B,1,H,W)` parameter
that multiplies the detection score map (Fact #13). LightGlue's authors
confirm that suppressing keypoints at detect time is sufficient — once the
detector returns no keypoints in a region, the matcher has nothing to match
there (Fact #14).
### Reference comparison
Inpainting-based options (C, D, E, H) all share the property that they
synthesize *some* content for the OSD region. Even non-generative
techniques like FFmpeg `delogo` (interpolation from outside pixels) or
ProPainter (propagation from neighbor frames) produce pixels that *look*
like terrain but didn't come from the actual terrain at that location. A
feature detector running on those inpainted pixels could legitimately fire
on the inpaint artifacts. Without a mask, the downstream pipeline cannot
distinguish a real feature from a fake one. With a mask, it doesn't have to:
the score map is zeroed before NMS, so no keypoint is produced for the OSD
region in the first place.
### Conclusion
Mask-aware downstream is **strictly better** than inpainting for this
project's use case, because:
1. Output fidelity is verbatim (no synthesized pixels enter the matcher).
2. The mask is a single static PNG, computed once from the OSD layout — far
simpler than per-frame inpainting.
3. The integration cost is one parameter on the existing `DISK.forward()`
call (Fact #6).
4. The OSD coverage is the union of all OSD elements, so the mask trivially
handles all of them at once (top-left HUD, attitude ladder, crosshair,
etc.) without one filter per region.
The only reason to fall back to inpainting (Option C/H) is if we want a
fixture that can be dropped into the existing replay path **without any
code change**, because today's replay tooling treats the input MP4 as
pristine. Even then, the right answer is to extend the replay tooling to
carry an optional companion `osd_mask.png` per fixture — at which point
Option B is again preferable.
### Confidence
✅ High — both the existence of the API and its semantic effect are
documented at L1 (Kornia docs, LightGlue maintainer reply).
---
## Dimension 3 — Why generative inpainting is disqualified
### Fact confirmation
VideoPainter (2025), DiffuEraser (2025), VidPivot (2025) and similar SOTA
inpainters (Fact #12) explicitly *generate* content for masked regions
using video-diffusion or I2V backbones. The papers claim these models
produce *plausible* terrain even where the masked region was fully
occluded.
### Reference comparison
Project's `meta-rule.mdc` rule "Real Results, Not Simulated Ones" is
unambiguous: the goal is a working product, not the appearance of one.
Specifically: "Never produce results by bypassing, faking, stubbing, or
passthrough-ing the component that is supposed to produce them."
The downstream component is a feature matcher whose entire purpose is to
detect real terrain features and match them to a satellite tile. A
generative inpaint inserts plausible-but-false terrain features into the
input. The matcher cannot tell the difference. It will happily match
fabricated grass texture to a real satellite-tile region with similar
texture and produce a confident, wrong, fix.
The same argument applies even more sharply to the project's
**AC-NEW-7** "cache-poisoning safety budget": onboard tiles fed back into
the basemap must not be misaligned. A fixture validating tile generation
that includes synthesized terrain features tests the wrong thing — it
validates that the system handles plausible-looking pixels, not that it
handles real-flight pixels.
### Conclusion
Generative inpainters (Option G) are **rejected**. They optimize the wrong
objective for this project.
### Confidence
✅ High — disqualifier comes from explicit project rule + reading of
upstream paper claims.
---
## Dimension 4 — Why temporal median fails for this case
### Fact confirmation
FFmpeg `tmedian=radius=N` outputs the per-pixel median over `2N+1`
neighboring frames (Fact #10). This works as an OSD-removal trick when:
1. The OSD pixels are **stable** (same value every frame, or at least the
majority of frames).
2. The underlying scene **changes** per frame (so the median over the
window is dominated by underlying scene values, not OSD values).
In our recorded video, both the OSD values **and** the underlying scene
change per frame:
- Burned-in OSD text shows live counters like `00:04:24` that update each
second; pitch number `-3.7°` updates with gimbal motion; HDOP/SATS values
change.
- Underlying EO video shows the ground moving as the UAV moves.
### Reference comparison
A *motion-conditional* temporal median (Source #16, Source #17) — apply
the median only where motion is below a threshold — addresses the issue in
principle. But the static-OSD assumption underneath that approach
specifically does not hold in our case: even the *positions* are static,
but the *content* in those positions is dynamic.
### Conclusion
Temporal median is **not suitable** for this video. The local PoC
(`poc3_crop_tmedian.mp4`) confirms: output shows ghosted, smeared OSD text
overlapping with smeared/aliased terrain — strictly worse than the
original for downstream feature matching.
### Confidence
✅ High — direct experimental result.
---
## Dimension 5 — Why frame filtering by gimbal pointing is mandatory
### Fact confirmation
Frame at t=30 s shows gimbal pointed forward (sky/horizon visible), frame
at t=300 s shows gimbal pointed near nadir (ground texture filling frame)
(Fact #5). The gimbal is operator-controlled — mid-flight pointing is
common; only a subset of frames are nadir.
### Reference comparison
The project's nav-camera spec is "fixed downward (no gimbal)"
(`restrictions.md`). The C2 VPR component is trained / tuned on satellite
imagery with the assumption that the query is a top-down view of the
ground. Forward-looking frames (sky, distant horizon, oblique terrain) are
out-of-distribution for the VPR retrieval and would produce poor or
spurious matches.
### Conclusion
A fixture derived from this MKV that contains forward-looking frames is
not a valid representative-data fixture for the nadir-tuned pipeline. A
frame-level filter is needed — either:
- **Option I** (telemetry-based): parse `MOUNT_STATUS`/`MOUNT_ORIENTATION`
from the paired `2026-05-09 16-09-54.tlog`. Cheaper and more reliable.
- **Option J** (OCR-based): read the burned-in `-3.7°` text from the
attitude indicator. Lower setup cost (no telemetry parser) but OCR
errors propagate.
### Confidence
✅ High — the gimbal-pointing fact is direct visual evidence; the
out-of-distribution argument is a derived consequence consistent with the
project's `restrictions.md` AC-2.1a "nadir ±10° bank/pitch" qualifier.
---
## Dimension 6 — Why this is a fixture-prep tooling concern, not a runtime concern
### Fact confirmation
Existing `solution_draft01.md` does not have a "data ingestion / fixture
prep" component (Fact #18). Replay fixtures appear in the test
infrastructure as already-cleaned MP4s + companion CSV/JSON.
### Reference comparison
The runtime nav-camera (per project spec) is the ADTi 20MP fixed-downward
without OSD. There is no expectation that the runtime pipeline ever sees
an OSD-laden frame from a multi-sensor gimbal. So the right place to
handle this MKV is **not** in the runtime — it is in the developer
tooling that produces fixtures.
### Conclusion
The Mode B revision is **additive, not subtractive**: it identifies a gap
(no fixture-prep component) and adds a developer tool. It does **not**
modify any runtime component. The C1/C2/C3/C4/C5 components in
`solution_draft01.md` are unchanged.
### Confidence
✅ High — direct read of `solution_draft01.md` confirms no such
component exists.
---
## Dimension 7 — Why the existing `flight_derkachi.mp4` precedent matters
### Fact confirmation
`flight_derkachi.mp4` is described as "cleaned/cropped replay fixture
rather than the raw camera feed" with "the rotating camera mechanically
fixed in a downward/nadir orientation" (Fact #16). It was produced by a
process that:
1. Disabled the gimbal OSD (likely via Topotek's GimbalControl Ethernet
utility).
2. Mechanically locked the gimbal nadir.
3. Recorded a 1080p clean stream.
4. Cropped to 880×720 (probably to remove residual borders or reframe).
### Reference comparison
The new MKV represents the *opposite* situation: OSD on, gimbal
unconstrained, GCS-screen-recorded rather than direct camera capture. The
existing fixture-creation procedure (steps 14 above) does not apply.
### Conclusion
A new, documented procedure is needed for the GCS-screen-recorded
class of input. That procedure is the deliverable of this Mode B run
(see `solution_draft02.md`). It complements the existing Derkachi
procedure — does not replace it.
### Confidence
✅ High.
@@ -0,0 +1,133 @@
# Validation Log — Mode B Video Extraction Run
## Validation scenario
A developer wants to use `2026-05-09 16-10-54.mkv` as a representative
replay fixture for the GPS-denied pipeline (analogous to
`flight_derkachi.mp4`), to extend testing to a new aircraft/camera class
(multi-sensor gimbal ball, multirotor profile) and a new operating
condition (low-altitude / non-nadir gimbal).
## Expected behavior under each candidate
### Option A (crop only)
Expected: produces an 740×525-ish MP4 with gimbal OSD elements still burned
in at the same screen positions. Replay infrastructure consumes it as-is.
Downstream C2/C3 detect features inside OSD text regions and produce false
matches. Drift accumulates, AC-2.1a fails.
**Actual** (from PoC1 + reasoning): predicted behavior matches. Output is a
valid MP4 but feeding it into a feature matcher would produce keypoints
inside the burned-in `-3.7°` and `FOV 53.2°` text regions, since those
regions have high local contrast.
### Option B (crop + mask-aware DISK)
Expected: same MP4 as Option A, plus a static `osd_mask.png` companion file.
Replay infrastructure modified to inject the mask into the C3 detect call.
DISK detector returns no keypoints inside masked regions (per Fact #13
score-map multiplication semantics). LightGlue matches only real-terrain
features. AC-2.1a passes for nadir frames.
**Actual** (predicted, no end-to-end PoC run): matches the documented
Kornia DISK contract. The change to the replay tooling is one optional
parameter added to a `Disk()` instantiation. Risk: if the existing
production code path uses a wrapper around DISK that does not forward the
`mask=` parameter, the wrapper needs adjustment.
### Option C (crop + chained `delogo`)
Expected: an 740×525-ish MP4 with OSD regions replaced by interpolation
from neighboring pixels. Replay infrastructure unchanged. Downstream C2/C3
detect features in the interpolated regions; some weak features may be
detected (interpolation produces low-contrast smooth regions) but
significantly fewer than the original OSD regions.
**Actual** (from PoC4): output looks reasonable with three OSD rectangles
replaced by smoothed interpolation. Some chained `delogo` filters caused
issues when their rectangles touched image edges in earlier attempts —
mitigated by avoiding edge contact.
### Option F (temporal median)
Expected: smeared, ghosted output as both OSD and underlying terrain
average together over the window.
**Actual** (from PoC3): confirmed. Output shows visible motion-blur ghosts
of OSD text across the frame, plus desaturated and smeared underlying
terrain. **Disqualified.**
### Option I (tlog-based gimbal-pointing filter)
Expected: parse `MOUNT_STATUS`/`MOUNT_ORIENTATION` messages from the
companion `2026-05-09 16-09-54.tlog`, build a frame index → gimbal-pitch
table, drop frames where pitch is more than (e.g.) 10° off-nadir. Output
preserves only nadir-band frames, suitable for the level-flight VPR
assumption.
**Actual** (predicted): depends on whether the camera class actually emits
`MOUNT_STATUS` to the FC. ArduPilot's documented gimbal integration
(Source #4) confirms gimbal angles are reported back to the FC for some
Topotek models. **Verify** before relying on this — if the tlog lacks
gimbal angle, fall back to Option J (OCR).
## Counterexamples
### Counterexample 1: gimbaled-fixed nadir flight
**Scenario**: the user happens to have already locked the gimbal nadir and
the entire recording is nadir-only. **Implication**: Option I/J becomes a
no-op; the rest of the pipeline works the same. **No change to
recommendation.**
### Counterexample 2: text values in OSD overlap with bright terrain
**Scenario**: the green attitude ladder text overlaps with bright sky in
forward-looking frames — does Option C `delogo` interpolation produce
something useful? **Predicted**: only the rectangle is touched; if the
rectangle covers sky-only pixels, the interpolation produces sky-colored
output (acceptable). If the rectangle straddles sky/horizon, the
interpolation may produce a smeared horizon line (mild artifact,
acceptable for non-nadir frames which would be filtered by Option I/J
anyway).
### Counterexample 3: future MKV recordings have different OSD layout
**Scenario**: a later flight uses a different GCS that places the OSD
elsewhere, breaking the hardcoded coordinates in the chained `delogo`
recipe. **Implication**: the developer tool must be parametrized, not
hardcoded. The proposed fixture-prep tool ships with a **per-recording
OSD profile** (a small YAML or JSON listing the GCS-chrome crop box and
the OSD rectangles) so adding a new recording class is a few-line config
change.
## Review checklist
- [x] Draft conclusions consistent with fact cards
- [x] No important dimensions missed (audio handling and frame-rate
normalization are noted as trivial in `00_question_decomposition.md`'s
Completeness Audit)
- [x] No over-extrapolation — claims are tied to specific facts
- [x] Conclusions actionable: a developer can follow the recipes in
`solution_draft02.md` to produce a new fixture
- [x] Every selected component matches the project's constraint matrix
(verified in `06_component_fit_matrix.md`)
- [x] Mismatches marked as disqualifiers (Option G, F)
- [x] Per-mode API capability verified for both lead candidates (Kornia
DISK in mask mode, FFmpeg `crop`+`delogo` chain) — both have saved
MVE blocks in `02_fact_cards.md`
## Open questions deferred to user / out-of-scope
1. **Does the paired `2026-05-09 16-09-54.tlog` contain `MOUNT_STATUS`
messages?** — not verified in this run. Recommendation: open the tlog
with `pymavlink` and grep for `MOUNT_STATUS`; if absent, fall back to
Option J or accept all frames + downstream covariance.
2. **Should this fixture replace `flight_derkachi.mp4` as the primary
replay fixture, or supplement it?** — supplement. Different aircraft
class, different sensor class. Both fixtures have value for
different test scenarios.
3. **Is the project willing to commit to one extra parameter on the
`tests/e2e/replay/conftest.py::_calibration_path()` family of helpers
for an optional `osd_mask.png` companion?** — recommended yes; it is
the cleanest path. Not blocking for this run; can be deferred to a
follow-up tracker ticket if Option C fallback is acceptable for now.
## Validation conclusion
The recommended composition (B + I primary, C + I fallback, Z preferred
for future recordings) holds up under the validation scenarios. Move to
Step 7.5 (Component Applicability Gate).
@@ -0,0 +1,100 @@
# Component Fit Matrix — Video Extraction Pipeline
> Step 7.5 — Component Applicability Gate. Applies because this run is
> classified as Technical-component selection.
## 7.5.1 Top-level Component Fit Matrix
| Component Area | Candidate | Pinned Mode/Config | Option Family | Intended Role | API Capability Evidence | Mismatches / Disqualifiers | Status | Decision Rationale |
|---|---|---|---|---|---|---|---|---|
| Geometric crop (GCS chrome removal) | FFmpeg `crop` filter | Single static crop box `crop=W:H:X:Y` derived per recording from variance-map analysis; one-shot CLI invocation | Established production | Strip GCS UI sidebars/minimap/IR-PIP from recorded MKV | MVE block in `02_fact_cards.md` (PoC1 produced playable output); docs Source #1 | None for the user's pinned use case (offline tool) | **Selected** | Trivial, lossless within re-encode, deterministic |
| OSD-pixel handling (PRIMARY) | Kornia `DISK.forward(img, mask=...)` | mask-aware mode `(B, 1, H, W)` mask, multiplied into the DISK score map before NMS | Established production (existing project component) | Suppress keypoint detection inside burned-in OSD regions | MVE block in `02_fact_cards.md`; docs Source #6 | Requires the existing C3 wrapper around DISK to forward the `mask=` parameter (one-line code change) | **Selected** | No pixel modification; fabrication-risk = 0; matches existing C3 stack exactly |
| OSD-pixel handling (FALLBACK) | FFmpeg `delogo` chained | Multiple `delogo=x:y:w:h` filters chained for each OSD rectangle, after `crop`. **Important**: rectangles must NOT touch image edges (no border pixels to interpolate from) | Simple baseline | Replace burned-in OSD rectangles with interpolation-from-neighbors output, producing a plain MP4 ingestable by the existing replay path with no code changes | MVE block in `02_fact_cards.md` (PoC4); docs Source #1, #2 | Quality degrades when the OSD rectangle is large (e.g., the IR PIP at 360×210 px) — for that, `removelogo` mask or geometric crop is preferred | **Selected (fallback)** | Cheap, deterministic, no toolchain beyond FFmpeg |
| OSD-pixel handling (REJECTED for fragility) | FFmpeg `removelogo` PNG mask | Single PNG mask covering all OSD elements, applied via `removelogo=mask.png` | Established production | One-shot OSD removal via mask | Source #3 docs claim it works; local test on FFmpeg 8.1 failed with `Invalid argument` (-22) | Version-fragile; could not be made to work in our local FFmpeg 8.1 with grayscale or RGB masks of correct dimensions | **Experimental only** | Try first if available on team's pinned FFmpeg version; fall back to chained `delogo` |
| OSD-pixel handling (REJECTED, fabrication risk) | VideoPainter / DiffuEraser / VidPivot (and any generative video inpainter) | Diffusion-backbone or I2V generative inpainter applied to the OSD mask | SOTA / Known bad | High-fidelity-looking OSD removal | Sources #12, #13 — papers explicitly describe synthesis | Generates terrain content that does not exist in the real recording. Project rule "Real Results, Not Simulated Ones" is unambiguous | **Rejected** | Disqualified by `meta-rule.mdc` |
| OSD-pixel handling (REJECTED, wrong assumption) | FFmpeg `tmedian` temporal median | `tmedian=radius=N` after `crop` | Adjacent domain | Suppress static OSD via temporal median | PoC3 test result | OSD values change every frame (timestamps, gimbal angle, HDOP), so the static-OSD assumption underneath the technique fails. Output is smeared | **Rejected** | Disqualified by direct experimental evidence |
| OSD-pixel handling (DEFER) | ProPainter | ProPainter checkpoint with mask-guided sparse Transformer | Current SOTA non-generative | High-quality OSD removal that respects no-fabrication constraint | Source #11 paper claims | Adds Python+PyTorch+CUDA toolchain; offline runtime ~0.25 s/frame at 480p; not necessary if Option B is implemented | **Experimental only** | Keep available for cases where Option B's downstream code change is rejected and the masked-region size is too large for `delogo` to interpolate cleanly |
| Frame filtering by gimbal pointing (PRIMARY) | `pymavlink` parser of paired `.tlog` for `MOUNT_STATUS` / `MOUNT_ORIENTATION` | Read paired `2026-05-09 16-09-54.tlog`, build a `frame_idx → gimbal_pitch_deg` table by interpolating message timestamps to the 30 fps frame timeline, drop frames where `|pitch (-90°)| > 10°` (or per-project nadir tolerance) | Established production | Reject non-nadir frames before encoding the cleaned MP4 | Verified path (`pymavlink` is already used in the project's `derkachi.tlog` pipeline per `flight_derkachi/README.md`) | **Verify**: must confirm the paired tlog actually emits `MOUNT_STATUS` for the camera in question; if the gimbal does not report attitude over MAVLink, this option fails | **Needs user decision** (effectively Selected if tlog has the messages) | Cleanest signal; deterministic; reuses existing project tooling |
| Frame filtering by gimbal pointing (FALLBACK) | OCR (Tesseract or PaddleOCR) on the burned-in pitch-angle text | Per-frame OCR of the `-3.7°` text in the attitude indicator | Adjacent domain | Recover gimbal pitch when telemetry path is unavailable | OCR libraries are common; no project-specific MVE built | OCR errors propagate; need confidence thresholding | **Experimental only** | Use only if the tlog lacks gimbal attitude |
| Calibration JSON | Per-camera `khp20s30_factory.json`-equivalent for the Topotek/Viewpro multi-sensor ball | "factory_sheet" approximation per the AZ-702 precedent | Established production (project precedent) | Provide intrinsics consumable by `tests/e2e/replay/` | Source #10 (Derkachi camera_info.md showing the convention) | None — same approach as the existing fixture | **Selected** | Project-accepted precedent |
| Companion telemetry CSV | Existing `derkachi.tlog → data_imu.csv` exporter, retargeted to `2026-05-09 16-09-54.tlog` | Run the same exporter that produced `data_imu.csv` for Derkachi | Established production (existing project tool) | Provide synchronized IMU data for the new fixture | Existing pipeline (`flight_derkachi/data_imu.csv`); reuses `pymavlink` | None | **Selected** | Reuses existing tool |
## 7.5.2 Restrictions × Candidate-Mode Sub-Matrix
> The "constraints" here come from the run-specific Project Constraint Matrix
> in `00_question_decomposition.md` (Constraints C1C8 — fixture must drop
> into existing replay infrastructure, no fabrication, etc.). Numbered AC
> from `acceptance_criteria.md` are referenced where directly relevant — but
> note this is a **fixture-prep tool, not a runtime component**, so most
> runtime-AC rows are N/A.
### Sub-Matrix — FFmpeg `crop` (geometric chrome removal)
| Constraint / AC | Candidate-mode behavior | Result | Evidence |
|---|---|---|---|
| C1 (ingestable by `tests/e2e/replay/`) | Output is a plain H.264 MP4 with arbitrary integer dimensions; the existing replay path consumes 880×720 (Derkachi) so any sub-1080p H.264 MP4 works | ✅ Pass | Fact #6 + #16 |
| C2 (no synthetic content) | `crop` discards pixels; never invents | ✅ Pass | Fact #6 |
| C3 (frame rate flexibility) | `crop` preserves frame rate | N/A | — |
| C5 (calibration honesty) | Crop changes principal point — calibration must be derived for the cropped frame, not the original 1280×720. Per-camera JSON should reflect the cropped image dimensions and shifted principal point | ✅ Pass (with derived calibration) | `flight_derkachi/camera_info.md` precedent |
| C6 (reproducibility) | Single deterministic FFmpeg command | ✅ Pass | Fact #6 |
| C7 (no false-positive features) | Cropped pixels are verbatim; remaining OSD is handled by other components | N/A (this component does not address OSD) | — |
| C8 (non-nadir frame filtering) | Crop is frame-agnostic | N/A | — |
### Sub-Matrix — Kornia DISK in mask-aware mode (PRIMARY)
| Constraint / AC | Candidate-mode behavior | Result | Evidence |
|---|---|---|---|
| C1 (ingestable by `tests/e2e/replay/`) | Requires one-line modification to the C3 detector wrapper to forward `mask=` | ✅ Pass with caveat | Fact #13 |
| C2 (no synthetic content) | Mask suppresses score-map values in OSD regions; pixel values are unchanged | ✅ Pass | Fact #13 + Fact #14 |
| C5 (calibration honesty) | Mask path orthogonal to calibration | N/A | — |
| C6 (reproducibility) | Mask is a static PNG file checked into the fixture directory | ✅ Pass | — |
| C7 (no false-positive features in OSD region) | DISK returns no keypoints in masked region by construction | ✅ Pass | Fact #13 |
| AC-2.1a (frame-to-frame registration >95%) | OSD region's keypoints removed before matching; matching depends only on real terrain features in the unmasked region | ✅ Pass for nadir frames (subject to C8 filter) | Fact #14 |
| AC-2.2 (Mean Reprojection Error <1.0 px frame-to-frame) | Reprojection error is computed on real-terrain matches only; not affected by mask | ✅ Pass | — |
### Sub-Matrix — FFmpeg `delogo` chained (FALLBACK)
| Constraint / AC | Candidate-mode behavior | Result | Evidence |
|---|---|---|---|
| C1 (ingestable) | Output is plain MP4 | ✅ Pass | Fact #7, #8 + PoC4 |
| C2 (no synthetic content) | `delogo` interpolates from neighbors — non-generative; no semantic terrain features synthesized | ✅ Pass with caveat (interpolation is *new* pixels, but they are computed from real adjacent pixels and produce smooth low-contrast regions unlikely to spawn false features) | Fact #7 |
| C6 (reproducibility) | Single deterministic FFmpeg command | ✅ Pass | Fact #7 |
| C7 (no false-positive features) | Smooth interpolated regions are unlikely to spawn high-confidence keypoints, but they CAN — DISK keypoints can fire on smooth gradient transitions; risk is real but small | ❓ Verify with empirical keypoint-density test on `poc4_delogo.mp4` vs the original | PoC4 visual inspection |
| AC-2.1a | Conditional on C7 result | ❓ Verify | — |
### Sub-Matrix — `pymavlink` MOUNT_STATUS frame filter (PRIMARY for non-nadir filtering)
| Constraint / AC | Candidate-mode behavior | Result | Evidence |
|---|---|---|---|
| C8 (non-nadir frame filtering) | Drops frames where gimbal pitch is off-nadir | ✅ Pass IF the tlog contains MOUNT_STATUS | Source #4 (ArduPilot Topotek docs reference gimbal angle messaging) |
| C6 (reproducibility) | Deterministic Python script | ✅ Pass | — |
| Tlog content actually contains MOUNT_STATUS for this gimbal | unverified — depends on whether the operator's autopilot was wired to receive and forward gimbal attitude | ❓ Verify | — |
### Sub-Matrix — Generative video inpainters (REJECTED)
| Constraint / AC | Candidate-mode behavior | Result | Evidence |
|---|---|---|---|
| C2 (no synthetic content) | Synthesizes terrain features that do not exist | ❌ Fail | Fact #12 |
## 7.5.3 Decision Summary
| Component area | Selected | Status notes |
|---|---|---|
| Chrome removal | FFmpeg `crop` | Selected, no caveats |
| OSD pixel handling (primary) | Kornia DISK mask-aware mode | Selected, conditional on one-line wrapper change |
| OSD pixel handling (fallback) | FFmpeg `delogo` chained | Selected fallback for fixtures that must drop into existing replay path with zero code changes |
| OSD pixel handling (other options) | `removelogo` (Experimental only — version-fragile), ProPainter (Experimental only — toolchain cost), `tmedian` (Rejected — disqualified by experiment), generative inpainters (Rejected — fabrication risk) | — |
| Non-nadir filter (primary) | `pymavlink` parser of paired tlog | Needs user decision: depends on whether tlog has MOUNT_STATUS |
| Non-nadir filter (fallback) | OCR on burned-in pitch text | Experimental only |
| Calibration JSON | Per-camera "factory_sheet" approximation | Selected (project precedent) |
| Telemetry CSV | Reuse existing tlog → CSV exporter | Selected |
**Blocker check**: One row is **Needs user decision** (tlog content not yet
verified). The user should be asked to either (a) confirm the tlog has
gimbal attitude, in which case Option I is Selected, or (b) accept Option J
fallback / accept all frames, in which case the fixture is supplied without
filtering and the test plan documents the limitation.
This blocker is non-blocking for the *technical recommendation* — the user
can choose either path and the rest of the pipeline is unchanged. It is
recorded in `solution_draft02.md`'s "Open questions" section.
@@ -0,0 +1,441 @@
# Solution Draft 02 — Recovering a Clean Nadir Fixture from `2026-05-09 16-10-54.mkv`
> **Mode**: B (Solution Assessment) — additive. This draft does **not** modify any runtime component in `_docs/01_solution/solution_draft01.md` (C1…C12). It adds a *fixture-prep developer tool* that converts an OSD-burned-in GCS screen recording into the `flight_derkachi.mp4`-shaped artifact consumed by `tests/e2e/replay/test_az835_e2e_real_flight.py`.
>
> **Run date**: 2026-05-30. Continues the 2026-05-29 Mode B investigation (`_docs/00_research/_mode_b_2026-05-29_video_extraction/`), with one previously-open "Needs user decision" row now resolved by a fresh tlog scan (Section 5 below).
>
> **Extraction executed on 2026-05-30**. The primary path (§4.1 Steps 1 + 2) was run against this MKV; the resulting fixture is at [`../../00_problem/input_data/flight_topotek_2026-05-09/`](../../00_problem/input_data/flight_topotek_2026-05-09/) with its own short README. The non-nadir frame filter (§4.1 Steps 45) and the companion calibration / IMU files (§4.3) were intentionally NOT produced — they are downstream decisions, not part of "extract a clean video". The verified crop coordinates differ from the 2026-05-29 draft's PoC4 values (which assumed a smaller IR PIP); the current §4.1 numbers reflect what was actually used.
>
> **Backing artifacts** (read these alongside this draft for full evidence):
> - Question decomposition: [`../../00_research/_mode_b_2026-05-29_video_extraction/00_question_decomposition.md`](../../00_research/_mode_b_2026-05-29_video_extraction/00_question_decomposition.md)
> - Source registry (17 L1/L2/L3 sources): [`../../00_research/_mode_b_2026-05-29_video_extraction/01_source_registry.md`](../../00_research/_mode_b_2026-05-29_video_extraction/01_source_registry.md)
> - Fact cards (18 verified facts incl. local PoC results): [`../../00_research/_mode_b_2026-05-29_video_extraction/02_fact_cards.md`](../../00_research/_mode_b_2026-05-29_video_extraction/02_fact_cards.md)
> - Comparison framework: [`../../00_research/_mode_b_2026-05-29_video_extraction/03_comparison_framework.md`](../../00_research/_mode_b_2026-05-29_video_extraction/03_comparison_framework.md)
> - Reasoning chain: [`../../00_research/_mode_b_2026-05-29_video_extraction/04_reasoning_chain.md`](../../00_research/_mode_b_2026-05-29_video_extraction/04_reasoning_chain.md)
> - Validation log: [`../../00_research/_mode_b_2026-05-29_video_extraction/05_validation_log.md`](../../00_research/_mode_b_2026-05-29_video_extraction/05_validation_log.md)
> - Component fit matrix: [`../../00_research/_mode_b_2026-05-29_video_extraction/06_component_fit_matrix.md`](../../00_research/_mode_b_2026-05-29_video_extraction/06_component_fit_matrix.md)
> - Inputs: [`../../00_problem/input_data/10.05.2026/2026-05-09 16-10-54.mkv`](../../00_problem/input_data/10.05.2026/2026-05-09%2016-10-54.mkv) and [`../../00_problem/input_data/10.05.2026/2026-05-09 16-09-54.zip`](../../00_problem/input_data/10.05.2026/2026-05-09%2016-09-54.zip) (contains the paired `.tlog`)
---
## 1. TL;DR
**Yes, a clean nadir replay fixture can be recovered**, but the answer has two parts that must both be done; doing only one will produce a fixture that quietly misleads the runtime pipeline.
| Concern | Recommended primary | Cheap fallback (zero replay-code changes) |
|---|---|---|
| **Strip the GCS UI chrome (sidebars / minimap / IR-PIP)** | `ffmpeg crop` (deterministic, verbatim pixels) | — same — |
| **Handle the gimbal's burned-in OSD (attitude ladder, crosshair, FOV brackets, status text)** | **Inject a static `osd_mask.png` into the existing C3 `kornia.feature.DISK.forward(img, mask=…)` call.** Zero pixel modification, zero fabrication risk. | `ffmpeg crop + delogo` chain (interpolates from neighbor pixels — non-generative; locally verified working as `poc4_delogo.mp4` on the prior run) |
| **Filter out frames where the gimbal is not nadir** | **OCR the burned-in pitch text** (Option J) — the *previously* preferred telemetry path is dead for this recording (see Section 5). | Manual labeling pass: ship a small `frame_ranges.yaml` of nadir vs non-nadir segments alongside the MP4. ~30 min of human labour for a 6-minute clip. |
**Disqualified**:
- Generative video inpainters (VideoPainter / DiffuEraser / VidPivot et al.) — they fabricate terrain, which corrupts VPR/matching evaluation and violates `meta-rule.mdc` "Real Results, Not Simulated Ones".
- FFmpeg `tmedian` (temporal median) — both the OSD text *and* the underlying scene change every frame, so the median is smeared in both regions (locally verified as `poc3_crop_tmedian.mp4` on the prior run).
**Not available to this project** — the source MKV is what we have; there is no access to the camera, the GCS host, or the upstream RTSP. The ideal-but-out-of-reach path would have been to pull RTSP directly from the gimbal (`rtsp://192.168.144.108:554/stream=0` for the Topotek / Viewpro multi-sensor ball class) or extract DCIM with OSD off via the GimbalControl Ethernet utility (ArduPilot Source #4). That path is documented here only for completeness (and because the `flight_derkachi.mp4` fixture was produced that way, which is why it is already clean); it is not actionable for this data source. The only thing that could replace the cleanup pipeline is the original supplier voluntarily re-recording with OSD off — which is outside this project's control.
---
## 2. What is in `2026-05-09 16-10-54.mkv`
### 2.1 Technical metadata (verified via `ffprobe`)
| Field | Value |
|---|---|
| Container | Matroska (`.mkv`) |
| Video codec | H.264 |
| Resolution | 1280 × 720 |
| Frame rate | 30/1 fps |
| Duration | 367.00 s (~6 m 7 s) |
| File size | 115 044 545 bytes (~110 MB) |
| Audio | AAC (discard at re-encode time with `-an`) |
| Bitrate | ~2.5 Mbit/s |
### 2.2 Three overlay layers (Fact #1 — direct 12-frame variance analysis on the prior run)
```
+-----------------------------------------------------------------------+
| GCS chrome top bar (status, mode, GPS, alt) |
+------------+----------------------------------------+-----------------+
| | TOP-LEFT HUD (burned by camera) | IR PIP |
| SL STATS | · timer 00:04:24 | (live IR/thermal|
| (live | · EO/IR zoom, FOV 53.2 ° | stream stamped |
| sidebar | · target lat/lon | by the gimbal) |
| values) | | |
| | [actual EO video region] | |
| | crosshair, attitude ladder | |
| | FOV brackets, +/-3.7 ° pitch text | |
| | +-----------------+
| | BOTTOM-RIGHT GIMBAL TEXT | ROLL / SPEED / |
| | · 50.0823, 36.2515 | DIST / BATT / |
| | · azimuth, elevation | CURRENT (live) |
+------------+----------------------------------------+-----------------+
| Minimap / bottom status bar |
+-----------------------------------------------------------------------+
```
The three layers map to **two removal classes** (per Reasoning Chain Dimension 1):
| Layer | Renderer | Removal class |
|---|---|---|
| GCS UI chrome (sidebars, minimap, status bars) | the GCS application, **after** the video stream arrives | **Pure crop** — discard the columns and rows around the EO region; pixels are *outside* the camera's video, no inpainting needed. |
| Burned-in gimbal OSD (attitude ladder, crosshair, FOV brackets, top-left HUD, bottom-right text) | the **camera itself**, before the recorder ever saw the stream | **Mask or inpaint** — these pixels overwrite real EO pixels; you must either tell the downstream not to look at them (mask) or fill them with something visually plausible (inpaint). |
| IR PIP (upper-right rectangle, ~360×210 px) | the camera (it stamps its IR channel into a corner of the EO output) | **Crop it out** geometrically — the rectangle is large enough that `delogo`'s interpolation is poor; cleanest to just keep the crop tight enough to exclude it. |
### 2.3 The aircraft / gimbal class (corroborated by the tlog scan in Section 5)
- Airframe: **multirotor**, ArduCopter 4.6.3 on Pixhawk6X, QUAD/X frame (`STATUSTEXT`: `'Frame: QUAD/X'`). The project's spec'd nav-camera is a fixed-downward APS-C sensor on a *fixed-wing* per `restrictions.md` — this MKV represents a **different aircraft class** than the primary runtime target. That's a feature (extends test coverage) not a bug, but the fixture's metadata must record the discrepancy.
- Gimbal: 3-axis stabilised, pitch range 90° to +20°, yaw range ±180°, roll range ±30° (per the tlog's `GIMBAL_MANAGER_INFORMATION` capability advertisement — Section 5). Consistent with the Topotek / Viewpro multi-sensor ball family identified by the prior run's visual inspection.
- GCS: **Mission Planner 1.3.83** (per `STATUSTEXT`). The project's `restrictions.md` mandates QGroundControl as the production GCS; for this fixture, the GCS is just whatever was used to make the recording — not a runtime concern.
---
## 3. Where this fits in the existing solution (and where it does not)
### 3.1 The gap
`_docs/01_solution/solution_draft01.md` defines components C1 (VIO), C2 (VPR), C3 (matchers), C4 (PnP), C5 (state estimator), C6 (tile cache), C7 (inference runtime), C8 (FC adapter), C10 (provisioning) and more — all *runtime* concerns on the Jetson Orin Nano Super. None of them is a "data ingestion / fixture-prep" component. Replay fixtures appear in `tests/e2e/replay/` as already-cleaned MP4s (Fact #18).
This is fine for `flight_derkachi.mp4`, which arrived pre-cleaned because the operator (a) disabled the gimbal OSD via Topotek's GimbalControl utility, (b) mechanically locked the gimbal nadir, and (c) recorded the direct camera feed at 1080p before cropping to 880×720 (Reasoning Chain Dimension 7).
`2026-05-09 16-10-54.mkv` arrived from the *opposite* situation: OSD on, gimbal unconstrained, GCS-screen-recorded. There is no existing project tool to turn this class of input into a usable fixture, which is why the question came up.
### 3.2 What this draft adds
A new **fixture-prep developer tool** (location: `tools/fixture_prep/` or `tests/fixtures/<flight_id>/build.py`, per existing project layout conventions) that converts one GCS-screen-recorded `.mkv` (plus its paired `.tlog`) into a directory of files in the same shape as `_docs/00_problem/input_data/flight_derkachi/`:
```
input_data/flight_topotek_2026-05-09/
├── flight_topotek_2026-05-09.mp4 # cleaned, cropped, OSD handled (Section 4)
├── osd_mask.png # 1-channel mask used by Option B (Section 4.1)
├── 2026-05-09 16-09-54.tlog # unpacked from the supplied .zip
├── data_imu.csv # SCALED_IMU2 + GLOBAL_POSITION_INT export
├── frame_ranges.yaml # nadir vs non-nadir frame ranges (Section 4.3)
├── camera_info.md # camera class + calibration provenance
└── topotek_gimbal_factory.json # calibration JSON, factory-sheet provenance
```
The tool is offline-only, deterministic, versioned, and reproducible — re-running it on the same input produces byte-identical outputs (the only non-determinism would be inside libx264, which we disable via `-preset placebo -tune zerolatency` or by pinning `-x264-params bframes=0:scenecut=0`, your choice depending on tolerated re-encode time).
**It does not change any runtime component.** C1…C12 are untouched. The single optional change in the *test* layer is to teach `tests/e2e/replay/conftest.py::_calibration_path()` (or its sibling helpers) to also look for a companion `osd_mask.png` if Option B is selected — and to forward it as an extra kwarg to whatever wraps `DISK.forward()` inside C3 (see Section 4.1 for the exact one-line change required).
---
## 4. The recommended pipeline (and the cheap fallback)
### 4.1 PRIMARY — A + B + J: crop, mask-aware DISK, OCR pitch filter
**Step 1 — Geometric crop (FFmpeg)**: discard the GCS chrome and the IR PIP rectangle.
```bash
INPUT="_docs/00_problem/input_data/10.05.2026/2026-05-09 16-10-54.mkv"
OUTPUT_DIR="_docs/00_problem/input_data/flight_topotek_2026-05-09"
mkdir -p "$OUTPUT_DIR"
# Crop coordinates *verified for this specific MKV* by direct frame inspection +
# luminance/saturation discontinuity detection on 2026-05-30 (see fixture README).
# Output: 610x260 EO-only region anchored at (250, 440) in the 1280x720 source.
#
# Why these numbers and not the prior research's draft (crop=900:445:50:25):
# - The IR PIP is much larger than initially estimated: it spans roughly
# x=620..1140, y=35..383 in the source frame. The prior crop's right edge at
# x=950 cut into the PIP and the left edge at x=50 still included the
# GCS left icon strip + the SL STATS panel.
# - The IR PIP rectangle (~520 wide x ~350 tall) is too large for FFmpeg
# `delogo` to interpolate cleanly. Geometric exclusion is the only honest
# option for this recording.
# - The largest *clean* EO rectangle (no GCS chrome, no IR PIP, almost no
# burned-in OSD) is in the lower half of the frame, below the IR PIP.
#
# Re-verify if you ingest a future recording with a different GCS layout or
# IR-PIP placement; see fixture README for the derivation script.
ffmpeg -y -i "$INPUT" \
-vf "crop=610:260:250:440" \
-an -c:v libx264 -crf 18 -preset medium \
"$OUTPUT_DIR/flight_topotek_2026-05-09.mp4"
```
This is Option A: verbatim pixels, no inpainting, no fabrication, deterministic. On this specific MKV the crop is tight enough that essentially **no** burned-in gimbal OSD survives inside the output (verified on 8 sample frames spread across the recording — variance analysis flagged 1/158 600 = 0.0006 % of pixels as "static OSD-like"). The remaining steps 25 below are still relevant for other recordings of this class that may need a looser crop.
**Step 2 — Build the OSD mask** (one-time, then versioned in the repo).
Build a 1-channel PNG of the same dimensions as the cropped output (900×445), where white (255) marks "real EO pixels — DISK is allowed to detect keypoints here" and black (0) marks "burned-in OSD pixels — DISK must suppress detection here". One quick recipe:
```python
# tools/fixture_prep/build_osd_mask.py
import cv2, numpy as np
from pathlib import Path
# Open a sample cropped frame and any image editor; trace the OSD rectangles by hand,
# then export as a 900x445 grayscale PNG. The script below is the deterministic alternative:
# build the mask from a pixel-stability test over a sample of frames.
src = Path("input_data/flight_topotek_2026-05-09/flight_topotek_2026-05-09_cropped.mp4")
cap = cv2.VideoCapture(str(src))
n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
sample = [int(n_frames * f) for f in (0.05, 0.15, 0.30, 0.45, 0.60, 0.75, 0.95)]
stack = []
for idx in sample:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ok, frame = cap.read()
if not ok: raise RuntimeError(f"frame {idx} unreadable")
stack.append(cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY).astype(np.float32))
stack = np.stack(stack) # (N, H, W)
std = stack.std(axis=0) # (H, W)
mean = stack.mean(axis=0) # (H, W)
# OSD heuristic: text/lines render as high-brightness, low-std (the *position* is stable
# even if the *value* in that position changes — the bounding box itself does not move).
# Real EO terrain over a moving camera is mid-brightness, high-std.
osd_likely = (std < 12.0) & (mean > 180.0) # white/bright pixels stable in position
osd_likely = cv2.dilate(osd_likely.astype(np.uint8) * 255, np.ones((7, 7), np.uint8))
mask = 255 - osd_likely # invert: white=keep, black=suppress
cv2.imwrite("input_data/flight_topotek_2026-05-09/osd_mask.png", mask)
```
The std/mean thresholds above are the right shape but should be tuned by eye on this specific recording — the prior research's variance analysis showed `mean per-pixel std ≈ 3040` for both the EO region and the GCS sidebars, so a `< 12` threshold cleanly separates burned-in OSD (which has near-zero std in pixels that contain text strokes) from real video. Inspect the saved `osd_mask.png` against a sample frame and refine the thresholds (or hand-trace) before committing it.
**Step 3 — One-line wrapper change in C3 to forward the mask** (the only code change this draft proposes).
`kornia.feature.DISK.forward(img, mask=None)` already accepts a mask argument of shape `(B, 1, H, W)` with values in `[0, 1]`, and multiplies the score map by it before NMS — keypoints in masked regions are suppressed by construction, with no preprocessing of the pixels themselves (Fact #13, Source #6 Kornia docs L1). The LightGlue maintainer (`cvg/LightGlue#97`) explicitly recommends this approach over post-hoc keypoint filtering.
Locate the project's existing `kornia.feature.DISK(...)` instantiation and the call site that invokes it (per `solution_draft01.md` C3 the detector is DISK + LightGlue; the call site is somewhere under `src/.../matchers/` or the runtime DISK wrapper). Pass `mask=<tensor>` through, where `<tensor>` is loaded once at fixture-init time from `osd_mask.png` and re-used per frame.
Sketch (project-specific paths to be filled in):
```python
# Existing
feats = self.disk(img, n=self.n_kp)
# Becomes
feats = self.disk(img, n=self.n_kp, mask=self.osd_mask)
```
`self.osd_mask` is loaded once in `__init__` from `(fixture_dir / "osd_mask.png").read_bytes()` and reshaped to `(1, 1, H, W)` float32 in `[0, 1]`. If the fixture has no `osd_mask.png`, the wrapper falls through to the original mask-less call — so existing `flight_derkachi.mp4` continues to work unchanged.
**Step 4 — Frame-level filter (Option J: OCR pitch from burned-in attitude indicator)**.
The previously-preferred telemetry path (Option I — parse `MOUNT_STATUS` / `GIMBAL_DEVICE_ATTITUDE_STATUS` from the paired `.tlog`) is **not viable for this recording**. See Section 5 for the evidence. The remaining viable paths are:
- **(J) OCR the burned-in pitch number** — the gimbal renders pitch as text such as `-3.7°` in the attitude indicator. Use Tesseract or PaddleOCR per frame on a fixed crop around that text region, then drop frames where `|pitch (90°)| > 10°`. Quick recipe (project must add `pytesseract` or `paddleocr` to `requirements-dev.txt`):
```python
# tools/fixture_prep/frame_pitch_from_ocr.py
import cv2, pytesseract, re, json
pat = re.compile(r"(-?\d+\.\d+)")
src = "input_data/flight_topotek_2026-05-09/flight_topotek_2026-05-09_cropped.mp4"
cap = cv2.VideoCapture(src)
out = []
for frame_idx in range(int(cap.get(cv2.CAP_PROP_FRAME_COUNT))):
ok, frame = cap.read()
if not ok: break
# Crop the attitude-indicator text region (coordinates depend on the cropped frame).
roi = frame[y0:y1, x0:x1]
text = pytesseract.image_to_string(roi, config="--psm 7 -c tessedit_char_whitelist=-0123456789.")
m = pat.search(text)
out.append({"frame": frame_idx, "pitch_deg": float(m.group(1)) if m else None})
with open("input_data/flight_topotek_2026-05-09/frame_pitch.json", "w") as f:
json.dump(out, f)
```
Then derive `frame_ranges.yaml` from `frame_pitch.json` by clustering contiguous frame indices whose pitch is within the nadir band.
- **(Manual)** — for a one-off fixture of 6 minutes, the cheapest deterministic alternative is a *manual labeling pass*: a developer watches the cropped video once, notes the frame ranges where the gimbal is at nadir (`[0:42, 1:53][3:58, 6:06]` etc.), and saves the ranges as `frame_ranges.yaml`. ~30 minutes of human labour, zero failure modes, fully reproducible by anyone who can re-watch the same MP4. This is the recommended path **for this specific fixture** unless additional GCS-screen-recorded fixtures are expected, in which case the OCR script amortises across them.
**Step 5 — Filter the cropped video down to nadir-only frames** (using `frame_ranges.yaml` from Step 4).
```bash
# Re-encode with a select filter restricting to the nadir frame ranges.
# Build the select expression programmatically from frame_ranges.yaml.
ffmpeg -y -i "$OUTPUT_DIR/flight_topotek_2026-05-09_cropped.mp4" \
-vf "select='between(n,1260,3390)+between(n,7140,10980)',setpts=N/FRAME_RATE/TB" \
-an -c:v libx264 -crf 18 -preset slow \
"$OUTPUT_DIR/flight_topotek_2026-05-09.mp4"
```
(Numbers above are illustrative; the actual `between(n, …)` segments come from the YAML.)
### 4.2 FALLBACK — A + C + (J or manual): no code change to C3
If teaching the C3 wrapper to forward `mask=…` is rejected for this fixture (a reasonable choice to keep `tests/e2e/replay/` purely "drop in an MP4 + a JSON + a CSV" with zero glue code), substitute **Option C** for Option B: replace each burned-in OSD rectangle with FFmpeg `delogo` interpolation.
**On this specific MKV, Option C collapses into Option A.** The verified crop in §4.1 Step 1 already produces a near-zero-OSD output (0.0006 % of pixels flagged as static-OSD-like over 8 sample frames), so there are no rectangles left to delogo. The Option-C-versus-Option-B trade-off only re-emerges for hypothetical *other* recordings of this class that need a looser crop — e.g. a recording where the camera HUD is positioned differently and there is no clean rectangle wholly outside it. The generic recipe shape for such a recording would be:
```bash
# Template only — instantiate W/H/X/Y for the looser crop and (x,y,w,h)
# rectangles for each surviving OSD region, all in cropped-frame coords.
# The delogo filter in FFmpeg 8.1 has no 'band' parameter (removed); only x, y,
# w, h, show remain. Rectangles must NOT touch the cropped frame's edge.
ffmpeg -y -i "$INPUT" \
-vf "crop=W:H:X:Y,\
delogo=x=x1:y=y1:w=w1:h=h1,\
delogo=x=x2:y=y2:w=w2:h=h2,\
..." \
-an -c:v libx264 -crf 18 -preset medium \
"$OUTPUT_DIR/$FIXTURE_ID_delogo.mp4"
```
**Important caveats on Option C** (when it does need to be used):
- `delogo` rectangles must not touch the image edge (no surrounding pixels to interpolate from).
- `delogo` produces *new* pixels (interpolated from the immediate neighbourhood). They are not synthesised semantic terrain content, but they *are* new pixels that did not exist in the original camera capture. The downstream feature detector *can* fire on smooth interpolated regions (DISK keypoints sometimes detect on smooth gradient transitions). This is the residual risk of Option C versus Option B; quantify it by running both pipelines on a few nadir segments and comparing the keypoint density inside the masked regions on the Option C output to zero (the trivially-correct value Option B delivers).
- `delogo` does **not** scale to rectangles much larger than ~50 px in their shorter dimension. For this MKV the IR PIP is ~520 × 350 px and cannot be cleanly delogo'd at all — geometric exclusion (i.e. the corrected crop in §4.1 Step 1) is the only honest option.
- Then chain Step 4 + Step 5 from Section 4.1 on top of this Option C output to get the same nadir-only result.
### 4.3 Companion files
| File | Source | Conventions |
|---|---|---|
| `flight_topotek_2026-05-09.mp4` | Sections 4.1 / 4.2 | H.264, 30 fps, exactly the cropped + OSD-handled + nadir-filtered video. Matches the `flight_derkachi.mp4` shape (any sub-1080p H.264 MP4 the replay harness already accepts). |
| `osd_mask.png` | Section 4.1 Step 2 (only for Option B) | 900×445 grayscale PNG, white=keep, black=suppress. Versioned alongside the MP4. |
| `2026-05-09 16-09-54.tlog` | Just unzip the `.zip` from `input_data/10.05.2026/` | Identical to the supplied tlog; ArduCopter 4.6.3 (Pixhawk6X), 133 191 messages over 446.8 s. |
| `data_imu.csv` | Reuse the existing `derkachi.tlog → data_imu.csv` exporter, retargeted at this new tlog | 10 Hz table of `SCALED_IMU2` and `GLOBAL_POSITION_INT` per the `flight_derkachi/README.md` convention. |
| `frame_ranges.yaml` | Section 4.1 Step 4 | List of `(start_frame, end_frame)` pairs the fixture considers "valid nadir frames". |
| `camera_info.md` | Hand-written, modelled on `flight_derkachi/camera_info.md` | Records: camera class (Topotek / Viewpro 3-axis multi-sensor ball, per ArduPilot Source #4 + the tlog's `GIMBAL_MANAGER_INFORMATION` cap_flags), recording chain (camera HDMI → GCS app → desktop screen recorder → MKV), and the calibration's provenance flag (`factory_sheet`, per AZ-702 precedent — Fact #17). |
| `topotek_gimbal_factory.json` | Same shape as `khp20s30_factory.json` (Fact #17) | Per-camera intrinsics + lens distortion from the camera's published spec sheet. Mark provenance `factory_sheet`. Residual focal-length error expected in the 13 % band, same envelope the project already accepts for `flight_derkachi.mp4`. |
---
## 5. New evidence — the paired tlog's gimbal state
The prior 2026-05-29 run left exactly one unresolved row in `06_component_fit_matrix.md`:
> **Frame filtering by gimbal pointing (PRIMARY) — `pymavlink` parser of paired `.tlog` for `MOUNT_STATUS` / `MOUNT_ORIENTATION`** → **Needs user decision**: depends on whether the paired tlog actually emits `MOUNT_STATUS` for the camera in question; if the gimbal does not report attitude over MAVLink, this option fails.
This draft resolves that row by directly scanning the paired tlog. Here is the evidence.
### 5.1 What is in the tlog (`2026-05-09 16-09-54.tlog`, unpacked from the supplied `.zip`)
`pymavlink 2.4.49` with `MAVLINK20=1`, `MAVLINK_DIALECT=all`. Scanned: 133 191 messages over 446.8 s, 46 distinct message types. Relevant subset:
| Message type | Count | Mean rate | Notes |
|---|---|---|---|
| `HEARTBEAT` | 1492 | 3.3 Hz | 4 endpoints: `(sys=1, comp=1, autopilot=3, type=2)` = ArduCopter / QUAD multirotor; `(sys=1, comp=191, autopilot=0, type=6)` = a GCS-class component co-resident on sysid 1; `(sys=255, comp=0, autopilot=8, type=18)` and `(sys=255, comp=190, autopilot=8, type=6)` = Mission Planner GCS. |
| `ATTITUDE` (vehicle) | 4174 | 9.3 Hz | Body pitch range: min 12.47°, max +4.68°, mean 3.95°. **This is the airframe attitude**, not the gimbal. |
| `GIMBAL_DEVICE_ATTITUDE_STATUS` | **4338** | **9.7 Hz** | **All 4338 messages carry the identity quaternion `q = (1.0, 0.0, 0.0, 0.0)`** (exactly one distinct quaternion value across the entire flight). `flags = 0x002c` = `YAW_IN_VEHICLE_FRAME | PITCH_LOCK | ROLL_LOCK`. `failure_flags = 0x00000000`. |
| `GIMBAL_MANAGER_INFORMATION` | 26 | discovery exchange | `gimbal_device_id=1`. Capability: pitch range `[90°, +20°]`, yaw range `±180°`, roll range `±30°`. cap_flags=206847. Confirms the gimbal physically *can* reach nadir; just isn't reporting where it is right now. |
| `COMMAND_LONG` distinct cmds | — | — | Only 6 distinct command IDs: 183 (`DO_SET_SERVO ch=15 pwm=1950` — one shot, possibly a release / trigger), 400 (`COMPONENT_ARM_DISARM`, twice), 511 (`SET_MESSAGE_INTERVAL`, 11×), 512 (`REQUEST_MESSAGE`, 100×), 520 (`REQUEST_AUTOPILOT_CAPABILITIES`, 47×), and 42428 (vendor-specific, params all zero). **None of these is a gimbal control command (no `MAV_CMD_DO_MOUNT_CONTROL` = 205, no `MAV_CMD_DO_GIMBAL_MANAGER_PITCHYAW` = 1000).** |
| `NAMED_VALUE_FLOAT` names | 1 unique | — | Only `ESCs_CURR`. No gimbal-related custom variable. |
| `STATUSTEXT` | 38 | — | Includes `'ArduCopter 4.6.3 - Agile(px6) (92b0cd78)'`, `'Pixhawk6X 001E0036 …'`, `'Frame: QUAD/X'`, `'Mission Planner 1.3.83'`. No gimbal-related text. |
### 5.2 What the identity quaternion really means
`q = (1, 0, 0, 0)` is the null rotation. Per the MAVLink GIMBAL_DEVICE_ATTITUDE_STATUS spec, that means "the gimbal is in its default forward-pointing pose" (no rotation away from the body frame's +X). But the prior run's frame-by-frame visual inspection saw the gimbal *clearly pointing forward at t=30s and clearly pointing nadir at t=300s* (Fact #5). The two observations are mutually exclusive: if the gimbal were truly at the null rotation throughout the flight, every frame would look like it does at t=30s (forward).
The reconciliation: **the gimbal is being moved by the operator, but the actual angle is not being reported back over MAVLink in this recording.** The gimbal driver is emitting the placeholder identity quaternion every ~100 ms because the ArduPilot mount driver expects to publish *something* at the configured rate, but no real angle is available (the gimbal device either isn't wired to talk back over MAVLink, or it is wired but isn't responding, or it is responding on a different transport — most likely the camera's own Ethernet protocol talking directly to the GCS, bypassing the autopilot).
This is consistent with:
- Mission Planner being able to control Topotek / Viewpro gimbals directly over Ethernet/UDP, separate from the ArduPilot MAVLink path.
- The `DO_SET_SERVO ch=15 pwm=1950` one-shot pointing to a *trigger* (likely shutter / record-toggle), not a per-frame angle command.
- The absence of `MAV_CMD_DO_MOUNT_CONTROL` and the absence of `GIMBAL_MANAGER_SET_ATTITUDE` in `COMMAND_LONG`.
### 5.3 Effect on the recommendation
| Component-fit row (from the 2026-05-29 component fit matrix) | Original status | Status after Section 5 |
|---|---|---|
| Frame filtering by gimbal pointing (PRIMARY) — `pymavlink` parser of `MOUNT_STATUS`/`MOUNT_ORIENTATION` from the paired tlog | **Needs user decision** | **❌ Rejected for this recording.** The message type IS present at 9.7 Hz, but every quaternion is the placeholder identity value; the data carries zero information about the actual gimbal angle. |
| Frame filtering by gimbal pointing (FALLBACK) — OCR on the burned-in pitch text | Experimental only | **✅ Selected as primary** (or the manual labeling pass, for one-off fixtures of this size). |
**No other row in `06_component_fit_matrix.md` changes.** The pixel-handling recommendation (Option B primary, Option C fallback) and the rejections (Options F generative / G temporal-median) stand.
### 5.4 Why this is not a project-runtime issue
The project's *runtime* nav-camera per `restrictions.md` is the ADTi 20MP fixed-downward (no gimbal at all). The runtime pipeline never sees a multi-sensor-ball gimbal-attitude stream. So the gap discovered here ("Mission-Planner-driven Topotek gimbals don't expose attitude over MAVLink") is only relevant for fixture preparation, not for the runtime contract. The follow-up "change the recording procedure to enable Topotek's own attitude-publish path" would require camera/GCS access this project does not have, so it is unavailable as a workaround. For any further recordings of this class, **plan on OCR-based pitch recovery (Option J) — or a manual labelling pass per fixture — as the standing strategy**, not as a temporary fallback.
---
## 6. Component fit summary (consolidated)
> Full detail per row in [`../../00_research/_mode_b_2026-05-29_video_extraction/06_component_fit_matrix.md`](../../00_research/_mode_b_2026-05-29_video_extraction/06_component_fit_matrix.md). The table below is the *post-tlog-scan* update.
| Component area | Candidate | Pinned mode | Status | Notes |
|---|---|---|---|---|
| GCS-chrome geometric crop | FFmpeg `crop` filter | `crop=900:445:50:25` per recording, derived from variance-map analysis | **Selected** | Trivial, lossless within re-encode, deterministic. PoC1 produced playable output on the prior run. |
| OSD pixel handling (PRIMARY) | Kornia `DISK.forward(img, mask=…)` | mask-aware mode, `(B, 1, H, W)` mask multiplied into the DISK score map before NMS | **Selected** | No pixel modification; fabrication-risk = 0. Requires one-line C3 wrapper change to forward `mask=`. Already API-verified against Kornia docs L1 (Source #6) + LightGlue maintainer reply (Source #5). |
| OSD pixel handling (FALLBACK) | FFmpeg `delogo` chained | multiple `delogo=x:y:w:h` after `crop`, rectangles inside the cropped frame | **Selected (fallback)** | PoC4 produced `poc4_delogo.mp4` on the prior run. Pick this if the C3 wrapper change is rejected for this fixture. |
| OSD pixel handling | FFmpeg `removelogo` PNG mask | `removelogo=mask.png` | **Experimental only** | Failed locally with `Invalid argument` (-22) on FFmpeg 8.1; works in older versions per Source #15. Try first on your team's pinned FFmpeg before falling through to chained `delogo`. |
| OSD pixel handling | ProPainter (non-generative video inpainter, ICCV 2023) | mask-guided sparse Transformer with flow completion | **Experimental only** | Highest visual quality among non-generative options. Adds PyTorch+CUDA toolchain; ~0.25 s/frame at 480p (Fact #11). Use only if a future recording's masked regions are too large for `delogo` interpolation. |
| OSD pixel handling | VideoPainter / DiffuEraser / VidPivot (and any generative video inpainter) | diffusion-backbone I2V generative inpainter | **❌ Rejected** | Synthesises terrain content. Disqualified by `meta-rule.mdc` "Real Results, Not Simulated Ones" (Fact #12). |
| OSD pixel handling | FFmpeg `tmedian` temporal median | `tmedian=radius=N` | **❌ Rejected** | Burned-in OSD text values change every frame, so the static-OSD assumption underneath the technique fails. PoC3 confirmed: smeared, ghosted output (Fact #10). |
| Non-nadir frame filter (PRIMARY) | `pymavlink` MOUNT_STATUS / GIMBAL_DEVICE_ATTITUDE_STATUS | parse paired tlog → `frame_idx → gimbal_pitch_deg` table | **❌ Rejected for this recording (NEW)** | Section 5: message present at 9.7 Hz, but all 4338 quaternions are identity (1,0,0,0) — no real angle data. |
| Non-nadir frame filter (PRIMARY, new) | OCR (Tesseract or PaddleOCR) on burned-in pitch text | per-frame OCR of the `3.7°` text in the attitude indicator | **Selected (NEW)** | Was "Experimental only" pre-tlog-scan; promoted to primary now that the telemetry path is dead. Add `pytesseract` or `paddleocr` to `requirements-dev.txt`. |
| Non-nadir frame filter (one-off alternative) | Manual labeling pass | developer watches the 6-min clip, marks ranges, commits `frame_ranges.yaml` | **Selected (for this fixture only)** | Cheapest deterministic path; recommended for this specific MKV unless additional GCS-screen-recorded fixtures are expected. |
| Calibration JSON | Per-camera `topotek_gimbal_factory.json` (same shape as `khp20s30_factory.json`) | "factory_sheet" provenance per AZ-702 precedent | **Selected** | Project-accepted (Fact #17). Residual 13 % focal-length error envelope. |
| Companion telemetry CSV | Existing `derkachi.tlog → data_imu.csv` exporter, retargeted | unchanged | **Selected** | Reuses existing tool. |
| Source recovery (Option Z) | Pull RTSP / extract DCIM from gimbal with OSD disabled via Topotek GimbalControl utility (ArduPilot Source #4) | n/a — out-of-band camera access | **❌ Not available — no camera / GCS access for this data source.** | Documented here only because it is the cleanest path *in principle* and because the existing `flight_derkachi.mp4` fixture was produced this way. Not actionable for this project's data pipeline. The only way it returns to the table is if the original supplier voluntarily re-records with OSD off — outside this project's control. |
---
## 7. Testing strategy
### 7.1 Functional / integration
1. **Crop-coordinate validation.** Decode 5 frames from `flight_topotek_2026-05-09.mp4` and assert they are 900×445; assert the IR PIP is *not* present in the right-third of the frame; assert the GCS sidebars are *not* present in the leftmost / rightmost columns.
2. **OSD mask validation.** Open `osd_mask.png`, assert dimensions match the MP4, assert the union of black pixels covers ≥95 % of the union of OSD rectangles you would otherwise pass to `delogo`. Optionally, render `cropped_frame * (mask/255.)` and eyeball that the burned-in text is dimmed to black while the EO terrain is preserved.
3. **DISK mask-aware contract.** Add a unit test under `tests/unit/c3_matchers/` that loads the existing DISK wrapper, passes a synthetic 900×445 image with a checkerboard pattern + a corner rectangle of pure white, passes a mask zeroing the corner, and asserts no keypoint is returned at coordinates inside that corner.
4. **End-to-end replay smoke test.** Add a sibling test to `tests/e2e/replay/test_az835_e2e_real_flight.py` parameterised over `flight_topotek_2026-05-09` and confirm the pipeline runs to completion. Track end-to-end accuracy separately under AC-1.x.
5. **Frame-range filter sanity.** Iterate `frame_ranges.yaml` and assert: every range's `start_frame < end_frame`, ranges are non-overlapping, and the union covers at least N seconds of footage (where N is a project-chosen minimum-fixture-duration).
### 7.2 Non-functional
- **Reproducibility**: re-run the entire `tools/fixture_prep/` script twice on a clean checkout and assert byte-identical outputs (pin libx264 settings; pin `pytesseract` version; pin Python version in `pyproject.toml`).
- **Throughput**: the entire fixture-prep run for one 6-minute MKV should complete in well under 10 minutes on a developer workstation (no AC requirement; sanity ceiling).
- **No fabrication regression**: extract keypoints from the masked region of the Option B output and assert count == 0; for the Option C output, assert keypoint count is at most 5 % of the unmasked terrain keypoint count.
---
## 8. Open questions
1. **Adopt the one-line C3 wrapper change?** Section 4.1 Step 3 proposes forwarding `mask=…` through the existing DISK call. This is the lowest-risk highest-quality path (Option B) but requires touching `src/.../matchers/`. The fallback (Option C, chained `delogo`) avoids this code change entirely and only touches the fixture-prep script. **Either is defensible** — the choice depends on whether the team is willing to formalise mask-aware fixtures as a first-class concept in the replay layer (recommended yes) or wants to keep that layer "drop in an MP4" pure (defensible too).
2. **Use OCR for the frame filter, or just hand-label this one fixture?** For a single 6-minute clip, the manual labeling pass is cheaper than building, validating, and pinning a Tesseract/PaddleOCR pipeline. Use OCR only if you expect to ingest additional fixtures of the same class (same gimbal HUD layout) and want the script to amortise. Either way, the YAML output format is the same — so this can be revisited later.
3. **Future recordings — Option Z (direct RTSP/DCIM extraction with OSD off) is not available** to this project: no camera or GCS access exists for this data source. The only theoretical path to bypass the cleanup pipeline is to ask the original supplier to re-record with OSD disabled (via Topotek's GimbalControl utility on their side, or by setting `MNT1_OPTIONS` / equivalent on their flight controller). Whether to even make that request is a separate decision; it is not a technical option this project can execute on its own. Assume Option Z stays unavailable and plan all future fixtures of this class around the OCR / manual-labelling path in §4 + §5.3.
---
## 9. References
L1 (official documentation / source):
- FFmpeg `delogo` filter, vf_delogo.c [Sources #1, #2 in source registry]
- FFmpeg `removelogo` filter, vf_removelogo.c [Source #3]
- FFmpeg `tmedian` filter [Source #8]
- ArduPilot Topotek Gimbal docs [Source #4]
- LightGlue maintainer reply on score-map masking, issue cvg/LightGlue#97 [Source #5]
- Kornia `DISK.forward()` documentation [Source #6]
- DISK upstream source (`disk/model/disk.py`) [Source #7]
- Project: `_docs/00_problem/input_data/flight_derkachi/README.md` [Source #9]
- Project: `_docs/00_problem/input_data/flight_derkachi/camera_info.md` [Source #10]
L2 (peer-reviewed):
- ProPainter (ICCV 2023) [Source #11]
- VideoPainter (arXiv 2503.05639, 2025) [Source #12] — referenced as disqualified
- VidPivot / DiffuEraser comparison (arXiv 2510.21461, 2025) [Source #13]
- DISK paper (NeurIPS 2020, arXiv 2006.13566) [Source #14]
L3 (practitioner / community):
- "Removing obnoxious logos from videos" blog [Source #15]
- Conditional Temporal Median Filter reference [Source #16]
- Foundry Nuke TemporalMedian reference [Source #17]
In-repo cross-references:
- `_docs/01_solution/solution_draft01.md` — existing solution; C2 (MixVPR TensorRT INT8+FP16), C3 (DISK + LightGlue), C5 (GTSAM iSAM2 + CombinedImuFactor) [Source #R1]
- `_docs/00_research/06_component_fit_matrix/00_summary.md` — confirms no fixture-prep component exists in the runtime [Source #R2]
- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` — existing per-camera calibration JSON precedent [Source #R3]
This-run new evidence:
- ffprobe verification of `2026-05-09 16-10-54.mkv` technical metadata (Section 2.1)
- `pymavlink` scan of unpacked `2026-05-09 16-09-54.tlog` (Section 5) — 133 191 messages over 446.8 s, GIMBAL_DEVICE_ATTITUDE_STATUS at 9.7 Hz, all identity quaternions
---
## 10. Related artifacts
| Artifact | Status |
|---|---|
| `_docs/00_research/_mode_b_2026-05-29_video_extraction/` | Complete through Step 7.5 — this draft is its Step 8 deliverable, with one row updated by new tlog evidence |
| `_docs/01_solution/solution_draft01.md` | Untouched. C1C12 unchanged. This draft is purely additive. |
| `_docs/01_solution/solution.md` | Untouched. |
| `_docs/00_problem/input_data/10.05.2026/2026-05-09 16-10-54.mkv` | Source MKV. Untouched. |
| `_docs/00_problem/input_data/10.05.2026/2026-05-09 16-09-54.zip` | Source tlog archive. Untouched. |
| Future: `input_data/flight_topotek_2026-05-09/` | The cleaned fixture directory this draft proposes producing. Not yet created. |
| Future: `tools/fixture_prep/` | The reproducible script that will produce the above. Not yet created. |
@@ -44,22 +44,22 @@ The user was presented options A-E on 2026-05-29 and skipped the choice. Per "us
## Goal ## Goal
Provision a NetVLAD-VGG16 `.pt` checkpoint at `models/netvlad/netvlad.pt` + matching `BackboneConfig` entry in `configs/operator_replay.yaml` so the AZ-839 fixture skip-gate clears and the AZ-840 orchestrator can compose c10 (+ c2_vpr) into a real pipeline run. Provision a NetVLAD-VGG16 `.pt` checkpoint at `models/net_vlad/net_vlad.pt` + matching `BackboneConfig` entry in `configs/operator_replay.yaml` so the AZ-839 fixture skip-gate clears and the AZ-840 orchestrator can compose c10 (+ c2_vpr) into a real pipeline run. File stem MUST equal `c2_vpr.net_vlad.MODEL_NAME == "net_vlad"` — the PyTorch FP16 runtime uses `path.stem` as the architecture-registry lookup key.
## Scope ## Scope
1. **Write `scripts/mk_netvlad_checkpoint.py`** — generates a deterministic `.pt`: 1. **Write `scripts/mk_netvlad_checkpoint.py`** — generates a deterministic `.pt`:
* Loads `torchvision.models.vgg16(weights="IMAGENET1K_V1")` features, slices `[:-2]` to match `_NetVladVgg16.encoder`. * Loads `torchvision.models.vgg16(weights="IMAGENET1K_V1")` features, slices `[:-2]` to match `_NetVladVgg16.encoder`.
* Seeds `torch.manual_seed(0)`, instantiates `make_net_vlad_vgg16(num_clusters=64, encoder_dim=512, descriptor_dim=4096)`, overlays ImageNet features into `encoder.*` keys. * Seeds `torch.manual_seed(0)`, instantiates `make_net_vlad_vgg16(num_clusters=64, encoder_dim=512, descriptor_dim=4096)`, overlays ImageNet features into `encoder.*` keys.
* Saves to `models/netvlad/netvlad.pt`. * Saves to `models/net_vlad/net_vlad.pt`.
* Prints SHA-256 + key composition. * Prints SHA-256 + key composition.
2. **Add `models/**/*.pt`, `*.onnx`, `*.engine` to `.gitattributes` for git-lfs**. 2. **Add `models/**/*.pt`, `*.onnx`, `*.engine` to `.gitattributes` for git-lfs**.
3. **Commit `models/netvlad/netvlad.pt` via git-lfs**. 3. **Commit `models/net_vlad/net_vlad.pt` via git-lfs**.
4. **Update `configs/operator_replay.yaml`**: 4. **Update `configs/operator_replay.yaml`**:
```yaml ```yaml
c2_vpr: c2_vpr:
strategy: net_vlad strategy: net_vlad
backbone_weights_path: /opt/models/netvlad/netvlad.pt backbone_weights_path: /opt/models/net_vlad/net_vlad.pt
netvlad_descriptor_dim: 4096 netvlad_descriptor_dim: 4096
warn_top1_threshold: 0.30 warn_top1_threshold: 0.30
@@ -67,7 +67,7 @@ Provision a NetVLAD-VGG16 `.pt` checkpoint at `models/netvlad/netvlad.pt` + matc
workspace_mb: 4096 workspace_mb: 4096
backbones: backbones:
- model_name: net_vlad - model_name: net_vlad
onnx_path: /opt/models/netvlad/netvlad.pt onnx_path: /opt/models/net_vlad/net_vlad.pt
expected_input_shape: [3, 480, 480] expected_input_shape: [3, 480, 480]
input_name: input input_name: input
``` ```
@@ -78,7 +78,7 @@ Provision a NetVLAD-VGG16 `.pt` checkpoint at `models/netvlad/netvlad.pt` + matc
## Acceptance Criteria ## Acceptance Criteria
* **AC-1**: `models/netvlad/netvlad.pt` exists in the repo (via git-lfs) with documented provenance + licence. * **AC-1**: `models/net_vlad/net_vlad.pt` exists in the repo (via git-lfs) with documented provenance + licence.
* **AC-2**: `torch.load(path, weights_only=True)` + `load_state_dict(strict=True)` on `make_net_vlad_vgg16()` succeeds locally (round-trip verified before commit). * **AC-2**: `torch.load(path, weights_only=True)` + `load_state_dict(strict=True)` on `make_net_vlad_vgg16()` succeeds locally (round-trip verified before commit).
* **AC-3**: `configs/operator_replay.yaml` declares the `net_vlad` backbone in `c10_provisioning.backbones` and the `c2_vpr` block with matching `backbone_weights_path`. * **AC-3**: `configs/operator_replay.yaml` declares the `net_vlad` backbone in `c10_provisioning.backbones` and the `c2_vpr` block with matching `backbone_weights_path`.
* **AC-4**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer SKIPs `test_az840_e2e_real_flight_orchestration` with the empty-backbones message. * **AC-4**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer SKIPs `test_az840_e2e_real_flight_orchestration` with the empty-backbones message.
+3 -2
View File
@@ -1,6 +1,7 @@
# NetVLAD-VGG16 Checkpoint — Provenance & License # NetVLAD-VGG16 Checkpoint — Provenance & License
**Artifact**: `models/netvlad/netvlad.pt` **Artifact**: `models/net_vlad/net_vlad.pt`
**Note**: File stem MUST equal `c2_vpr.net_vlad.MODEL_NAME == "net_vlad"` — the PyTorch FP16 runtime uses `path.stem` as the architecture-registry lookup key.
**Generated**: 2026-05-29 (AZ-965) **Generated**: 2026-05-29 (AZ-965)
**Architecture**: project-owned `_NetVladVgg16` in `src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py` **Architecture**: project-owned `_NetVladVgg16` in `src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py`
**Parameters**: 149,002,112 (~568.4 MiB fp32) **Parameters**: 149,002,112 (~568.4 MiB fp32)
@@ -44,7 +45,7 @@ export SSL_CERT_FILE=$(python -c "import certifi; print(certifi.where())")
# Generate the checkpoint: # Generate the checkpoint:
python scripts/mk_netvlad_checkpoint.py python scripts/mk_netvlad_checkpoint.py
# → writes models/netvlad/netvlad.pt # → writes models/net_vlad/net_vlad.pt
``` ```
The script is **deterministic** (`torch.manual_seed(0)` before the random-init layers, IMAGENET1K_V1 weights are content-addressed). Re-running on a different machine yields the same SHA-256. The script is **deterministic** (`torch.manual_seed(0)` before the random-init layers, IMAGENET1K_V1 weights are content-addressed). Re-running on a different machine yields the same SHA-256.
+7 -5
View File
@@ -17,15 +17,17 @@
# * `SATELLITE_PROVIDER_URL` → c11_tile_manager.satellite_provider_url # * `SATELLITE_PROVIDER_URL` → c11_tile_manager.satellite_provider_url
# * `SATELLITE_PROVIDER_API_KEY` → c11_tile_manager.service_api_key # * `SATELLITE_PROVIDER_API_KEY` → c11_tile_manager.service_api_key
# #
# AZ-965 (2026-05-29): `c10_provisioning.backbones` now declares a # AZ-965 (2026-05-29): `c10_provisioning.backbones` declares one
# single NetVLAD-VGG16 entry pointing at `models/netvlad/netvlad.pt` # NetVLAD-VGG16 entry pointing at `models/net_vlad/net_vlad.pt`
# (568 MiB git-lfs blob; see `_docs/03_ip_attribution/netvlad.md` for # (568 MiB git-lfs blob; see `_docs/03_ip_attribution/netvlad.md` for
# provenance — VGG16 encoder = torchvision IMAGENET1K_V1 BSD, NetVLAD # provenance — VGG16 encoder = torchvision IMAGENET1K_V1 BSD, NetVLAD
# pool + PCA tail = deterministic-random untrained). Bind-mounted into # pool + PCA tail = deterministic-random untrained). Bind-mounted into
# the e2e-runner at `/opt/models` via docker-compose.test.jetson.yml. # the e2e-runner at `/opt/models` via docker-compose.test.jetson.yml.
# AZ-321 design: NetVLAD runs on the PyTorch FP16 runtime (NOT TRT), # AZ-321 design: NetVLAD runs on the PyTorch FP16 runtime (NOT TRT),
# so the field literally named `onnx_path` here is actually the path # so the field literally named `onnx_path` here is actually the path
# to the `.pt` PyTorch state_dict the runtime consumes. # to the `.pt` PyTorch state_dict the runtime consumes. File stem MUST
# equal `MODEL_NAME == "net_vlad"` from c2_vpr.net_vlad because the
# PyTorch runtime uses `path.stem` as the registry lookup key.
__top__: __top__:
mode: replay mode: replay
@@ -55,7 +57,7 @@ c7_inference:
c2_vpr: c2_vpr:
strategy: net_vlad strategy: net_vlad
backbone_weights_path: /opt/models/netvlad/netvlad.pt backbone_weights_path: /opt/models/net_vlad/net_vlad.pt
netvlad_descriptor_dim: 4096 netvlad_descriptor_dim: 4096
warn_top1_threshold: 0.30 warn_top1_threshold: 0.30
# faiss_index_path is overlaid at runtime by # faiss_index_path is overlaid at runtime by
@@ -66,7 +68,7 @@ c10_provisioning:
workspace_mb: 4096 workspace_mb: 4096
backbones: backbones:
- model_name: net_vlad - model_name: net_vlad
onnx_path: /opt/models/netvlad/netvlad.pt onnx_path: /opt/models/net_vlad/net_vlad.pt
expected_input_shape: [3, 480, 480] expected_input_shape: [3, 480, 480]
input_name: input input_name: input
+1 -1
View File
@@ -55,7 +55,7 @@ from gps_denied_onboard.components.c2_vpr._net_vlad_architecture import ( # noq
) )
_DEFAULT_OUTPUT = _REPO_ROOT / "models" / "netvlad" / "netvlad.pt" _DEFAULT_OUTPUT = _REPO_ROOT / "models" / "net_vlad" / "net_vlad.pt"
_SEED = 0 _SEED = 0
@@ -137,6 +137,10 @@ class BackboneConfig:
f"BackboneConfig({self.model_name!r}).onnx_path must " f"BackboneConfig({self.model_name!r}).onnx_path must "
"be a non-empty string" "be a non-empty string"
) )
if isinstance(self.expected_input_shape, list):
object.__setattr__(
self, "expected_input_shape", tuple(self.expected_input_shape)
)
if not self.expected_input_shape: if not self.expected_input_shape:
raise ConfigError( raise ConfigError(
f"BackboneConfig({self.model_name!r}).expected_input_shape " f"BackboneConfig({self.model_name!r}).expected_input_shape "
@@ -235,6 +239,20 @@ class C10ProvisioningConfig:
"C10ProvisioningConfig.workspace_mb must be > 0; " "C10ProvisioningConfig.workspace_mb must be > 0; "
f"got {self.workspace_mb}" f"got {self.workspace_mb}"
) )
# YAML loaders pass `backbones` as `list[dict]`; the config loader's
# generic `_replace_block` path does not recursively construct
# nested dataclasses inside list/tuple fields, so we coerce here.
# Idempotent: existing BackboneConfig entries pass through.
if self.backbones and not all(
isinstance(b, BackboneConfig) for b in self.backbones
):
coerced = tuple(
BackboneConfig(**b) if isinstance(b, dict) else b
for b in self.backbones
)
object.__setattr__(self, "backbones", coerced)
elif isinstance(self.backbones, list):
object.__setattr__(self, "backbones", tuple(self.backbones))
seen: set[str] = set() seen: set[str] = set()
for backbone in self.backbones: for backbone in self.backbones:
if backbone.model_name in seen: if backbone.model_name in seen:
+35 -10
View File
@@ -608,15 +608,7 @@ def _build_replay_backbone_embedder(
"DINOv2-VPR or NetVLAD per AZ-321)." "DINOv2-VPR or NetVLAD per AZ-321)."
) )
host = HostCapabilities( host = HostCapabilities(sm=87, jetpack="6.2", trt="10.3")
gpu_name="replay-e2e",
cuda_compute_capability=(0, 0),
cuda_runtime_version="0.0",
tensorrt_version="0.0",
host_arch="unknown",
host_os="linux",
driver_version="unknown",
)
engine_cache_root = cache_root / "engines" engine_cache_root = cache_root / "engines"
engine_cache_root.mkdir(parents=True, exist_ok=True) engine_cache_root.mkdir(parents=True, exist_ok=True)
request = EngineCompileRequest( request = EngineCompileRequest(
@@ -638,8 +630,14 @@ def _build_replay_backbone_embedder(
first = results[0] first = results[0]
spec = backbones[0] spec = backbones[0]
inference_runtime = build_inference_runtime(config) inference_runtime = build_inference_runtime(config)
engine_handle = inference_runtime.deserialize_engine(first.entry)
descriptor_dim = _resolve_replay_descriptor_dim(config, spec) descriptor_dim = _resolve_replay_descriptor_dim(config, spec)
# The c10 engine compiler treats backbones generically and does not
# know about c2_vpr's architecture registry. The c2_vpr factory
# would do this registration on its own create() path, but this
# fixture bypasses build_vpr_strategy. Register the strategy's
# NN architecture here so deserialize_engine can find it.
_register_replay_strategy_architecture(config, descriptor_dim)
engine_handle = inference_runtime.deserialize_engine(first.entry)
return C7EngineBackboneEmbedder( return C7EngineBackboneEmbedder(
inference_runtime=inference_runtime, inference_runtime=inference_runtime,
engine_handle=engine_handle, engine_handle=engine_handle,
@@ -651,6 +649,33 @@ def _build_replay_backbone_embedder(
) )
def _register_replay_strategy_architecture(
config: Any, descriptor_dim: int
) -> None:
"""Register c2_vpr's NN architecture with c7's registry.
Production runs go through ``vpr_factory.build_vpr_strategy`` which
invokes ``_register_strategy_architecture`` as a side effect before
the strategy is bound. The AZ-839 fixture pre-builds engines via
c10 directly (the operator pre-flight cache responsibility) and
skips ``build_vpr_strategy``, so the registration would never run.
Without this, ``InferenceRuntime.deserialize_engine`` raises
``EngineDeserializeError: No architecture registered for
model_name='net_vlad'`` when looking up the factory by file stem.
"""
block = config.components.get("c2_vpr") if config.components else None
strategy = getattr(block, "strategy", None) if block is not None else None
if strategy != "net_vlad":
return # other strategies handle their own registration / no-op
from gps_denied_onboard.components.c2_vpr.net_vlad import (
MODEL_NAME,
architecture_factory,
)
from gps_denied_onboard.components.c7_inference import register_architecture
register_architecture(MODEL_NAME, architecture_factory(descriptor_dim))
def _resolve_replay_descriptor_dim(config: Any, spec: Any) -> int: def _resolve_replay_descriptor_dim(config: Any, spec: Any) -> int:
"""Resolve the descriptor output dimension for the AZ-839 NetVLAD baseline. """Resolve the descriptor output dimension for the AZ-839 NetVLAD baseline.
@@ -0,0 +1,109 @@
"""AZ-965 — `C10ProvisioningConfig` coerces YAML-shaped backbones to dataclasses.
YAML loaders (`config/loader.py::_replace_block` → `dataclasses.replace`)
pass `backbones` as `list[dict]` because the generic loader path does
not recursively construct nested dataclasses inside list/tuple fields.
`C10ProvisioningConfig.__post_init__` must coerce each dict entry to a
proper :class:`BackboneConfig` instance before downstream consumers
iterate `backbone.model_name`. Similarly, `BackboneConfig.__post_init__`
must coerce `expected_input_shape: list[int]` to `tuple[int, ...]`
because PyYAML loads `[3, 480, 480]` as a list.
"""
from __future__ import annotations
from gps_denied_onboard.components.c10_provisioning.config import (
BackboneConfig,
C10ProvisioningConfig,
)
def test_c10_provisioning_coerces_list_of_dicts_to_backbone_configs() -> None:
# Arrange — what the YAML loader produces for c10_provisioning.backbones
yaml_shaped_backbones = [
{
"model_name": "net_vlad",
"onnx_path": "/opt/models/net_vlad/net_vlad.pt",
"expected_input_shape": [3, 480, 480],
"input_name": "input",
},
]
# Act
config = C10ProvisioningConfig(backbones=yaml_shaped_backbones)
# Assert
assert len(config.backbones) == 1
assert isinstance(config.backbones, tuple)
only = config.backbones[0]
assert isinstance(only, BackboneConfig)
assert only.model_name == "net_vlad"
assert only.onnx_path == "/opt/models/net_vlad/net_vlad.pt"
assert only.expected_input_shape == (3, 480, 480)
assert isinstance(only.expected_input_shape, tuple)
assert only.input_name == "input"
def test_backbone_config_coerces_list_input_shape_to_tuple() -> None:
# Act
backbone = BackboneConfig(
model_name="net_vlad",
onnx_path="/opt/models/net_vlad/net_vlad.pt",
expected_input_shape=[3, 480, 480], # type: ignore[arg-type]
input_name="input",
)
# Assert
assert isinstance(backbone.expected_input_shape, tuple)
assert backbone.expected_input_shape == (3, 480, 480)
def test_c10_provisioning_passes_through_existing_backbone_configs() -> None:
# Arrange
existing = BackboneConfig(
model_name="net_vlad",
onnx_path="/opt/models/net_vlad/net_vlad.pt",
expected_input_shape=(3, 480, 480),
input_name="input",
)
# Act — coercion path must be idempotent for already-typed inputs
config = C10ProvisioningConfig(backbones=(existing,))
# Assert
assert config.backbones == (existing,)
assert config.backbones[0] is existing
def test_c10_provisioning_coerces_mixed_dict_and_dataclass_entries() -> None:
# Arrange — partial migration shape (defensive)
existing = BackboneConfig(
model_name="ultra_vpr",
onnx_path="/opt/models/ultra_vpr/ultra_vpr.onnx",
expected_input_shape=(3, 224, 224),
input_name="input",
)
yaml_shaped = {
"model_name": "net_vlad",
"onnx_path": "/opt/models/net_vlad/net_vlad.pt",
"expected_input_shape": [3, 480, 480],
"input_name": "input",
}
# Act
config = C10ProvisioningConfig(backbones=[existing, yaml_shaped])
# Assert
assert len(config.backbones) == 2
assert config.backbones[0] is existing
assert isinstance(config.backbones[1], BackboneConfig)
assert config.backbones[1].model_name == "net_vlad"
def test_c10_provisioning_empty_backbones_remains_empty_tuple() -> None:
# Act
config = C10ProvisioningConfig()
# Assert
assert config.backbones == ()
assert isinstance(config.backbones, tuple)