mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-22 04:41:10 +00:00
[AZ-662] [AZ-669] Close batch 19: green test gate via Jetson Docker
ci/woodpecker/push/build-arm Pipeline failed
ci/woodpecker/push/build-arm Pipeline failed
Stand up a production-target test runner on jetson-e2e and run the deferred cargo test --workspace for batch 19. Infra: - Dockerfile.test: ubuntu:22.04 + libopencv-dev + libav*-dev + libclang-dev + protobuf-compiler + rust 1.82.0 (rustfmt, clippy). Sets LIBCLANG_PATH so clang-sys can dlopen libclang under the opencv-rust clang-runtime path. - scripts/jetson-test.sh: rsync source to jetson-e2e, docker build, docker run cargo test --workspace --no-fail-fast. Workspace fix exposed by the gate: - Cargo.toml: enable opencv "clang-runtime" feature. Without it the workspace fails to build because clang-sys is shared between opencv-binding-generator and bindgen (via ffmpeg-sys-next) and the opencv generator panics with "a `libclang` shared library is not loaded on this thread" (opencv-rust GH issue #635). Batch-19 code bugs exposed by the gate (6 compile errors + 1 algo bug): - movement_detector::optical_flow: min_max_loc signature (opencv 0.98 expects Option<&mut f64> / Option<&mut Point>); data_mut() returns *mut u8 directly, not Result. RANSAC residual now filters by the inlier mask returned by find_homography (matches the docstring; was systematically over-reporting motion magnitude on synthetic pure-pan input). - semantic_analyzer::scoring::freshness: same data_mut() fix; stddev_f32 now takes &impl core::ToInputArray so it accepts the BoxedRef<Mat> that Mat::roi returns in opencv 0.98. Result: 391 tests passed across 58 binaries, 0 in-scope failures. Two pre-existing failures in frame_ingest (batch 16-18 scope) are NOT addressed here and are recorded as leftovers: - frame_ingest_cuvid_segv: HIGH severity production bug; libavcodec58 advertises h264_cuvid but libnvcuvid.so.1 is missing at runtime, the software fallback never fires, first send_packet SEGVs. - frame_ingest_publisher_timing_flake: LOW severity; Jetson-specific timing budget too tight for ac1_three_consumers_at_rate_lose_no_frames. Neither blocks batch 20 (movement_detector / semantic_analyzer next). Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,49 +0,0 @@
|
||||
# Leftover — Batch 19 OpenCV test gate
|
||||
|
||||
- **Timestamp**: 2026-05-20T20:35:00+03:00
|
||||
- **Source**: autodev batch-19 close-out session
|
||||
- **Origin**: commit `db844db [AZ-662] [AZ-669] Implement ego-motion estimator and primitive graph`
|
||||
- **Blocked operation**: `cargo test --workspace` (specifically the `movement_detector` and `semantic_analyzer` crates that newly depend on the `opencv = "0.98"` workspace dep)
|
||||
|
||||
## Why it is blocked
|
||||
|
||||
The crate uses the Rust `opencv` 0.98 binding, which pulls in the native OpenCV 4 system library at link time.
|
||||
|
||||
1. **macOS dev box**: no `libopencv*` installed. `brew install opencv pkg-config` failed with `ENOSPC` — data-partition free space ≤ 1.1 GiB; opencv + transitive deps (proj, ffmpeg, qt, vtk, openblas, ceres-solver, ...) need ~3-5 GiB.
|
||||
2. **Jetson (`jetson-e2e`)**: state file recorded `ssh jetson-e2e && cargo test --workspace` as the authoritative test path, but the host is configured as the CI infra box (Gitea + Woodpecker via `~/ci/docker-compose.ci.yml`). It has neither the autopilot source checkout nor `cargo` at any standard path. The recorded plan is not directly executable.
|
||||
3. **Dockerfile**: `apt-get install -y --no-install-recommends ca-certificates libssl3` in the `runtime` stage only — the `rust:1.82-bookworm` builder image does NOT install `libopencv-dev`. A vanilla `docker build` will also fail.
|
||||
|
||||
## Test design (already in source, not yet executed)
|
||||
|
||||
| Crate | Test | Maps to AC |
|
||||
|-------|------|------------|
|
||||
| `movement_detector` | `internal::ego_motion::tests::ac1_pure_pan_residual_near_zero` | AZ-662 AC-1 |
|
||||
| `movement_detector` | `internal::ego_motion::tests::ac2_skew_above_zoom_out_tolerance_dropped` | AZ-662 AC-2 |
|
||||
| `movement_detector` | `internal::ego_motion::tests::ac3_degenerate_white_frame` | AZ-662 AC-3 |
|
||||
| `movement_detector` | `internal::zoom_bands::tests::*` (3 tests) | tolerance-table coverage |
|
||||
| `movement_detector` | `internal::telemetry_sync::tests::*` (3 tests) | skew-gate edge cases |
|
||||
| `semantic_analyzer` | `internal::primitive_graph::builder::tests::ac1_node_counts_per_class` | AZ-669 AC-1 |
|
||||
| `semantic_analyzer` | `internal::scoring::freshness::tests::ac2_freshness_score_bounded` | AZ-669 AC-2 |
|
||||
| `semantic_analyzer` | `internal::primitive_graph::builder::tests::ac3_disconnected_path_graph_flagged` | AZ-669 AC-3 |
|
||||
|
||||
## Replay options (any one closes the gate)
|
||||
|
||||
1. **macOS local — preferred**: free ≥ 5 GiB on the data partition (`df -h /System/Volumes/Data`), then `brew install opencv pkg-config && cargo test --workspace`. This matches the pattern used for `ffmpeg-next` in batches 17/18.
|
||||
2. **Jetson via CI**: push the `dev` branch to Gitea, configure the Woodpecker pipeline to run `cargo test --workspace` inside a `rust:1.82-bookworm` container with `apt-get install -y libopencv-dev clang libclang-dev` in a prep step.
|
||||
3. **Docker local**: extend the workspace `Dockerfile` (build stage) with `apt-get install -y libopencv-dev clang libclang-dev pkg-config` BEFORE the `cargo build` line, then `docker build -t autopilot-test --target build .` and `docker run --rm autopilot-test cargo test --workspace`.
|
||||
4. **Jetson as dev box**: clone the repo to `~/autopilot` on `jetson-e2e`, install rustup + cargo, install `libopencv-dev`, then run tests there. (Most setup effort; only worth it if Jetson will keep being used as the dev sandbox.)
|
||||
|
||||
## Acceptance for closing this leftover
|
||||
|
||||
- All tests listed above run successfully.
|
||||
- The full `cargo test --workspace` produces the same pre-existing flake summary as the batches-16-18 cumulative review (`mission_executor` `ac3_bounded_retry_then_success` / `ac1_multirotor_happy_path_reaches_done` may flake — tracked in `2026-05-20_mission_executor_ac3_flake.md`; not blocking).
|
||||
- Append the run output to `batch_19_cycle1_report.md` under a "Test Run — DONE" section and remove the "Test Gate — DEFERRED" caveat.
|
||||
- Delete this leftover file.
|
||||
|
||||
## Why no Jira write deferral
|
||||
|
||||
AZ-662 + AZ-669 have already been transitioned to `In Testing` per implement-skill Step 12 semantics ("dev work done, tests should now run"). The test gate itself is not a Jira write — it is a CI / local-build action. No tracker replay required when this leftover closes.
|
||||
|
||||
## Why this blocks batch 20
|
||||
|
||||
Batch 20 candidates (`AZ-663`, `AZ-664`, `AZ-670`, `AZ-671`, ...) depend on `movement_detector::ego_motion` and `semantic_analyzer::primitive_graph` per `_docs/02_tasks/_dependencies_table.md`. Building batch 20 on unverified `db844db` risks compounding bugs across two cycles before any test ever runs.
|
||||
@@ -0,0 +1,65 @@
|
||||
# Leftover — frame_ingest h264_cuvid SIGSEGV
|
||||
|
||||
- **Timestamp**: 2026-05-20T22:10:00+03:00
|
||||
- **Source**: Batch-19 Jetson test-gate run (commit pending — closes batch 19)
|
||||
- **Severity**: HIGH — real production bug; would crash the decoder process in any deployment where Ubuntu's libavcodec58 was built with cuvid headers but libnvcuvid.so.1 is missing (e.g., a Jetson reflash before the NVIDIA driver is installed, or any non-NVIDIA host with `libavcodec-extra` installed).
|
||||
- **Origin component**: `frame_ingest` (AZ-657 / AZ-658, batches 16-18)
|
||||
- **NOT in batch 19 scope** — recorded for the next batch that touches `frame_ingest`.
|
||||
|
||||
## Symptom
|
||||
|
||||
`cargo test -p frame_ingest --lib` and `cargo test -p frame_ingest --test decoder_pipeline` both SIGSEGV during construction of the production decoder:
|
||||
|
||||
```
|
||||
[h264_cuvid @ 0xffff8c000d70] Cannot load libnvcuvid.so.1
|
||||
[h264_cuvid @ 0xffff8c000d70] Failed loading nvcuvid.
|
||||
error: test failed, to rerun pass `-p frame_ingest --lib`
|
||||
Caused by:
|
||||
process didn't exit successfully: `.../frame_ingest-...` (signal: 11, SIGSEGV: invalid memory reference)
|
||||
```
|
||||
|
||||
Reproduced in `Dockerfile.test` (ubuntu:22.04 + libopencv-dev + libav*-dev + no NVIDIA driver) — i.e., the canonical "production-like minus NVDEC" environment.
|
||||
|
||||
## Root cause
|
||||
|
||||
`crates/frame_ingest/src/internal/decoder.rs::open_with_backend`:
|
||||
|
||||
```rust
|
||||
if let Some(nv) = ffmpeg::codec::decoder::find_by_name(codec.nvdec_name()) {
|
||||
match try_open(nv) {
|
||||
Ok(d) => { return Ok((d, DecoderBackend::Nvdec)); }
|
||||
Err(e) => { /* fall through to software */ }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and `try_open`:
|
||||
|
||||
```rust
|
||||
fn try_open(codec: ffmpeg::Codec) -> Result<ffmpeg::decoder::Video, DecoderInitError> {
|
||||
let ctx = ffmpeg::codec::Context::new();
|
||||
let opened = ctx.decoder().open_as(codec).map_err(DecoderInitError::OpenFailed)?;
|
||||
opened.video().map_err(DecoderInitError::OpenFailed)
|
||||
}
|
||||
```
|
||||
|
||||
Ubuntu's `libavcodec58` package was built against the NVIDIA cuvid headers, so `find_by_name("h264_cuvid")` returns `Some(...)` **even when libnvcuvid.so.1 is absent at runtime**. `open_as(codec)` ALSO returns `Ok` because FFmpeg defers the libnvcuvid `dlopen` until the first `send_packet`. The fallback to software h264 therefore never fires; the first decode SEGVs because `libnvcuvid.so.1` couldn't be opened.
|
||||
|
||||
## Fix sketch
|
||||
|
||||
In `try_open` (or a new `probe_nvdec` helper), call `send_packet` with a minimal valid NAL unit (or just allocate a CUDA context via `avcodec_send_packet` + `avcodec_receive_frame` round-trip) so the libnvcuvid load is attempted at probe time. If it fails, return `Err(DecoderInitError::OpenFailed(...))` so the existing fallback kicks in.
|
||||
|
||||
Alternative (cheaper) probe: `dlopen("libnvcuvid.so.1")` directly via the `libloading` crate before declaring NVDEC opened. If dlopen fails, immediately fall back to software without ever touching the FFmpeg cuvid path.
|
||||
|
||||
Either approach restores the AZ-658 design intent ("real NVDEC binding when present, real software fallback always") — currently the fallback only fires when the cuvid codec is unregistered, not when it is registered-but-non-functional.
|
||||
|
||||
## Acceptance for closing this leftover
|
||||
|
||||
- `cargo test -p frame_ingest --lib` passes in `Dockerfile.test` on `jetson-e2e`.
|
||||
- `cargo test -p frame_ingest --test decoder_pipeline` passes in the same env.
|
||||
- `FfmpegDecoder::new(Codec::H264)` returns `Ok` with `backend() == Software` (not NVDEC) when libnvcuvid.so.1 is missing, regardless of whether `h264_cuvid` is registered.
|
||||
- A new test (e.g., `decoder_falls_back_to_software_when_libnvcuvid_missing`) covers the regression and runs in `Dockerfile.test`.
|
||||
|
||||
## Suggested owner
|
||||
|
||||
Next batch that touches `frame_ingest` (likely a maintenance touch when AZ-678 / AZ-679 / AZ-680 land). Could also be packaged as a standalone Bug ticket in Jira; defer to whoever picks up the next `frame_ingest` work.
|
||||
@@ -0,0 +1,38 @@
|
||||
# Leftover — frame_ingest publisher timing flake on Jetson
|
||||
|
||||
- **Timestamp**: 2026-05-20T22:10:00+03:00
|
||||
- **Source**: Batch-19 Jetson test-gate run (commit pending — closes batch 19)
|
||||
- **Severity**: LOW — flaky test, not a production bug; passed on the second run.
|
||||
- **Origin component**: `frame_ingest` (AZ-657, batch 16)
|
||||
- **NOT in batch 19 scope** — recorded for the next batch that touches `frame_ingest`.
|
||||
|
||||
## Symptom
|
||||
|
||||
`cargo test -p frame_ingest --test publisher::ac1_three_consumers_at_rate_lose_no_frames` failed on the first run inside `Dockerfile.test` on `jetson-e2e`:
|
||||
|
||||
```
|
||||
---- ac1_three_consumers_at_rate_lose_no_frames stdout ----
|
||||
thread 'tokio-rt-worker' (1069) panicked at crates/frame_ingest/tests/publisher.rs:78:31:
|
||||
telemetry stalled at 25/30
|
||||
```
|
||||
|
||||
Passed on the second run with no code change. The test produces 30 frames at a fixed rate and expects all three consumers to keep up. The Jetson Orin Nano Super (6-core Cortex-A78AE at ~2 GHz) is significantly slower than the macOS dev box where the test was originally tuned, so the per-frame timing budget (the source of the 25/30 cutoff at line 78) is too tight for this hardware under load (e.g., during a cold `cargo build` of the next test binary).
|
||||
|
||||
## Fix sketch
|
||||
|
||||
Two options:
|
||||
|
||||
1. **Relax the timing budget** in `crates/frame_ingest/tests/publisher.rs:78` to allow longer per-frame deadlines, OR derive it from a measured baseline so a slow host gets proportionally more time. The test's INTENT — "all three consumers receive all 30 frames" — is preserved; only the synthetic rate is adjusted.
|
||||
|
||||
2. **Mark the test `#[ignore]` on aarch64-linux with a comment pointing here**, then add a slower-rate variant that runs everywhere. This keeps the original test as a "ideal-hardware" check.
|
||||
|
||||
Option 1 is cleaner and matches the existing pattern in the same crate (`ac2_slow_consumer_drops_while_fast_consumers_unaffected` uses a fixed but generous rate).
|
||||
|
||||
## Acceptance for closing this leftover
|
||||
|
||||
- `cargo test -p frame_ingest --test publisher` passes on the first run in `Dockerfile.test` on `jetson-e2e`, three consecutive times.
|
||||
- Test intent (zero-frame-loss across 3 consumers at the configured rate) is preserved.
|
||||
|
||||
## Suggested owner
|
||||
|
||||
Whichever batch next touches `frame_ingest`. Same batch as `2026-05-20_frame_ingest_cuvid_segv.md` if both can be addressed together.
|
||||
Reference in New Issue
Block a user