gps-denied-onboard/_docs/LESSONS.md

# LESSONS

Append-only ledger of lessons learned during the project. New entries go at the **top**. Each entry is one short bullet + a one-sentence "what changed".

Ring buffer: trim to the last 15 entries. Categories: `estimation · architecture · testing · dependencies · tooling · process`.

---

## 2026-05-26 — [testing] Removing `@pytest.mark.xfail` must be paired with a same-batch run on the actual hardware tier the test targets

**Trigger**: AZ-848 root cause re-diagnosis (2026-05-26). In cycle 2, commit `8de2716 [AZ-776] Open-loop ESKF composition profile via c4_pose.enabled` removed `@xfail` decorators from AC-1/AC-2/AC-5/AC-6 in `test_derkachi_1min.py` with AC-7 in the spec stating "tests run on Jetson after this task → All five pass". The Jetson run was never executed before AZ-776 closed. The latent C1 contract bug (`VioOutput.emitted_at_ns` uses `monotonic_ns` instead of FC-boot-relative timestamps) was therefore not detected until cycle-3 Step 11 — three weeks later. AZ-848 is 5 SP and now blocks all real airborne work in cycle 4.

**What changed**: `.cursor/skills/implement/SKILL.md` batch self-review should add a check — **if the batch removes any `@pytest.mark.xfail` decorator**, the same batch MUST include a green test execution against the test's target tier (or explicit `tier-2-only` skip documentation if the hardware is unavailable in the batch session). Block PASS verdict without this evidence. Predates the 2026-05 `meta-rule.mdc` "Real Results, Not Simulated Ones" rule but the implement skill's own gate should also enforce.

Source: `_docs/06_metrics/retro_2026-05-26.md`

## 2026-05-26 — [process] Autodev must block Step-N+1 entry if the previous cycle's retro file is missing

**Trigger**: cycle-2 retro was never filed. The autodev orchestrator silently auto-chained from cycle-2 Step 17 (if it ran at all) straight into cycle-3 Step 9 without producing `retro_<cycle2-date>.md`. As a result, cycle-1 retro's Top-3 Improvement Actions sat invisible across cycle 2 and were re-discovered, all three still undelivered, only at cycle-3 close — including `architecture_compliance_baseline.md` (action #3) which is now in its third cycle of being un-delivered.

**What changed**: `.cursor/skills/autodev/state.md` Re-Entry After Completion (or `flows/existing-code.md`) should verify that `_docs/06_metrics/retro_<YYYY-MM-DD>.md` exists for the previous cycle (`state.cycle`) before incrementing the cycle counter and entering Step 9 of cycle N+1. If absent, BLOCK and surface the gap with an A/B/C choice: (A) author the missing retro now, (B) stub a backfilled retro and proceed, (C) abort and ask the user.

Source: `_docs/06_metrics/retro_2026-05-26.md`

## 2026-05-26 — [tooling] When investigating bug X reveals a separate latent bug Y, file Y as a new ticket immediately — do not fold Y's scope into X

**Trigger**: AZ-848 evidence-based investigation (2026-05-26) used a pymavlink probe against the Derkachi tlog to verify the original "IMU-vs-IMU clock mismatch" hypothesis. The probe REFUTED the original hypothesis (both `RAW_IMU` and `SCALED_IMU2` share the FC-boot timebase) and SIMULTANEOUSLY surfaced a separate latent bug — `c8_fc_adapter._handle_imu` mis-reads `SCALED_IMU2.time_boot_ms` as `time_usec`, defaulting to 0 for ~half of all IMU samples. Both bugs are real and orthogonal in their fix paths. The decision was to split — AZ-883 (2 SP) gets its own ticket, AZ-848 (5 SP) keeps its tightly-scoped contract repair.

**What changed**: when a deep investigation surfaces a second latent issue that's orthogonal to the primary bug, file the second issue as its own ticket in the same session (with full evidence + reproduction protocol), then resume the primary investigation. Resist the temptation to fold the second issue into the primary ticket's scope "for convenience" — it inflates SP estimates and couples fix landings unnecessarily.

Source: `_docs/06_metrics/retro_2026-05-26.md`

## 2026-05-20 — [testing] Two-tier test policy retired — all tests run on Jetson only

**Trigger**: a `/test-run` invocation on the workstation Tier-1 Docker stack uncovered eight categorically distinct, sequential bugs in the supposedly-supported workstation path (Dockerfile `COPY` ordering before editable install, base-image pip too old for `gtsam` pre-release wheels, runtime stage missing the `python3` metapackage that `python3 -m venv` symlinks against, missing `libgl1` / `libglib2.0-0` for `cv2` import, missing `runtime_root/__main__.py` shim, lazy import that never registered the `c6_tile_cache` config block, and a `BUILD_FAISS_INDEX` env flag gap in `docker-compose.test.jetson.yml`). None of these had been hit before because no one had actually executed the workstation Docker stack end-to-end since it was authored — the colocated Jetson Woodpecker agent was the only test environment that ever ran. Maintaining the divergent x86 path was producing only false-negative signal and engineering time, never honest test coverage.

**What changed**: the two-tier execution profile is retired in favour of a Jetson-only policy. Source of truth: `_docs/02_document/tests/environment.md` (active-policy banner at top + superseding "Decision (2026-05-20)" in § Test Execution). CI policy updated in `_docs/04_deploy/ci_cd_pipeline.md` and `_docs/02_document/deployment/ci_cd_pipeline.md`. Local-development entry point: `scripts/run-tests-jetson.sh` against the configured `jetson-e2e` SSH alias. The general rule: **if you have one environment that matches production and one that doesn't, don't maintain both — maintain the one that matches.**

## 2026-05-20 — [process] Before classifying a per-task FAIL, probe cross-cutting state the task depends on (registries, factories, baselines)

**Trigger**: cycle-1 Step 7 Product Implementation Completeness Gate originally classified AZ-332 + AZ-333 as FAIL and proposed two per-strategy remediation tasks (AZ-589 + AZ-590). Post-mortem found the actual gap was the empty central `_STRATEGY_REGISTRY` — a cross-cutting concern that should have produced **one** task (AZ-591), not two. AZ-589 + AZ-590 closed Won't Fix.

**What changed**: completeness gates should now run a workspace grep for cross-cutting registry / factory state the task depends on before classifying a per-task FAIL. If the actual root cause is cross-cutting, propose a single cross-cutting task instead of N per-task remediation tasks. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Suggested Rule/Skill Updates.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-20 — [testing] If N test specs share a single un-built fixture, schedule the fixture builder as a P0 prerequisite during decompose

**Trigger**: cycle-1 ended with 17 NFT scenarios `sitl_replay_ready`-skipping on the Tier-1 docker harness because AZ-595 (SITL observer + FDR replay fixture builder) was decomposed as a peer task and slipped to the end of the cycle. Cumulative review window 88-92 surfaced this as a 5 cp PBI that now blocks the cycle-2 Step 11 retry.

**What changed**: `decompose/SKILL.md` should identify the fixture-builder dependency surface explicitly during test-task decomposition. If N test tasks share one un-built fixture, the fixture builder is a P0 prerequisite and is scheduled ahead of the dependent tasks, not as a peer. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Suggested Rule/Skill Updates.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-20 — [architecture] Land `_docs/02_document/architecture_compliance_baseline.md` as a Step 6 (Decompose) prerequisite so cumulative reviews can emit Baseline Delta sections

**Trigger**: every cumulative review across cycle 1 logged "`_docs/02_document/architecture_compliance_baseline.md` does NOT exist → no Baseline Delta section emitted". Structural regressions (new cycles in the import graph, newly-introduced architecture violations) therefore could not be quantified across cycle 1 — only verified pairwise per batch.

**What changed**: cycle 2 Step 6 (Decompose) should create the baseline file with `0` violations seeded from the structural snapshot at `_docs/06_metrics/structure_2026-05-20.md`. From cycle 2 onward, `## Baseline Delta` rows quantify carried-over / resolved / newly-introduced violations per cycle. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Top 3 Improvement Actions #3.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-18 — When autodev rewinds N → 7 (or any earlier step) mid-session, treat the handoff as a session boundary

**Trigger**: In Step 11 (Run Tests) cycle 1, the Jetson e2e gate routed the flow back to Step 7 (Implement) for AZ-618 (cross-cutting 5pt task with 12 infrastructure deps). The user repeatedly chose to continue in the same conversation. I rewound state cleanly (task spec + autodev state) but, on attempting to enter the implement skill's batch loop in the SAME conversation, found that even just investigating the 12 builder signatures consumed enough context to reach the Caution zone — writing the implementation would have hit truncation mid-batch.

**What changed**: When the autodev rewinds the flow to an EARLIER step in the same conversation (Step 11 → Step 7, Step 11 → Step 9, etc.), treat the rewind itself as a session boundary, regardless of whether the flow file's Auto-Chain Rules table marks it as one. Save the bootstrap artifacts (task spec, state, dependencies-table refresh), commit them, then ask for a fresh conversation. The rewind already cost real tool calls; the destination step's batch loop deserves clean context. Document the rewind reason in `sub_step.detail` so re-entry is one-line clear.

## 2026-05-17 — Always call `getTransitionsForJiraIssue` before `transitionJiraIssue`

**Trigger**: In batch 87 (autodev step 10), I transitioned AZ-436..AZ-439 with `transition.id="31"` assuming = "In Progress" from stale memory. Read-back showed all four moved to **Done** instead (id `31` in this workflow = Done; In Progress = `21`, In Testing = `32`, To Do = `11`). The mistake was caught by the tracker rule's mandatory read-back gate, fixed by re-transitioning to `21`, and confirmed via second read-back.

**What changed**: Treat the transition ID as workflow-specific, not memorizable across sessions. Always query `getTransitionsForJiraIssue` first on the actual target issue (or one in the same project/workflow) and select the transition by `name` ("In Progress" / "In Testing" / "Done" / "To Do") — never by hard-coded numeric id. This is true even when you "remember" the IDs from a prior batch this same day, because the agent has no guarantee the workflow definition is stable.