gps-denied-onboard/_docs/LESSONS.md

# LESSONS

Append-only ledger of lessons learned during the project. New entries go at the **top**. Each entry is one short bullet + a one-sentence "what changed".

Ring buffer: trim to the last 15 entries. Categories: `estimation · architecture · testing · dependencies · tooling · process`.

---

## 2026-05-20 — [testing] Two-tier test policy retired — all tests run on Jetson only

**Trigger**: a `/test-run` invocation on the workstation Tier-1 Docker stack uncovered eight categorically distinct, sequential bugs in the supposedly-supported workstation path (Dockerfile `COPY` ordering before editable install, base-image pip too old for `gtsam` pre-release wheels, runtime stage missing the `python3` metapackage that `python3 -m venv` symlinks against, missing `libgl1` / `libglib2.0-0` for `cv2` import, missing `runtime_root/__main__.py` shim, lazy import that never registered the `c6_tile_cache` config block, and a `BUILD_FAISS_INDEX` env flag gap in `docker-compose.test.jetson.yml`). None of these had been hit before because no one had actually executed the workstation Docker stack end-to-end since it was authored — the colocated Jetson Woodpecker agent was the only test environment that ever ran. Maintaining the divergent x86 path was producing only false-negative signal and engineering time, never honest test coverage.

**What changed**: the two-tier execution profile is retired in favour of a Jetson-only policy. Source of truth: `_docs/02_document/tests/environment.md` (active-policy banner at top + superseding "Decision (2026-05-20)" in § Test Execution). CI policy updated in `_docs/04_deploy/ci_cd_pipeline.md` and `_docs/02_document/deployment/ci_cd_pipeline.md`. Local-development entry point: `scripts/run-tests-jetson.sh` against the configured `jetson-e2e` SSH alias. The general rule: **if you have one environment that matches production and one that doesn't, don't maintain both — maintain the one that matches.**

## 2026-05-20 — [process] Before classifying a per-task FAIL, probe cross-cutting state the task depends on (registries, factories, baselines)

**Trigger**: cycle-1 Step 7 Product Implementation Completeness Gate originally classified AZ-332 + AZ-333 as FAIL and proposed two per-strategy remediation tasks (AZ-589 + AZ-590). Post-mortem found the actual gap was the empty central `_STRATEGY_REGISTRY` — a cross-cutting concern that should have produced **one** task (AZ-591), not two. AZ-589 + AZ-590 closed Won't Fix.

**What changed**: completeness gates should now run a workspace grep for cross-cutting registry / factory state the task depends on before classifying a per-task FAIL. If the actual root cause is cross-cutting, propose a single cross-cutting task instead of N per-task remediation tasks. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Suggested Rule/Skill Updates.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-20 — [testing] If N test specs share a single un-built fixture, schedule the fixture builder as a P0 prerequisite during decompose

**Trigger**: cycle-1 ended with 17 NFT scenarios `sitl_replay_ready`-skipping on the Tier-1 docker harness because AZ-595 (SITL observer + FDR replay fixture builder) was decomposed as a peer task and slipped to the end of the cycle. Cumulative review window 88-92 surfaced this as a 5 cp PBI that now blocks the cycle-2 Step 11 retry.

**What changed**: `decompose/SKILL.md` should identify the fixture-builder dependency surface explicitly during test-task decomposition. If N test tasks share one un-built fixture, the fixture builder is a P0 prerequisite and is scheduled ahead of the dependent tasks, not as a peer. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Suggested Rule/Skill Updates.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-20 — [architecture] Land `_docs/02_document/architecture_compliance_baseline.md` as a Step 6 (Decompose) prerequisite so cumulative reviews can emit Baseline Delta sections

**Trigger**: every cumulative review across cycle 1 logged "`_docs/02_document/architecture_compliance_baseline.md` does NOT exist → no Baseline Delta section emitted". Structural regressions (new cycles in the import graph, newly-introduced architecture violations) therefore could not be quantified across cycle 1 — only verified pairwise per batch.

**What changed**: cycle 2 Step 6 (Decompose) should create the baseline file with `0` violations seeded from the structural snapshot at `_docs/06_metrics/structure_2026-05-20.md`. From cycle 2 onward, `## Baseline Delta` rows quantify carried-over / resolved / newly-introduced violations per cycle. Captured in `_docs/06_metrics/retro_2026-05-20.md` § Top 3 Improvement Actions #3.

Source: `_docs/06_metrics/retro_2026-05-20.md`

## 2026-05-18 — When autodev rewinds N → 7 (or any earlier step) mid-session, treat the handoff as a session boundary

**Trigger**: In Step 11 (Run Tests) cycle 1, the Jetson e2e gate routed the flow back to Step 7 (Implement) for AZ-618 (cross-cutting 5pt task with 12 infrastructure deps). The user repeatedly chose to continue in the same conversation. I rewound state cleanly (task spec + autodev state) but, on attempting to enter the implement skill's batch loop in the SAME conversation, found that even just investigating the 12 builder signatures consumed enough context to reach the Caution zone — writing the implementation would have hit truncation mid-batch.

**What changed**: When the autodev rewinds the flow to an EARLIER step in the same conversation (Step 11 → Step 7, Step 11 → Step 9, etc.), treat the rewind itself as a session boundary, regardless of whether the flow file's Auto-Chain Rules table marks it as one. Save the bootstrap artifacts (task spec, state, dependencies-table refresh), commit them, then ask for a fresh conversation. The rewind already cost real tool calls; the destination step's batch loop deserves clean context. Document the rewind reason in `sub_step.detail` so re-entry is one-line clear.

## 2026-05-17 — Always call `getTransitionsForJiraIssue` before `transitionJiraIssue`

**Trigger**: In batch 87 (autodev step 10), I transitioned AZ-436..AZ-439 with `transition.id="31"` assuming = "In Progress" from stale memory. Read-back showed all four moved to **Done** instead (id `31` in this workflow = Done; In Progress = `21`, In Testing = `32`, To Do = `11`). The mistake was caught by the tracker rule's mandatory read-back gate, fixed by re-transitioning to `21`, and confirmed via second read-back.

**What changed**: Treat the transition ID as workflow-specific, not memorizable across sessions. Always query `getTransitionsForJiraIssue` first on the actual target issue (or one in the same project/workflow) and select the transition by `name` ("In Progress" / "In Testing" / "Done" / "To Do") — never by hard-coded numeric id. This is true even when you "remember" the IDs from a prior batch this same day, because the agent has no guarantee the workflow definition is stable.