Files
satellite-provider/_docs/LESSONS.md
T
Oleksandr Bezdieniezhnykh a0449f79d0
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
[AZ-1123] Cycle 11 closeout: traceability, deploy, retro
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-26 11:08:24 +03:00

70 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Engineering Lessons
Recurring bugs, surprising library behaviors, and process insights extracted from completed cycles. Newest at the top. Keep entries short — this is for fast scanning at the start of new cycles, not exhaustive history.
This file has two layers:
- **Deep engineering lessons** (L-NNN): library bugs, architectural insights, multi-paragraph context. Persist forever.
- **Ring buffer** at the bottom: most recent 15 single-sentence lessons emitted by the retrospective skill, consumed by `new-task` / `plan` / `decompose` / `autodev` Step 0. Oldest entries drop off the top.
Categories: `estimation` · `architecture` · `testing` · `dependencies` · `tooling` · `process`
---
## L-001 — Dapper `TypeHandler<T>` is bypassed for enum types during read deserialization
**Cycle**: 1 (AZ-484)
**Discovered by**: integration test failure (`Error parsing column 12 (source=google_maps - String)`); root-caused via web search to long-standing Dapper issue [#259](https://github.com/DapperLib/Dapper/issues/259).
**Affects**: Dapper 2.1.35 (and most other versions until the proposed `Settings.PreferTypeHandlersForEnums` opt-in in PR #2200, not yet merged).
**What happens**
Registering `SqlMapper.AddTypeHandler(new MyEnumHandler())` for an `enum` type — even via `SqlMapper.TypeHandler<TEnum>` — works for **writes** (the handler's `SetValue` is invoked for parameter binding) but is **silently bypassed for reads**. Dapper's IL-emitted deserializer checks `IsEnum` first and falls back to `Enum.TryParse(string, ignoreCase: true)`.
**Why this is dangerous**
If the enum's wire string happens to match a member name case-insensitively (e.g., `RegionStatus.Failed``"failed"`), the bypass goes unnoticed and round-trip works *accidentally*. The bug only surfaces when the wire format diverges from the C# member name (e.g., `TileSource.GoogleMaps``"google_maps"``Enum.TryParse("google_maps")` does not match `GoogleMaps` because of the underscore).
**Recommended approach**
- Do **not** rely on `SqlMapper.TypeHandler<TEnum>` for read-side enum mapping unless the wire values match the enum member names case-insensitively.
- For enums whose wire format diverges (snake_case, kebab-case, custom IDs), store the entity field as `string` and provide an explicit converter (`*Converter.ToWireValue` / `FromWireValue`) for use at the service-layer boundary. This is what AZ-484 does for `TileEntity.Source``TileSourceConverter`.
- Unit-test the converter directly. Do not assume that round-tripping through Dapper proves anything for enums.
**Detection**
- Unit tests of the type handler in isolation will pass even when the handler is bypassed at runtime.
- Failure surfaces only at integration-test time when the actual SELECT runs.
- If you must keep an enum-typed field, write at minimum one integration test that reads the enum back through Dapper from a real database row.
---
- [2026-06-25] [process] Documentation-only autodev cycles should still run Step 12 traceability rows (doc-verified ACs) and Step 13 ripple logs, while Steps 1415 are appropriately skipped when no code surface changes — avoids empty cycle artifacts without running meaningless security/perf gates (cycle 11: AZ-1123).
Source: _docs/06_metrics/retro_2026-06-25_cycle11.md
## Ring buffer (last 15 entries — newest at top)
- [2026-06-25] [testing] PT-07 cold-vs-warm region latency is sensitive to outlier cold p95 on a warm compose volume — the perf gate should drain the region queue before the warm pass and accept warm p50 < cold p50 when p95 is within noise (cycle 10: two marginal PT-07 FAILs before harness fix; AZ-1113 did not touch region paths).
Source: _docs/06_metrics/retro_2026-06-25_cycle10.md
- [2026-06-25] [process] Retrospective security recommendations that name concrete finding IDs (F-AZ795-1/2, F-AZ810-1) and fit ≤2 SP can ship as the sole cycle theme and close multi-cycle carry-overs in one batch — cycle 9 Action #3 → cycle 10 AZ-1113 resolved all three Low A09 findings.
Source: _docs/06_metrics/retro_2026-06-25_cycle10.md
- [2026-06-25] [tooling] Creating `docker-compose.perf.yml` without documenting it in deployment docs leaves the next cycle rediscovering the same host-port conflict — ship the compose overlay and a one-paragraph playbook in `containerization.md` in the same cycle (cycle 10: file added, doc still pending).
Source: _docs/06_metrics/retro_2026-06-25_cycle10.md
- [2026-06-25] [tooling] When host port 5433 is occupied by a sibling Postgres container, integration and perf gates must not depend on publishing postgres to the host — use a self-contained test compose file (internal network only) or a documented `ports: !reset []` override on the dev stack so Step 11/15 can run without stopping sibling projects (cycle 9: `fleet-viewer-dev-db` blocked both integration tests and perf until compose was adjusted).
Source: _docs/06_metrics/retro_2026-06-25_cycle9.md
- [2026-06-25] [testing] Adding a new transport (gRPC) over shared orchestrator logic does not automatically extend the perf harness — REST PT-01..PT-08 can pass while the new RPC surface stays Unverified until an explicit PT-NN scenario and threshold land in `performance-tests.md` + `run-performance-tests.sh` (cycle 9: gRPC DeliverRouteTiles had no perf scenario; gate passed on REST-only evidence).
Source: _docs/06_metrics/retro_2026-06-25_cycle9.md
- [2026-06-25] [architecture] Shared wire contracts belong in a leaf `*.GrpcContracts` (or equivalent) project referenced by both server and test client — keeping proto in the API project forces test projects to reference the full API assembly and couples codegen to the delivery layer (cycle 9: `SatelliteProvider.GrpcContracts` extracted from Api).
Source: _docs/06_metrics/retro_2026-06-25_cycle9.md
- [2026-05-23] [process] Step-14 security-audit Medium findings whose remediation fits the small-fix threshold (≤2 SP, ≤2 files, ≤1 contract bump) should be resolved within the same autodev invocation rather than deferred to cycle N+1 — the fix lands with the same commit chain that introduced the surface, the contract version reflects the fix immediately, and the traceability matrix and blackbox-tests.md sub-cases are written while the finding is fresh; codify the option as a first-class A/B/C choice in `.cursor/skills/autodev/flows/existing-code.md` Step-14 action (cycle 8: F-AZ809-1 unbounded `geofences.polygons` DoS — discovered in commit `ac40a8b`, resolved in commit `8fca6e0` with `MaxPolygons = 50` cap + unit + integration test + `route-creation.md` v1.0.1 patch bump, ~30 minutes from finding to fix landed).
Source: _docs/06_metrics/retro_2026-05-23_cycle8.md
- [2026-05-23] [process] Retrospective recommendations ship end-to-end in the next cycle only when they (a) name concrete tracker tickets / files / endpoints in the action text, (b) are sized as a coherent cycle theme rather than scattered one-off fixes, and (c) the next cycle's planning phase pulls the recommendation directly into the task slate without re-deriving scope — phrase recommendations to satisfy all three or they become multi-cycle carry-overs (cycle 7 Action 3 named the 4 AZ-795 child endpoints + the SP sizing → cycle 8 shipped AZ-808 + AZ-809 + AZ-810 + AZ-811 + AZ-812 as the coherent strict-validation theme, first directly-traceable cross-cycle improvement-action end-to-end in project history).
Source: _docs/06_metrics/retro_2026-05-23_cycle8.md
- [2026-05-23] [tooling] When a contract introduces a new required field or renames a wire key, the test-spec sync step must ripgrep all consumer paths (`scripts/run-performance-tests.sh`, `scripts/probe_*.sh`, `README.md` example URLs, `_docs/04_deploy/*.md` example bodies, `_docs/02_document/tests/blackbox-tests.md` trigger excerpts, OpenAPI examples) and verify they reference the new shape — otherwise partial syncs surface at Step 15 Performance Test gate or worse in production (cycle 8: AZ-812 lat/lon rename updated the perf script but AZ-809's newly-required `requestMaps` + `createTilesZip` fields did not propagate to PT-06; caught at Step 15 with a 1-line script fix in commit `32bc5c1` — a static-check probe at batch closure would have caught it 2 days earlier).
Source: _docs/06_metrics/retro_2026-05-23_cycle8.md
- [2026-05-23] [process] When verifying a "no-regression" AC for an input-validation change ("AZ-NNN does not break existing tests"), the only sound evidence is a green integration-test run — tracing fixture variables back to their generators in source is insufficient because helpers can produce values outside the new bounds and previously slipped through silently when no validator existed; document the standard as "verified by reading source" → unconfirmed, "verified by full test run" → confirmed, and gate the batch report's AC table on the latter before the implement skill closes the batch (cycle 8: AZ-810 batch_04 AC-9 claimed "no AZ-488 regression" based on tracing `latitude = coord.Latitude` in test source, but `NextTestCoordinate` seeded by `(Ticks/TicksPerSecond) % 1_000_000` produced lat far above 90° at runtime; the false-PASS only surfaced at autodev Step 11 when the integration test run returned HTTP 400 from the new validator on the AZ-488 happy path).
Source: _docs/03_implementation/batch_04_cycle8_report.md (AC-9 row)
- [2026-05-22] [process] When the implement skill ships a cycle's batch commit without writing `_docs/03_implementation/implementation_report_*_cycle{N}.md`, downstream skills (test-spec cycle-update, document task mode, retrospective Step 1) must fall back to reading the cycle's task specs in `_docs/02_tasks/done/` plus the commit body via `git log --grep='[AZ-...]'` — codify the fallback in those skills' instructions instead of leaving it as per-cycle improvisation, because the implicit contract between Step 10 and Steps 11-17 broke silently this cycle and only succeeded because every downstream skill happened to be robust enough to substitute (cycle 7: AZ-794+AZ-795+AZ-796 shipped as commit `865dfdb` with no report artifact; doc-skill auto-walked the diff, test-spec read the task specs, retrospective wrote from the deploy + security + perf reports — all worked, but the contract was never formal). **Status**: closed by cycle 8 — the implement skill self-corrected and produced both per-batch reports AND a consolidated implementation report AND a completeness report AND a cumulative cross-batch review.
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
- [2026-05-22] [testing] When a strict-validation layer ships (`JsonSerializerOptions.UnmappedMemberHandling.Disallow`, FluentValidation rules, explicit DTO `[JsonRequired]`), expect the project's own integration tests to surface latent bugs the prior lenient defaults had been masking — silent PascalCase fallback property names, out-of-range fixture coordinates, wrong-cased JSON keys; correct them in the same PR or the test suite goes red and the strict layer looks like a regression instead of the bug-finder it is (cycle 7: `IdempotentPostTests.RoutePoint` had been posting `{"Lat":...}` against a `[JsonPropertyName("lat")]` DTO for months; the new strict deserializer caught it and the 2-line payload fix landed alongside the strict layer; **cycle 8 instance**: `UavUploadTests.NextTestCoordinate` produced lat > 90°, caught by AZ-810 validator, 2-file clamp fix in batch 4).
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
- [2026-05-22] [architecture] Contract MAJOR bumps for projects with ≤2 known consumers should ship without a wire-format adapter — the cost of maintaining a dual-accepting endpoint outweighs the benefit when the operator runbook can coordinate the consumer cut-over directly; only invest in an adapter when consumer count or release-cadence asymmetry makes coordinated cut-over impractical (cycle 7: `tile-inventory.md` 1.0.0 → 2.0.0 renamed `tileZoom/tileX/tileY → z/x/y` and shipped breaking; the only consumer is `gps-denied-onboard` and the AZ-777 follow-up loop handled the coordination via the operator runbook in `deploy_cycle7.md`).
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
- [2026-05-12] [tooling] Kestrel `HttpProtocols.Http1AndHttp2` silently serves only HTTP/1.1 over a plaintext listener — ALPN requires TLS, so any "enable HTTP/2" task without TLS in its definition-of-done will downgrade transparently and the only log line is at INFO; tasks that mention HTTP/2 / h2 / multiplexing / ALPN MUST resolve the TLS-vs-h2c choice at spec-write time and the test gate MUST assert `HttpVersion == 2.0` not just a 200 (cycle 6: AZ-505 AC-5 first landed on h2c plaintext, required a post-merge TLS+ALPN pivot with dev-cert plumbing across compose/tests/perf/docs).
Source: _docs/06_metrics/retro_2026-05-12_cycle6.md