mirror of
https://github.com/azaion/satellite-provider.git
synced 2026-06-21 19:31:13 +00:00
b763da3f24
The AZ-810 metadata validator rejects lat outside [-90, 90] and lon
outside [-180, 180]. Two NextTestCoordinate() helpers seeded their
counter from `(Ticks/TicksPerSecond) % 1_000_000` and returned
`60 + n*0.0005`, producing lat well above 90° for almost any seed
(e.g. n=200000 -> lat=160). Pre-AZ-810 there was no validator and no
DB constraint, so the out-of-range values were silently accepted; the
new validator (correctly) rejected them at HTTP 400.
Clamp both helpers to non-overlapping OSM-valid ranges:
- UavUploadTests.cs: lat in [50, 70), lon in [10, 40)
- UavUploadValidationTests.cs: lat in [-70, -50), lon in [-40, -10)
Non-overlap (not the prior +5_000_000 counter offset) is what now
guarantees AZ-488 and AZ-810 suites don't collide on the per-source
UNIQUE index when both run against the same DB.
No production code change; AZ-810 validator behaviour is unchanged.
Also:
- Correct AC-9 in batch_04_cycle8_report.md: the original claim
("verified by tracing source") was a false-PASS; the autodev
Step 11 test run surfaced the gap. Now confirmed by full-suite
green (scripts/run-tests.sh --full).
- Add ring-buffer lesson on AC-verification standards for input-
validation changes: tracing fixture variables to their generators
is insufficient; only a green integration-test run is sound
evidence for a "no-regression" AC.
Co-authored-by: Cursor <cursoragent@cursor.com>
72 lines
14 KiB
Markdown
72 lines
14 KiB
Markdown
# Engineering Lessons
|
|
|
|
Recurring bugs, surprising library behaviors, and process insights extracted from completed cycles. Newest at the top. Keep entries short — this is for fast scanning at the start of new cycles, not exhaustive history.
|
|
|
|
This file has two layers:
|
|
|
|
- **Deep engineering lessons** (L-NNN): library bugs, architectural insights, multi-paragraph context. Persist forever.
|
|
- **Ring buffer** at the bottom: most recent 15 single-sentence lessons emitted by the retrospective skill, consumed by `new-task` / `plan` / `decompose` / `autodev` Step 0. Oldest entries drop off the top.
|
|
|
|
Categories: `estimation` · `architecture` · `testing` · `dependencies` · `tooling` · `process`
|
|
|
|
---
|
|
|
|
## L-001 — Dapper `TypeHandler<T>` is bypassed for enum types during read deserialization
|
|
|
|
**Cycle**: 1 (AZ-484)
|
|
**Discovered by**: integration test failure (`Error parsing column 12 (source=google_maps - String)`); root-caused via web search to long-standing Dapper issue [#259](https://github.com/DapperLib/Dapper/issues/259).
|
|
**Affects**: Dapper 2.1.35 (and most other versions until the proposed `Settings.PreferTypeHandlersForEnums` opt-in in PR #2200, not yet merged).
|
|
|
|
**What happens**
|
|
Registering `SqlMapper.AddTypeHandler(new MyEnumHandler())` for an `enum` type — even via `SqlMapper.TypeHandler<TEnum>` — works for **writes** (the handler's `SetValue` is invoked for parameter binding) but is **silently bypassed for reads**. Dapper's IL-emitted deserializer checks `IsEnum` first and falls back to `Enum.TryParse(string, ignoreCase: true)`.
|
|
|
|
**Why this is dangerous**
|
|
If the enum's wire string happens to match a member name case-insensitively (e.g., `RegionStatus.Failed` ↔ `"failed"`), the bypass goes unnoticed and round-trip works *accidentally*. The bug only surfaces when the wire format diverges from the C# member name (e.g., `TileSource.GoogleMaps` ↔ `"google_maps"` — `Enum.TryParse("google_maps")` does not match `GoogleMaps` because of the underscore).
|
|
|
|
**Recommended approach**
|
|
- Do **not** rely on `SqlMapper.TypeHandler<TEnum>` for read-side enum mapping unless the wire values match the enum member names case-insensitively.
|
|
- For enums whose wire format diverges (snake_case, kebab-case, custom IDs), store the entity field as `string` and provide an explicit converter (`*Converter.ToWireValue` / `FromWireValue`) for use at the service-layer boundary. This is what AZ-484 does for `TileEntity.Source` ↔ `TileSourceConverter`.
|
|
- Unit-test the converter directly. Do not assume that round-tripping through Dapper proves anything for enums.
|
|
|
|
**Detection**
|
|
- Unit tests of the type handler in isolation will pass even when the handler is bypassed at runtime.
|
|
- Failure surfaces only at integration-test time when the actual SELECT runs.
|
|
- If you must keep an enum-typed field, write at minimum one integration test that reads the enum back through Dapper from a real database row.
|
|
|
|
---
|
|
|
|
## Ring buffer (last 15 entries — newest at top)
|
|
|
|
- [2026-05-23] [process] When verifying a "no-regression" AC for an input-validation change ("AZ-NNN does not break existing tests"), the only sound evidence is a green integration-test run — tracing fixture variables back to their generators in source is insufficient because helpers can produce values outside the new bounds and previously slipped through silently when no validator existed; document the standard as "verified by reading source" → unconfirmed, "verified by full test run" → confirmed, and gate the batch report's AC table on the latter before the implement skill closes the batch (cycle 8: AZ-810 batch_04 AC-9 claimed "no AZ-488 regression" based on tracing `latitude = coord.Latitude` in test source, but `NextTestCoordinate` seeded by `(Ticks/TicksPerSecond) % 1_000_000` produced lat far above 90° at runtime; the false-PASS only surfaced at autodev Step 11 when the integration test run returned HTTP 400 from the new validator on the AZ-488 happy path).
|
|
Source: _docs/03_implementation/batch_04_cycle8_report.md (AC-9 row)
|
|
- [2026-05-22] [process] When the implement skill ships a cycle's batch commit without writing `_docs/03_implementation/implementation_report_*_cycle{N}.md`, downstream skills (test-spec cycle-update, document task mode, retrospective Step 1) must fall back to reading the cycle's task specs in `_docs/02_tasks/done/` plus the commit body via `git log --grep='[AZ-...]'` — codify the fallback in those skills' instructions instead of leaving it as per-cycle improvisation, because the implicit contract between Step 10 and Steps 11-17 broke silently this cycle and only succeeded because every downstream skill happened to be robust enough to substitute (cycle 7: AZ-794+AZ-795+AZ-796 shipped as commit `865dfdb` with no report artifact; doc-skill auto-walked the diff, test-spec read the task specs, retrospective wrote from the deploy + security + perf reports — all worked, but the contract was never formal).
|
|
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
|
|
- [2026-05-22] [testing] When a strict-validation layer ships (`JsonSerializerOptions.UnmappedMemberHandling.Disallow`, FluentValidation rules, explicit DTO `[JsonRequired]`), expect the project's own integration tests to surface latent bugs the prior lenient defaults had been masking — silent PascalCase fallback property names, out-of-range fixture coordinates, wrong-cased JSON keys; correct them in the same PR or the test suite goes red and the strict layer looks like a regression instead of the bug-finder it is (cycle 7: `IdempotentPostTests.RoutePoint` had been posting `{"Lat":...}` against a `[JsonPropertyName("lat")]` DTO for months; the new strict deserializer caught it and the 2-line payload fix landed alongside the strict layer).
|
|
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
|
|
- [2026-05-22] [architecture] Contract MAJOR bumps for projects with ≤2 known consumers should ship without a wire-format adapter — the cost of maintaining a dual-accepting endpoint outweighs the benefit when the operator runbook can coordinate the consumer cut-over directly; only invest in an adapter when consumer count or release-cadence asymmetry makes coordinated cut-over impractical (cycle 7: `tile-inventory.md` 1.0.0 → 2.0.0 renamed `tileZoom/tileX/tileY → z/x/y` and shipped breaking; the only consumer is `gps-denied-onboard` and the AZ-777 follow-up loop handled the coordination via the operator runbook in `deploy_cycle7.md`).
|
|
Source: _docs/06_metrics/retro_2026-05-22_cycle7.md
|
|
- [2026-05-12] [tooling] Kestrel `HttpProtocols.Http1AndHttp2` silently serves only HTTP/1.1 over a plaintext listener — ALPN requires TLS, so any "enable HTTP/2" task without TLS in its definition-of-done will downgrade transparently and the only log line is at INFO; tasks that mention HTTP/2 / h2 / multiplexing / ALPN MUST resolve the TLS-vs-h2c choice at spec-write time and the test gate MUST assert `HttpVersion == 2.0` not just a 200 (cycle 6: AZ-505 AC-5 first landed on h2c plaintext, required a post-merge TLS+ALPN pivot with dev-cert plumbing across compose/tests/perf/docs).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle6.md
|
|
- [2026-05-12] [testing] When a test bypasses Dapper to gain access to a feature Dapper doesn't expose (e.g. `ANY($1::uuid[])` array params, raw `NpgsqlCommand` for performance fixtures), the test owns the Npgsql type-conversion contract that Dapper used to handle silently — `DateTime.Kind=Utc` must be converted to `Unspecified` before binding into a `timestamp without time zone` column (cycle 6: AZ-505 introduced two Dapper-bypassing paths and all three new test files hit the same `Cannot write DateTime with Kind=UTC` error until `DateTime.SpecifyKind(..., Unspecified)` was added at the bind sites).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle6.md
|
|
- [2026-05-12] [testing] Tests that assert specific schema artifact names (`idx_<name>` / `pk_<name>` / `fk_<name>`) need cross-migration awareness — phrase assertions at the capability abstraction level ("any index whose first column is `location_hash`") rather than the artifact-name level when possible, otherwise drop/rename migrations require fixture co-updates in the same PR (cycle 6: `MigrationTests.Az503NewUniqueIndexCoversIntegerKeyAndFlightId` hardcoded `idx_tiles_location_hash` from migration 014; migration 015 dropped it, broke the assertion until broadened to accept either index name).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle6.md
|
|
- [2026-05-12] [architecture] Cross-repo cryptographic invariants (UUID namespaces, deterministic-key formulas, base32/64 alphabets, tile-zoom conventions) MUST live as code-level constants in BOTH repos with reference-vector tests on BOTH sides — documentation alone is insufficient because constant drift surfaces only as 100% lookup misses in production, harder to detect than a unit-test failure (cycle 5: AZ-503 introduced `TileNamespace = 5b8d0c2e-7f1a-4d3b-9c5e-1f3a8e7d2b6c` which must byte-match the same constant in `gps-denied-onboard/components/c6_tile_cache/_uuid.py`; the satellite-provider side has the constant + 10 Python-generated reference vectors in `Uuidv5Tests.cs` and the sibling repo will mirror).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle5.md
|
|
- [2026-05-12] [tooling] Local Docker/colima DNS cold-start is a recurring class of failure that contaminates the Step-15 perf gate — when the perf-mode "one re-run" rule fires twice across consecutive cycles with the same root-cause class (DNS / NTP / resolver), the harness must escalate from "re-run" to a deterministic fix at the harness layer (DNS pre-warm in script, OR move gate to CI), not just another re-run (cycle 5: PT-01 failed Run #1 on `tile.googleapis.com` cold-start, then Run #2 on `mt0.google.com` cold-start; the warmup probe between runs only touched the hostnames it explicitly named).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle5.md
|
|
- [2026-05-12] [process] When a /autodev cycle's task spec contradicts the live codebase by ≥2 missing prerequisites, the implement skill should preferentially split into foundation + follow-up via A/B/C (option C) rather than (A) silently expand the SP budget or (B) defer the entire task — both halves remain individually shippable and individually testable, the cross-PBI dependency is captured as a blocked-link in the tracker (cycle 5: AZ-503 → AZ-503-foundation + AZ-505 split when `flight_id` / `FlightId` / `voting_status` all missing from live code; AZ-503-foundation shipped this cycle, AZ-505 blocks-on-it for cycle 6).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle5.md
|
|
- [2026-05-12] [dependencies] Major-version bumps of direct deps cascade through transitives; the task spec must list the transitive packages whose major version changes as a result OR explicitly note "transitive major-version drift not analyzed in spec" — verify with `dotnet restore --dry-run` against a scratch branch before writing the spec (cycle 4: AZ-500 surprise-bumped `Microsoft.OpenApi` 1.x → 2.x via the `Microsoft.AspNetCore.OpenApi` 8.0.25 → 10.0.7 path; forced an unscheduled Swashbuckle bump + Program.cs refactor mid-implementation).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle4.md
|
|
- [2026-05-12] [process] When a scope-protected task newly *exposes* a pre-existing bug elsewhere in the codebase (vs. introducing a new one), surface it as a recommended follow-up PBI in the batch report AND list it as a "newly exposed bug" separate from "newly introduced findings" in the deploy report — bugs that already existed don't count as cycle-introduced regressions, but they must not be silently re-buried (cycle 4: AZ-500's bootstrap fix unmasked the pre-existing `scripts/run-performance-tests.sh:417` `grep -o | wc -l` + `pipefail` bug).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle4.md
|
|
- [2026-05-12] [process] When a cycle has a single non-functional task (migration / refactor / dependency hygiene), the retro must reframe the metric set around continuity (0 regressions), forward-resolution (prior findings closed by the bump itself), and unblocking (capabilities now exercisable end-to-end) — task count + complexity points read as misleading flatlines that look like under-productivity (cycle 4: AZ-500 alone delivered 5 SP vs cycle 3's 18 SP, but the cycle's value was forward-resolving 2 cycle-3 advisories and finally executing PT-01..PT-08 end-to-end against the migrated build).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle4.md
|
|
- [2026-05-12] [process] For cross-team blockers (admin team must supply config values, etc.), prefer an Option-B forcing function (ship the validation/scaffolding with prod-empty config that fails-fast at deploy) over deferring the entire task — the fail-fast contract makes the cross-team conversation impossible to skip and ships the in-workspace work in the current cycle (cycle 3: AZ-494 shipped iss/aud validation with empty prod appsettings so deploy must supply real values).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle3.md
|
|
- [2026-05-12] [process] ACs that prescribe a specific measurement or sentinel mechanism (e.g. "per-item latency < 50ms", "guard fires when DB name contains _test") should also prescribe — or explicitly defer — the path for collecting / enforcing it, or implementations will substitute proxies / equivalents that look like spec drift in review (cycle 3: AZ-492 PT-08 per-item gate cost became a derived proxy; AZ-493 DB-name guard became Host-allowlist).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle3.md
|
|
- [2026-05-12] [process] ACs that require cross-repo writes should be tagged with the target workspace and rendered separately in the traceability matrix — mixing them with in-workspace ACs makes "correctly deferred" indistinguishable from "incomplete work" (cycle 3: AZ-494 AC-7 deferred for the suite-repo write; matrix renders as `◐ deferred` which is ambiguous).
|
|
Source: _docs/06_metrics/retro_2026-05-12_cycle3.md
|