From 8e509b550c44112a76e4b43bd12d6c08b60543d1 Mon Sep 17 00:00:00 2001 From: Oleksandr Bezdieniezhnykh Date: Tue, 12 May 2026 16:27:40 +0300 Subject: [PATCH] [AZ-503] [AZ-504] cycle 5 new-task: tile identity + perf-script-fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - AZ-503 (3 SP, epic AZ-483) — Tile identity → UUIDv5 deterministic id; integer-only UPSERT with COALESCE(flight_id) per-flight separation; content_sha256 column; POST /api/satellite/tiles/inventory bulk-list endpoint; HTTP/2 at Kestrel edge. Cross-workspace handoff from gps-denied-onboard (AZ-304 / AZ-316 counterpart). Supersedes the AZ-484 UPSERT-conflict-key portion. - AZ-504 (1 SP, epic AZ-483) — Fix scripts/run-performance-tests.sh lines 416-417: grep -o | wc -l + set -o pipefail kills PT-08 when rejected=0. Closes the replay obligation for the cycle-3 perf-harness leftover (leftover deletion gated on green full perf run, AC-4). Updates _dependencies_table.md with cycle 5 entries and records replay attempt #4 against the perf-cycle3 leftover (PBI opened — leftover still stays until AZ-504 lands and full perf run is green). State advanced to Step 10 (Implement). Co-authored-by: Cursor --- _docs/02_tasks/_dependencies_table.md | 17 ++ .../AZ-503_tile_identity_uuidv5_bulk_list.md | 206 ++++++++++++++++++ .../AZ-504_perf_script_grep_pipefail_fix.md | 94 ++++++++ _docs/_autodev_state.md | 6 +- ...026-05-12_perf-cycle3-harness-execution.md | 8 + 5 files changed, 328 insertions(+), 3 deletions(-) create mode 100644 _docs/02_tasks/todo/AZ-503_tile_identity_uuidv5_bulk_list.md create mode 100644 _docs/02_tasks/todo/AZ-504_perf_script_grep_pipefail_fix.md diff --git a/_docs/02_tasks/_dependencies_table.md b/_docs/02_tasks/_dependencies_table.md index eeb95d7..e863d2b 100644 --- a/_docs/02_tasks/_dependencies_table.md +++ b/_docs/02_tasks/_dependencies_table.md @@ -92,6 +92,15 @@ Source: cycle-3 perf-harness leftover replay surfaced the host SDK / project SDK |------|-------|-----------|--------|--------| | AZ-500 | .NET 8 → .NET 10 migration (TFM + SDK pin + Docker images + CI images + Microsoft.AspNetCore.* + Microsoft.Extensions.* + Serilog.AspNetCore) | — | 5 | Done (In Testing) | +### Step 9 cycle 5 — New Task: Tile identity foundation + perf-harness fix (AZ-483 epic) + +Source: cross-workspace handoff from `gps-denied-onboard` (tile-schema scenario analysis) for AZ-503; cycle-3 perf-harness leftover replay-obligation closure for AZ-504. Both attach to epic AZ-483 (Multi-source tile storage + UAV upload, Layer 2) — AZ-503 supersedes the AZ-484 UPSERT-conflict-key portion, AZ-504 unblocks PT-08 measurement. + +| Task | Title | Depends On | Points | Status | +|------|-------|-----------|--------|--------| +| AZ-503 | Tile identity → UUIDv5 + integer UPSERT + bulk-list endpoint | AZ-484 (supersedes UPSERT-conflict-key portion of AZ-484 selection rule) | 3 | To Do | +| AZ-504 | Perf script: fix grep \| wc -l pipefail crash in PT-08 | — (independent; references AZ-488 PT-08 threshold) | 1 | To Do | + ## Execution Order ### Step 6 @@ -135,6 +144,13 @@ Single task; coordinated cross-cutting bump. 1. AZ-500 (5 SP) — .NET 8 → .NET 10 migration. Self-contained but touches every csproj, both Dockerfiles, run-tests.sh, .woodpecker/01-test.yml, global.json, and two docs files. +### Step 9 cycle 5 + +Independent tracks — both can run in parallel; no ordering constraint between them. AZ-504 is a prerequisite for the cycle's Step 15 Performance Test to deliver a green PT-08 reading (and therefore for deleting the perf-cycle3 leftover); AZ-503 is the cycle's main feature. + +1. AZ-504 (1 SP) — cheapest unblocker; lands first to clear PT-08 reporting for the cycle. +2. AZ-503 (3 SP) — main feature; data-model + API; cross-workspace alignment with `gps-denied-onboard` AZ-304 / AZ-316. + ## Total Effort Step 6: 6 tasks, 17 story points @@ -144,6 +160,7 @@ Step 9 cycle 1: 1 task created (AZ-484, 5 pts) Step 9 cycle 2: 2 tasks created (AZ-487 = 2 pts, AZ-488 = 8 pts over-cap user-accepted) — total 10 pts Step 9 cycle 3: 6 tasks created (AZ-491 = 3 pts, AZ-492 = 3 pts, AZ-493 = 2 pts, AZ-494 = 2 pts, AZ-495 = 1 pt, AZ-496 = 2 pts) — total 13 pts Step 9 cycle 4: 1 task created (AZ-500 = 5 pts) +Step 9 cycle 5: 2 tasks created (AZ-503 = 3 pts, AZ-504 = 1 pt) — total 4 pts ## Coverage Verification diff --git a/_docs/02_tasks/todo/AZ-503_tile_identity_uuidv5_bulk_list.md b/_docs/02_tasks/todo/AZ-503_tile_identity_uuidv5_bulk_list.md new file mode 100644 index 0000000..8aa0b17 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-503_tile_identity_uuidv5_bulk_list.md @@ -0,0 +1,206 @@ +# Tile identity → UUIDv5 + integer UPSERT + bulk-list endpoint + +**Task**: AZ-503_tile_identity_uuidv5_bulk_list +**Name**: Tile identity → UUIDv5 + integer UPSERT + bulk-list endpoint +**Description**: Tile identity in the `tiles` table is currently random (`Guid.NewGuid()`), and the UPSERT conflict key uses `double precision` `latitude`/`longitude` and omits `flight_id`, which (a) makes idempotent re-insert fragile against float rounding and (b) destroys per-flight evidence required by the D-PROJ-2 multi-flight voting layer when two UAVs upload the same `(z, x, y)` cell. This task migrates tile identity to deterministic UUIDv5 (`id = uuidv5(NAMESPACE, "{z}/{x}/{y}/{source}/{flight_id or 'none'}")`), adds a `location_hash` UUIDv5 (`uuidv5(..., "{z}/{x}/{y}")`) for efficient cell-bag queries (UI Leaflet path + future voting), switches the UPSERT conflict key to integer-only `(zoom_level, tile_x, tile_y, tile_size_meters, source, COALESCE(flight_id, '00000000-0000-0000-0000-000000000000'::uuid))`, adds a `content_sha256 bytea NOT NULL` column for content-addressable dedup, and adds the `POST /api/satellite/tiles/inventory` endpoint that the onboard `TileDownloader` (`gps-denied-onboard` AZ-316) needs for bbox→tile enumeration during pre-flight provisioning. +**Complexity**: 3 points +**Dependencies**: AZ-484 (UPSERT-per-source + AZ-484 selection rule — done; this task supersedes the UPSERT conflict-key portion) +**Component**: SatelliteProvider.DataAccess + SatelliteProvider.Services.TileDownloader + SatelliteProvider.Api +**Tracker**: AZ-503 +**Epic**: AZ-483 — Multi-source tile storage + UAV upload (Layer 2) + +## Origin + +Cross-workspace surface from `gps-denied-onboard` `_docs/_process_leftovers/2026-05-12_tile-schema-scenario-analysis.md`. The onboard repo's `AZ-304` C6 Postgres schema is being designed with `location_hash` + `content_sha256` columns and a deterministic `id`; this satellite-provider task is the parent-suite counterpart so both sides of the wire agree on tile identity semantics. + +Related: `gps-denied-onboard` `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` Design Task #2 (multi-flight trust / voting layer) is the downstream consumer of this task's `flight_id`-aware UPSERT and `content_sha256`. + +## Problem + +Three concrete issues in the current code: + +1. **Random tile id** — `SatelliteProvider.Services.TileDownloader/TileService.cs:149` and `UavTileUploadHandler.cs:160` use `Id = Guid.NewGuid()`. The `id` is opaque, non-deterministic, and useless as a content/location handle. Onboard cannot pre-compute or compare ids before round-tripping to the DB. + +2. **Float-based UPSERT collapses multi-flight evidence** — `TileRepository.InsertAsync` line 124: `ON CONFLICT (latitude, longitude, tile_zoom, tile_size_meters, source) DO UPDATE SET file_path = EXCLUDED.file_path, ...`. Problems: + - `latitude` / `longitude` are `double precision`; Postgres conflict detection requires bit-identical floats. Re-uploads of the same tile computed from independently-rounded center coords can miss the conflict and create duplicate rows. AZ-484's `DISTINCT ON` read-side fix papers over the duplicate but does not prevent it. + - The conflict key omits `flight_id`. When two flights upload `source='uav'` for the same cell, `DO UPDATE` overwrites Flight A with Flight B. **D-PROJ-2 voting (Design Task #2 from the 2026-05-09 leftover) needs both rows alive.** + +3. **No bulk-list endpoint for pre-flight provisioning** — onboard `TileDownloader` (`gps-denied-onboard` AZ-316) calls `GET /api/satellite/tiles?bbox=...&zoom=...&list-only=true` to size and enumerate a pre-flight cache build. **This endpoint does not exist** in `SatelliteProvider.Api/Program.cs`. The closest is `GetTilesByRegionAsync` (private, lat/lon-meters input, no HTTP surface) and `GET /api/satellite/tiles/latlon` (single tile). Operators today cannot pre-size a cache build over the bbox the mission planner produces. + +4. **No content digest** — there is no `content_sha256` column. Same JPEG re-uploaded under a different `source` or `flight_id` is indistinguishable from a re-encode. + +## Outcome + +- Tile id becomes deterministic: `id = uuidv5(TILE_NAMESPACE, "{z}/{x}/{y}/{source}/{flight_id or '00000000-0000-0000-0000-000000000000'}")`. Same inputs always produce the same id; idempotent inserts no longer require a "did this row exist?" pre-check. +- `location_hash uuid NOT NULL` column: `uuidv5(TILE_NAMESPACE, "{z}/{x}/{y}")`. Drives Scenario 1 (UI Leaflet `/tiles/{z}/{x}/{y}`) lookup as a single hash-index probe, and Scenario 6 (voting) as a single cell-bag fetch. +- `content_sha256 bytea NOT NULL` column: SHA-256 of the JPEG body, computed at insert time. Enables dedup detection ("Flight B uploaded a byte-identical tile to Flight A — flag for inspection") and integrity checks on the read path. +- UPSERT conflict key becomes: + ``` + ON CONFLICT (zoom_level, tile_x, tile_y, tile_size_meters, source, + COALESCE(flight_id, '00000000-0000-0000-0000-000000000000'::uuid)) + DO UPDATE SET file_path = EXCLUDED.file_path, captured_at = EXCLUDED.captured_at, updated_at = EXCLUDED.updated_at + ``` + Integer-only equality, per-flight separation. (Note: requires `zoom_level` column — the current schema uses `tile_zoom`; either rename or use `tile_zoom` consistently — this task keeps the existing column name `tile_zoom` and adjusts the onboard-side spec to match.) +- **New endpoint: `POST /api/satellite/tiles/inventory`** (preferred over the originally-proposed `GET /api/satellite/tiles?bbox=...`). Body is a list of `(z, x, y)` coords (or pre-computed `location_hash` UUIDs); response is one entry per input. Justification: the onboard side already has the deterministic slippy-tile math (`helpers/wgs_converter.py.latlon_to_tile_xy`, identical to C# `GeoUtils.WorldToTilePos`); making the server re-enumerate the bbox is wasted work. POST inventory is also the natural batched-existence-check shape — single round-trip, indexed lookup per row. + ``` + POST /api/satellite/tiles/inventory + Body: { "tiles": [ { "z": 18, "x": 12345, "y": 23456 }, ... ] } + // OR { "location_hashes": [ "uuid-v5", ... ] } + Response: { + "results": [ + { + "tile_x": 12345, "tile_y": 23456, "tile_zoom": 18, "location_hash": "uuid-v5...", + "present": true, + "id": "uuid-v5...", "captured_at": "...", "resolution_m_per_px": 0.3, + "estimated_bytes": 42017, "source": "google_maps", "flight_id": null + }, + { "tile_x": 12346, "tile_y": 23456, "tile_zoom": 18, "location_hash": "uuid-v5...", "present": false } + ] + } + ``` + Server-side query (single round-trip): + ```sql + SELECT DISTINCT ON (location_hash) + location_hash, id, captured_at, resolution_m_per_px, estimated_bytes, source, flight_id + FROM tiles + WHERE location_hash = ANY($1::uuid[]) + AND voting_status IN ('trusted', NULL) + ORDER BY location_hash, captured_at DESC, updated_at DESC, id DESC; + ``` +- Required covering index for Leaflet hot path: + ```sql + CREATE INDEX tiles_leaflet_path + ON tiles (location_hash, captured_at DESC, updated_at DESC, id DESC) + INCLUDE (file_path, content_type, etag, voting_status); + ``` + Leaflet `/tiles/{z}/{x}/{y}` becomes an index-only scan (no heap fetch when `voting_status='trusted'`): + ```sql + SELECT file_path FROM tiles + WHERE location_hash = $1 AND voting_status IN ('trusted', NULL) + ORDER BY captured_at DESC, updated_at DESC, id DESC LIMIT 1; + ``` +- Migration: additive — add `location_hash`, `content_sha256` as nullable, backfill in a single `UPDATE`, then set `NOT NULL`. Drop the old `UNIQUE (latitude, longitude, tile_zoom, tile_size_meters, source)` constraint and add the new integer-keyed one. Backfill `id` IS NOT trivial — see Risk 1. +- The UUIDv5 generator is implemented inline in `SatelliteProvider.Common/Utils/Uuidv5.cs` (RFC 9562 algorithm — 60 lines of C#, MD5 dropped, SHA-1 only). .NET 10 has no native `Guid.CreateVersion5` (only `CreateVersion7`); we do NOT add a 3rd-party dep for this. + +## Scope + +### Included +- `SatelliteProvider.Common/Utils/Uuidv5.cs` — pure-C# RFC 9562 UUIDv5 implementation, unit-tested against the Python `uuid.uuid5` reference vectors (the onboard side uses Python `uuid.uuid5`; both must produce byte-identical output for the same name + namespace). +- `SatelliteProvider.DataAccess` — Dapper SQL changes: new columns, new UPSERT, new SELECT shapes. `TileRepository.GetByLocationHashAsync` and `TileRepository.InventoryAsync(uuid[])` added; `GetByTileCoordinatesAsync` rewritten to use `location_hash`. Existing `tiles_leaflet_path` covering index added. +- `SatelliteProvider.Services.TileDownloader` — `BuildTileEntity` no longer calls `Guid.NewGuid()`; it computes the UUIDv5 and the `location_hash` from the deterministic inputs. Same change in `UavTileUploadHandler`. +- `SatelliteProvider.Api/Program.cs` — new MapPost route `/api/satellite/tiles/inventory`; existing `/tiles/{z}/{x}/{y}` Leaflet path migrated to use `location_hash`-keyed query against the covering index. +- Migration script in the existing migrations tool (whichever the repo uses — Flyway/EFCore/handwritten SQL; this task uses whatever is already established). +- **On-disk layout migration**: UAV tiles move from `./tiles/uav/{zoom}/{x}/{y}.jpg` to `./tiles/uav/{flight_id}/{zoom}/{x}/{y}.jpg`. Google Maps tiles stay at `./tiles/{zoom}/{x}/{y}/...jpg` (or normalise to `./tiles/google_maps/{zoom}/{x}/{y}.jpg` if the cleanup is cheap). The DB `file_path` column is rewritten in the same backfill that populates `location_hash`/`content_sha256`. Test `SatelliteProvider.Tests/UavTileFilePathTests.cs:23` is updated to assert the new path shape. +- OpenAPI annotations for the new endpoint. +- Unit tests for `Uuidv5` against Python reference vectors. +- Integration tests for the new POST `/api/satellite/tiles/inventory` surface (use existing `docker-compose.tests.yml` fixture). +- Integration test for multi-flight upload — confirms two `source='uav'` rows for the same `(z, x, y)` from different `flight_id`s coexist on disk (different paths) and in DB (different rows, same `location_hash`). +- **Enable HTTP/2 (and HTTP/3 over TLS where feasible)** at the Kestrel endpoint boundary: `EndpointDefaults.Protocols = HttpProtocols.Http1AndHttp2AndHttp3`. Verify the dev `docker-compose` nginx reverse proxy also has `http2 on;` in the relevant `listen` directive. This is the bulk-retrieval mechanism for BOTH Leaflet (browser opens one TCP connection, multiplexes 30+ tile streams, HPACK compresses repeated headers) and UAV provisioning (`httpx.Client(http2=True)` on the onboard side). No application-level batching is added. +- **No materialised `tile_current` pointer table** — deferred until production profiling demands it. Pre-optimisation rejected. +- **No content-addressable / blob storage layout** — `content_sha256` is for dedup *detection* (and integrity), not dedup *storage*. CAS adds complexity without measurable benefit at our scale. +- **No multipart / tar / zip bundle endpoint** for UAV provisioning — rejected in favour of inventory POST + per-tile GET over HTTP/2 multiplex. The bundle approach collapses resume granularity, loses per-tile cacheability, and gives no throughput win over HTTP/2 multistream. PMTiles archive is excellent for STATIC tile sets (Cloudflare/Protomaps) but our DB is dynamic — UAV uploads invalidate any pre-built archive. Defer PMTiles until profiling demands it. + +### Excluded +- The voting / trust-promotion layer (Design Task #2 from 2026-05-09 leftover) — separate task. This task makes voting POSSIBLE by keeping per-flight rows; it does NOT implement voting. +- Onboard companion auth (mTLS / signed payloads) — already covered by D-PROJ-2 Design Task #1. +- Renaming the `tile_zoom` column to `zoom_level` (rule: never rename columns without explicit confirmation — see `coderule.mdc`). +- Per-flight key management (already covered by gps-denied-onboard AZ-318). +- Removing the existing `latitude`/`longitude` columns. They stay as advisory center-of-tile data. + +## Acceptance Criteria + +**AC-1: UUIDv5 reference vectors match Python** +Given the test vector `namespace = TILE_NAMESPACE` and `name = "18/12345/23456/google_maps/00000000-0000-0000-0000-000000000000"` +When `Uuidv5.Create(TILE_NAMESPACE, name)` runs +Then the resulting `Guid` is byte-identical to Python `uuid.uuid5(TILE_NAMESPACE, "18/12345/23456/google_maps/00000000-0000-0000-0000-000000000000")` for ≥10 randomly-generated test cases. + +**AC-2: Insert is idempotent on identical inputs** +Given a tile is inserted with `(tile_zoom=18, tile_x=A, tile_y=B, tile_size_meters=S, source='google_maps', flight_id=NULL)` returning `id=X` +When the same insert is repeated +Then exactly ONE row exists in `tiles`; the returned `id == X`; the `id` column is not regenerated; `updated_at` IS refreshed but `created_at` is NOT. + +**AC-3: Multi-flight UAV uploads coexist** +Given two `source='uav'` inserts for the same `(tile_zoom, tile_x, tile_y, tile_size_meters)` with `flight_id=F1` and `flight_id=F2` (F1 ≠ F2) +When both inserts complete +Then TWO rows exist in `tiles`; each has its own `id`; both rows share the same `location_hash`. + +**AC-4: Float rounding does not break idempotence** +Given an insert with `latitude=47.123456789012345` and another insert recomputed from `tile_center = TileToWorldPos(x, y, z)` (slightly different float representation) +When both inserts target the same `(tile_zoom, tile_x, tile_y, tile_size_meters, source, flight_id)` +Then exactly ONE row results; the conflict triggers despite float differences (because the new UPSERT key does not include `latitude`/`longitude`). + +**AC-5: Inventory endpoint returns one entry per requested coord** +Given a POST body of 25 `(z, x, y)` coords at zoom 18, with 12 already in the DB and 13 absent +When `POST /api/satellite/tiles/inventory` is called +Then `results` contains 25 entries in the SAME ORDER as the input; 12 entries have `present=true` with `id`/`location_hash`/`captured_at` populated, 13 entries have `present=false` with `location_hash` populated (computed via UUIDv5) and `id=null`; per-tile `estimated_bytes` is `null|int`. + +**AC-6: Leaflet path returns most-recent variant via location_hash** +Given multiple rows for `(z, x, y)` from different sources/flights +When `GET /tiles/{z}/{x}/{y}` is called +Then ONE tile body is returned, selected by `WHERE location_hash = $1 ORDER BY captured_at DESC, updated_at DESC, id DESC LIMIT 1` (semantically identical to AZ-484's prior rule, now using `location_hash`). + +**AC-7: content_sha256 is computed and persisted** +Given a UAV upload of a JPEG with known SHA-256 +When the insert lands +Then `content_sha256` matches the externally-computed digest; a follow-up insert of a byte-identical body produces the SAME `content_sha256` value. + +**AC-8: Migration is reversible (best-effort)** +Given the migration runs forward on a populated `tiles` table +When the back-migration runs +Then the table is restored to the pre-migration shape; data loss is limited to the new columns (`location_hash`, `content_sha256`). (Best-effort because UPSERT key changes are awkward to reverse cleanly.) + +**AC-9: Performance — inventory endpoint ≤ 500 ms for 2500 tiles** +Given a POST body listing 2500 `(z, x, y)` coords at zoom 18 against a populated DB (average ~3 versions per cell across `google_maps` + `uav` sources) +When `POST /api/satellite/tiles/inventory` is called +Then the response arrives within 500 ms (95th percentile over 20 calls). Index-only scan via `tiles_leaflet_path` is the expected plan. + +**AC-10: Leaflet hot path is index-only** +Given the `tiles_leaflet_path` covering index exists and the table has ≥ 100k rows +When `EXPLAIN (ANALYZE, BUFFERS) SELECT file_path FROM tiles WHERE location_hash = $1 AND voting_status IN ('trusted', NULL) ORDER BY captured_at DESC LIMIT 1` is run +Then the plan is `Index Only Scan using tiles_leaflet_path`; `Heap Fetches = 0` (visibility map fully built); total time < 0.5 ms. + +**AC-12: HTTP/2 multiplexed responses** +Given Kestrel is configured with `Http1AndHttp2AndHttp3` (or `Http1AndHttp2` over plain TLS without QUIC support) +When a single `httpx.Client(http2=True)` issues 20 concurrent `GET /tiles/{z}/{x}/{y}` requests +Then the responses arrive over ONE TCP connection (verifiable via packet capture / `httpx.Response.http_version == 'HTTP/2'`); all 20 responses interleave on the wire; total wall-clock time < 2× single-tile latency (vs. 20× for HTTP/1.1 without pipelining); per-tile ETags + `Cache-Control` headers are preserved unchanged. + +**AC-11: Per-flight on-disk separation** +Given two UAV uploads of the same `(z, x, y)` from `flight_id=F1` and `flight_id=F2` +When both inserts complete and the backing JPEGs are persisted +Then two distinct files exist at `./tiles/uav/{F1}/{z}/{x}/{y}.jpg` and `./tiles/uav/{F2}/{z}/{x}/{y}.jpg`; `rm -rf ./tiles/uav/{F1}/` removes ONLY Flight F1's evidence (Flight F2's file is untouched); the DB `file_path` columns reflect the per-flight paths. + +## Constraints + +- **No column renames**: keep `tile_zoom`, `tile_x`, `tile_y`, `latitude`, `longitude` exactly as named today. The onboard side (`AZ-304`) is responsible for matching column names on its own table. +- **UUID namespace MUST be agreed cross-repo**: pick a fixed UUID (e.g. `urn:uuid:5b8d0c2e-...`) and pin it in BOTH `SatelliteProvider.Common/Utils/Uuidv5.cs` and the onboard `gps_denied_onboard/components/c6_tile_cache/_uuid.py`. Document the chosen value in this task's review. +- **No third-party UUIDv5 dependency**: pure-C# RFC 9562 implementation, ≤80 LoC. +- **Migration must run online**: the `tiles` table is the busiest table in the service. Adding the new columns must be a non-blocking `ALTER TABLE` followed by a backfill in batches. +- **Existing AZ-484 selection rule is preserved**: the read-side `ORDER BY captured_at DESC, updated_at DESC, id DESC` tie-break stays; the only change is the WHERE clause uses `location_hash` instead of `(tile_zoom, tile_x, tile_y)`. + +## Risks & Mitigation + +**Risk 1: Backfilling `id` on existing rows is irreversible** +- *Risk*: existing rows have random `Guid` ids. If we overwrite them with computed UUIDv5 values, any cached external reference to the old id (e.g. operator UI bookmarks, audit log entries) becomes stale. +- *Mitigation*: add a `legacy_id uuid NULL` column populated from the existing `id` before the backfill. The old id is preserved for diagnostic queries. The new UUIDv5 `id` takes over. After a deprecation window (one cycle), drop `legacy_id`. + +**Risk 2: UUIDv5 namespace divergence between Python and C#** +- *Risk*: subtle bug in the C# SHA-1 impl produces ids that differ from Python's. Cross-repo lookups fail silently. +- *Mitigation*: AC-1 requires byte-identical output across ≥10 vectors. The vectors are generated by Python and pasted into the C# test fixture as fixed-string expectations. + +**Risk 3: Migration deadlock under load** +- *Risk*: `ALTER TABLE ADD COLUMN ... NOT NULL DEFAULT '...'` on a 10M+ row table locks the table for minutes. +- *Mitigation*: 3-step migration — (a) ADD COLUMN nullable; (b) UPDATE in 1000-row batches with `pg_sleep(0.1)` between batches; (c) SET NOT NULL after backfill. Documented in the migration file. + +**Risk 4: Onboard `TileDownloader` (AZ-316) writes against this endpoint before it exists** +- *Risk*: ordering — onboard AZ-316 might be implemented before this task lands. Production calls hit 404. +- *Mitigation*: the onboard `TileDownloader` already has a fallback path (per-tile GET via `/tiles/{z}/{x}/{y}`); document that fallback in AZ-316's caveats and gate the `list-only=true` path behind a feature flag `c11.use_bulk_list_endpoint` (default `false` until this satellite-provider task is in production). + +## References + +- `gps-denied-onboard/_docs/_process_leftovers/2026-05-12_tile-schema-scenario-analysis.md` — origin + scenario analysis. +- `gps-denied-onboard/_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` — sibling design tasks (inbound ingest, voting layer). +- `gps-denied-onboard/_docs/02_tasks/todo/AZ-304_c6_postgres_schema.md` — onboard counterpart schema. +- `gps-denied-onboard/_docs/02_tasks/todo/AZ-316_c11_tile_downloader.md` — onboard consumer of the bulk-list endpoint. +- `SatelliteProvider.DataAccess/Repositories/TileRepository.cs` — current `Guid.NewGuid()` + float UPSERT. +- `SatelliteProvider.Services.TileDownloader/TileService.cs` — current `BuildTileEntity`. +- `SatelliteProvider.Api/Program.cs` — endpoint surface. diff --git a/_docs/02_tasks/todo/AZ-504_perf_script_grep_pipefail_fix.md b/_docs/02_tasks/todo/AZ-504_perf_script_grep_pipefail_fix.md new file mode 100644 index 0000000..5de0794 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-504_perf_script_grep_pipefail_fix.md @@ -0,0 +1,94 @@ +# Perf script: fix grep | wc -l pipefail crash in PT-08 + +**Task**: AZ-504_perf_script_grep_pipefail_fix +**Name**: Perf script: fix grep | wc -l pipefail crash in PT-08 +**Description**: `scripts/run-performance-tests.sh:416-417` uses `grep -o ... | wc -l` to count `"status":"rejected"` and `"status":"accepted"` markers in the PT-08 batch response. On the happy path (`rejected=0`), `grep -o` exits 1 (no matches), and because the script has `set -o pipefail` + `set -e`, the assignment kills the script silently right after `rejected=0`. The cycle-3 perf-harness leftover (`_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md`) has reproduced this crash twice (replay #2 + #3, post-AZ-500); PT-01..PT-07 stay green. Until this one-line fix lands, PT-08 cannot be measured against its 2000 ms p95 threshold and the leftover stays open. +**Complexity**: 1 points +**Dependencies**: None +**Component**: scripts/run-performance-tests.sh +**Tracker**: AZ-504 +**Epic**: AZ-483 — Multi-source tile storage + UAV upload (Layer 2) + +## Problem + +`scripts/run-performance-tests.sh` lines 416-417 currently read: + +```bash +accepted=$(grep -o '"status":"accepted"' "$PERF_TMP_DIR/pt08_resp.json" | wc -l | tr -d ' ') +rejected=$(grep -o '"status":"rejected"' "$PERF_TMP_DIR/pt08_resp.json" | wc -l | tr -d ' ') +``` + +With `set -o pipefail` (line 16) and `set -e`, when `grep -o` matches zero times it exits 1; the pipeline returns 1; the assignment kills the script. This is masked when both counts are ≥1 — the bug only surfaces on the happy path where `rejected=0`. Two consecutive full-harness replays (post-AZ-500 .NET 10 migration) reproduced the crash at the exact same line; PT-01..PT-07 are unaffected because they don't pipe potentially-empty `grep` output through `wc -l`. + +The leftover documents that the actual perf-relevant data PT-08 captured before crashing was healthy: HTTP 200, batch latency 99 ms (well under the 2000 ms threshold), accepted=2, rejected=0. The harness is buggy, not the production code path. + +## Outcome + +- `scripts/run-performance-tests.sh` runs PT-08 to completion when `rejected=0` (and when `accepted=0`, defensively). +- A full default-parameter perf run (`PERF_REPEAT_COUNT=20 PERF_UAV_BATCH_SIZE=10`) prints the PT-08 summary line with batch p50/p95, accepted total, rejected total, and exits 0 against an api built from `dev`. +- The cycle-3 perf-harness leftover (`_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md`) is deleted once a full perf run is green (per AZ-500 Constraint). + +## Scope + +### Included +- `scripts/run-performance-tests.sh` lines 416-417 replaced with pipefail-tolerant counting (`grep -c ... || true`). +- A short shellcheck pass on the surrounding `pt08_summary` block so no similar empty-match-counting bug remains in the same scope. +- Delete `_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md` after the green full perf run completes (post-implementation). + +### Excluded +- Changes to PT-01..PT-07 scenarios or any production code path — bug is harness-only. +- Adding new perf scenarios — out of scope. +- Renaming or restructuring `scripts/run-performance-tests.sh`. +- Server-side per-call `UavTileQualityGate.Validate` timing instrumentation (already deferred in cycle 3 AZ-492 — out of scope here too). + +## Acceptance Criteria + +**AC-1: PT-08 completes on zero-rejected response** +Given the api is up against `docker-compose up -d --build` and a UAV batch upload returns `accepted ≥ 1, rejected = 0` +When the harness runs PT-08 batch counting +Then the assignment to `rejected` succeeds (no script exit); `rejected` is `0`; the loop continues to the next iteration. + +**AC-2: PT-08 completes on zero-accepted response (defensive)** +Given a UAV batch upload returns `accepted = 0, rejected ≥ 1` (e.g., quality gate rejects every item) +When the harness runs PT-08 batch counting +Then the assignment to `accepted` succeeds (no script exit); `accepted` is `0`; the loop continues. + +**AC-3: PT-08 summary line prints in full run** +Given a full default-parameter perf run (`PERF_REPEAT_COUNT=20 PERF_UAV_BATCH_SIZE=10`) against a healthy api +When the harness reaches the PT-08 reporting block +Then a summary line `PT-08 UAV batch upload: PASS p95=Xms / 2000ms (accepted=A, rejected=R, N=20)` is printed; the script exits 0. + +**AC-4: Leftover deletion on green full run** +Given AC-1 + AC-2 + AC-3 hold and the full perf run is green +When the implementer verifies the run +Then `_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md` is deleted in the same commit (per AZ-500 Constraint "leftover file is deleted ONLY when the full perf script runs cleanly"). + +## Non-Functional Requirements + +**Compatibility** +- The fix must work under `bash` (script's `#!/bin/bash`). No POSIX-sh constraint. +- The fix must not silently swallow other errors in the surrounding block — only the empty-match case (`grep` exit 1) is tolerated; any other failure (file unreadable, pipe broken) must still propagate (`coderule.mdc` "never suppress errors silently"). + +## Constraints + +- One-file edit. No new scripts, no helper extraction. +- Preserve `set -o pipefail` and `set -e` semantics globally — only neutralise grep's exit-1-on-no-match locally. +- Do not change PT-08's threshold (2000 ms p95, established by AZ-488). +- Do not rename, move, or restructure `scripts/run-performance-tests.sh`. + +## Risks & Mitigation + +**Risk 1: `grep -c` counts lines containing the pattern, not occurrences** +- *Risk*: `grep -c` counts matching LINES, not total occurrences. If the PT-08 response packs multiple `"status":"..."` markers on one line (compact JSON), the count diverges from `grep -o | wc -l`. +- *Mitigation*: inspect a sample `pt08_resp.json` from a recent run; if the JSON is one-per-line the `grep -c` swap is exact, otherwise use `grep -o ... | wc -l` wrapped in `|| echo 0` to preserve occurrence-count semantics while neutralising the empty-match exit. + +**Risk 2: Pipefail bug exists elsewhere in the same script** +- *Risk*: similar `grep -o ... | wc -l` patterns elsewhere will crash on different happy paths. +- *Mitigation*: in-scope shellcheck pass over the file flags every `grep -o ... | wc -l` / `grep ... | wc -l` site for review; same defensive treatment applied as needed. + +## References + +- `_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md` — replay #2 + #3 evidence; one-line fix proposed at lines 56-67 of the leftover. +- `scripts/run-performance-tests.sh:416-417` — exact lines to fix. +- `AZ-488` — PT-08 scenario owner; threshold context. +- `AZ-500` Constraint — "leftover file is deleted ONLY when the full perf script runs cleanly". diff --git a/_docs/_autodev_state.md b/_docs/_autodev_state.md index 4a063f0..5f711ad 100644 --- a/_docs/_autodev_state.md +++ b/_docs/_autodev_state.md @@ -2,9 +2,9 @@ ## Current Step flow: existing-code -step: 9 -name: New Task -status: not_started +step: 10 +name: Implement +status: in_progress sub_step: phase: 0 name: awaiting-invocation diff --git a/_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md b/_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md index 44a82e6..6a4e1d9 100644 --- a/_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md +++ b/_docs/_process_leftovers/2026-05-12_perf-cycle3-harness-execution.md @@ -122,3 +122,11 @@ User picked A at the Step 15 (Performance Test) gate of cycle 4. Full default-pa ## Replay obligation Open a new follow-up PBI for the `scripts/run-performance-tests.sh:416-417` grep fix (estimated 1 SP). Once that lands and a full perf run is green, delete this file. Until then, this leftover stays. + +## Replay attempt #4 — 2026-05-12T13:00:00Z (cycle 5 /autodev Step 9 New Task) + +PBI opened: **AZ-504 — "Perf script: fix grep | wc -l pipefail crash in PT-08"** (1 SP, parent epic AZ-483 — same as PT-08 scenario owner AZ-488). Spec landed at `_docs/02_tasks/todo/AZ-504_perf_script_grep_pipefail_fix.md`. The spec captures the AC-4 obligation that THIS leftover file is deleted in the same commit as the green full perf run. + +The "open the PBI" half of the Replay obligation is now done. The "full perf run is green" half remains outstanding — this leftover stays open until AZ-504 lands AND a default-parameter `./scripts/run-performance-tests.sh` (`PERF_REPEAT_COUNT=20 PERF_UAV_BATCH_SIZE=10`) exits 0 against an api built from `dev`. + +Next-cycle /autodev should NOT attempt replay #5 (open another PBI) — AZ-504 is the canonical replay vehicle. The next replay action is implementing AZ-504 itself (cycle 5 Step 10).