# Batch 41 — Code Review **Tasks**: AZ-320 (C11 IdempotentRetryTileUploader) **Cycle**: 1 **Reviewer**: autodev **Verdict**: **PASS_WITH_WARNINGS** ## Scope reviewed Production code: - `src/gps_denied_onboard/components/c6_tile_cache/_types.py` (added `VotingStatus.UPLOAD_GIVEUP`) - `src/gps_denied_onboard/components/c6_tile_cache/interface.py` (added `TileMetadataStore.increment_upload_attempts(tile_id) -> int` with a `NotImplementedError` default impl) - `src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py` (added `increment_upload_attempts` SQL impl, extended `_ALLOWED_VOTING_TRANSITIONS`, tightened `pending_uploads` SQL) - `db/migrations/versions/0003_c11_upload_attempts.py` (additive: column + widened CHECK constraint, reversible) - `src/gps_denied_onboard/components/c11_tile_manager/config.py` (added `C11RetryConfig` frozen dataclass + `disable_retry_decorator` + nested `retry: C11RetryConfig` field) - `src/gps_denied_onboard/components/c11_tile_manager/idempotent_retry.py` (new — `IdempotentRetryTileUploader`, `_RetryMetadataStoreLike`) - `src/gps_denied_onboard/components/c11_tile_manager/__init__.py` (re-exports) - `src/gps_denied_onboard/runtime_root/c11_factory.py` (`build_tile_uploader` wraps in decorator by default; bypass via `disable_retry_decorator=true`; new `clock` keyword parameter) - `src/gps_denied_onboard/fdr_client/records.py` (registered `c11.upload.giveup`) Contracts: - `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` (v1.3.0 — added `increment_upload_attempts`, updated I-8) Tests: - `tests/unit/c11_tile_manager/test_idempotent_retry.py` (new — 13 tests) - `tests/unit/c11_tile_manager/test_protocol_conformance.py` (added AC-9) - `tests/unit/c6_tile_cache/test_protocol_conformance.py` (extended AC-10 enum surface; added `increment_upload_attempts` to two metadata-store fakes) - `tests/unit/test_ac5_alembic.py` (updated head-revision assertion) - `tests/unit/test_az272_fdr_record_schema.py` (added mock payload) ## Phase 1 — Architecture ### AZ-507 cross-component rule `idempotent_retry.py` does NOT import from `components.c6_tile_cache`. The two c6 write surfaces it consumes (`increment_upload_attempts`, `update_voting_status`) are reached via the locally-declared `_RetryMetadataStoreLike` `Protocol` cut. The `UPLOAD_GIVEUP` enum value is reached via a locally-scoped string constant (`_VOTING_STATUS_UPLOAD_GIVEUP = "upload_giveup"`) — c6's `update_voting_status` impl coerces the string back into the enum via `VotingStatus(status)`, so the decorator never imports the enum. This is the same pattern Batch 40's `HttpTileDownloader` uses for the freshness-label string surface. The AZ-270 lint passes — composition root remains the only layer that may bind concrete c6 implementations. ### Composition root wiring `build_tile_uploader` now returns a `TileUploader` Protocol (was the concrete `HttpTileUploader`). Default path wraps the inner in `IdempotentRetryTileUploader`; the `disable_retry_decorator=true` config flag returns the bare inner with a single INFO log (`kind="c11.upload.retry.decorator.bypassed"`). The `clock` keyword defaults to a fresh `WallClock()` for production wiring; tests inject a fake clock to keep timing deterministic. The new factory tests (test_idempotent_retry.py — `test_ac10_*`) exercise both branches by toggling the config flag. ### Backwards-compat for `TileMetadataStore` Protocol The new `increment_upload_attempts` method is added with a `NotImplementedError` default impl on the Protocol surface, so existing AZ-303 conformance tests + existing duck-typed fakes that don't yet implement it continue to satisfy `isinstance` checks (the `runtime_checkable` mechanism only checks for attribute presence — and a default impl IS an attribute). The two c6 conformance fakes that didn't have the method have been updated to declare it (raising `NotImplementedError`) so that `runtime_checkable` continues to behave consistently. ### Migration discipline `0003_c11_upload_attempts.py` is purely additive on top of `0002_c6_tile_identity_and_lru`: - `tiles.upload_attempts INTEGER NOT NULL DEFAULT 0` (existing rows get 0 by default; no data migration required) - `ck_tiles_voting_status` widened to admit `'upload_giveup'` while preserving the `voting_status IS NULL` clause from 0001 (legacy rows on dev DBs that never set a voting status would otherwise fail the new CHECK) - `downgrade()` is symmetric — drops the column and restores the AZ-304 CHECK predicate exactly Per the spec's "Unacceptable substitutes" clause and `coderule.mdc`, AZ-304's 0001 + 0002 migrations are unchanged. ## Phase 2 — Code quality ### Single Responsibility `IdempotentRetryTileUploader` has a single responsibility: bound the retry behaviour around an inner `TileUploader`. Internal helpers are split cleanly: - `_handle_rejected_tiles` — fan-out increment + threshold check - `_mark_giveup` — single-tile transition + FDR + ERROR log - `_backoff_for` — pure exponent computation - `_sleep` — Clock-routed sleep (honours AZ-398) - `_next_retry_at_s` — operator hint derivation - `_with_retry_count` — pure dataclass.replace wrapper No method handles two distinct concerns; every method name matches the work it does. ### Error suppression The decorator does NOT swallow exceptions. `_handle_rejected_tiles` explicitly re-raises any failure from `increment_upload_attempts` (the alternative would be silently retrying without budget enforcement — exactly the failure mode the spec calls out as "unbounded behaviour"). FDR `enqueue` failures inside `_mark_giveup` will propagate; this is consistent with `tile_uploader.py`'s treatment of the same call. ### Comments Production comments are limited to non-obvious intent: - `_VOTING_STATUS_UPLOAD_GIVEUP` — explains why the constant is declared locally (AZ-507 boundary) rather than imported. - The `retries_used += 1` move-up — explains the off-by-one fix and references the spec's worked example. - `_RetryMetadataStoreLike` docstring — documents the `Any`-typed status parameter rationale. No narration comments; the AAA test pattern is honoured throughout the test file (with `# Arrange / # Act / # Assert` headers, omitting sections that are empty). ## Phase 3 — Test coverage vs. spec | AC | Test | Coverage | |----|------|----------| | AC-1 | `test_ac1_success_on_first_attempt_zero_side_effects` | Pass-through; zero retries | | AC-2 | `test_ac2_partial_then_success_increments_attempts_and_sleeps_once` | Sleep[2.0]; retry_count=1; 3 increments | | AC-3 | `test_ac3_per_tile_budget_exhausted_moves_to_giveup` | Threshold trips; FDR + ERROR log + transition | | AC-4 | `test_ac4_in_call_budget_exhausted_yields_partial_with_hint` | Sleeps[2.0,4.0,8.0]; retry_count=3; next_retry_at_s set | | AC-5 | `test_ac5_backoff_cap_honoured_at_high_attempt_number` | Cap at 10s, never 64s | | AC-6 | `test_ac10_voting_status_has_documented_states_only` (in c6 conformance) | `UPLOAD_GIVEUP` present | | AC-7 | (deferred — Postgres+Docker gated) | SQL impl reviewed against spec verbatim | | AC-8 | (deferred — Postgres+Docker gated) | Migration reviewed; head assertion updated | | AC-9 | `test_ac9_idempotent_retry_decorator_satisfies_uploader_protocol` | `isinstance(decorator, TileUploader)` | | AC-10 | `test_ac10_factory_returns_decorated_uploader_by_default` + `test_ac10_factory_bypasses_decorator_when_flag_set` | Both branches | | AC-11 | `test_ac11_enumerate_pending_passes_through` + `test_ac11_confirm_flight_state_passes_through` | Both methods delegate | | AC-12 | `test_ac12_flight_state_not_on_ground_propagates_without_retry` + `test_ac12_satellite_provider_error_propagates_without_retry` | Re-raised; zero sleep | | AC-13 | (implicit — relies on c6 SQL `pending_uploads` filter, exercised in AZ-319 batch suite) | Tile filter SQL reviewed | | NFR-perf-overhead | `test_nfr_overhead_under_5ms_on_success_first_attempt` | Generous bound (50ms dev-host) catches O(n²) regressions | The deferred tests (AC-7, AC-8) require Postgres + Alembic apply and are only exercised in CI's Docker-compose phase. The schema test under `tests/unit/c6_tile_cache/test_postgres_schema.py` is docker-gated (skipped on this dev host); a follow-up tweak there to assert the new column will land when the Docker suite is re-run against 0003 (see Findings F4 + F5). ## Phase 4 — Lints `ReadLints` clean across all touched files. ## Phase 5 — Test results `pytest tests/unit/c11_tile_manager tests/unit/c6_tile_cache tests/unit/test_az272_fdr_record_schema.py -q --deselect tests/unit/c11_tile_manager/test_signing_key.py::test_nfr_perf_sign_microbench_p99_under_one_ms`: - **238 passed, 57 skipped, 1 deselected** - Skips are Docker-gated (Postgres) — none AZ-320 related `pytest tests/unit -q --deselect tests/unit/c11_tile_manager/test_signing_key.py::test_nfr_perf_sign_microbench_p99_under_one_ms`: - **1429 passed, 80 skipped, 3 failed** — all 3 failures are pre-existing perf microbenches unrelated to AZ-320 scope: - `tests/unit/c10_provisioning/test_descriptor_batcher.py::test_nfr_perf_overhead_below_5_percent` - `tests/unit/c8_fc_adapter/test_az392_covariance_projector.py::test_nfr_perf_projector_under_100us_per_call` - (signing-key NFR was deselected; same failure mode flagged in Batch 40's sweep) ## Findings ### F1 — Recurring Clock-injection deviation: CLOSED for C11 (Informational) **Severity**: Informational **Status**: closed by this batch The cumulative review reports for batches 37-39 flagged that C11's sub-skills consistently injected a bare `Callable[[float], None]` sleep instead of the full `Clock` Protocol. AZ-320 needed both `monotonic_ns` (backoff arithmetic) AND `time_ns` (operator hint), so the decorator accepts the full `Clock` Protocol — matching AZ-307 / AZ-308 / project hygiene. No action; documented for the cumulative-batch readers. ### F2 — `_iso_now()` derives ts from datetime.now (Low) **Severity**: Low **Recommendation**: track in the existing project-wide `_iso_now` audit PBI The decorator's FDR records use `_iso_now()` (which calls `datetime.now(timezone.utc).strftime(...)`) for the `ts` field rather than deriving it from the injected `Clock.time_ns()`. This matches the existing pattern in `tile_uploader.py` (which does the same), so introducing a deviation in this batch alone would cause inconsistency across C11. The right fix is a project-wide sweep (also covers `signing_key.py` etc.) that the cumulative review window can carry. ### F3 — Spec wording: `pending` tile check vs. budget check (Low) **Severity**: Low **Recommendation**: add a one-line comment OR update the spec The spec § Outcome bullet 3 says: "If `retries_used < config.max_in_call_retries` AND there are still tiles with `voting_status == pending`, sleep + recurse". The implementation only checks the budget. The two are equivalent in practice — if no tiles are still `pending`, the inner uploader's next call queries `pending_uploads`, finds nothing, and returns `outcome=SUCCESS` — but the implementation does NOT explicitly query c6 to short-circuit before the sleep. This is an O(1) speedup at most (skips one `time.sleep`) and adds a c6 round-trip per retry, so the implementation choice is reasonable; a one-line code comment referencing the spec equivalence would close the gap. ### F4 — AC-7 + AC-8 not exercised on dev host (Low) **Severity**: Low **Recommendation**: defer to next Docker-compose CI run AC-7 (concurrent `increment_upload_attempts`) and AC-8 (migration applied to live DB) require Postgres + Alembic apply. The implementation follows the spec verbatim: - `increment_upload_attempts`: `UPDATE tiles SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts` - Migration: `op.add_column(...)` + `CREATE … CHECK …`, with a symmetric `downgrade()` Both will be exercised by CI's Docker-compose phase. Code review of the SQL + migration is complete; runtime validation is the gating step. ### F5 — Postgres schema test still references AZ-304 head (Low) **Severity**: Low **Recommendation**: update in the same PR that lands the Docker-compose CI run for 0003 `tests/unit/c6_tile_cache/test_postgres_schema.py` and `tests/unit/c6_tile_cache/fixtures/c6_postgres_schema_v2.sql` still reference AZ-304's head `0002_c6_tile_identity_and_lru`. These tests are docker-gated and skipped on this dev host, so they did NOT fail in the Batch 41 sweep. They will fail when the Docker CI runs against 0003 — at which point the fixture file should be extended (or replaced with `c6_postgres_schema_v3.sql`) to include the new column + widened constraint, and `_AZ304_REV` constants should be supplemented with `_AZ320_REV`. This is intentionally NOT done in this batch (the dev sweep can't verify the fix), but documented here as a known follow-up. ## Verdict **PASS_WITH_WARNINGS**. Five Low/Informational findings, none blocking; F3 is the only one that implies a code change (a single comment), and it is genuinely optional.