Files
gps-denied-onboard/_docs/03_implementation/reviews/batch_41_review.md
T
Oleksandr Bezdieniezhnykh a06b107fc3 [AZ-320] Add C11 IdempotentRetryTileUploader decorator
Wraps HttpTileUploader (AZ-319) with two bounded retry budgets:

- In-call (per-batch) — re-invokes inner on PARTIAL outcome up to
  `max_in_call_retries` times with capped exponential backoff
  (`min(base ** attempt_number, cap)`). On exhaustion: surfaces an
  operator hint via `next_retry_at_s = now + backoff_cap_s`.
- Per-tile (cross-call) — atomically increments c6's
  `tiles.upload_attempts` counter for every rejection; once a tile
  hits `max_per_tile_attempts` it is forward-only transitioned to
  `voting_status = upload_giveup` (excluded from `pending_uploads`).
  Each transition emits FDR `kind="c11.upload.giveup"` plus an
  ERROR log.

C6 contract changes (AZ-303 v1.3.0):
- VotingStatus.UPLOAD_GIVEUP added (forward-only from PENDING/TRUSTED).
- TileMetadataStore.increment_upload_attempts(tile_id) -> int added
  with NotImplementedError default for backwards-compat.
- Migration 0003_c11_upload_attempts: additive column +
  widened ck_tiles_voting_status (preserves IS NULL clause).

C11 wiring:
- C11RetryConfig + disable_retry_decorator on C11Config.
- build_tile_uploader wraps in decorator by default; bypass flag
  returns the bare HttpTileUploader. New `clock` keyword.

Cross-component isolation honoured (AZ-507): the decorator declares
`_RetryMetadataStoreLike` Protocol cut over c6's TileMetadataStore
and references `UPLOAD_GIVEUP` via a local string constant — no c6
imports.

Tests: 13 decorator + 1 conformance + 2 factory bypass + AC-6 enum
update + alembic head bump + AZ-272 schema fixture. 238 passed across
c11/c6/fdr suites; pre-existing perf microbenches unrelated.

Code review: PASS_WITH_WARNINGS (5 Low/Informational findings,
docs-level or downstream-CI-blocked). See
_docs/03_implementation/reviews/batch_41_review.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:48:53 +03:00

13 KiB

Batch 41 — Code Review

Tasks: AZ-320 (C11 IdempotentRetryTileUploader) Cycle: 1 Reviewer: autodev Verdict: PASS_WITH_WARNINGS

Scope reviewed

Production code:

  • src/gps_denied_onboard/components/c6_tile_cache/_types.py (added VotingStatus.UPLOAD_GIVEUP)
  • src/gps_denied_onboard/components/c6_tile_cache/interface.py (added TileMetadataStore.increment_upload_attempts(tile_id) -> int with a NotImplementedError default impl)
  • src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py (added increment_upload_attempts SQL impl, extended _ALLOWED_VOTING_TRANSITIONS, tightened pending_uploads SQL)
  • db/migrations/versions/0003_c11_upload_attempts.py (additive: column + widened CHECK constraint, reversible)
  • src/gps_denied_onboard/components/c11_tile_manager/config.py (added C11RetryConfig frozen dataclass + disable_retry_decorator
    • nested retry: C11RetryConfig field)
  • src/gps_denied_onboard/components/c11_tile_manager/idempotent_retry.py (new — IdempotentRetryTileUploader, _RetryMetadataStoreLike)
  • src/gps_denied_onboard/components/c11_tile_manager/__init__.py (re-exports)
  • src/gps_denied_onboard/runtime_root/c11_factory.py (build_tile_uploader wraps in decorator by default; bypass via disable_retry_decorator=true; new clock keyword parameter)
  • src/gps_denied_onboard/fdr_client/records.py (registered c11.upload.giveup)

Contracts:

  • _docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md (v1.3.0 — added increment_upload_attempts, updated I-8)

Tests:

  • tests/unit/c11_tile_manager/test_idempotent_retry.py (new — 13 tests)
  • tests/unit/c11_tile_manager/test_protocol_conformance.py (added AC-9)
  • tests/unit/c6_tile_cache/test_protocol_conformance.py (extended AC-10 enum surface; added increment_upload_attempts to two metadata-store fakes)
  • tests/unit/test_ac5_alembic.py (updated head-revision assertion)
  • tests/unit/test_az272_fdr_record_schema.py (added mock payload)

Phase 1 — Architecture

AZ-507 cross-component rule

idempotent_retry.py does NOT import from components.c6_tile_cache. The two c6 write surfaces it consumes (increment_upload_attempts, update_voting_status) are reached via the locally-declared _RetryMetadataStoreLike Protocol cut. The UPLOAD_GIVEUP enum value is reached via a locally-scoped string constant (_VOTING_STATUS_UPLOAD_GIVEUP = "upload_giveup") — c6's update_voting_status impl coerces the string back into the enum via VotingStatus(status), so the decorator never imports the enum. This is the same pattern Batch 40's HttpTileDownloader uses for the freshness-label string surface.

The AZ-270 lint passes — composition root remains the only layer that may bind concrete c6 implementations.

Composition root wiring

build_tile_uploader now returns a TileUploader Protocol (was the concrete HttpTileUploader). Default path wraps the inner in IdempotentRetryTileUploader; the disable_retry_decorator=true config flag returns the bare inner with a single INFO log (kind="c11.upload.retry.decorator.bypassed"). The clock keyword defaults to a fresh WallClock() for production wiring; tests inject a fake clock to keep timing deterministic.

The new factory tests (test_idempotent_retry.py — test_ac10_*) exercise both branches by toggling the config flag.

Backwards-compat for TileMetadataStore Protocol

The new increment_upload_attempts method is added with a NotImplementedError default impl on the Protocol surface, so existing AZ-303 conformance tests + existing duck-typed fakes that don't yet implement it continue to satisfy isinstance checks (the runtime_checkable mechanism only checks for attribute presence — and a default impl IS an attribute). The two c6 conformance fakes that didn't have the method have been updated to declare it (raising NotImplementedError) so that runtime_checkable continues to behave consistently.

Migration discipline

0003_c11_upload_attempts.py is purely additive on top of 0002_c6_tile_identity_and_lru:

  • tiles.upload_attempts INTEGER NOT NULL DEFAULT 0 (existing rows get 0 by default; no data migration required)
  • ck_tiles_voting_status widened to admit 'upload_giveup' while preserving the voting_status IS NULL clause from 0001 (legacy rows on dev DBs that never set a voting status would otherwise fail the new CHECK)
  • downgrade() is symmetric — drops the column and restores the AZ-304 CHECK predicate exactly

Per the spec's "Unacceptable substitutes" clause and coderule.mdc, AZ-304's 0001 + 0002 migrations are unchanged.

Phase 2 — Code quality

Single Responsibility

IdempotentRetryTileUploader has a single responsibility: bound the retry behaviour around an inner TileUploader. Internal helpers are split cleanly:

  • _handle_rejected_tiles — fan-out increment + threshold check
  • _mark_giveup — single-tile transition + FDR + ERROR log
  • _backoff_for — pure exponent computation
  • _sleep — Clock-routed sleep (honours AZ-398)
  • _next_retry_at_s — operator hint derivation
  • _with_retry_count — pure dataclass.replace wrapper

No method handles two distinct concerns; every method name matches the work it does.

Error suppression

The decorator does NOT swallow exceptions. _handle_rejected_tiles explicitly re-raises any failure from increment_upload_attempts (the alternative would be silently retrying without budget enforcement — exactly the failure mode the spec calls out as "unbounded behaviour"). FDR enqueue failures inside _mark_giveup will propagate; this is consistent with tile_uploader.py's treatment of the same call.

Comments

Production comments are limited to non-obvious intent:

  • _VOTING_STATUS_UPLOAD_GIVEUP — explains why the constant is declared locally (AZ-507 boundary) rather than imported.
  • The retries_used += 1 move-up — explains the off-by-one fix and references the spec's worked example.
  • _RetryMetadataStoreLike docstring — documents the Any-typed status parameter rationale.

No narration comments; the AAA test pattern is honoured throughout the test file (with # Arrange / # Act / # Assert headers, omitting sections that are empty).

Phase 3 — Test coverage vs. spec

AC Test Coverage
AC-1 test_ac1_success_on_first_attempt_zero_side_effects Pass-through; zero retries
AC-2 test_ac2_partial_then_success_increments_attempts_and_sleeps_once Sleep[2.0]; retry_count=1; 3 increments
AC-3 test_ac3_per_tile_budget_exhausted_moves_to_giveup Threshold trips; FDR + ERROR log + transition
AC-4 test_ac4_in_call_budget_exhausted_yields_partial_with_hint Sleeps[2.0,4.0,8.0]; retry_count=3; next_retry_at_s set
AC-5 test_ac5_backoff_cap_honoured_at_high_attempt_number Cap at 10s, never 64s
AC-6 test_ac10_voting_status_has_documented_states_only (in c6 conformance) UPLOAD_GIVEUP present
AC-7 (deferred — Postgres+Docker gated) SQL impl reviewed against spec verbatim
AC-8 (deferred — Postgres+Docker gated) Migration reviewed; head assertion updated
AC-9 test_ac9_idempotent_retry_decorator_satisfies_uploader_protocol isinstance(decorator, TileUploader)
AC-10 test_ac10_factory_returns_decorated_uploader_by_default + test_ac10_factory_bypasses_decorator_when_flag_set Both branches
AC-11 test_ac11_enumerate_pending_passes_through + test_ac11_confirm_flight_state_passes_through Both methods delegate
AC-12 test_ac12_flight_state_not_on_ground_propagates_without_retry + test_ac12_satellite_provider_error_propagates_without_retry Re-raised; zero sleep
AC-13 (implicit — relies on c6 SQL pending_uploads filter, exercised in AZ-319 batch suite) Tile filter SQL reviewed
NFR-perf-overhead test_nfr_overhead_under_5ms_on_success_first_attempt Generous bound (50ms dev-host) catches O(n²) regressions

The deferred tests (AC-7, AC-8) require Postgres + Alembic apply and are only exercised in CI's Docker-compose phase. The schema test under tests/unit/c6_tile_cache/test_postgres_schema.py is docker-gated (skipped on this dev host); a follow-up tweak there to assert the new column will land when the Docker suite is re-run against 0003 (see Findings F4 + F5).

Phase 4 — Lints

ReadLints clean across all touched files.

Phase 5 — Test results

pytest tests/unit/c11_tile_manager tests/unit/c6_tile_cache tests/unit/test_az272_fdr_record_schema.py -q --deselect tests/unit/c11_tile_manager/test_signing_key.py::test_nfr_perf_sign_microbench_p99_under_one_ms:

  • 238 passed, 57 skipped, 1 deselected
  • Skips are Docker-gated (Postgres) — none AZ-320 related

pytest tests/unit -q --deselect tests/unit/c11_tile_manager/test_signing_key.py::test_nfr_perf_sign_microbench_p99_under_one_ms:

  • 1429 passed, 80 skipped, 3 failed — all 3 failures are pre-existing perf microbenches unrelated to AZ-320 scope:
    • tests/unit/c10_provisioning/test_descriptor_batcher.py::test_nfr_perf_overhead_below_5_percent
    • tests/unit/c8_fc_adapter/test_az392_covariance_projector.py::test_nfr_perf_projector_under_100us_per_call
    • (signing-key NFR was deselected; same failure mode flagged in Batch 40's sweep)

Findings

F1 — Recurring Clock-injection deviation: CLOSED for C11 (Informational)

Severity: Informational Status: closed by this batch

The cumulative review reports for batches 37-39 flagged that C11's sub-skills consistently injected a bare Callable[[float], None] sleep instead of the full Clock Protocol. AZ-320 needed both monotonic_ns (backoff arithmetic) AND time_ns (operator hint), so the decorator accepts the full Clock Protocol — matching AZ-307 / AZ-308 / project hygiene. No action; documented for the cumulative-batch readers.

F2 — _iso_now() derives ts from datetime.now (Low)

Severity: Low Recommendation: track in the existing project-wide _iso_now audit PBI

The decorator's FDR records use _iso_now() (which calls datetime.now(timezone.utc).strftime(...)) for the ts field rather than deriving it from the injected Clock.time_ns(). This matches the existing pattern in tile_uploader.py (which does the same), so introducing a deviation in this batch alone would cause inconsistency across C11. The right fix is a project-wide sweep (also covers signing_key.py etc.) that the cumulative review window can carry.

F3 — Spec wording: pending tile check vs. budget check (Low)

Severity: Low Recommendation: add a one-line comment OR update the spec

The spec § Outcome bullet 3 says: "If retries_used < config.max_in_call_retries AND there are still tiles with voting_status == pending, sleep + recurse". The implementation only checks the budget. The two are equivalent in practice — if no tiles are still pending, the inner uploader's next call queries pending_uploads, finds nothing, and returns outcome=SUCCESS — but the implementation does NOT explicitly query c6 to short-circuit before the sleep. This is an O(1) speedup at most (skips one time.sleep) and adds a c6 round-trip per retry, so the implementation choice is reasonable; a one-line code comment referencing the spec equivalence would close the gap.

F4 — AC-7 + AC-8 not exercised on dev host (Low)

Severity: Low Recommendation: defer to next Docker-compose CI run

AC-7 (concurrent increment_upload_attempts) and AC-8 (migration applied to live DB) require Postgres + Alembic apply. The implementation follows the spec verbatim:

  • increment_upload_attempts: UPDATE tiles SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts
  • Migration: op.add_column(...) + CREATE … CHECK …, with a symmetric downgrade()

Both will be exercised by CI's Docker-compose phase. Code review of the SQL + migration is complete; runtime validation is the gating step.

F5 — Postgres schema test still references AZ-304 head (Low)

Severity: Low Recommendation: update in the same PR that lands the Docker-compose CI run for 0003

tests/unit/c6_tile_cache/test_postgres_schema.py and tests/unit/c6_tile_cache/fixtures/c6_postgres_schema_v2.sql still reference AZ-304's head 0002_c6_tile_identity_and_lru. These tests are docker-gated and skipped on this dev host, so they did NOT fail in the Batch 41 sweep. They will fail when the Docker CI runs against 0003 — at which point the fixture file should be extended (or replaced with c6_postgres_schema_v3.sql) to include the new column + widened constraint, and _AZ304_REV constants should be supplemented with _AZ320_REV.

This is intentionally NOT done in this batch (the dev sweep can't verify the fix), but documented here as a known follow-up.

Verdict

PASS_WITH_WARNINGS. Five Low/Informational findings, none blocking; F3 is the only one that implies a code change (a single comment), and it is genuinely optional.