Files
gps-denied-onboard/_docs/03_implementation/batch_41_cycle1_report.md
T
Oleksandr Bezdieniezhnykh a06b107fc3 [AZ-320] Add C11 IdempotentRetryTileUploader decorator
Wraps HttpTileUploader (AZ-319) with two bounded retry budgets:

- In-call (per-batch) — re-invokes inner on PARTIAL outcome up to
  `max_in_call_retries` times with capped exponential backoff
  (`min(base ** attempt_number, cap)`). On exhaustion: surfaces an
  operator hint via `next_retry_at_s = now + backoff_cap_s`.
- Per-tile (cross-call) — atomically increments c6's
  `tiles.upload_attempts` counter for every rejection; once a tile
  hits `max_per_tile_attempts` it is forward-only transitioned to
  `voting_status = upload_giveup` (excluded from `pending_uploads`).
  Each transition emits FDR `kind="c11.upload.giveup"` plus an
  ERROR log.

C6 contract changes (AZ-303 v1.3.0):
- VotingStatus.UPLOAD_GIVEUP added (forward-only from PENDING/TRUSTED).
- TileMetadataStore.increment_upload_attempts(tile_id) -> int added
  with NotImplementedError default for backwards-compat.
- Migration 0003_c11_upload_attempts: additive column +
  widened ck_tiles_voting_status (preserves IS NULL clause).

C11 wiring:
- C11RetryConfig + disable_retry_decorator on C11Config.
- build_tile_uploader wraps in decorator by default; bypass flag
  returns the bare HttpTileUploader. New `clock` keyword.

Cross-component isolation honoured (AZ-507): the decorator declares
`_RetryMetadataStoreLike` Protocol cut over c6's TileMetadataStore
and references `UPLOAD_GIVEUP` via a local string constant — no c6
imports.

Tests: 13 decorator + 1 conformance + 2 factory bypass + AC-6 enum
update + alembic head bump + AZ-272 schema fixture. 238 passed across
c11/c6/fdr suites; pre-existing perf microbenches unrelated.

Code review: PASS_WITH_WARNINGS (5 Low/Informational findings,
docs-level or downstream-CI-blocked). See
_docs/03_implementation/reviews/batch_41_review.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:48:53 +03:00

11 KiB
Raw Blame History

Batch 41 — Cycle 1 Report

Date: 2026-05-13 Batch: 41 (single-task batch — C11 idempotent retry decorator) Tasks:

  • AZ-320 (C11 IdempotentRetryTileUploader, 3pt)

Total complexity: 3pt Status: complete; pending transition to "In Testing".

Scope

Batch 41 lands the AZ-320 retry decorator that wraps the AZ-319 HttpTileUploader and gives the operator-side upload path two bounded retry budgets:

  1. In-call (per-batch) budget — re-invokes the inner uploader at most config.c11.retry.max_in_call_retries times when the inner returns outcome=PARTIAL. Backoff between rounds is min(base ** attempt_number, cap); the spec's worked example (max=3, base=2.0 → sleeps 2.0, 4.0, 8.0) drove the "attempt-number is 1-indexed" off-by-one fix in the loop body.
  2. Per-tile (cross-call) budget — for every rejection the inner surfaces, the decorator atomically increments c6's tiles.upload_attempts counter; once the counter hits config.c11.retry.max_per_tile_attempts the tile is forward-only transitioned to voting_status = upload_giveup. The c6 pending_uploads SQL excludes that status so subsequent operator re-runs naturally skip those tiles. Recovery is documented as an out-of-band SQL UPDATE (per the spec's "human decision boundary" constraint).

Each UPLOAD_GIVEUP transition emits one FDR record (kind="c11.upload.giveup") plus an ERROR log; budget exhaustion on the in-call side emits a WARN log and surfaces an operator hint via the existing UploadBatchReport.next_retry_at_s field (now + backoff_cap_s). Pass-through methods (enumerate_pending_tiles, confirm_flight_state) delegate to the inner unchanged so the decorator is a true TileUploader Protocol drop-in.

Architectural decisions

AZ-507 — consumer-side cuts for c6 (no enum imports either)

The decorator only needs two write surfaces on c6's TileMetadataStore: increment_upload_attempts and update_voting_status. A direct from gps_denied_onboard.components.c6_tile_cache import … would violate AZ-507 / trip the AZ-270 lint, so idempotent_retry.py declares a local _RetryMetadataStoreLike Protocol cut over those two methods and binds the concrete PostgresFilesystemStore only at the composition root.

The c6 VotingStatus.UPLOAD_GIVEUP enum value is reached via a locally-scoped _VOTING_STATUS_UPLOAD_GIVEUP = "upload_giveup" string constant. The update_voting_status impl coerces either a c6 enum or the bare string via VotingStatus(status), so the decorator never imports c6's enum. This matches the same pattern HttpTileDownloader uses for the freshness-label string surface (Batch 40, _TileWriterLike).

Forward-only voting transitions — list update

VotingStatus.UPLOAD_GIVEUP is added as a fourth enum value; PostgresFilesystemStore._ALLOWED_VOTING_TRANSITIONS was extended with (PENDING → UPLOAD_GIVEUP) and (TRUSTED → UPLOAD_GIVEUP). The contract file's Invariant I-8 was updated in lockstep (v1.3.0 Change Log entry). REJECTED → UPLOAD_GIVEUP is intentionally NOT permitted — once the parent suite has rejected a tile, the local retry budget is irrelevant.

Migration is append-only

Per coderule.mdc (migrations are append-only) and the spec's "Unacceptable substitutes" clause ("modifying AZ-304's 0001 migration in place"), the new 0003_c11_upload_attempts.py is a fresh additive migration:

  • Adds tiles.upload_attempts INTEGER NOT NULL DEFAULT 0.
  • Widens the ck_tiles_voting_status CHECK constraint to admit 'upload_giveup'. The widened predicate explicitly preserves voting_status IS NULL (the original 0001 migration permits NULL) — without this, legacy rows would fail the CHECK on re-creation.
  • Reversible: rollback drops both the column and the widened constraint, restoring the AZ-304 head exactly.

The Alembic head-revision assertion in tests/unit/test_ac5_alembic.py was updated from 0002_c6_tile_identity_and_lru to 0003_c11_upload_attempts in lockstep (the test docstring already calls out "Future migrations update this assertion in lockstep").

Clock injection (full Protocol, not just sleep)

This is the third batch in a row to touch the "Clock vs. sleep injection" deviation flagged in cumulative review batches 37-39 (F2). For AZ-320 the decorator needs BOTH monotonic_ns (backoff arithmetic) AND time_ns (the operator-facing next_retry_at_s hint), so it accepts the full Clock Protocol — matching the pattern AZ-307 / AZ-308 already use. This is the first C11 batch to honour the no-deviation path; documented in the batch review as F1 (informational, no action).

FDR ts derivation — datetime.now, not Clock

The decorator emits c11.upload.giveup records with a ts=datetime.now(timezone.utc).strftime(...) ISO string, matching the existing pattern in tile_uploader.py (_iso_now). Switching to Clock.time_ns() for ts derivation would break consistency across the C11 component and would require a project-wide audit of every _iso_now() call site. Documented as F2 (Low) for the follow-up sweep PBI.

Off-by-one in the backoff exponent — fix during the test pass

Initial implementation used base ** retries_used with retries_used starting at 0, yielding sleeps of 1.0, 2.0, 4.0 for max=3, base=2.0. The spec's worked example (AC-4) requires 2.0, 4.0, 8.0. Fixed by incrementing retries_used BEFORE computing the backoff, and renamed the helper parameter to attempt_number (1-indexed) with a clarifying docstring. Caught by the test pass — re-confirms the value of writing the AC-4 fixture verbatim from the spec rather than from the implementation.

Files touched

Production:

  • src/gps_denied_onboard/components/c6_tile_cache/_types.py (added VotingStatus.UPLOAD_GIVEUP + updated forward-transition docstring)
  • src/gps_denied_onboard/components/c6_tile_cache/interface.py (added TileMetadataStore.increment_upload_attempts(tile_id) -> int with a NotImplementedError default impl per the spec's Compatibility NFR)
  • src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py (added increment_upload_attempts SQL + extended _ALLOWED_VOTING_TRANSITIONS + tightened pending_uploads SQL to exclude voting_status='upload_giveup')
  • db/migrations/versions/0003_c11_upload_attempts.py (new — additive column + widened CHECK constraint)
  • src/gps_denied_onboard/components/c11_tile_manager/config.py (added C11RetryConfig frozen dataclass, disable_retry_decorator bypass flag, nested retry: C11RetryConfig field on C11Config)
  • src/gps_denied_onboard/components/c11_tile_manager/idempotent_retry.py (new — IdempotentRetryTileUploader, _RetryMetadataStoreLike, _iso_now)
  • src/gps_denied_onboard/components/c11_tile_manager/__init__.py (re-exports for C11RetryConfig, IdempotentRetryTileUploader)
  • src/gps_denied_onboard/runtime_root/c11_factory.py (build_tile_uploader now wraps HttpTileUploader in the decorator by default; disable_retry_decorator=true returns the bare uploader; new clock keyword parameter with WallClock default for production wiring; return type widened to the TileUploader Protocol)
  • src/gps_denied_onboard/fdr_client/records.py (registered c11.upload.giveup in KNOWN_PAYLOAD_KEYS)

Contracts:

  • _docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md (v1.3.0 — added increment_upload_attempts to method table, updated Invariant I-8 forward-transition list)

Tests:

  • tests/unit/c11_tile_manager/test_idempotent_retry.py (new — 13 tests: AC-1, AC-2, AC-3, AC-4, AC-5, AC-10 ×2, AC-11 ×2, AC-12 ×2, FAILURE pass-through, NFR overhead microbench)
  • tests/unit/c11_tile_manager/test_protocol_conformance.py (added AC-9 — isinstance(IdempotentRetryTileUploader, TileUploader))
  • tests/unit/c6_tile_cache/test_protocol_conformance.py (extended AC-10 enum-surface test for UPLOAD_GIVEUP; updated the two metadata-store fakes to include increment_upload_attempts)
  • tests/unit/test_ac5_alembic.py (updated head-revision assertion to 0003_c11_upload_attempts)
  • tests/unit/test_az272_fdr_record_schema.py (added mock payload for c11.upload.giveup)

Test results

pytest tests/unit -q --deselect tests/unit/c11_tile_manager/test_signing_key.py::test_nfr_perf_sign_microbench_p99_under_one_ms:

  • 1429 passed, 80 skipped, 1 deselected, 3 failed (all 3 failures are pre-existing perf microbenches unrelated to AZ-320: C10 batcher overhead, C8 covariance projector latency, and the C11 signing-key sign-p99 microbench that is the same flaky test the deselect targets). The deselected one is the signing-key bench; the C10 and C8 perf benches were also flaky on Batch 40's sweep (same dev-host noise).
  • +9 net tests vs. Batch 40's sweep (the 13 decorator tests + 1 conformance test + 2 factory bypass tests, minus the 7 fakes that were already counted under c6 conformance and now include the new increment_upload_attempts method).

pytest tests/unit/c11_tile_manager tests/unit/c6_tile_cache tests/unit/test_az272_fdr_record_schema.py:

  • 238 passed, 57 skipped (Postgres+Docker gates), 1 deselected. Zero failures across all in-scope unit suites.

ReadLints: clean across every touched file.

Code review verdict

PASS_WITH_WARNINGS — see _docs/03_implementation/reviews/batch_41_review.md. Findings:

  • F1 (Informational) — Clock injection deviation from prior batches is now CLOSED for C11 (decorator uses the full Clock Protocol). No action.
  • F2 (Low) — _iso_now() still pulls wall-clock directly via datetime.now; aligns with existing tile_uploader._iso_now but the project-wide hygiene PBI to derive ts from Clock remains open.
  • F3 (Low) — Spec says "If retries_used < max_in_call_retries AND there are still tiles with voting_status == pending"; the decorator only checks the budget. Equivalent in practice (the inner's next call queries pending_uploads and returns SUCCESS immediately if empty), but worth a one-line comment.
  • F4 (Low) — AC-7 (concurrent SQL increment) and AC-8 (migration applied to live DB) are gated behind Docker-compose and were not exercised in this dev sweep. The SQL implementation follows the spec verbatim (UPDATE … RETURNING …); Docker CI run will validate.
  • F5 (Low) — Postgres tests under c6_tile_cache/test_postgres_schema.py and c6_tile_cache/fixtures/c6_postgres_schema_v2.sql still reference the AZ-304 head and will need a follow-up tweak when the Docker-gated suite is run against 0003. No code change in this batch since those tests are skipped on the dev host.

No blocking findings; no code change required for batch close-out.

Cumulative review

Batch 41 is single-task; the next cumulative review window covers batches 40-42 and will land before Batch 43 starts. The recurring Clock-vs-sleep deviation flagged in cumulative reports for batches 37-39 is now CLOSED for C11 (this batch landed the full Clock injection); the project-wide audit-PBI for _iso_now / datetime.now callers remains open.