Wraps HttpTileUploader (AZ-319) with two bounded retry budgets: - In-call (per-batch) — re-invokes inner on PARTIAL outcome up to `max_in_call_retries` times with capped exponential backoff (`min(base ** attempt_number, cap)`). On exhaustion: surfaces an operator hint via `next_retry_at_s = now + backoff_cap_s`. - Per-tile (cross-call) — atomically increments c6's `tiles.upload_attempts` counter for every rejection; once a tile hits `max_per_tile_attempts` it is forward-only transitioned to `voting_status = upload_giveup` (excluded from `pending_uploads`). Each transition emits FDR `kind="c11.upload.giveup"` plus an ERROR log. C6 contract changes (AZ-303 v1.3.0): - VotingStatus.UPLOAD_GIVEUP added (forward-only from PENDING/TRUSTED). - TileMetadataStore.increment_upload_attempts(tile_id) -> int added with NotImplementedError default for backwards-compat. - Migration 0003_c11_upload_attempts: additive column + widened ck_tiles_voting_status (preserves IS NULL clause). C11 wiring: - C11RetryConfig + disable_retry_decorator on C11Config. - build_tile_uploader wraps in decorator by default; bypass flag returns the bare HttpTileUploader. New `clock` keyword. Cross-component isolation honoured (AZ-507): the decorator declares `_RetryMetadataStoreLike` Protocol cut over c6's TileMetadataStore and references `UPLOAD_GIVEUP` via a local string constant — no c6 imports. Tests: 13 decorator + 1 conformance + 2 factory bypass + AC-6 enum update + alembic head bump + AZ-272 schema fixture. 238 passed across c11/c6/fdr suites; pre-existing perf microbenches unrelated. Code review: PASS_WITH_WARNINGS (5 Low/Informational findings, docs-level or downstream-CI-blocked). See _docs/03_implementation/reviews/batch_41_review.md. Co-authored-by: Cursor <cursoragent@cursor.com>
18 KiB
C11 Idempotent Retry — In-Call Retry Loop on Partial-Success Batches
Task: AZ-320_c11_idempotent_retry
Name: C11 Idempotent Retry Decorator
Description: Implement IdempotentRetryTileUploader, a decorator that wraps the AZ-319 TileUploader Protocol impl and adds bounded in-call retry on partial-success batches. After the underlying uploader returns outcome=partial, the decorator re-queries C6's pending_uploads (already-acked tiles were mark_uploaded'd, so the second pass naturally targets only the unacked subset), waits an exponential-backoff delay, and re-invokes the underlying upload. Caps at config.c11.max_in_call_retries (default 3); on budget exhaustion, the final report's outcome stays partial and next_retry_at_s carries an operator hint for when to retry later. A per-tile rejection counter in C6 metadata (upload_attempts) bounds the per-tile retry budget — after config.c11.max_per_tile_attempts (default 5), the tile is moved to voting_status = upload_giveup (a new enum value added by this task) and surfaced via FDR for human review.
Complexity: 3 points
Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-319_c11_tile_uploader
Component: c11_tilemanager (epic AZ-251 / E-C11)
Tracker: AZ-320
Epic: AZ-251 (E-C11)
Document Dependencies
_docs/02_document/contracts/c11_tilemanager/tile_uploader.md— the underlying Protocol this decorator wraps; the decorator itself implements the same Protocol (drop-in replacement)._docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md— consumed:pending_uploads,mark_uploaded,update_voting_status. This task adds anupload_attemptsinteger field and anupload_giveupvalue toVotingStatus— a contract change that bumpstile_metadata_store.mdto v1.1.0 (non-breaking minor)._docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md—kind="c11.upload.giveup"envelope._docs/02_document/components/12_c11_tilemanager/tests.md— C11-IT-05 test scenario.
Problem
Without bounded in-call retry:
- C11-IT-05 ("idempotent uploads on retry — re-running
upload_pendingafter a partial-success batch only POSTs the tiles that weren't acknowledged before") relies on the operator manually re-invokingupload_pending. Operators tolerate one re-invocation but resent doing it 3-4 times after transientsatellite-providerflakiness. - A single tile that ALWAYS fails (e.g., truncated tile_blob in C6 fails ingest validation forever) becomes a poison pill — every retry attempt re-uploads it AND every other unacked tile, wasting bandwidth and signing cycles. Without a per-tile budget, the operator cannot distinguish transient failures from terminal ones.
- The
next_retry_at_sfield ofUploadBatchReport(per AZ-319 contract) has no producer — without backoff calculation, the field is always None and the operator gets no hint on retry timing. - The parent suite's voting layer assumes uploaded tiles are eventually-consistent; an unbounded retry loop with no per-tile state would create lockstep retry storms.
This task delivers the retry decorator. It changes NO underlying logic in AZ-319; it composes.
Outcome
- An
IdempotentRetryTileUploaderclass atsrc/gps_denied_onboard/components/c11_tilemanager/idempotent_retry.py:- Implements the
TileUploaderProtocol (drop-in forHttpTileUploader). - Constructor:
__init__(self, *, inner: TileUploader, tile_metadata_store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11RetryConfig). C11RetryConfigis a frozen dataclass withmax_in_call_retries: int = 3,max_per_tile_attempts: int = 5,backoff_base_s: float = 2.0,backoff_cap_s: float = 60.0.
- Implements the
upload_pending_tiles(request)flow:- Calls
inner.upload_pending_tiles(request)once. - If the inner returns
outcome in (success, failure)→ return as-is. - If
outcome == partial:- For each
PerTileStatus.status == rejected, increments the tile'supload_attemptsin C6 via a newtile_metadata_store.increment_upload_attempts(tile_id)method. - For each tile whose
upload_attempts >= config.max_per_tile_attempts, callstile_metadata_store.update_voting_status(tile_id, VotingStatus.UPLOAD_GIVEUP); emits FDRkind="c11.upload.giveup"with{tile_id, attempts, last_rejection_reason}; emits ERROR log. - If
retries_used < config.max_in_call_retriesAND there are still tiles withvoting_status == pending:- Sleeps
min(config.backoff_base_s ** retries_used, config.backoff_cap_s)seconds via injectedClock.sleep. - Recurses with
retries_used += 1(via internal helper, NOT actual recursion — bounded loop).
- Sleeps
- Else (budget exhausted):
- Aggregates the final
UploadBatchReport:outcome = partial;retry_count = retries_used;next_retry_at_s = clock.now() + config.backoff_cap_s(operator hint). - Returns the aggregated report.
- Aggregates the final
- For each
- Calls
enumerate_pending_tiles(flight_id)andconfirm_flight_state()pass through to the inner unchanged.- A new
VotingStatus.UPLOAD_GIVEUPenum value is added to AZ-303'sVotingStatus(in C6's_types.py); this is a non-breaking minor bump oftile_metadata_store.mdto v1.1.0 — the producer (AZ-303) stays in v1, but C6's contract file'sChange Logis appended by this task with a note pointing to the bump. - A new
tile_metadata_store.increment_upload_attempts(tile_id) -> intmethod is added to AZ-303'sTileMetadataStoreProtocol; returns the new attempt count post-increment. This is a Protocol surface addition (minor bump). The implementation lives in AZ-305'sPostgresFilesystemStore. This task adds:- The Protocol method declaration in
c6_tile_cache/interface.py. - The impl in
c6_tile_cache/postgres_filesystem_store.py(a single SQLUPDATE ... SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts). - The Postgres column
upload_attempts INTEGER NOT NULL DEFAULT 0via a NEW alembic migration_alembic/0002_upload_attempts.sql(NOT modifying AZ-304's 0001 migration; percoderule.mdcmigrations are append-only).
- The Protocol method declaration in
- The composition root wraps
HttpTileUploaderwithIdempotentRetryTileUploaderby default. Aconfig.c11.disable_retry_decorator: bool = falselets operators bypass the decorator for debugging. - INFO log on session start with retry config; INFO log per retry attempt with
attempt_number, sleep_s, remaining_pending_count; ERROR log on per-tile giveup; FDRkind="c11.upload.giveup"per tile.
Scope
Included
IdempotentRetryTileUploaderdecorator class.C11RetryConfigfrozen dataclass.VotingStatus.UPLOAD_GIVEUPenum value addition (in C6's_types.py).tile_metadata_store.increment_upload_attempts(tile_id) -> intProtocol method addition + AZ-305 SQL impl._alembic/0002_upload_attempts.sqlmigration — addsupload_attempts INTEGER NOT NULL DEFAULT 0column to the tiles table.- Composition-root wiring (decorate
HttpTileUploaderby default;config.c11.disable_retry_decoratorlets operators opt out). - Bumping
tile_metadata_store.mdto v1.1.0 with a Change Log entry. - INFO/ERROR logs and FDR
c11.upload.giveupemission. - Conformance test:
isinstance(IdempotentRetryTileUploader(...), TileUploader).
Excluded
- The underlying
HttpTileUploaderimpl — owned by AZ-319. - The decision rule for what counts as a transient vs. terminal rejection — the decorator treats EVERY rejection as transient until the per-tile attempt budget is hit; the operator may manually move
UPLOAD_GIVEUPtiles back topendingafter investigation (out-of-band SQL UPDATE; no API surface). - A separate background-retry daemon — the retry happens within
upload_pending_tiles; the operator decides when to invoke it. - Cross-process retry coordination — the C12 lockfile already prevents concurrent C11 invocations.
- Surfacing
UPLOAD_GIVEUPin the operator-tooling CLI — owned by E-C12. - Auto-promotion of
UPLOAD_GIVEUPback topendingafter manual fixes — operator concern; out of scope.
Acceptance Criteria
AC-1: Success on first attempt — no retry
Given the inner uploader returns outcome = success on the first call
When upload_pending_tiles(request) is called
Then the decorator returns immediately; ZERO calls to Clock.sleep; ZERO calls to increment_upload_attempts; report passes through unchanged
AC-2: Partial-success with retry budget available
Given inner returns outcome=partial with 3 of 10 tiles rejected (per_tile_status), and max_in_call_retries=3
When the decorator processes the partial
Then increment_upload_attempts is called 3 times (one per rejected tile); Clock.sleep(2.0) is called once; inner is re-invoked; if the second attempt is success, the final aggregated report shows outcome = success and retry_count = 1
AC-3: Per-tile budget exhausted moves tile to UPLOAD_GIVEUP
Given a tile whose upload_attempts reaches max_per_tile_attempts=5
When the decorator increments the counter
Then update_voting_status(tile_id, UPLOAD_GIVEUP) is called; ONE FDR kind="c11.upload.giveup" is emitted with {tile_id, attempts=5, last_rejection_reason}; ONE ERROR log; the tile is NOT re-uploaded in subsequent retries (since pending_uploads excludes UPLOAD_GIVEUP)
AC-4: In-call retry budget exhausted
Given inner consistently returns outcome=partial with the same rejected tile, and max_in_call_retries=3
When the decorator runs out of in-call retries
Then 3 retries are attempted (4 total inner calls including the first); Clock.sleep is called 3 times with backoffs 2.0, 4.0, 8.0; the final report has outcome=partial, retry_count=3, next_retry_at_s = clock.now() + backoff_cap_s
AC-5: Backoff cap honoured
Given max_in_call_retries=10 and backoff_cap_s=10
When the decorator computes the 6th retry delay
Then Clock.sleep(10.0) is called (capped at 10s, not 2^6 = 64s)
AC-6: VotingStatus.UPLOAD_GIVEUP enum exposed
Given the AZ-303 VotingStatus enum (post-this-task)
When a consumer imports it
Then VotingStatus.UPLOAD_GIVEUP is present alongside PENDING, TRUSTED, REJECTED; the contract file's Change Log shows v1.1.0
AC-7: increment_upload_attempts returns new count
Given a tile with upload_attempts = 2
When increment_upload_attempts(tile_id) is called
Then the SQL row's upload_attempts is now 3; the method returns 3; concurrent invocations on different tiles produce no contention (per-row lock)
AC-8: Migration 0002 adds the column
Given a fresh DB at AZ-304's 0001 migration
When 0002 is applied
Then the tiles table has an upload_attempts INTEGER NOT NULL DEFAULT 0 column; existing rows have upload_attempts = 0; the migration is reversible (drops the column on rollback)
AC-9: Decorator is a drop-in for the Protocol
Given an IdempotentRetryTileUploader instance
When isinstance(impl, TileUploader) is checked under runtime_checkable
Then the result is True; consumers that depend on the Protocol see no shape difference
AC-10: disable_retry_decorator config bypass
Given config.c11.disable_retry_decorator = true
When the composition root constructs the uploader
Then build_tile_uploader(config) returns the bare HttpTileUploader (no decorator); a debug INFO log records the bypass
AC-11: Pass-through methods
Given the decorator
When enumerate_pending_tiles(flight_id) and confirm_flight_state() are called
Then both delegate to inner directly with no added logic
AC-12: Inner exception propagates without retry
Given inner raises FlightStateNotOnGroundError or SatelliteProviderError
When the decorator catches the exception
Then it re-raises immediately; no retry is attempted (these are not partial-success cases); ZERO Clock.sleep calls
AC-13: Idempotent across re-invocations (the C11-IT-05 scenario)
Given a 50-tile batch where 30 succeed and 20 are rejected on first call (no in-call retry to avoid mixing); operator re-invokes after 5 minutes
When the second call runs
Then pending_uploads returns only the 20 tiles (the 30 are already voting_status = uploaded); inner.upload_pending_tiles is called with the request; only those 20 are POSTed; the 30 are NOT re-sent
Non-Functional Requirements
Performance
- Decorator overhead per
upload_pending_tilescall ≤ 5 ms (plusClock.sleeptime, which is intentional). increment_upload_attemptsSQL call ≤ 5 ms p99 against the local Postgres.
Compatibility
- The new migration is append-only (NOT a modification of AZ-304's 0001 migration).
- The new
VotingStatus.UPLOAD_GIVEUPvalue is additive (non-breaking). increment_upload_attemptsis a Protocol method addition; existing AZ-303 conformance tests pass (the method has a default impl that raisesNotImplementedErrorif a future implementation forgets it — but the AZ-305 impl provides the SQL version).
Reliability
- The retry loop is bounded by BOTH
max_in_call_retriesANDmax_per_tile_attempts; neither alone can produce unbounded behaviour. - The decorator does NOT swallow exceptions from
inner; onlyoutcome=partialresults are eligible for retry. - The injected
Clock.sleepmakes retry timing deterministic in tests.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Inner success | Pass-through; zero retries |
| AC-2 | Partial → retry → success | One Clock.sleep(2.0); retry_count=1; final outcome=success |
| AC-3 | Per-tile attempts hits 5 | update_voting_status called; FDR + ERROR log emitted |
| AC-4 | Persistent partial across 4 attempts | Clock.sleep(2.0), (4.0), (8.0); retry_count=3; final partial |
| AC-5 | Cap=10 with high attempt number | Clock.sleep(10.0) not Clock.sleep(64.0) |
| AC-6 | Import VotingStatus.UPLOAD_GIVEUP |
Present; tile_metadata_store.md v1.1.0 |
| AC-7 | Concurrent increment_upload_attempts on different tiles |
No deadlock; correct counts |
| AC-8 | Apply 0002 migration | Column added; default 0; rollback drops |
| AC-9 | isinstance check |
True |
| AC-10 | Config bypass | Bare impl; debug log |
| AC-11 | Pass-through methods | Delegated unchanged |
| AC-12 | Inner raises | Re-raised; zero retries |
| AC-13 | Two-call scenario across operator re-invocations | First call: 30 acked / 20 rejected; second call: only 20 POSTed |
| NFR-perf-overhead | Microbench decorator with success-on-first | ≤ 5 ms overhead |
Constraints
- The decorator MUST be a drop-in for
TileUploader; the composition root selects viaconfig.c11.disable_retry_decoratoronly. - The retry budget is per-call (in-call) and per-tile (across calls); neither budget alone fully bounds — both are required.
increment_upload_attemptsis the ONLY method that mutatesupload_attempts; consumers do NOT directly UPDATE the column. This is a contract invariant; code-review treats direct UPDATEs asArchitecturefinding (High).- The
UPLOAD_GIVEUPvoting status is a HUMAN-decision boundary — automated promotion back topendingis forbidden in this task. An out-of-band SQL UPDATE by the operator is the documented recovery path. - The migration 0002 is APPEND-ONLY relative to 0001; it does NOT alter existing column types.
- This task introduces no new third-party dependencies.
Risks & Mitigation
Risk 1: AZ-303 contract bump cascades to other consumers
- Risk: Adding
increment_upload_attemptsto the Protocol forces every existing C6 consumer (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C12 OperatorTooling) to re-confirm conformance. - Mitigation: The new method is OPTIONAL via a Protocol default impl that raises
NotImplementedError; consumers that don't call it are unaffected. The conformance test verifies only that AZ-305's impl provides it.
Risk 2: Backoff cap interacts badly with operator workflows
- Risk: A 60-second cap means the operator may walk away during retries; the visible CLI hangs.
- Mitigation: The decorator emits an INFO log per retry attempt with
attempt_number, sleep_s, remaining_pending_count; C12's CLI surfaces this so the operator sees progress. Cap is configurable.
Risk 3: UPLOAD_GIVEUP tiles accumulating without operator visibility
- Risk: A subtle data corruption in C6 causes 100% of tiles to hit
UPLOAD_GIVEUP; the operator notices only when they manually inspect C6. - Mitigation: Each
UPLOAD_GIVEUPevent emits FDRkind="c11.upload.giveup"AND ERROR log; C12's CLI summary surfaces the count post-upload-run. This task adds NO direct UI; C12's task list will include surfacing.
Risk 4: Clock.sleep blocking on KeyboardInterrupt
- Risk: A long backoff (60s) blocks the process; Ctrl+C aborts mid-sleep but might leave state inconsistent.
- Mitigation: The decorator uses the injected
Clockwhich is the same singleton as AZ-307/AZ-308; KeyboardInterrupt propagates upward and AZ-319's try/finally still runskey_manager.end_session(); the decorator's own state is just the retry counter (in-memory; no on-disk side effects between retries).
Runtime Completeness
- Named capability: bounded in-call retry on partial-success uploads, per-tile retry budget with
UPLOAD_GIVEUPterminal state, operator-friendlynext_retry_at_shint (description.md § 5, C11-IT-05). - Production code that must exist: real
IdempotentRetryTileUploaderdecorator, realincrement_upload_attemptsSQL, real migration 0002, realVotingStatus.UPLOAD_GIVEUPenum value, real composition-root wiring with the bypass flag. - Allowed external stubs: tests MAY use a fake
inner(mock TileUploader implementing the Protocol with scripted responses), fakeClock, faketile_metadata_store(already provided by AZ-303 conformance fakes); production wiring uses real all the way down. - Unacceptable substitutes: a recursive Python implementation of the retry loop (stack-explosion risk; bounded iteration is required); skipping the per-tile budget (lets one bad tile poison every retry); silently moving tiles to
UPLOAD_GIVEUPwithout FDR (loses safety officer surface); modifying AZ-304's 0001 migration in place (breaks deployment idempotence — migrations are append-only).