# C11 Idempotent Retry — In-Call Retry Loop on Partial-Success Batches **Task**: AZ-320_c11_idempotent_retry **Name**: C11 Idempotent Retry Decorator **Description**: Implement `IdempotentRetryTileUploader`, a decorator that wraps the AZ-319 `TileUploader` Protocol impl and adds bounded in-call retry on partial-success batches. After the underlying uploader returns `outcome=partial`, the decorator re-queries C6's `pending_uploads` (already-acked tiles were `mark_uploaded`'d, so the second pass naturally targets only the unacked subset), waits an exponential-backoff delay, and re-invokes the underlying upload. Caps at `config.c11.max_in_call_retries` (default 3); on budget exhaustion, the final report's `outcome` stays `partial` and `next_retry_at_s` carries an operator hint for when to retry later. A per-tile rejection counter in C6 metadata (`upload_attempts`) bounds the per-tile retry budget — after `config.c11.max_per_tile_attempts` (default 5), the tile is moved to `voting_status = upload_giveup` (a new enum value added by this task) and surfaced via FDR for human review. **Complexity**: 3 points **Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-319_c11_tile_uploader **Component**: c11_tilemanager (epic AZ-251 / E-C11) **Tracker**: AZ-320 **Epic**: AZ-251 (E-C11) ### Document Dependencies - `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md` — the underlying Protocol this decorator wraps; the decorator itself implements the same Protocol (drop-in replacement). - `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — consumed: `pending_uploads`, `mark_uploaded`, `update_voting_status`. This task adds an `upload_attempts` integer field and an `upload_giveup` value to `VotingStatus` — a contract change that bumps `tile_metadata_store.md` to v1.1.0 (non-breaking minor). - `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c11.upload.giveup"` envelope. - `_docs/02_document/components/12_c11_tilemanager/tests.md` — C11-IT-05 test scenario. ## Problem Without bounded in-call retry: - C11-IT-05 ("idempotent uploads on retry — re-running `upload_pending` after a partial-success batch only POSTs the tiles that weren't acknowledged before") relies on the operator manually re-invoking `upload_pending`. Operators tolerate one re-invocation but resent doing it 3-4 times after transient `satellite-provider` flakiness. - A single tile that ALWAYS fails (e.g., truncated tile_blob in C6 fails ingest validation forever) becomes a poison pill — every retry attempt re-uploads it AND every other unacked tile, wasting bandwidth and signing cycles. Without a per-tile budget, the operator cannot distinguish transient failures from terminal ones. - The `next_retry_at_s` field of `UploadBatchReport` (per AZ-319 contract) has no producer — without backoff calculation, the field is always None and the operator gets no hint on retry timing. - The parent suite's voting layer assumes uploaded tiles are eventually-consistent; an unbounded retry loop with no per-tile state would create lockstep retry storms. This task delivers the retry decorator. It changes NO underlying logic in AZ-319; it composes. ## Outcome - An `IdempotentRetryTileUploader` class at `src/gps_denied_onboard/components/c11_tilemanager/idempotent_retry.py`: - Implements the `TileUploader` Protocol (drop-in for `HttpTileUploader`). - Constructor: `__init__(self, *, inner: TileUploader, tile_metadata_store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11RetryConfig)`. - `C11RetryConfig` is a frozen dataclass with `max_in_call_retries: int = 3`, `max_per_tile_attempts: int = 5`, `backoff_base_s: float = 2.0`, `backoff_cap_s: float = 60.0`. - `upload_pending_tiles(request)` flow: 1. Calls `inner.upload_pending_tiles(request)` once. 2. If the inner returns `outcome in (success, failure)` → return as-is. 3. If `outcome == partial`: - For each `PerTileStatus.status == rejected`, increments the tile's `upload_attempts` in C6 via a new `tile_metadata_store.increment_upload_attempts(tile_id)` method. - For each tile whose `upload_attempts >= config.max_per_tile_attempts`, calls `tile_metadata_store.update_voting_status(tile_id, VotingStatus.UPLOAD_GIVEUP)`; emits FDR `kind="c11.upload.giveup"` with `{tile_id, attempts, last_rejection_reason}`; emits ERROR log. - If `retries_used < config.max_in_call_retries` AND there are still tiles with `voting_status == pending`: - Sleeps `min(config.backoff_base_s ** retries_used, config.backoff_cap_s)` seconds via injected `Clock.sleep`. - Recurses with `retries_used += 1` (via internal helper, NOT actual recursion — bounded loop). - Else (budget exhausted): - Aggregates the final `UploadBatchReport`: `outcome = partial`; `retry_count = retries_used`; `next_retry_at_s = clock.now() + config.backoff_cap_s` (operator hint). - Returns the aggregated report. - `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` pass through to the inner unchanged. - A new `VotingStatus.UPLOAD_GIVEUP` enum value is added to AZ-303's `VotingStatus` (in C6's `_types.py`); this is a non-breaking minor bump of `tile_metadata_store.md` to v1.1.0 — the producer (AZ-303) stays in v1, but C6's contract file's `Change Log` is appended by this task with a note pointing to the bump. - A new `tile_metadata_store.increment_upload_attempts(tile_id) -> int` method is added to AZ-303's `TileMetadataStore` Protocol; returns the new attempt count post-increment. This is a Protocol surface addition (minor bump). The implementation lives in AZ-305's `PostgresFilesystemStore`. This task adds: - The Protocol method declaration in `c6_tile_cache/interface.py`. - The impl in `c6_tile_cache/postgres_filesystem_store.py` (a single SQL `UPDATE ... SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts`). - The Postgres column `upload_attempts INTEGER NOT NULL DEFAULT 0` via a NEW alembic migration `_alembic/0002_upload_attempts.sql` (NOT modifying AZ-304's 0001 migration; per `coderule.mdc` migrations are append-only). - The composition root wraps `HttpTileUploader` with `IdempotentRetryTileUploader` by default. A `config.c11.disable_retry_decorator: bool = false` lets operators bypass the decorator for debugging. - INFO log on session start with retry config; INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; ERROR log on per-tile giveup; FDR `kind="c11.upload.giveup"` per tile. ## Scope ### Included - `IdempotentRetryTileUploader` decorator class. - `C11RetryConfig` frozen dataclass. - `VotingStatus.UPLOAD_GIVEUP` enum value addition (in C6's `_types.py`). - `tile_metadata_store.increment_upload_attempts(tile_id) -> int` Protocol method addition + AZ-305 SQL impl. - `_alembic/0002_upload_attempts.sql` migration — adds `upload_attempts INTEGER NOT NULL DEFAULT 0` column to the tiles table. - Composition-root wiring (decorate `HttpTileUploader` by default; `config.c11.disable_retry_decorator` lets operators opt out). - Bumping `tile_metadata_store.md` to v1.1.0 with a Change Log entry. - INFO/ERROR logs and FDR `c11.upload.giveup` emission. - Conformance test: `isinstance(IdempotentRetryTileUploader(...), TileUploader)`. ### Excluded - The underlying `HttpTileUploader` impl — owned by AZ-319. - The decision rule for what counts as a transient vs. terminal rejection — the decorator treats EVERY rejection as transient until the per-tile attempt budget is hit; the operator may manually move `UPLOAD_GIVEUP` tiles back to `pending` after investigation (out-of-band SQL UPDATE; no API surface). - A separate background-retry daemon — the retry happens within `upload_pending_tiles`; the operator decides when to invoke it. - Cross-process retry coordination — the C12 lockfile already prevents concurrent C11 invocations. - Surfacing `UPLOAD_GIVEUP` in the operator-tooling CLI — owned by E-C12. - Auto-promotion of `UPLOAD_GIVEUP` back to `pending` after manual fixes — operator concern; out of scope. ## Acceptance Criteria **AC-1: Success on first attempt — no retry** Given the inner uploader returns `outcome = success` on the first call When `upload_pending_tiles(request)` is called Then the decorator returns immediately; ZERO calls to `Clock.sleep`; ZERO calls to `increment_upload_attempts`; report passes through unchanged **AC-2: Partial-success with retry budget available** Given inner returns `outcome=partial` with 3 of 10 tiles rejected (per_tile_status), and `max_in_call_retries=3` When the decorator processes the partial Then `increment_upload_attempts` is called 3 times (one per rejected tile); `Clock.sleep(2.0)` is called once; inner is re-invoked; if the second attempt is `success`, the final aggregated report shows `outcome = success` and `retry_count = 1` **AC-3: Per-tile budget exhausted moves tile to UPLOAD_GIVEUP** Given a tile whose `upload_attempts` reaches `max_per_tile_attempts=5` When the decorator increments the counter Then `update_voting_status(tile_id, UPLOAD_GIVEUP)` is called; ONE FDR `kind="c11.upload.giveup"` is emitted with `{tile_id, attempts=5, last_rejection_reason}`; ONE ERROR log; the tile is NOT re-uploaded in subsequent retries (since `pending_uploads` excludes UPLOAD_GIVEUP) **AC-4: In-call retry budget exhausted** Given inner consistently returns `outcome=partial` with the same rejected tile, and `max_in_call_retries=3` When the decorator runs out of in-call retries Then 3 retries are attempted (4 total inner calls including the first); `Clock.sleep` is called 3 times with backoffs `2.0, 4.0, 8.0`; the final report has `outcome=partial`, `retry_count=3`, `next_retry_at_s = clock.now() + backoff_cap_s` **AC-5: Backoff cap honoured** Given `max_in_call_retries=10` and `backoff_cap_s=10` When the decorator computes the 6th retry delay Then `Clock.sleep(10.0)` is called (capped at 10s, not `2^6 = 64s`) **AC-6: VotingStatus.UPLOAD_GIVEUP enum exposed** Given the AZ-303 `VotingStatus` enum (post-this-task) When a consumer imports it Then `VotingStatus.UPLOAD_GIVEUP` is present alongside `PENDING`, `TRUSTED`, `REJECTED`; the contract file's Change Log shows v1.1.0 **AC-7: `increment_upload_attempts` returns new count** Given a tile with `upload_attempts = 2` When `increment_upload_attempts(tile_id)` is called Then the SQL row's `upload_attempts` is now 3; the method returns `3`; concurrent invocations on different tiles produce no contention (per-row lock) **AC-8: Migration 0002 adds the column** Given a fresh DB at AZ-304's 0001 migration When 0002 is applied Then the `tiles` table has an `upload_attempts INTEGER NOT NULL DEFAULT 0` column; existing rows have `upload_attempts = 0`; the migration is reversible (drops the column on rollback) **AC-9: Decorator is a drop-in for the Protocol** Given an `IdempotentRetryTileUploader` instance When `isinstance(impl, TileUploader)` is checked under `runtime_checkable` Then the result is `True`; consumers that depend on the Protocol see no shape difference **AC-10: `disable_retry_decorator` config bypass** Given `config.c11.disable_retry_decorator = true` When the composition root constructs the uploader Then `build_tile_uploader(config)` returns the bare `HttpTileUploader` (no decorator); a debug INFO log records the bypass **AC-11: Pass-through methods** Given the decorator When `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` are called Then both delegate to `inner` directly with no added logic **AC-12: Inner exception propagates without retry** Given inner raises `FlightStateNotOnGroundError` or `SatelliteProviderError` When the decorator catches the exception Then it re-raises immediately; no retry is attempted (these are not partial-success cases); ZERO `Clock.sleep` calls **AC-13: Idempotent across re-invocations (the C11-IT-05 scenario)** Given a 50-tile batch where 30 succeed and 20 are rejected on first call (no in-call retry to avoid mixing); operator re-invokes after 5 minutes When the second call runs Then `pending_uploads` returns only the 20 tiles (the 30 are already `voting_status = uploaded`); `inner.upload_pending_tiles` is called with the request; only those 20 are POSTed; the 30 are NOT re-sent ## Non-Functional Requirements **Performance** - Decorator overhead per `upload_pending_tiles` call ≤ 5 ms (plus `Clock.sleep` time, which is intentional). - `increment_upload_attempts` SQL call ≤ 5 ms p99 against the local Postgres. **Compatibility** - The new migration is append-only (NOT a modification of AZ-304's 0001 migration). - The new `VotingStatus.UPLOAD_GIVEUP` value is additive (non-breaking). - `increment_upload_attempts` is a Protocol method addition; existing AZ-303 conformance tests pass (the method has a default impl that raises `NotImplementedError` if a future implementation forgets it — but the AZ-305 impl provides the SQL version). **Reliability** - The retry loop is bounded by BOTH `max_in_call_retries` AND `max_per_tile_attempts`; neither alone can produce unbounded behaviour. - The decorator does NOT swallow exceptions from `inner`; only `outcome=partial` results are eligible for retry. - The injected `Clock.sleep` makes retry timing deterministic in tests. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Inner success | Pass-through; zero retries | | AC-2 | Partial → retry → success | One `Clock.sleep(2.0)`; retry_count=1; final outcome=success | | AC-3 | Per-tile attempts hits 5 | `update_voting_status` called; FDR + ERROR log emitted | | AC-4 | Persistent partial across 4 attempts | `Clock.sleep(2.0)`, `(4.0)`, `(8.0)`; retry_count=3; final partial | | AC-5 | Cap=10 with high attempt number | `Clock.sleep(10.0)` not `Clock.sleep(64.0)` | | AC-6 | Import `VotingStatus.UPLOAD_GIVEUP` | Present; tile_metadata_store.md v1.1.0 | | AC-7 | Concurrent `increment_upload_attempts` on different tiles | No deadlock; correct counts | | AC-8 | Apply 0002 migration | Column added; default 0; rollback drops | | AC-9 | `isinstance` check | True | | AC-10 | Config bypass | Bare impl; debug log | | AC-11 | Pass-through methods | Delegated unchanged | | AC-12 | Inner raises | Re-raised; zero retries | | AC-13 | Two-call scenario across operator re-invocations | First call: 30 acked / 20 rejected; second call: only 20 POSTed | | NFR-perf-overhead | Microbench decorator with success-on-first | ≤ 5 ms overhead | ## Constraints - The decorator MUST be a drop-in for `TileUploader`; the composition root selects via `config.c11.disable_retry_decorator` only. - The retry budget is per-call (in-call) and per-tile (across calls); neither budget alone fully bounds — both are required. - `increment_upload_attempts` is the ONLY method that mutates `upload_attempts`; consumers do NOT directly UPDATE the column. This is a contract invariant; code-review treats direct UPDATEs as `Architecture` finding (High). - The `UPLOAD_GIVEUP` voting status is a HUMAN-decision boundary — automated promotion back to `pending` is forbidden in this task. An out-of-band SQL UPDATE by the operator is the documented recovery path. - The migration 0002 is APPEND-ONLY relative to 0001; it does NOT alter existing column types. - This task introduces no new third-party dependencies. ## Risks & Mitigation **Risk 1: AZ-303 contract bump cascades to other consumers** - *Risk*: Adding `increment_upload_attempts` to the Protocol forces every existing C6 consumer (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C12 OperatorTooling) to re-confirm conformance. - *Mitigation*: The new method is OPTIONAL via a Protocol default impl that raises `NotImplementedError`; consumers that don't call it are unaffected. The conformance test verifies only that AZ-305's impl provides it. **Risk 2: Backoff cap interacts badly with operator workflows** - *Risk*: A 60-second cap means the operator may walk away during retries; the visible CLI hangs. - *Mitigation*: The decorator emits an INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; C12's CLI surfaces this so the operator sees progress. Cap is configurable. **Risk 3: `UPLOAD_GIVEUP` tiles accumulating without operator visibility** - *Risk*: A subtle data corruption in C6 causes 100% of tiles to hit `UPLOAD_GIVEUP`; the operator notices only when they manually inspect C6. - *Mitigation*: Each `UPLOAD_GIVEUP` event emits FDR `kind="c11.upload.giveup"` AND ERROR log; C12's CLI summary surfaces the count post-upload-run. This task adds NO direct UI; C12's task list will include surfacing. **Risk 4: Clock.sleep blocking on KeyboardInterrupt** - *Risk*: A long backoff (60s) blocks the process; Ctrl+C aborts mid-sleep but might leave state inconsistent. - *Mitigation*: The decorator uses the injected `Clock` which is the same singleton as AZ-307/AZ-308; KeyboardInterrupt propagates upward and AZ-319's try/finally still runs `key_manager.end_session()`; the decorator's own state is just the retry counter (in-memory; no on-disk side effects between retries). ## Runtime Completeness - **Named capability**: bounded in-call retry on partial-success uploads, per-tile retry budget with `UPLOAD_GIVEUP` terminal state, operator-friendly `next_retry_at_s` hint (description.md § 5, C11-IT-05). - **Production code that must exist**: real `IdempotentRetryTileUploader` decorator, real `increment_upload_attempts` SQL, real migration 0002, real `VotingStatus.UPLOAD_GIVEUP` enum value, real composition-root wiring with the bypass flag. - **Allowed external stubs**: tests MAY use a fake `inner` (mock TileUploader implementing the Protocol with scripted responses), fake `Clock`, fake `tile_metadata_store` (already provided by AZ-303 conformance fakes); production wiring uses real all the way down. - **Unacceptable substitutes**: a recursive Python implementation of the retry loop (stack-explosion risk; bounded iteration is required); skipping the per-tile budget (lets one bad tile poison every retry); silently moving tiles to `UPLOAD_GIVEUP` without FDR (loses safety officer surface); modifying AZ-304's 0001 migration in place (breaks deployment idempotence — migrations are append-only).