[AZ-320] Add C11 IdempotentRetryTileUploader decorator

Wraps HttpTileUploader (AZ-319) with two bounded retry budgets:

- In-call (per-batch) — re-invokes inner on PARTIAL outcome up to
  `max_in_call_retries` times with capped exponential backoff
  (`min(base ** attempt_number, cap)`). On exhaustion: surfaces an
  operator hint via `next_retry_at_s = now + backoff_cap_s`.
- Per-tile (cross-call) — atomically increments c6's
  `tiles.upload_attempts` counter for every rejection; once a tile
  hits `max_per_tile_attempts` it is forward-only transitioned to
  `voting_status = upload_giveup` (excluded from `pending_uploads`).
  Each transition emits FDR `kind="c11.upload.giveup"` plus an
  ERROR log.

C6 contract changes (AZ-303 v1.3.0):
- VotingStatus.UPLOAD_GIVEUP added (forward-only from PENDING/TRUSTED).
- TileMetadataStore.increment_upload_attempts(tile_id) -> int added
  with NotImplementedError default for backwards-compat.
- Migration 0003_c11_upload_attempts: additive column +
  widened ck_tiles_voting_status (preserves IS NULL clause).

C11 wiring:
- C11RetryConfig + disable_retry_decorator on C11Config.
- build_tile_uploader wraps in decorator by default; bypass flag
  returns the bare HttpTileUploader. New `clock` keyword.

Cross-component isolation honoured (AZ-507): the decorator declares
`_RetryMetadataStoreLike` Protocol cut over c6's TileMetadataStore
and references `UPLOAD_GIVEUP` via a local string constant — no c6
imports.

Tests: 13 decorator + 1 conformance + 2 factory bypass + AC-6 enum
update + alembic head bump + AZ-272 schema fixture. 238 passed across
c11/c6/fdr suites; pre-existing perf microbenches unrelated.

Code review: PASS_WITH_WARNINGS (5 Low/Informational findings,
docs-level or downstream-CI-blocked). See
_docs/03_implementation/reviews/batch_41_review.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 08:48:53 +03:00
parent 90f4ac78f4
commit a06b107fc3
19 changed files with 1788 additions and 21 deletions
@@ -0,0 +1,214 @@
# C11 Idempotent Retry — In-Call Retry Loop on Partial-Success Batches
**Task**: AZ-320_c11_idempotent_retry
**Name**: C11 Idempotent Retry Decorator
**Description**: Implement `IdempotentRetryTileUploader`, a decorator that wraps the AZ-319 `TileUploader` Protocol impl and adds bounded in-call retry on partial-success batches. After the underlying uploader returns `outcome=partial`, the decorator re-queries C6's `pending_uploads` (already-acked tiles were `mark_uploaded`'d, so the second pass naturally targets only the unacked subset), waits an exponential-backoff delay, and re-invokes the underlying upload. Caps at `config.c11.max_in_call_retries` (default 3); on budget exhaustion, the final report's `outcome` stays `partial` and `next_retry_at_s` carries an operator hint for when to retry later. A per-tile rejection counter in C6 metadata (`upload_attempts`) bounds the per-tile retry budget — after `config.c11.max_per_tile_attempts` (default 5), the tile is moved to `voting_status = upload_giveup` (a new enum value added by this task) and surfaced via FDR for human review.
**Complexity**: 3 points
**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-319_c11_tile_uploader
**Component**: c11_tilemanager (epic AZ-251 / E-C11)
**Tracker**: AZ-320
**Epic**: AZ-251 (E-C11)
### Document Dependencies
- `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md` — the underlying Protocol this decorator wraps; the decorator itself implements the same Protocol (drop-in replacement).
- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — consumed: `pending_uploads`, `mark_uploaded`, `update_voting_status`. This task adds an `upload_attempts` integer field and an `upload_giveup` value to `VotingStatus` — a contract change that bumps `tile_metadata_store.md` to v1.1.0 (non-breaking minor).
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md``kind="c11.upload.giveup"` envelope.
- `_docs/02_document/components/12_c11_tilemanager/tests.md` — C11-IT-05 test scenario.
## Problem
Without bounded in-call retry:
- C11-IT-05 ("idempotent uploads on retry — re-running `upload_pending` after a partial-success batch only POSTs the tiles that weren't acknowledged before") relies on the operator manually re-invoking `upload_pending`. Operators tolerate one re-invocation but resent doing it 3-4 times after transient `satellite-provider` flakiness.
- A single tile that ALWAYS fails (e.g., truncated tile_blob in C6 fails ingest validation forever) becomes a poison pill — every retry attempt re-uploads it AND every other unacked tile, wasting bandwidth and signing cycles. Without a per-tile budget, the operator cannot distinguish transient failures from terminal ones.
- The `next_retry_at_s` field of `UploadBatchReport` (per AZ-319 contract) has no producer — without backoff calculation, the field is always None and the operator gets no hint on retry timing.
- The parent suite's voting layer assumes uploaded tiles are eventually-consistent; an unbounded retry loop with no per-tile state would create lockstep retry storms.
This task delivers the retry decorator. It changes NO underlying logic in AZ-319; it composes.
## Outcome
- An `IdempotentRetryTileUploader` class at `src/gps_denied_onboard/components/c11_tilemanager/idempotent_retry.py`:
- Implements the `TileUploader` Protocol (drop-in for `HttpTileUploader`).
- Constructor: `__init__(self, *, inner: TileUploader, tile_metadata_store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11RetryConfig)`.
- `C11RetryConfig` is a frozen dataclass with `max_in_call_retries: int = 3`, `max_per_tile_attempts: int = 5`, `backoff_base_s: float = 2.0`, `backoff_cap_s: float = 60.0`.
- `upload_pending_tiles(request)` flow:
1. Calls `inner.upload_pending_tiles(request)` once.
2. If the inner returns `outcome in (success, failure)` → return as-is.
3. If `outcome == partial`:
- For each `PerTileStatus.status == rejected`, increments the tile's `upload_attempts` in C6 via a new `tile_metadata_store.increment_upload_attempts(tile_id)` method.
- For each tile whose `upload_attempts >= config.max_per_tile_attempts`, calls `tile_metadata_store.update_voting_status(tile_id, VotingStatus.UPLOAD_GIVEUP)`; emits FDR `kind="c11.upload.giveup"` with `{tile_id, attempts, last_rejection_reason}`; emits ERROR log.
- If `retries_used < config.max_in_call_retries` AND there are still tiles with `voting_status == pending`:
- Sleeps `min(config.backoff_base_s ** retries_used, config.backoff_cap_s)` seconds via injected `Clock.sleep`.
- Recurses with `retries_used += 1` (via internal helper, NOT actual recursion — bounded loop).
- Else (budget exhausted):
- Aggregates the final `UploadBatchReport`: `outcome = partial`; `retry_count = retries_used`; `next_retry_at_s = clock.now() + config.backoff_cap_s` (operator hint).
- Returns the aggregated report.
- `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` pass through to the inner unchanged.
- A new `VotingStatus.UPLOAD_GIVEUP` enum value is added to AZ-303's `VotingStatus` (in C6's `_types.py`); this is a non-breaking minor bump of `tile_metadata_store.md` to v1.1.0 — the producer (AZ-303) stays in v1, but C6's contract file's `Change Log` is appended by this task with a note pointing to the bump.
- A new `tile_metadata_store.increment_upload_attempts(tile_id) -> int` method is added to AZ-303's `TileMetadataStore` Protocol; returns the new attempt count post-increment. This is a Protocol surface addition (minor bump). The implementation lives in AZ-305's `PostgresFilesystemStore`. This task adds:
- The Protocol method declaration in `c6_tile_cache/interface.py`.
- The impl in `c6_tile_cache/postgres_filesystem_store.py` (a single SQL `UPDATE ... SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts`).
- The Postgres column `upload_attempts INTEGER NOT NULL DEFAULT 0` via a NEW alembic migration `_alembic/0002_upload_attempts.sql` (NOT modifying AZ-304's 0001 migration; per `coderule.mdc` migrations are append-only).
- The composition root wraps `HttpTileUploader` with `IdempotentRetryTileUploader` by default. A `config.c11.disable_retry_decorator: bool = false` lets operators bypass the decorator for debugging.
- INFO log on session start with retry config; INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; ERROR log on per-tile giveup; FDR `kind="c11.upload.giveup"` per tile.
## Scope
### Included
- `IdempotentRetryTileUploader` decorator class.
- `C11RetryConfig` frozen dataclass.
- `VotingStatus.UPLOAD_GIVEUP` enum value addition (in C6's `_types.py`).
- `tile_metadata_store.increment_upload_attempts(tile_id) -> int` Protocol method addition + AZ-305 SQL impl.
- `_alembic/0002_upload_attempts.sql` migration — adds `upload_attempts INTEGER NOT NULL DEFAULT 0` column to the tiles table.
- Composition-root wiring (decorate `HttpTileUploader` by default; `config.c11.disable_retry_decorator` lets operators opt out).
- Bumping `tile_metadata_store.md` to v1.1.0 with a Change Log entry.
- INFO/ERROR logs and FDR `c11.upload.giveup` emission.
- Conformance test: `isinstance(IdempotentRetryTileUploader(...), TileUploader)`.
### Excluded
- The underlying `HttpTileUploader` impl — owned by AZ-319.
- The decision rule for what counts as a transient vs. terminal rejection — the decorator treats EVERY rejection as transient until the per-tile attempt budget is hit; the operator may manually move `UPLOAD_GIVEUP` tiles back to `pending` after investigation (out-of-band SQL UPDATE; no API surface).
- A separate background-retry daemon — the retry happens within `upload_pending_tiles`; the operator decides when to invoke it.
- Cross-process retry coordination — the C12 lockfile already prevents concurrent C11 invocations.
- Surfacing `UPLOAD_GIVEUP` in the operator-tooling CLI — owned by E-C12.
- Auto-promotion of `UPLOAD_GIVEUP` back to `pending` after manual fixes — operator concern; out of scope.
## Acceptance Criteria
**AC-1: Success on first attempt — no retry**
Given the inner uploader returns `outcome = success` on the first call
When `upload_pending_tiles(request)` is called
Then the decorator returns immediately; ZERO calls to `Clock.sleep`; ZERO calls to `increment_upload_attempts`; report passes through unchanged
**AC-2: Partial-success with retry budget available**
Given inner returns `outcome=partial` with 3 of 10 tiles rejected (per_tile_status), and `max_in_call_retries=3`
When the decorator processes the partial
Then `increment_upload_attempts` is called 3 times (one per rejected tile); `Clock.sleep(2.0)` is called once; inner is re-invoked; if the second attempt is `success`, the final aggregated report shows `outcome = success` and `retry_count = 1`
**AC-3: Per-tile budget exhausted moves tile to UPLOAD_GIVEUP**
Given a tile whose `upload_attempts` reaches `max_per_tile_attempts=5`
When the decorator increments the counter
Then `update_voting_status(tile_id, UPLOAD_GIVEUP)` is called; ONE FDR `kind="c11.upload.giveup"` is emitted with `{tile_id, attempts=5, last_rejection_reason}`; ONE ERROR log; the tile is NOT re-uploaded in subsequent retries (since `pending_uploads` excludes UPLOAD_GIVEUP)
**AC-4: In-call retry budget exhausted**
Given inner consistently returns `outcome=partial` with the same rejected tile, and `max_in_call_retries=3`
When the decorator runs out of in-call retries
Then 3 retries are attempted (4 total inner calls including the first); `Clock.sleep` is called 3 times with backoffs `2.0, 4.0, 8.0`; the final report has `outcome=partial`, `retry_count=3`, `next_retry_at_s = clock.now() + backoff_cap_s`
**AC-5: Backoff cap honoured**
Given `max_in_call_retries=10` and `backoff_cap_s=10`
When the decorator computes the 6th retry delay
Then `Clock.sleep(10.0)` is called (capped at 10s, not `2^6 = 64s`)
**AC-6: VotingStatus.UPLOAD_GIVEUP enum exposed**
Given the AZ-303 `VotingStatus` enum (post-this-task)
When a consumer imports it
Then `VotingStatus.UPLOAD_GIVEUP` is present alongside `PENDING`, `TRUSTED`, `REJECTED`; the contract file's Change Log shows v1.1.0
**AC-7: `increment_upload_attempts` returns new count**
Given a tile with `upload_attempts = 2`
When `increment_upload_attempts(tile_id)` is called
Then the SQL row's `upload_attempts` is now 3; the method returns `3`; concurrent invocations on different tiles produce no contention (per-row lock)
**AC-8: Migration 0002 adds the column**
Given a fresh DB at AZ-304's 0001 migration
When 0002 is applied
Then the `tiles` table has an `upload_attempts INTEGER NOT NULL DEFAULT 0` column; existing rows have `upload_attempts = 0`; the migration is reversible (drops the column on rollback)
**AC-9: Decorator is a drop-in for the Protocol**
Given an `IdempotentRetryTileUploader` instance
When `isinstance(impl, TileUploader)` is checked under `runtime_checkable`
Then the result is `True`; consumers that depend on the Protocol see no shape difference
**AC-10: `disable_retry_decorator` config bypass**
Given `config.c11.disable_retry_decorator = true`
When the composition root constructs the uploader
Then `build_tile_uploader(config)` returns the bare `HttpTileUploader` (no decorator); a debug INFO log records the bypass
**AC-11: Pass-through methods**
Given the decorator
When `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` are called
Then both delegate to `inner` directly with no added logic
**AC-12: Inner exception propagates without retry**
Given inner raises `FlightStateNotOnGroundError` or `SatelliteProviderError`
When the decorator catches the exception
Then it re-raises immediately; no retry is attempted (these are not partial-success cases); ZERO `Clock.sleep` calls
**AC-13: Idempotent across re-invocations (the C11-IT-05 scenario)**
Given a 50-tile batch where 30 succeed and 20 are rejected on first call (no in-call retry to avoid mixing); operator re-invokes after 5 minutes
When the second call runs
Then `pending_uploads` returns only the 20 tiles (the 30 are already `voting_status = uploaded`); `inner.upload_pending_tiles` is called with the request; only those 20 are POSTed; the 30 are NOT re-sent
## Non-Functional Requirements
**Performance**
- Decorator overhead per `upload_pending_tiles` call ≤ 5 ms (plus `Clock.sleep` time, which is intentional).
- `increment_upload_attempts` SQL call ≤ 5 ms p99 against the local Postgres.
**Compatibility**
- The new migration is append-only (NOT a modification of AZ-304's 0001 migration).
- The new `VotingStatus.UPLOAD_GIVEUP` value is additive (non-breaking).
- `increment_upload_attempts` is a Protocol method addition; existing AZ-303 conformance tests pass (the method has a default impl that raises `NotImplementedError` if a future implementation forgets it — but the AZ-305 impl provides the SQL version).
**Reliability**
- The retry loop is bounded by BOTH `max_in_call_retries` AND `max_per_tile_attempts`; neither alone can produce unbounded behaviour.
- The decorator does NOT swallow exceptions from `inner`; only `outcome=partial` results are eligible for retry.
- The injected `Clock.sleep` makes retry timing deterministic in tests.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Inner success | Pass-through; zero retries |
| AC-2 | Partial → retry → success | One `Clock.sleep(2.0)`; retry_count=1; final outcome=success |
| AC-3 | Per-tile attempts hits 5 | `update_voting_status` called; FDR + ERROR log emitted |
| AC-4 | Persistent partial across 4 attempts | `Clock.sleep(2.0)`, `(4.0)`, `(8.0)`; retry_count=3; final partial |
| AC-5 | Cap=10 with high attempt number | `Clock.sleep(10.0)` not `Clock.sleep(64.0)` |
| AC-6 | Import `VotingStatus.UPLOAD_GIVEUP` | Present; tile_metadata_store.md v1.1.0 |
| AC-7 | Concurrent `increment_upload_attempts` on different tiles | No deadlock; correct counts |
| AC-8 | Apply 0002 migration | Column added; default 0; rollback drops |
| AC-9 | `isinstance` check | True |
| AC-10 | Config bypass | Bare impl; debug log |
| AC-11 | Pass-through methods | Delegated unchanged |
| AC-12 | Inner raises | Re-raised; zero retries |
| AC-13 | Two-call scenario across operator re-invocations | First call: 30 acked / 20 rejected; second call: only 20 POSTed |
| NFR-perf-overhead | Microbench decorator with success-on-first | ≤ 5 ms overhead |
## Constraints
- The decorator MUST be a drop-in for `TileUploader`; the composition root selects via `config.c11.disable_retry_decorator` only.
- The retry budget is per-call (in-call) and per-tile (across calls); neither budget alone fully bounds — both are required.
- `increment_upload_attempts` is the ONLY method that mutates `upload_attempts`; consumers do NOT directly UPDATE the column. This is a contract invariant; code-review treats direct UPDATEs as `Architecture` finding (High).
- The `UPLOAD_GIVEUP` voting status is a HUMAN-decision boundary — automated promotion back to `pending` is forbidden in this task. An out-of-band SQL UPDATE by the operator is the documented recovery path.
- The migration 0002 is APPEND-ONLY relative to 0001; it does NOT alter existing column types.
- This task introduces no new third-party dependencies.
## Risks & Mitigation
**Risk 1: AZ-303 contract bump cascades to other consumers**
- *Risk*: Adding `increment_upload_attempts` to the Protocol forces every existing C6 consumer (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C12 OperatorTooling) to re-confirm conformance.
- *Mitigation*: The new method is OPTIONAL via a Protocol default impl that raises `NotImplementedError`; consumers that don't call it are unaffected. The conformance test verifies only that AZ-305's impl provides it.
**Risk 2: Backoff cap interacts badly with operator workflows**
- *Risk*: A 60-second cap means the operator may walk away during retries; the visible CLI hangs.
- *Mitigation*: The decorator emits an INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; C12's CLI surfaces this so the operator sees progress. Cap is configurable.
**Risk 3: `UPLOAD_GIVEUP` tiles accumulating without operator visibility**
- *Risk*: A subtle data corruption in C6 causes 100% of tiles to hit `UPLOAD_GIVEUP`; the operator notices only when they manually inspect C6.
- *Mitigation*: Each `UPLOAD_GIVEUP` event emits FDR `kind="c11.upload.giveup"` AND ERROR log; C12's CLI summary surfaces the count post-upload-run. This task adds NO direct UI; C12's task list will include surfacing.
**Risk 4: Clock.sleep blocking on KeyboardInterrupt**
- *Risk*: A long backoff (60s) blocks the process; Ctrl+C aborts mid-sleep but might leave state inconsistent.
- *Mitigation*: The decorator uses the injected `Clock` which is the same singleton as AZ-307/AZ-308; KeyboardInterrupt propagates upward and AZ-319's try/finally still runs `key_manager.end_session()`; the decorator's own state is just the retry counter (in-memory; no on-disk side effects between retries).
## Runtime Completeness
- **Named capability**: bounded in-call retry on partial-success uploads, per-tile retry budget with `UPLOAD_GIVEUP` terminal state, operator-friendly `next_retry_at_s` hint (description.md § 5, C11-IT-05).
- **Production code that must exist**: real `IdempotentRetryTileUploader` decorator, real `increment_upload_attempts` SQL, real migration 0002, real `VotingStatus.UPLOAD_GIVEUP` enum value, real composition-root wiring with the bypass flag.
- **Allowed external stubs**: tests MAY use a fake `inner` (mock TileUploader implementing the Protocol with scripted responses), fake `Clock`, fake `tile_metadata_store` (already provided by AZ-303 conformance fakes); production wiring uses real all the way down.
- **Unacceptable substitutes**: a recursive Python implementation of the retry loop (stack-explosion risk; bounded iteration is required); skipping the per-tile budget (lets one bad tile poison every retry); silently moving tiles to `UPLOAD_GIVEUP` without FDR (loses safety officer surface); modifying AZ-304's 0001 migration in place (breaks deployment idempotence — migrations are append-only).