Files
gps-denied-onboard/_docs/02_tasks/done/AZ-320_c11_idempotent_retry.md
T
Oleksandr Bezdieniezhnykh a06b107fc3 [AZ-320] Add C11 IdempotentRetryTileUploader decorator
Wraps HttpTileUploader (AZ-319) with two bounded retry budgets:

- In-call (per-batch) — re-invokes inner on PARTIAL outcome up to
  `max_in_call_retries` times with capped exponential backoff
  (`min(base ** attempt_number, cap)`). On exhaustion: surfaces an
  operator hint via `next_retry_at_s = now + backoff_cap_s`.
- Per-tile (cross-call) — atomically increments c6's
  `tiles.upload_attempts` counter for every rejection; once a tile
  hits `max_per_tile_attempts` it is forward-only transitioned to
  `voting_status = upload_giveup` (excluded from `pending_uploads`).
  Each transition emits FDR `kind="c11.upload.giveup"` plus an
  ERROR log.

C6 contract changes (AZ-303 v1.3.0):
- VotingStatus.UPLOAD_GIVEUP added (forward-only from PENDING/TRUSTED).
- TileMetadataStore.increment_upload_attempts(tile_id) -> int added
  with NotImplementedError default for backwards-compat.
- Migration 0003_c11_upload_attempts: additive column +
  widened ck_tiles_voting_status (preserves IS NULL clause).

C11 wiring:
- C11RetryConfig + disable_retry_decorator on C11Config.
- build_tile_uploader wraps in decorator by default; bypass flag
  returns the bare HttpTileUploader. New `clock` keyword.

Cross-component isolation honoured (AZ-507): the decorator declares
`_RetryMetadataStoreLike` Protocol cut over c6's TileMetadataStore
and references `UPLOAD_GIVEUP` via a local string constant — no c6
imports.

Tests: 13 decorator + 1 conformance + 2 factory bypass + AC-6 enum
update + alembic head bump + AZ-272 schema fixture. 238 passed across
c11/c6/fdr suites; pre-existing perf microbenches unrelated.

Code review: PASS_WITH_WARNINGS (5 Low/Informational findings,
docs-level or downstream-CI-blocked). See
_docs/03_implementation/reviews/batch_41_review.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:48:53 +03:00

18 KiB

C11 Idempotent Retry — In-Call Retry Loop on Partial-Success Batches

Task: AZ-320_c11_idempotent_retry Name: C11 Idempotent Retry Decorator Description: Implement IdempotentRetryTileUploader, a decorator that wraps the AZ-319 TileUploader Protocol impl and adds bounded in-call retry on partial-success batches. After the underlying uploader returns outcome=partial, the decorator re-queries C6's pending_uploads (already-acked tiles were mark_uploaded'd, so the second pass naturally targets only the unacked subset), waits an exponential-backoff delay, and re-invokes the underlying upload. Caps at config.c11.max_in_call_retries (default 3); on budget exhaustion, the final report's outcome stays partial and next_retry_at_s carries an operator hint for when to retry later. A per-tile rejection counter in C6 metadata (upload_attempts) bounds the per-tile retry budget — after config.c11.max_per_tile_attempts (default 5), the tile is moved to voting_status = upload_giveup (a new enum value added by this task) and surfaced via FDR for human review. Complexity: 3 points Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-319_c11_tile_uploader Component: c11_tilemanager (epic AZ-251 / E-C11) Tracker: AZ-320 Epic: AZ-251 (E-C11)

Document Dependencies

  • _docs/02_document/contracts/c11_tilemanager/tile_uploader.md — the underlying Protocol this decorator wraps; the decorator itself implements the same Protocol (drop-in replacement).
  • _docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md — consumed: pending_uploads, mark_uploaded, update_voting_status. This task adds an upload_attempts integer field and an upload_giveup value to VotingStatus — a contract change that bumps tile_metadata_store.md to v1.1.0 (non-breaking minor).
  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.mdkind="c11.upload.giveup" envelope.
  • _docs/02_document/components/12_c11_tilemanager/tests.md — C11-IT-05 test scenario.

Problem

Without bounded in-call retry:

  • C11-IT-05 ("idempotent uploads on retry — re-running upload_pending after a partial-success batch only POSTs the tiles that weren't acknowledged before") relies on the operator manually re-invoking upload_pending. Operators tolerate one re-invocation but resent doing it 3-4 times after transient satellite-provider flakiness.
  • A single tile that ALWAYS fails (e.g., truncated tile_blob in C6 fails ingest validation forever) becomes a poison pill — every retry attempt re-uploads it AND every other unacked tile, wasting bandwidth and signing cycles. Without a per-tile budget, the operator cannot distinguish transient failures from terminal ones.
  • The next_retry_at_s field of UploadBatchReport (per AZ-319 contract) has no producer — without backoff calculation, the field is always None and the operator gets no hint on retry timing.
  • The parent suite's voting layer assumes uploaded tiles are eventually-consistent; an unbounded retry loop with no per-tile state would create lockstep retry storms.

This task delivers the retry decorator. It changes NO underlying logic in AZ-319; it composes.

Outcome

  • An IdempotentRetryTileUploader class at src/gps_denied_onboard/components/c11_tilemanager/idempotent_retry.py:
    • Implements the TileUploader Protocol (drop-in for HttpTileUploader).
    • Constructor: __init__(self, *, inner: TileUploader, tile_metadata_store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11RetryConfig).
    • C11RetryConfig is a frozen dataclass with max_in_call_retries: int = 3, max_per_tile_attempts: int = 5, backoff_base_s: float = 2.0, backoff_cap_s: float = 60.0.
  • upload_pending_tiles(request) flow:
    1. Calls inner.upload_pending_tiles(request) once.
    2. If the inner returns outcome in (success, failure) → return as-is.
    3. If outcome == partial:
      • For each PerTileStatus.status == rejected, increments the tile's upload_attempts in C6 via a new tile_metadata_store.increment_upload_attempts(tile_id) method.
      • For each tile whose upload_attempts >= config.max_per_tile_attempts, calls tile_metadata_store.update_voting_status(tile_id, VotingStatus.UPLOAD_GIVEUP); emits FDR kind="c11.upload.giveup" with {tile_id, attempts, last_rejection_reason}; emits ERROR log.
      • If retries_used < config.max_in_call_retries AND there are still tiles with voting_status == pending:
        • Sleeps min(config.backoff_base_s ** retries_used, config.backoff_cap_s) seconds via injected Clock.sleep.
        • Recurses with retries_used += 1 (via internal helper, NOT actual recursion — bounded loop).
      • Else (budget exhausted):
        • Aggregates the final UploadBatchReport: outcome = partial; retry_count = retries_used; next_retry_at_s = clock.now() + config.backoff_cap_s (operator hint).
        • Returns the aggregated report.
  • enumerate_pending_tiles(flight_id) and confirm_flight_state() pass through to the inner unchanged.
  • A new VotingStatus.UPLOAD_GIVEUP enum value is added to AZ-303's VotingStatus (in C6's _types.py); this is a non-breaking minor bump of tile_metadata_store.md to v1.1.0 — the producer (AZ-303) stays in v1, but C6's contract file's Change Log is appended by this task with a note pointing to the bump.
  • A new tile_metadata_store.increment_upload_attempts(tile_id) -> int method is added to AZ-303's TileMetadataStore Protocol; returns the new attempt count post-increment. This is a Protocol surface addition (minor bump). The implementation lives in AZ-305's PostgresFilesystemStore. This task adds:
    • The Protocol method declaration in c6_tile_cache/interface.py.
    • The impl in c6_tile_cache/postgres_filesystem_store.py (a single SQL UPDATE ... SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts).
    • The Postgres column upload_attempts INTEGER NOT NULL DEFAULT 0 via a NEW alembic migration _alembic/0002_upload_attempts.sql (NOT modifying AZ-304's 0001 migration; per coderule.mdc migrations are append-only).
  • The composition root wraps HttpTileUploader with IdempotentRetryTileUploader by default. A config.c11.disable_retry_decorator: bool = false lets operators bypass the decorator for debugging.
  • INFO log on session start with retry config; INFO log per retry attempt with attempt_number, sleep_s, remaining_pending_count; ERROR log on per-tile giveup; FDR kind="c11.upload.giveup" per tile.

Scope

Included

  • IdempotentRetryTileUploader decorator class.
  • C11RetryConfig frozen dataclass.
  • VotingStatus.UPLOAD_GIVEUP enum value addition (in C6's _types.py).
  • tile_metadata_store.increment_upload_attempts(tile_id) -> int Protocol method addition + AZ-305 SQL impl.
  • _alembic/0002_upload_attempts.sql migration — adds upload_attempts INTEGER NOT NULL DEFAULT 0 column to the tiles table.
  • Composition-root wiring (decorate HttpTileUploader by default; config.c11.disable_retry_decorator lets operators opt out).
  • Bumping tile_metadata_store.md to v1.1.0 with a Change Log entry.
  • INFO/ERROR logs and FDR c11.upload.giveup emission.
  • Conformance test: isinstance(IdempotentRetryTileUploader(...), TileUploader).

Excluded

  • The underlying HttpTileUploader impl — owned by AZ-319.
  • The decision rule for what counts as a transient vs. terminal rejection — the decorator treats EVERY rejection as transient until the per-tile attempt budget is hit; the operator may manually move UPLOAD_GIVEUP tiles back to pending after investigation (out-of-band SQL UPDATE; no API surface).
  • A separate background-retry daemon — the retry happens within upload_pending_tiles; the operator decides when to invoke it.
  • Cross-process retry coordination — the C12 lockfile already prevents concurrent C11 invocations.
  • Surfacing UPLOAD_GIVEUP in the operator-tooling CLI — owned by E-C12.
  • Auto-promotion of UPLOAD_GIVEUP back to pending after manual fixes — operator concern; out of scope.

Acceptance Criteria

AC-1: Success on first attempt — no retry Given the inner uploader returns outcome = success on the first call When upload_pending_tiles(request) is called Then the decorator returns immediately; ZERO calls to Clock.sleep; ZERO calls to increment_upload_attempts; report passes through unchanged

AC-2: Partial-success with retry budget available Given inner returns outcome=partial with 3 of 10 tiles rejected (per_tile_status), and max_in_call_retries=3 When the decorator processes the partial Then increment_upload_attempts is called 3 times (one per rejected tile); Clock.sleep(2.0) is called once; inner is re-invoked; if the second attempt is success, the final aggregated report shows outcome = success and retry_count = 1

AC-3: Per-tile budget exhausted moves tile to UPLOAD_GIVEUP Given a tile whose upload_attempts reaches max_per_tile_attempts=5 When the decorator increments the counter Then update_voting_status(tile_id, UPLOAD_GIVEUP) is called; ONE FDR kind="c11.upload.giveup" is emitted with {tile_id, attempts=5, last_rejection_reason}; ONE ERROR log; the tile is NOT re-uploaded in subsequent retries (since pending_uploads excludes UPLOAD_GIVEUP)

AC-4: In-call retry budget exhausted Given inner consistently returns outcome=partial with the same rejected tile, and max_in_call_retries=3 When the decorator runs out of in-call retries Then 3 retries are attempted (4 total inner calls including the first); Clock.sleep is called 3 times with backoffs 2.0, 4.0, 8.0; the final report has outcome=partial, retry_count=3, next_retry_at_s = clock.now() + backoff_cap_s

AC-5: Backoff cap honoured Given max_in_call_retries=10 and backoff_cap_s=10 When the decorator computes the 6th retry delay Then Clock.sleep(10.0) is called (capped at 10s, not 2^6 = 64s)

AC-6: VotingStatus.UPLOAD_GIVEUP enum exposed Given the AZ-303 VotingStatus enum (post-this-task) When a consumer imports it Then VotingStatus.UPLOAD_GIVEUP is present alongside PENDING, TRUSTED, REJECTED; the contract file's Change Log shows v1.1.0

AC-7: increment_upload_attempts returns new count Given a tile with upload_attempts = 2 When increment_upload_attempts(tile_id) is called Then the SQL row's upload_attempts is now 3; the method returns 3; concurrent invocations on different tiles produce no contention (per-row lock)

AC-8: Migration 0002 adds the column Given a fresh DB at AZ-304's 0001 migration When 0002 is applied Then the tiles table has an upload_attempts INTEGER NOT NULL DEFAULT 0 column; existing rows have upload_attempts = 0; the migration is reversible (drops the column on rollback)

AC-9: Decorator is a drop-in for the Protocol Given an IdempotentRetryTileUploader instance When isinstance(impl, TileUploader) is checked under runtime_checkable Then the result is True; consumers that depend on the Protocol see no shape difference

AC-10: disable_retry_decorator config bypass Given config.c11.disable_retry_decorator = true When the composition root constructs the uploader Then build_tile_uploader(config) returns the bare HttpTileUploader (no decorator); a debug INFO log records the bypass

AC-11: Pass-through methods Given the decorator When enumerate_pending_tiles(flight_id) and confirm_flight_state() are called Then both delegate to inner directly with no added logic

AC-12: Inner exception propagates without retry Given inner raises FlightStateNotOnGroundError or SatelliteProviderError When the decorator catches the exception Then it re-raises immediately; no retry is attempted (these are not partial-success cases); ZERO Clock.sleep calls

AC-13: Idempotent across re-invocations (the C11-IT-05 scenario) Given a 50-tile batch where 30 succeed and 20 are rejected on first call (no in-call retry to avoid mixing); operator re-invokes after 5 minutes When the second call runs Then pending_uploads returns only the 20 tiles (the 30 are already voting_status = uploaded); inner.upload_pending_tiles is called with the request; only those 20 are POSTed; the 30 are NOT re-sent

Non-Functional Requirements

Performance

  • Decorator overhead per upload_pending_tiles call ≤ 5 ms (plus Clock.sleep time, which is intentional).
  • increment_upload_attempts SQL call ≤ 5 ms p99 against the local Postgres.

Compatibility

  • The new migration is append-only (NOT a modification of AZ-304's 0001 migration).
  • The new VotingStatus.UPLOAD_GIVEUP value is additive (non-breaking).
  • increment_upload_attempts is a Protocol method addition; existing AZ-303 conformance tests pass (the method has a default impl that raises NotImplementedError if a future implementation forgets it — but the AZ-305 impl provides the SQL version).

Reliability

  • The retry loop is bounded by BOTH max_in_call_retries AND max_per_tile_attempts; neither alone can produce unbounded behaviour.
  • The decorator does NOT swallow exceptions from inner; only outcome=partial results are eligible for retry.
  • The injected Clock.sleep makes retry timing deterministic in tests.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Inner success Pass-through; zero retries
AC-2 Partial → retry → success One Clock.sleep(2.0); retry_count=1; final outcome=success
AC-3 Per-tile attempts hits 5 update_voting_status called; FDR + ERROR log emitted
AC-4 Persistent partial across 4 attempts Clock.sleep(2.0), (4.0), (8.0); retry_count=3; final partial
AC-5 Cap=10 with high attempt number Clock.sleep(10.0) not Clock.sleep(64.0)
AC-6 Import VotingStatus.UPLOAD_GIVEUP Present; tile_metadata_store.md v1.1.0
AC-7 Concurrent increment_upload_attempts on different tiles No deadlock; correct counts
AC-8 Apply 0002 migration Column added; default 0; rollback drops
AC-9 isinstance check True
AC-10 Config bypass Bare impl; debug log
AC-11 Pass-through methods Delegated unchanged
AC-12 Inner raises Re-raised; zero retries
AC-13 Two-call scenario across operator re-invocations First call: 30 acked / 20 rejected; second call: only 20 POSTed
NFR-perf-overhead Microbench decorator with success-on-first ≤ 5 ms overhead

Constraints

  • The decorator MUST be a drop-in for TileUploader; the composition root selects via config.c11.disable_retry_decorator only.
  • The retry budget is per-call (in-call) and per-tile (across calls); neither budget alone fully bounds — both are required.
  • increment_upload_attempts is the ONLY method that mutates upload_attempts; consumers do NOT directly UPDATE the column. This is a contract invariant; code-review treats direct UPDATEs as Architecture finding (High).
  • The UPLOAD_GIVEUP voting status is a HUMAN-decision boundary — automated promotion back to pending is forbidden in this task. An out-of-band SQL UPDATE by the operator is the documented recovery path.
  • The migration 0002 is APPEND-ONLY relative to 0001; it does NOT alter existing column types.
  • This task introduces no new third-party dependencies.

Risks & Mitigation

Risk 1: AZ-303 contract bump cascades to other consumers

  • Risk: Adding increment_upload_attempts to the Protocol forces every existing C6 consumer (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C12 OperatorTooling) to re-confirm conformance.
  • Mitigation: The new method is OPTIONAL via a Protocol default impl that raises NotImplementedError; consumers that don't call it are unaffected. The conformance test verifies only that AZ-305's impl provides it.

Risk 2: Backoff cap interacts badly with operator workflows

  • Risk: A 60-second cap means the operator may walk away during retries; the visible CLI hangs.
  • Mitigation: The decorator emits an INFO log per retry attempt with attempt_number, sleep_s, remaining_pending_count; C12's CLI surfaces this so the operator sees progress. Cap is configurable.

Risk 3: UPLOAD_GIVEUP tiles accumulating without operator visibility

  • Risk: A subtle data corruption in C6 causes 100% of tiles to hit UPLOAD_GIVEUP; the operator notices only when they manually inspect C6.
  • Mitigation: Each UPLOAD_GIVEUP event emits FDR kind="c11.upload.giveup" AND ERROR log; C12's CLI summary surfaces the count post-upload-run. This task adds NO direct UI; C12's task list will include surfacing.

Risk 4: Clock.sleep blocking on KeyboardInterrupt

  • Risk: A long backoff (60s) blocks the process; Ctrl+C aborts mid-sleep but might leave state inconsistent.
  • Mitigation: The decorator uses the injected Clock which is the same singleton as AZ-307/AZ-308; KeyboardInterrupt propagates upward and AZ-319's try/finally still runs key_manager.end_session(); the decorator's own state is just the retry counter (in-memory; no on-disk side effects between retries).

Runtime Completeness

  • Named capability: bounded in-call retry on partial-success uploads, per-tile retry budget with UPLOAD_GIVEUP terminal state, operator-friendly next_retry_at_s hint (description.md § 5, C11-IT-05).
  • Production code that must exist: real IdempotentRetryTileUploader decorator, real increment_upload_attempts SQL, real migration 0002, real VotingStatus.UPLOAD_GIVEUP enum value, real composition-root wiring with the bypass flag.
  • Allowed external stubs: tests MAY use a fake inner (mock TileUploader implementing the Protocol with scripted responses), fake Clock, fake tile_metadata_store (already provided by AZ-303 conformance fakes); production wiring uses real all the way down.
  • Unacceptable substitutes: a recursive Python implementation of the retry loop (stack-explosion risk; bounded iteration is required); skipping the per-tile budget (lets one bad tile poison every retry); silently moving tiles to UPLOAD_GIVEUP without FDR (loses safety officer surface); modifying AZ-304's 0001 migration in place (breaks deployment idempotence — migrations are append-only).