CacheBudgetEnforcer.reserve_headroom(needed_bytes) returns immediately when total_disk_bytes() + needed_bytes <= budget, otherwise iterates lru_candidates in eviction_batch_size batches, deletes via delete_tile, emits one INFO log per evicted tile (c6.evicted) and one FDR record per eviction batch (c6.eviction_batch, evicted_tile_ids capped to 5). Raises CacheBudgetExhaustedError AFTER a full sweep if the budget cannot be met. BudgetEnforcedTileStore decorates a TileStore so the policy stays separable from PostgresFilesystemStore. Composition root in storage_factory.build_tile_store wires the wrapper unconditionally. PostgresFilesystemStore now accepts lru_clock: Clock | None = None; when set, read_tile_pixels calls record_lru_access(tile_id, now) so eviction picks the right LRU candidates. Production wiring injects WallClock(); AZ-305 unit tests still construct without the clock and keep their pass-through semantics. Contract tile_store.md bumped to v1.1.0 to add CacheBudgetExhaustedError to the TileCacheError family; shared FDR schema bumped to v1.3.0 for the new c6.eviction_batch kind. Co-authored-by: Cursor <cursoragent@cursor.com>
14 KiB
Batch 30 / Cycle 1 — Implementation Report
Date: 2026-05-12 Tasks: AZ-308 (C6 Cache Budget Eviction — 10 GB hard cap with LRU sweep) Story points landed: 3 Status: complete (AZ-308 → In Testing)
Scope summary
Single-task batch landing the production CacheBudgetEnforcer — the
policy layer that converts AZ-303's total_disk_bytes / lru_candidates
/ delete_tile / record_lru_access primitives into RESTRICT-SAT-2's
10 GB hard cap. The enforcer runs synchronously inside
write_tile via the new BudgetEnforcedTileStore decorator: every
write first asks reserve_headroom(len(tile_blob)); if head-room is
sufficient the call is a single total_disk_bytes() SELECT and
returns immediately, otherwise the enforcer iterates
lru_candidates(max_count=eviction_batch_size) in 32-row batches,
deletes the oldest tiles via delete_tile, and stops as soon as the
freed bytes meet the shortfall. If the candidate list is exhausted
without meeting the budget, CacheBudgetExhaustedError is raised
after the full sweep (per AC-5 — partial eviction beats no
eviction so the operator's recovery has maximum head-room).
Eviction is observable end-to-end: one INFO log per evicted tile
(kind="c6.evicted", payload {tile_id, disk_bytes, accessed_at, evicted_at}), one FDR record per eviction batch (kind= "c6.eviction_batch", payload {trigger_tile_id, freed_bytes, evicted_count, evicted_tile_ids[:5]} — capped to 5 ids to keep the
record bounded), and one construction-time INFO log
(kind="c6.budget.loaded") so the operator sees (budget_bytes, current_disk_bytes, headroom_bytes) at process start (with a WARN if
the prior flight ended over-budget).
The AZ-305 LRU-clock hook is now wired: PostgresFilesystemStore
accepts an optional lru_clock: Clock | None = None ctor argument, and
when set, every read_tile_pixels call invokes record_lru_access( tile_id, now) after the row/file existence check. The unit-test path
(AZ-305's existing fixtures) can still construct the store with
lru_clock=None, preserving the AZ-305 contract. Production wiring
in storage_factory.build_tile_store always injects WallClock()
into the inner store and wraps the result in BudgetEnforcedTileStore.
The decorator pattern is mandatory per the spec § Constraints — making
budget enforcement a wrapper keeps the policy layer separable from the
store impl (single-responsibility), and a future voting-tier-aware
policy can replace the enforcer without changing
PostgresFilesystemStore.
Files added / modified
New (production)
src/gps_denied_onboard/components/c6_tile_cache/cache_budget_enforcer.py—EvictionResultfrozen dataclass;_iso_ts_nowUTC helper;CacheBudgetEnforcerclass with one public methodreserve_headroom(needed_bytes) -> EvictionResultdoing the no-evict fast-path → LRU-sweep escalation flow, emitting one INFO log per eviction and one FDR record per batch, plus the AC-12 construction-timec6.budget.loadedINFO log (with optional WARN on over-budget startup);BudgetEnforcedTileStoredecorator implementing theTileStoreProtocol by delegatingread_tile_pixels/tile_exists/delete_tilestraight through and callingenforcer.reserve_headroom(len(tile_blob))before delegatingwrite_tile; and an operator CLI (python -m gps_denied_onboard.components.c6_tile_cache.cache_budget_enforcer dry-run --pretend-needed-bytes N) that loads config viaload_config(os.environ)and prints what WOULD be evicted without performing the eviction (nodelete_tilecall, no FDR write, no INFO log).
Modified (production)
src/gps_denied_onboard/components/c6_tile_cache/errors.py— addsCacheBudgetExhaustedErrorto theTileCacheErrorfamily with diagnostic fieldsneeded_bytes,available_bytes,evicted_count(all keyword-only, all default toNoneso the parameter set is forward-compatible with future tightening).src/gps_denied_onboard/components/c6_tile_cache/config.py— adds theeviction_batch_size: int = 32config knob (default per spec § Constraints, validated> 0in__post_init__); the existinglru_eviction_threshold_bytesalready provides the budget.src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py— adds optionallru_clock: Clock | None = Nonector arg; when present,read_tile_pixelscallsself.record_lru_access(tile_id, now_dt)after row/file existence checks succeed, wherenow_dt = datetime.fromtimestamp( self._lru_clock.time_ns() / 1e9, tz=UTC).from_confignow injectsWallClock()so the production path always updates the LRU clock; AZ-305's unit tests that construct the store directly with no clock keep the pass-through behaviour (the LRU UPDATE is guarded byif self._lru_clock is not None).src/gps_denied_onboard/fdr_client/records.py— addsc6.eviction_batch(payload{trigger_tile_id, freed_bytes, evicted_count, evicted_tile_ids}capped to 5 ids per AC-11) toKNOWN_PAYLOAD_KEYS. The per-tilec6.evictedevent is INFO log only (it is high-frequency under load and would dilute the FDR ring-buffer; aggregated batch counts go to FDR).src/gps_denied_onboard/runtime_root/storage_factory.py—build_tile_storenow constructs aPostgresFilesystemStore, aCacheBudgetEnforcerwired to a producer-localFdrClient(producer_id="c6_tile_cache.budget") and the C6 logger, withbudget_bytes = config.tile_cache.lru_eviction_threshold_bytesandeviction_batch_size = config.tile_cache.eviction_batch_size— then wraps the store in aBudgetEnforcedTileStoreand returns the decorator.build_tile_metadata_storeis unchanged (the decorator only interceptsTileStore, never the metadata store).
Modified (tests)
tests/unit/c6_tile_cache/test_cache_budget_enforcer.py— NEW suite of 18 tests:- 4 non-docker unit tests for
CacheBudgetEnforceragainst an in-memory_FakeStorecovering AC-1 (no-eviction fast path), AC-2 (single-tile sweep), AC-3 (multi-tile until shortfall met), AC-4 (batch-size-respectinglru_candidatescalls). - 3 non-docker tests for the error-handling envelope: AC-5 (sweep
exhausted →
CacheBudgetExhaustedErrorAFTER all candidates deleted), AC-7 (decorator does NOT rewrap aContentHashMismatchErrorfrom the inner store), AC-9 (SELECT-count tally for no-evict vs evict paths). - 4 non-docker tests for FDR + log payloads: AC-11 (evicted_tile_ids
truncated to 5 even when 100 evictions occurred), AC-12
(construction-time
c6.budget.loadedINFO log + WARN-on-over- budget), and the NFR-reliability "candidate gone mid-sweep" case wheredelete_tilereturns False. - 1 non-docker NFR test (
reserve_headroom × 10000no-evict path with a strict p99 ≤ 5 ms ceiling). - 3
@pytest.mark.dockerTier-2 tests against a real Postgres (composition-root smoke): AC-6 (decorator +write_tileend-to-end with near-cap state), AC-8 (realread_tile_pixelsbumps the LRU clock and changeslru_candidatesordering), and AC-10 (synthetic-fill test — 50 MB of writes under a deliberately tight 50 MB pre-eviction headroom; verifies eviction kicks in and disk usage never exceeds the cap). - 3 protocol-shape sanity tests (
EvictionResultis frozen andtotal_freed_bytesderives correctly, the wrapper exposes the underlying store as_wrapped, and the decorator passestile_exists/delete_tilestraight through).
- 4 non-docker unit tests for
tests/unit/c6_tile_cache/test_protocol_conformance.py— adjusted_install_fake_postgres_store_moduleto provide a workingtotal_disk_bytes() -> 0(the priorNotImplementedErrorstub would breakCacheBudgetEnforcer.__init__which reads the value for AC-12); and rewrotetest_ac4_build_tile_store_returns_protocol_implto recognise the AZ-308 wrapper (isinstance(store, BudgetEnforcedTileStore),isinstance(store, TileStore),isinstance(store._wrapped, fake_cls)). No new fakes; the change is local to one shared helper + one test.tests/unit/test_az272_fdr_record_schema.py— adds a fixture payload for the newc6.eviction_batchkind so the AZ-272 per-kind round-trip test covers it.
Modified (docs)
_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md— bumped to v1.3.0; added a row forc6.eviction_batch(producerc6_tile_cache.budget, payload shape, cap-to-5 note) in the v1.0.0 closed-enum table and a change-log entry._docs/02_document/contracts/c6_tile_cache/tile_store.md— bumped to v1.1.0 (additive);CacheBudgetExhaustedErrorjoins theTileCacheErrorfamily diagram + change-log entry per the Versioning Rules § "new error variant added toTileCacheError".
Acceptance criteria coverage
| AC | Test | Status |
|---|---|---|
| AC-1 No-eviction fast path | test_ac1_no_eviction_fast_path |
passing |
| AC-2 Single-tile eviction frees enough | test_ac2_single_tile_eviction_frees_enough |
passing |
| AC-3 Multi-tile eviction iterates LRU candidates | test_ac3_multi_tile_eviction_iterates_until_target |
passing |
AC-4 Eviction batches respect eviction_batch_size |
test_ac4_eviction_batches_respect_batch_size |
passing |
AC-5 Insufficient candidates raise CacheBudgetExhaustedError |
test_ac5_insufficient_candidates_raise_after_full_sweep |
passing |
AC-6 BudgetEnforcedTileStore decorator integrates with write_tile |
test_ac6_decorator_write_tile_triggers_eviction (Docker) |
passing |
AC-7 Decorator propagates TileCacheError unchanged |
test_ac7_decorator_propagates_tilecacheerror_unchanged |
passing |
AC-8 read_tile_pixels updates the LRU clock |
test_ac8_read_tile_pixels_updates_lru_clock (Docker) |
passing |
| AC-9 No-evict path = 1 SELECT; evict path = 1 + N + N | test_ac9_no_evict_path_uses_single_select |
passing |
| AC-10 10 GB budget enforcement under synthetic load | test_ac10_synthetic_load_stays_under_cap (Docker) |
passing |
AC-11 FDR evicted_tile_ids capped to 5 |
test_ac11_fdr_evicted_tile_ids_capped_at_five |
passing |
| AC-12 Construction-time disk-bytes report | test_ac12_construction_emits_budget_loaded_info + test_ac12_construction_warns_when_over_budget |
passing |
| NFR-perf no-evict p99 ≤ 5 ms | test_nfr_perf_no_evict_path_p99_under_5ms |
passing |
| NFR-reliability candidate-gone mid-sweep | test_nfr_reliability_delete_returns_false_no_op |
passing |
AC Test Coverage: 12 of 12 covered (+ 2 NFRs + 1 frozen-dataclass shape test)
Code Review Verdict: PASS
Auto-Fix Attempts: 1 (ruff format + check — 8 cosmetic findings auto-resolved: 4 ambiguous × characters in comments, 3 unused noqa: ARG002 directives, 1 unescaped-metacharacter regex in pytest.raises(match=...))
Stuck Agents: None
Findings (self-review)
| # | Severity | Category | Location | Note | Resolution |
|---|---|---|---|---|---|
| 1 | Low | Maintainability | CacheBudgetEnforcer.__init__ |
The ctor runs self._store.total_disk_bytes() synchronously to emit the AC-12 startup INFO log. If the metadata store's pool is contended at process start, this blocks the composition-root path. Accepted because the enforcer is constructed once per process and the cost is one indexed SELECT. |
Open (Low) — accepted as-is. |
| 2 | Low | Test-quality | test_ac10_synthetic_load_stays_under_cap |
Uses a 50 MB synthetic budget (not the 10 GB production cap) to keep the test reasonable on a dev laptop. The cap-enforcement logic is the same shape; the test verifies the loop terminates correctly and disk usage never exceeds the cap. | Open (Low) — accepted as-is. |
| 3 | Low | Test-quality | test_ac8_read_tile_pixels_updates_lru_clock |
Wall-clock parity between the host (Python) and Postgres container is sub-second-skew on macOS/Colima, so a real record_lru_access UPDATE with the host wall clock can lose to GREATEST(accessed_at, %s) against the DB's DEFAULT now(). Test pins the LRU clock to a far-future timestamp (2099-01-01) via a fixture-local _FakeClock; production wiring (storage_factory) still injects WallClock(). |
Open (Low) — accepted as-is. |
| 4 | Low | Adjacent-Hygiene | tests/unit/c6_tile_cache/test_protocol_conformance.py::_FakePostgresFilesystemStore |
The AZ-303 protocol-conformance fake inherits total_disk_bytes from _FullTileMetadataStore which raises NotImplementedError. Once build_tile_store started constructing a CacheBudgetEnforcer (which calls total_disk_bytes at construction), this stub broke the test. Overrode total_disk_bytes on the AZ-308 path to return 0 — minimal change, no other test using the shared helper changed semantically. |
FIXED in this batch. |
| 5 | Low | Maintainability | BudgetEnforcedTileStore._wrapped |
The wrapper exposes the inner store via a private _wrapped attribute so tests + future debugging can introspect it. This is documented in the AC-4 protocol-conformance test comment; not part of the public Protocol contract (the Protocol only requires the four TileStore methods, which the wrapper provides). |
Open (Low) — accepted as documented. |
Tracker
- AZ-308 transitioned to In Progress on session start; will be moved to In Testing post-commit per
protocols.md.
Test suite
tests/unit/c6_tile_cache/test_cache_budget_enforcer.py(18 tests) — passing standalone (Tier-2 + Docker Postgres) and as part of the combined c6 suite (193 / 194 passed in the combined run; see below).tests/unit/c6_tile_cache/(194 tests) — 193 passing; the sametest_ac13_read_tile_pixels_warm_latency_p95flake noted in the AZ-307 batch 29 report (Finding 3 of the AZ-305 batch 28 report) surfaces under combined load. Verified non-regression bygit stash -uround-trip: with my AZ-308 changes stashed, the same test still fails (p95 ≈ 8 msvs the 5 ms ceiling) in the combined run, and passes 3-of-3 standalone. Not a blocker for AZ-308.tests/unit/test_az272_fdr_record_schema.py— passing with the newc6.eviction_batchkind fixtured.- Full unit suite (excluding
tests/integration/and the unrelated c7test_ac8_read_host_tuple_on_jetsonthat requirespynvml, pre-existing) — 1267 passed, 8 environment-skipped (CUDA-only, cmake, actionlint), 1 deselected (pynvml).
Next batch
Cycle 1 advances per the greenfield queue — autodev re-detects the next AZ ticket in the Step 7 batch loop and continues.