Files
Oleksandr Bezdieniezhnykh d571ca25f9 [AZ-308] c6 CacheBudgetEnforcer: 10 GB hard cap + LRU sweep
CacheBudgetEnforcer.reserve_headroom(needed_bytes) returns immediately
when total_disk_bytes() + needed_bytes <= budget, otherwise iterates
lru_candidates in eviction_batch_size batches, deletes via delete_tile,
emits one INFO log per evicted tile (c6.evicted) and one FDR record per
eviction batch (c6.eviction_batch, evicted_tile_ids capped to 5).
Raises CacheBudgetExhaustedError AFTER a full sweep if the budget
cannot be met. BudgetEnforcedTileStore decorates a TileStore so the
policy stays separable from PostgresFilesystemStore. Composition root
in storage_factory.build_tile_store wires the wrapper unconditionally.

PostgresFilesystemStore now accepts lru_clock: Clock | None = None;
when set, read_tile_pixels calls record_lru_access(tile_id, now) so
eviction picks the right LRU candidates. Production wiring injects
WallClock(); AZ-305 unit tests still construct without the clock and
keep their pass-through semantics. Contract tile_store.md bumped to
v1.1.0 to add CacheBudgetExhaustedError to the TileCacheError family;
shared FDR schema bumped to v1.3.0 for the new c6.eviction_batch kind.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 20:37:41 +03:00

14 KiB
Raw Permalink Blame History

Batch 30 / Cycle 1 — Implementation Report

Date: 2026-05-12 Tasks: AZ-308 (C6 Cache Budget Eviction — 10 GB hard cap with LRU sweep) Story points landed: 3 Status: complete (AZ-308 → In Testing)

Scope summary

Single-task batch landing the production CacheBudgetEnforcer — the policy layer that converts AZ-303's total_disk_bytes / lru_candidates / delete_tile / record_lru_access primitives into RESTRICT-SAT-2's 10 GB hard cap. The enforcer runs synchronously inside write_tile via the new BudgetEnforcedTileStore decorator: every write first asks reserve_headroom(len(tile_blob)); if head-room is sufficient the call is a single total_disk_bytes() SELECT and returns immediately, otherwise the enforcer iterates lru_candidates(max_count=eviction_batch_size) in 32-row batches, deletes the oldest tiles via delete_tile, and stops as soon as the freed bytes meet the shortfall. If the candidate list is exhausted without meeting the budget, CacheBudgetExhaustedError is raised after the full sweep (per AC-5 — partial eviction beats no eviction so the operator's recovery has maximum head-room).

Eviction is observable end-to-end: one INFO log per evicted tile (kind="c6.evicted", payload {tile_id, disk_bytes, accessed_at, evicted_at}), one FDR record per eviction batch (kind= "c6.eviction_batch", payload {trigger_tile_id, freed_bytes, evicted_count, evicted_tile_ids[:5]} — capped to 5 ids to keep the record bounded), and one construction-time INFO log (kind="c6.budget.loaded") so the operator sees (budget_bytes, current_disk_bytes, headroom_bytes) at process start (with a WARN if the prior flight ended over-budget).

The AZ-305 LRU-clock hook is now wired: PostgresFilesystemStore accepts an optional lru_clock: Clock | None = None ctor argument, and when set, every read_tile_pixels call invokes record_lru_access( tile_id, now) after the row/file existence check. The unit-test path (AZ-305's existing fixtures) can still construct the store with lru_clock=None, preserving the AZ-305 contract. Production wiring in storage_factory.build_tile_store always injects WallClock() into the inner store and wraps the result in BudgetEnforcedTileStore.

The decorator pattern is mandatory per the spec § Constraints — making budget enforcement a wrapper keeps the policy layer separable from the store impl (single-responsibility), and a future voting-tier-aware policy can replace the enforcer without changing PostgresFilesystemStore.

Files added / modified

New (production)

  • src/gps_denied_onboard/components/c6_tile_cache/cache_budget_enforcer.pyEvictionResult frozen dataclass; _iso_ts_now UTC helper; CacheBudgetEnforcer class with one public method reserve_headroom(needed_bytes) -> EvictionResult doing the no-evict fast-path → LRU-sweep escalation flow, emitting one INFO log per eviction and one FDR record per batch, plus the AC-12 construction-time c6.budget.loaded INFO log (with optional WARN on over-budget startup); BudgetEnforcedTileStore decorator implementing the TileStore Protocol by delegating read_tile_pixels / tile_exists / delete_tile straight through and calling enforcer.reserve_headroom(len(tile_blob)) before delegating write_tile; and an operator CLI (python -m gps_denied_onboard.components.c6_tile_cache.cache_budget_enforcer dry-run --pretend-needed-bytes N) that loads config via load_config(os.environ) and prints what WOULD be evicted without performing the eviction (no delete_tile call, no FDR write, no INFO log).

Modified (production)

  • src/gps_denied_onboard/components/c6_tile_cache/errors.py — adds CacheBudgetExhaustedError to the TileCacheError family with diagnostic fields needed_bytes, available_bytes, evicted_count (all keyword-only, all default to None so the parameter set is forward-compatible with future tightening).
  • src/gps_denied_onboard/components/c6_tile_cache/config.py — adds the eviction_batch_size: int = 32 config knob (default per spec § Constraints, validated > 0 in __post_init__); the existing lru_eviction_threshold_bytes already provides the budget.
  • src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py — adds optional lru_clock: Clock | None = None ctor arg; when present, read_tile_pixels calls self.record_lru_access(tile_id, now_dt) after row/file existence checks succeed, where now_dt = datetime.fromtimestamp( self._lru_clock.time_ns() / 1e9, tz=UTC). from_config now injects WallClock() so the production path always updates the LRU clock; AZ-305's unit tests that construct the store directly with no clock keep the pass-through behaviour (the LRU UPDATE is guarded by if self._lru_clock is not None).
  • src/gps_denied_onboard/fdr_client/records.py — adds c6.eviction_batch (payload {trigger_tile_id, freed_bytes, evicted_count, evicted_tile_ids} capped to 5 ids per AC-11) to KNOWN_PAYLOAD_KEYS. The per-tile c6.evicted event is INFO log only (it is high-frequency under load and would dilute the FDR ring-buffer; aggregated batch counts go to FDR).
  • src/gps_denied_onboard/runtime_root/storage_factory.pybuild_tile_store now constructs a PostgresFilesystemStore, a CacheBudgetEnforcer wired to a producer-local FdrClient (producer_id="c6_tile_cache.budget") and the C6 logger, with budget_bytes = config.tile_cache.lru_eviction_threshold_bytes and eviction_batch_size = config.tile_cache.eviction_batch_size — then wraps the store in a BudgetEnforcedTileStore and returns the decorator. build_tile_metadata_store is unchanged (the decorator only intercepts TileStore, never the metadata store).

Modified (tests)

  • tests/unit/c6_tile_cache/test_cache_budget_enforcer.pyNEW suite of 18 tests:
    • 4 non-docker unit tests for CacheBudgetEnforcer against an in-memory _FakeStore covering AC-1 (no-eviction fast path), AC-2 (single-tile sweep), AC-3 (multi-tile until shortfall met), AC-4 (batch-size-respecting lru_candidates calls).
    • 3 non-docker tests for the error-handling envelope: AC-5 (sweep exhausted → CacheBudgetExhaustedError AFTER all candidates deleted), AC-7 (decorator does NOT rewrap a ContentHashMismatchError from the inner store), AC-9 (SELECT-count tally for no-evict vs evict paths).
    • 4 non-docker tests for FDR + log payloads: AC-11 (evicted_tile_ids truncated to 5 even when 100 evictions occurred), AC-12 (construction-time c6.budget.loaded INFO log + WARN-on-over- budget), and the NFR-reliability "candidate gone mid-sweep" case where delete_tile returns False.
    • 1 non-docker NFR test (reserve_headroom × 10000 no-evict path with a strict p99 ≤ 5 ms ceiling).
    • 3 @pytest.mark.docker Tier-2 tests against a real Postgres (composition-root smoke): AC-6 (decorator + write_tile end-to-end with near-cap state), AC-8 (real read_tile_pixels bumps the LRU clock and changes lru_candidates ordering), and AC-10 (synthetic-fill test — 50 MB of writes under a deliberately tight 50 MB pre-eviction headroom; verifies eviction kicks in and disk usage never exceeds the cap).
    • 3 protocol-shape sanity tests (EvictionResult is frozen and total_freed_bytes derives correctly, the wrapper exposes the underlying store as _wrapped, and the decorator passes tile_exists / delete_tile straight through).
  • tests/unit/c6_tile_cache/test_protocol_conformance.py — adjusted _install_fake_postgres_store_module to provide a working total_disk_bytes() -> 0 (the prior NotImplementedError stub would break CacheBudgetEnforcer.__init__ which reads the value for AC-12); and rewrote test_ac4_build_tile_store_returns_protocol_impl to recognise the AZ-308 wrapper (isinstance(store, BudgetEnforcedTileStore), isinstance(store, TileStore), isinstance(store._wrapped, fake_cls)). No new fakes; the change is local to one shared helper + one test.
  • tests/unit/test_az272_fdr_record_schema.py — adds a fixture payload for the new c6.eviction_batch kind so the AZ-272 per-kind round-trip test covers it.

Modified (docs)

  • _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md — bumped to v1.3.0; added a row for c6.eviction_batch (producer c6_tile_cache.budget, payload shape, cap-to-5 note) in the v1.0.0 closed-enum table and a change-log entry.
  • _docs/02_document/contracts/c6_tile_cache/tile_store.md — bumped to v1.1.0 (additive); CacheBudgetExhaustedError joins the TileCacheError family diagram + change-log entry per the Versioning Rules § "new error variant added to TileCacheError".

Acceptance criteria coverage

AC Test Status
AC-1 No-eviction fast path test_ac1_no_eviction_fast_path passing
AC-2 Single-tile eviction frees enough test_ac2_single_tile_eviction_frees_enough passing
AC-3 Multi-tile eviction iterates LRU candidates test_ac3_multi_tile_eviction_iterates_until_target passing
AC-4 Eviction batches respect eviction_batch_size test_ac4_eviction_batches_respect_batch_size passing
AC-5 Insufficient candidates raise CacheBudgetExhaustedError test_ac5_insufficient_candidates_raise_after_full_sweep passing
AC-6 BudgetEnforcedTileStore decorator integrates with write_tile test_ac6_decorator_write_tile_triggers_eviction (Docker) passing
AC-7 Decorator propagates TileCacheError unchanged test_ac7_decorator_propagates_tilecacheerror_unchanged passing
AC-8 read_tile_pixels updates the LRU clock test_ac8_read_tile_pixels_updates_lru_clock (Docker) passing
AC-9 No-evict path = 1 SELECT; evict path = 1 + N + N test_ac9_no_evict_path_uses_single_select passing
AC-10 10 GB budget enforcement under synthetic load test_ac10_synthetic_load_stays_under_cap (Docker) passing
AC-11 FDR evicted_tile_ids capped to 5 test_ac11_fdr_evicted_tile_ids_capped_at_five passing
AC-12 Construction-time disk-bytes report test_ac12_construction_emits_budget_loaded_info + test_ac12_construction_warns_when_over_budget passing
NFR-perf no-evict p99 ≤ 5 ms test_nfr_perf_no_evict_path_p99_under_5ms passing
NFR-reliability candidate-gone mid-sweep test_nfr_reliability_delete_returns_false_no_op passing

AC Test Coverage: 12 of 12 covered (+ 2 NFRs + 1 frozen-dataclass shape test)

Code Review Verdict: PASS

Auto-Fix Attempts: 1 (ruff format + check — 8 cosmetic findings auto-resolved: 4 ambiguous × characters in comments, 3 unused noqa: ARG002 directives, 1 unescaped-metacharacter regex in pytest.raises(match=...))

Stuck Agents: None

Findings (self-review)

# Severity Category Location Note Resolution
1 Low Maintainability CacheBudgetEnforcer.__init__ The ctor runs self._store.total_disk_bytes() synchronously to emit the AC-12 startup INFO log. If the metadata store's pool is contended at process start, this blocks the composition-root path. Accepted because the enforcer is constructed once per process and the cost is one indexed SELECT. Open (Low) — accepted as-is.
2 Low Test-quality test_ac10_synthetic_load_stays_under_cap Uses a 50 MB synthetic budget (not the 10 GB production cap) to keep the test reasonable on a dev laptop. The cap-enforcement logic is the same shape; the test verifies the loop terminates correctly and disk usage never exceeds the cap. Open (Low) — accepted as-is.
3 Low Test-quality test_ac8_read_tile_pixels_updates_lru_clock Wall-clock parity between the host (Python) and Postgres container is sub-second-skew on macOS/Colima, so a real record_lru_access UPDATE with the host wall clock can lose to GREATEST(accessed_at, %s) against the DB's DEFAULT now(). Test pins the LRU clock to a far-future timestamp (2099-01-01) via a fixture-local _FakeClock; production wiring (storage_factory) still injects WallClock(). Open (Low) — accepted as-is.
4 Low Adjacent-Hygiene tests/unit/c6_tile_cache/test_protocol_conformance.py::_FakePostgresFilesystemStore The AZ-303 protocol-conformance fake inherits total_disk_bytes from _FullTileMetadataStore which raises NotImplementedError. Once build_tile_store started constructing a CacheBudgetEnforcer (which calls total_disk_bytes at construction), this stub broke the test. Overrode total_disk_bytes on the AZ-308 path to return 0 — minimal change, no other test using the shared helper changed semantically. FIXED in this batch.
5 Low Maintainability BudgetEnforcedTileStore._wrapped The wrapper exposes the inner store via a private _wrapped attribute so tests + future debugging can introspect it. This is documented in the AC-4 protocol-conformance test comment; not part of the public Protocol contract (the Protocol only requires the four TileStore methods, which the wrapper provides). Open (Low) — accepted as documented.

Tracker

  • AZ-308 transitioned to In Progress on session start; will be moved to In Testing post-commit per protocols.md.

Test suite

  • tests/unit/c6_tile_cache/test_cache_budget_enforcer.py (18 tests) — passing standalone (Tier-2 + Docker Postgres) and as part of the combined c6 suite (193 / 194 passed in the combined run; see below).
  • tests/unit/c6_tile_cache/ (194 tests) — 193 passing; the same test_ac13_read_tile_pixels_warm_latency_p95 flake noted in the AZ-307 batch 29 report (Finding 3 of the AZ-305 batch 28 report) surfaces under combined load. Verified non-regression by git stash -u round-trip: with my AZ-308 changes stashed, the same test still fails (p95 ≈ 8 ms vs the 5 ms ceiling) in the combined run, and passes 3-of-3 standalone. Not a blocker for AZ-308.
  • tests/unit/test_az272_fdr_record_schema.py — passing with the new c6.eviction_batch kind fixtured.
  • Full unit suite (excluding tests/integration/ and the unrelated c7 test_ac8_read_host_tuple_on_jetson that requires pynvml, pre-existing) — 1267 passed, 8 environment-skipped (CUDA-only, cmake, actionlint), 1 deselected (pynvml).

Next batch

Cycle 1 advances per the greenfield queue — autodev re-detects the next AZ ticket in the Step 7 batch loop and continues.