mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 23:01:13 +00:00
[AZ-308] c6 CacheBudgetEnforcer: 10 GB hard cap + LRU sweep
CacheBudgetEnforcer.reserve_headroom(needed_bytes) returns immediately when total_disk_bytes() + needed_bytes <= budget, otherwise iterates lru_candidates in eviction_batch_size batches, deletes via delete_tile, emits one INFO log per evicted tile (c6.evicted) and one FDR record per eviction batch (c6.eviction_batch, evicted_tile_ids capped to 5). Raises CacheBudgetExhaustedError AFTER a full sweep if the budget cannot be met. BudgetEnforcedTileStore decorates a TileStore so the policy stays separable from PostgresFilesystemStore. Composition root in storage_factory.build_tile_store wires the wrapper unconditionally. PostgresFilesystemStore now accepts lru_clock: Clock | None = None; when set, read_tile_pixels calls record_lru_access(tile_id, now) so eviction picks the right LRU candidates. Production wiring injects WallClock(); AZ-305 unit tests still construct without the clock and keep their pass-through semantics. Contract tile_store.md bumped to v1.1.0 to add CacheBudgetExhaustedError to the TileCacheError family; shared FDR schema bumped to v1.3.0 for the new c6.eviction_batch kind. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,224 @@
|
||||
# Batch 30 / Cycle 1 — Implementation Report
|
||||
|
||||
**Date**: 2026-05-12
|
||||
**Tasks**: AZ-308 (C6 Cache Budget Eviction — 10 GB hard cap with LRU sweep)
|
||||
**Story points landed**: 3
|
||||
**Status**: complete (AZ-308 → In Testing)
|
||||
|
||||
## Scope summary
|
||||
|
||||
Single-task batch landing the production `CacheBudgetEnforcer` — the
|
||||
policy layer that converts AZ-303's `total_disk_bytes` / `lru_candidates`
|
||||
/ `delete_tile` / `record_lru_access` primitives into RESTRICT-SAT-2's
|
||||
**10 GB hard cap**. The enforcer runs **synchronously inside
|
||||
`write_tile`** via the new `BudgetEnforcedTileStore` decorator: every
|
||||
write first asks `reserve_headroom(len(tile_blob))`; if head-room is
|
||||
sufficient the call is a single `total_disk_bytes()` SELECT and
|
||||
returns immediately, otherwise the enforcer iterates
|
||||
`lru_candidates(max_count=eviction_batch_size)` in 32-row batches,
|
||||
deletes the oldest tiles via `delete_tile`, and stops as soon as the
|
||||
freed bytes meet the shortfall. If the candidate list is exhausted
|
||||
without meeting the budget, `CacheBudgetExhaustedError` is raised
|
||||
**after** the full sweep (per AC-5 — partial eviction beats no
|
||||
eviction so the operator's recovery has maximum head-room).
|
||||
|
||||
Eviction is observable end-to-end: one INFO log per evicted tile
|
||||
(`kind="c6.evicted"`, payload `{tile_id, disk_bytes, accessed_at,
|
||||
evicted_at}`), one FDR record per eviction batch (`kind=
|
||||
"c6.eviction_batch"`, payload `{trigger_tile_id, freed_bytes,
|
||||
evicted_count, evicted_tile_ids[:5]}` — capped to 5 ids to keep the
|
||||
record bounded), and one construction-time INFO log
|
||||
(`kind="c6.budget.loaded"`) so the operator sees `(budget_bytes,
|
||||
current_disk_bytes, headroom_bytes)` at process start (with a WARN if
|
||||
the prior flight ended over-budget).
|
||||
|
||||
The AZ-305 LRU-clock hook is now wired: `PostgresFilesystemStore`
|
||||
accepts an optional `lru_clock: Clock | None = None` ctor argument, and
|
||||
when set, every `read_tile_pixels` call invokes `record_lru_access(
|
||||
tile_id, now)` after the row/file existence check. The unit-test path
|
||||
(AZ-305's existing fixtures) can still construct the store with
|
||||
`lru_clock=None`, preserving the AZ-305 contract. Production wiring
|
||||
in `storage_factory.build_tile_store` always injects `WallClock()`
|
||||
into the inner store and wraps the result in `BudgetEnforcedTileStore`.
|
||||
|
||||
The decorator pattern is mandatory per the spec § Constraints — making
|
||||
budget enforcement a wrapper keeps the policy layer separable from the
|
||||
store impl (single-responsibility), and a future voting-tier-aware
|
||||
policy can replace the enforcer without changing
|
||||
`PostgresFilesystemStore`.
|
||||
|
||||
## Files added / modified
|
||||
|
||||
### New (production)
|
||||
|
||||
- `src/gps_denied_onboard/components/c6_tile_cache/cache_budget_enforcer.py` —
|
||||
`EvictionResult` frozen dataclass; `_iso_ts_now` UTC helper;
|
||||
`CacheBudgetEnforcer` class with one public method
|
||||
`reserve_headroom(needed_bytes) -> EvictionResult` doing the
|
||||
no-evict fast-path → LRU-sweep escalation flow, emitting one INFO
|
||||
log per eviction and one FDR record per batch, plus the AC-12
|
||||
construction-time `c6.budget.loaded` INFO log (with optional WARN
|
||||
on over-budget startup); `BudgetEnforcedTileStore` decorator
|
||||
implementing the `TileStore` Protocol by delegating
|
||||
`read_tile_pixels` / `tile_exists` / `delete_tile` straight through
|
||||
and calling `enforcer.reserve_headroom(len(tile_blob))` before
|
||||
delegating `write_tile`; and an operator CLI
|
||||
(`python -m gps_denied_onboard.components.c6_tile_cache.cache_budget_enforcer dry-run --pretend-needed-bytes N`)
|
||||
that loads config via `load_config(os.environ)` and prints what
|
||||
WOULD be evicted without performing the eviction (no `delete_tile`
|
||||
call, no FDR write, no INFO log).
|
||||
|
||||
### Modified (production)
|
||||
|
||||
- `src/gps_denied_onboard/components/c6_tile_cache/errors.py` — adds
|
||||
`CacheBudgetExhaustedError` to the `TileCacheError` family with
|
||||
diagnostic fields `needed_bytes`, `available_bytes`,
|
||||
`evicted_count` (all keyword-only, all default to `None` so the
|
||||
parameter set is forward-compatible with future tightening).
|
||||
- `src/gps_denied_onboard/components/c6_tile_cache/config.py` — adds
|
||||
the `eviction_batch_size: int = 32` config knob (default per spec
|
||||
§ Constraints, validated `> 0` in `__post_init__`); the existing
|
||||
`lru_eviction_threshold_bytes` already provides the budget.
|
||||
- `src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py`
|
||||
— adds optional `lru_clock: Clock | None = None` ctor arg; when
|
||||
present, `read_tile_pixels` calls
|
||||
`self.record_lru_access(tile_id, now_dt)` after row/file existence
|
||||
checks succeed, where `now_dt = datetime.fromtimestamp(
|
||||
self._lru_clock.time_ns() / 1e9, tz=UTC)`. `from_config` now
|
||||
injects `WallClock()` so the production path always updates the
|
||||
LRU clock; AZ-305's unit tests that construct the store directly
|
||||
with no clock keep the pass-through behaviour (the LRU UPDATE is
|
||||
guarded by `if self._lru_clock is not None`).
|
||||
- `src/gps_denied_onboard/fdr_client/records.py` — adds
|
||||
`c6.eviction_batch` (payload `{trigger_tile_id, freed_bytes,
|
||||
evicted_count, evicted_tile_ids}` capped to 5 ids per AC-11) to
|
||||
`KNOWN_PAYLOAD_KEYS`. The per-tile `c6.evicted` event is INFO log
|
||||
only (it is high-frequency under load and would dilute the FDR
|
||||
ring-buffer; aggregated batch counts go to FDR).
|
||||
- `src/gps_denied_onboard/runtime_root/storage_factory.py` —
|
||||
`build_tile_store` now constructs a `PostgresFilesystemStore`, a
|
||||
`CacheBudgetEnforcer` wired to a producer-local `FdrClient`
|
||||
(`producer_id="c6_tile_cache.budget"`) and the C6 logger, with
|
||||
`budget_bytes = config.tile_cache.lru_eviction_threshold_bytes`
|
||||
and `eviction_batch_size = config.tile_cache.eviction_batch_size`
|
||||
— then wraps the store in a `BudgetEnforcedTileStore` and returns
|
||||
the decorator. `build_tile_metadata_store` is unchanged (the
|
||||
decorator only intercepts `TileStore`, never the metadata store).
|
||||
|
||||
### Modified (tests)
|
||||
|
||||
- `tests/unit/c6_tile_cache/test_cache_budget_enforcer.py` — **NEW**
|
||||
suite of 18 tests:
|
||||
- 4 non-docker unit tests for `CacheBudgetEnforcer` against an
|
||||
in-memory `_FakeStore` covering AC-1 (no-eviction fast path),
|
||||
AC-2 (single-tile sweep), AC-3 (multi-tile until shortfall met),
|
||||
AC-4 (batch-size-respecting `lru_candidates` calls).
|
||||
- 3 non-docker tests for the error-handling envelope: AC-5 (sweep
|
||||
exhausted → `CacheBudgetExhaustedError` AFTER all candidates
|
||||
deleted), AC-7 (decorator does NOT rewrap a
|
||||
`ContentHashMismatchError` from the inner store), AC-9
|
||||
(SELECT-count tally for no-evict vs evict paths).
|
||||
- 4 non-docker tests for FDR + log payloads: AC-11 (evicted_tile_ids
|
||||
truncated to 5 even when 100 evictions occurred), AC-12
|
||||
(construction-time `c6.budget.loaded` INFO log + WARN-on-over-
|
||||
budget), and the NFR-reliability "candidate gone mid-sweep"
|
||||
case where `delete_tile` returns False.
|
||||
- 1 non-docker NFR test (`reserve_headroom × 10000` no-evict path
|
||||
with a strict p99 ≤ 5 ms ceiling).
|
||||
- 3 `@pytest.mark.docker` Tier-2 tests against a real Postgres
|
||||
(composition-root smoke): AC-6 (decorator + `write_tile`
|
||||
end-to-end with near-cap state), AC-8 (real `read_tile_pixels`
|
||||
bumps the LRU clock and changes `lru_candidates` ordering), and
|
||||
AC-10 (synthetic-fill test — 50 MB of writes under a deliberately
|
||||
tight 50 MB pre-eviction headroom; verifies eviction kicks in
|
||||
and disk usage never exceeds the cap).
|
||||
- 3 protocol-shape sanity tests (`EvictionResult` is frozen and
|
||||
`total_freed_bytes` derives correctly, the wrapper exposes the
|
||||
underlying store as `_wrapped`, and the decorator passes
|
||||
`tile_exists` / `delete_tile` straight through).
|
||||
- `tests/unit/c6_tile_cache/test_protocol_conformance.py` — adjusted
|
||||
`_install_fake_postgres_store_module` to provide a working
|
||||
`total_disk_bytes() -> 0` (the prior `NotImplementedError` stub
|
||||
would break `CacheBudgetEnforcer.__init__` which reads the value
|
||||
for AC-12); and rewrote
|
||||
`test_ac4_build_tile_store_returns_protocol_impl` to recognise the
|
||||
AZ-308 wrapper (`isinstance(store, BudgetEnforcedTileStore)`,
|
||||
`isinstance(store, TileStore)`, `isinstance(store._wrapped,
|
||||
fake_cls)`). No new fakes; the change is local to one shared
|
||||
helper + one test.
|
||||
- `tests/unit/test_az272_fdr_record_schema.py` — adds a fixture
|
||||
payload for the new `c6.eviction_batch` kind so the AZ-272 per-kind
|
||||
round-trip test covers it.
|
||||
|
||||
### Modified (docs)
|
||||
|
||||
- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md`
|
||||
— bumped to v1.3.0; added a row for `c6.eviction_batch`
|
||||
(producer `c6_tile_cache.budget`, payload shape, cap-to-5 note) in
|
||||
the v1.0.0 closed-enum table and a change-log entry.
|
||||
- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — bumped
|
||||
to v1.1.0 (additive); `CacheBudgetExhaustedError` joins the
|
||||
`TileCacheError` family diagram + change-log entry per the
|
||||
Versioning Rules § "new error variant added to `TileCacheError`".
|
||||
|
||||
## Acceptance criteria coverage
|
||||
|
||||
| AC | Test | Status |
|
||||
|----|------|--------|
|
||||
| AC-1 No-eviction fast path | `test_ac1_no_eviction_fast_path` | passing |
|
||||
| AC-2 Single-tile eviction frees enough | `test_ac2_single_tile_eviction_frees_enough` | passing |
|
||||
| AC-3 Multi-tile eviction iterates LRU candidates | `test_ac3_multi_tile_eviction_iterates_until_target` | passing |
|
||||
| AC-4 Eviction batches respect `eviction_batch_size` | `test_ac4_eviction_batches_respect_batch_size` | passing |
|
||||
| AC-5 Insufficient candidates raise `CacheBudgetExhaustedError` | `test_ac5_insufficient_candidates_raise_after_full_sweep` | passing |
|
||||
| AC-6 `BudgetEnforcedTileStore` decorator integrates with `write_tile` | `test_ac6_decorator_write_tile_triggers_eviction` (Docker) | passing |
|
||||
| AC-7 Decorator propagates `TileCacheError` unchanged | `test_ac7_decorator_propagates_tilecacheerror_unchanged` | passing |
|
||||
| AC-8 `read_tile_pixels` updates the LRU clock | `test_ac8_read_tile_pixels_updates_lru_clock` (Docker) | passing |
|
||||
| AC-9 No-evict path = 1 SELECT; evict path = 1 + N + N | `test_ac9_no_evict_path_uses_single_select` | passing |
|
||||
| AC-10 10 GB budget enforcement under synthetic load | `test_ac10_synthetic_load_stays_under_cap` (Docker) | passing |
|
||||
| AC-11 FDR `evicted_tile_ids` capped to 5 | `test_ac11_fdr_evicted_tile_ids_capped_at_five` | passing |
|
||||
| AC-12 Construction-time disk-bytes report | `test_ac12_construction_emits_budget_loaded_info` + `test_ac12_construction_warns_when_over_budget` | passing |
|
||||
| NFR-perf no-evict p99 ≤ 5 ms | `test_nfr_perf_no_evict_path_p99_under_5ms` | passing |
|
||||
| NFR-reliability candidate-gone mid-sweep | `test_nfr_reliability_delete_returns_false_no_op` | passing |
|
||||
|
||||
## AC Test Coverage: 12 of 12 covered (+ 2 NFRs + 1 frozen-dataclass shape test)
|
||||
## Code Review Verdict: PASS
|
||||
## Auto-Fix Attempts: 1 (ruff `format` + `check` — 8 cosmetic findings auto-resolved: 4 ambiguous `×` characters in comments, 3 unused `noqa: ARG002` directives, 1 unescaped-metacharacter regex in `pytest.raises(match=...)`)
|
||||
## Stuck Agents: None
|
||||
|
||||
## Findings (self-review)
|
||||
|
||||
| # | Severity | Category | Location | Note | Resolution |
|
||||
|---|----------|----------|----------|------|------------|
|
||||
| 1 | Low | Maintainability | `CacheBudgetEnforcer.__init__` | The ctor runs `self._store.total_disk_bytes()` synchronously to emit the AC-12 startup INFO log. If the metadata store's pool is contended at process start, this blocks the composition-root path. Accepted because the enforcer is constructed once per process and the cost is one indexed SELECT. | Open (Low) — accepted as-is. |
|
||||
| 2 | Low | Test-quality | `test_ac10_synthetic_load_stays_under_cap` | Uses a 50 MB synthetic budget (not the 10 GB production cap) to keep the test reasonable on a dev laptop. The cap-enforcement logic is the same shape; the test verifies the loop terminates correctly and disk usage never exceeds the cap. | Open (Low) — accepted as-is. |
|
||||
| 3 | Low | Test-quality | `test_ac8_read_tile_pixels_updates_lru_clock` | Wall-clock parity between the host (Python) and Postgres container is sub-second-skew on macOS/Colima, so a real `record_lru_access` UPDATE with the host wall clock can lose to `GREATEST(accessed_at, %s)` against the DB's `DEFAULT now()`. Test pins the LRU clock to a far-future timestamp (`2099-01-01`) via a fixture-local `_FakeClock`; production wiring (`storage_factory`) still injects `WallClock()`. | Open (Low) — accepted as-is. |
|
||||
| 4 | Low | Adjacent-Hygiene | `tests/unit/c6_tile_cache/test_protocol_conformance.py::_FakePostgresFilesystemStore` | The AZ-303 protocol-conformance fake inherits `total_disk_bytes` from `_FullTileMetadataStore` which raises `NotImplementedError`. Once `build_tile_store` started constructing a `CacheBudgetEnforcer` (which calls `total_disk_bytes` at construction), this stub broke the test. Overrode `total_disk_bytes` on the AZ-308 path to return 0 — minimal change, no other test using the shared helper changed semantically. | **FIXED** in this batch. |
|
||||
| 5 | Low | Maintainability | `BudgetEnforcedTileStore._wrapped` | The wrapper exposes the inner store via a private `_wrapped` attribute so tests + future debugging can introspect it. This is documented in the AC-4 protocol-conformance test comment; not part of the public Protocol contract (the Protocol only requires the four `TileStore` methods, which the wrapper provides). | Open (Low) — accepted as documented. |
|
||||
|
||||
## Tracker
|
||||
|
||||
- AZ-308 transitioned to **In Progress** on session start; will be moved to **In Testing** post-commit per `protocols.md`.
|
||||
|
||||
## Test suite
|
||||
|
||||
- `tests/unit/c6_tile_cache/test_cache_budget_enforcer.py` (18 tests) —
|
||||
passing standalone (Tier-2 + Docker Postgres) and as part of the
|
||||
combined c6 suite (193 / 194 passed in the combined run; see below).
|
||||
- `tests/unit/c6_tile_cache/` (194 tests) — 193 passing; the same
|
||||
`test_ac13_read_tile_pixels_warm_latency_p95` flake noted in the
|
||||
AZ-307 batch 29 report (Finding 3 of the AZ-305 batch 28 report)
|
||||
surfaces under combined load. Verified non-regression by `git stash
|
||||
-u` round-trip: with my AZ-308 changes stashed, the same test still
|
||||
fails (`p95 ≈ 8 ms` vs the 5 ms ceiling) in the combined run, and
|
||||
passes 3-of-3 standalone. Not a blocker for AZ-308.
|
||||
- `tests/unit/test_az272_fdr_record_schema.py` — passing with the new
|
||||
`c6.eviction_batch` kind fixtured.
|
||||
- Full unit suite (excluding `tests/integration/` and the unrelated
|
||||
c7 `test_ac8_read_host_tuple_on_jetson` that requires `pynvml`,
|
||||
pre-existing) — 1267 passed, 8 environment-skipped (CUDA-only, cmake,
|
||||
actionlint), 1 deselected (pynvml).
|
||||
|
||||
## Next batch
|
||||
|
||||
Cycle 1 advances per the greenfield queue — autodev re-detects the
|
||||
next AZ ticket in the Step 7 batch loop and continues.
|
||||
Reference in New Issue
Block a user