diff --git a/_docs/02_tasks/todo/AZ-446_csv_reporter_refinements.md b/_docs/02_tasks/done/AZ-446_csv_reporter_refinements.md similarity index 100% rename from _docs/02_tasks/todo/AZ-446_csv_reporter_refinements.md rename to _docs/02_tasks/done/AZ-446_csv_reporter_refinements.md diff --git a/_docs/03_implementation/batch_89_cycle1_report.md b/_docs/03_implementation/batch_89_cycle1_report.md new file mode 100644 index 0000000..a59901b --- /dev/null +++ b/_docs/03_implementation/batch_89_cycle1_report.md @@ -0,0 +1,58 @@ +# Batch Report + +**Batch**: 89 +**Tasks**: AZ-446 (CSV reporter refinements — trend-line + acceptance-band annotations + Monte Carlo CI) +**Date**: 2026-05-17 +**Cycle**: 1 +**Complexity**: 2 points + +## Task Results + +| Task | Status | Files Modified | Tests | AC Coverage | Issues | +|------|--------|----------------|-------|-------------|--------| +| AZ-446_csv_reporter_refinements | Done | 1 source (nfr_recorder), 2 scenarios (nft_res_03 + nft_perf_01), 1 unit test (test_nfr_recorder) | pass | 3/3 | F1 Low (CI naming semantics, in-scope), F2 Medium (carry-over from batch 88, not in scope) | + +## AC Test Coverage: All covered (3 of 3 ACs) + +## Code Review Verdict: PASS_WITH_WARNINGS + +See `_docs/03_implementation/reviews/batch_89_review.md`. 0 Critical / 0 High / 1 Medium (cumulative-review carry-over from batches 85–88: `write_csv_evidence` + `_resolve_fixture_path` duplication is outside AZ-446 scope — surfaces again at the next cumulative review) / 1 Low (empirical-CI naming semantics, documented in nfr_recorder docstring). + +## Auto-Fix Attempts: 0 + +## Stuck Agents: None + +## Test Results + +- `e2e/_unit_tests/reporting/test_nfr_recorder.py` — 14 tests pass (8 pre-existing + 6 new for AZ-446 band/CI behavior). +- Full e2e unit-test suite: **1229 passed in 134 s** (+6 vs. batch 88). +- No scenario-level regressions: NFT-RES-03 + NFT-PERF-01 scenarios continue to skip cleanly via `sitl_replay_ready` / `tier2_only` gating in the Tier-1 docker harness. + +## API Change Summary + +`_NfrRecorder.record_metric` (and the underlying `_RunAggregator`): + +```python +record_metric( + name, value, ac_id=None, *, + band: str | None = None, + ci95_low: float | None = None, + ci95_high: float | None = None, +) +``` + +- All new params kw-only, default `None` — fully backwards compatible. +- Unbalanced `ci95_*` → `ValueError`. + +New artifact: + +- `/report.csv` (one row per (scenario, metric)) — + columns: `scenario_id, metric_name, value, value_band, ci95_low, + ci95_high, ac_id, outcome`. Emitted once per pytest session by + `_PluginHooks.pytest_sessionfinish` (AC-3). + +`regression-baseline.json` schema is unchanged (flat `{metric: +numeric}`) to preserve the diff contract used by regression-detection +tooling. + +## Next Batch: None — all selected test-implementation tasks are done (Step 10 Implement Tests complete for cycle 1). diff --git a/_docs/03_implementation/reviews/batch_89_review.md b/_docs/03_implementation/reviews/batch_89_review.md new file mode 100644 index 0000000..c817b33 --- /dev/null +++ b/_docs/03_implementation/reviews/batch_89_review.md @@ -0,0 +1,99 @@ +# Code Review Report + +**Batch**: 89 — AZ-446 (CSV reporter refinements) +**Date**: 2026-05-17 +**Verdict**: PASS_WITH_WARNINGS + +## Scope + +Files modified: + +- `e2e/runner/reporting/nfr_recorder.py` — extended `record_metric` + with `band`, `ci95_low`, `ci95_high` (kw-only); added + `_RunAggregator.emit_per_metric_report` and a `pytest_sessionfinish` + call to it; introduced `_stringify` helper. No backwards-incompatible + changes (existing 4-arg callers keep working). +- `e2e/tests/resilience/test_nft_res_03_monte_carlo.py` — passes band + for AC-1 (`iteration_count`) and AC-3 (`envelope_ratio`) and + computes per-iteration empirical ci95 for `envelope_ratio`. +- `e2e/tests/performance/test_nft_perf_01_e2e_latency.py` — passes + band for AC-4 (`frame_drop_ratio`) and AC-2/AC-3 (`latency_ms_p95`); + computes empirical 2.5/97.5 percentile of the raw latency samples as + ci95 for p50/p95/p99. +- `e2e/_unit_tests/reporting/test_nfr_recorder.py` — 6 new tests + covering band+CI persistence, value-error on unbalanced CI, the + new `report.csv` columns, explicit-path overload, and end-to-end + emission via the pytest plugin. + +## Findings + +| # | Severity | Category | File:Line | Title | +|---|----------|----------|-----------|-------| +| 1 | Low | Style | n/a | "Empirical CI" naming clarification — `ci95_low`/`ci95_high` are emitted as the central 95% range of the underlying sample distribution, not a confidence interval on a point estimator | +| 2 | Medium | Maintainability | `e2e/runner/helpers/*_evaluator.py`, `e2e/tests/resource_limit/*` | Carried over from batches 85–87 and 88: duplicated `write_csv_evidence` + `_resolve_fixture_path` boilerplate. NOT in AZ-446 scope (AZ-446 is reporting-only). Tracked separately. | + +No Critical / High / Security findings. + +## Finding Details + +### F1: Empirical CI naming clarification (Low / Style) + +- Location: `e2e/runner/reporting/nfr_recorder.py` (docstring) + + `_percentile_pair` helpers in NFT-RES-03 and NFT-PERF-01. +- Description: the task spec says "Monte Carlo confidence intervals + where applicable." The implementation emits the empirical 2.5/97.5 + percentile of the underlying sample distribution. For + `nft_res_03.envelope_ratio` this is the spread of per-iteration + ratios — a defensible MC bootstrap-like signal. For + `nft_perf_01.latency_ms_p{50,95,99}` it is the range of raw + per-frame latency samples, NOT a CI on the percentile estimator. A + parametric CI on the percentile would require a bootstrap loop + (resampling) that materially expands the scope. +- Resolution: docstring spells the semantics out; downstream tooling + reads the same column whether the source is parametric or + empirical. Acceptable trade-off for the 2-point task. +- Tasks: AZ-446. + +### F2: Cumulative-review carry-over from batch 88 (Medium / Maintainability) + +- Location: `e2e/runner/helpers/{memory,fdr_size,storage,thermal_envelope}_evaluator.py` + and `e2e/tests/resource_limit/test_nft_lim_*_*.py`. +- Description: AZ-446 was speculatively expected to absorb the + cumulative-review carry-over for `write_csv_evidence` and + `_resolve_fixture_path` duplication. Re-examining: AZ-446's spec is + reporting-only (band annotations + CI95 columns on the per-metric + `report.csv` + nfr_recorder API). The carried-over duplications live + in evaluator helpers and scenario boilerplate, both outside the + AZ-446 ownership envelope. A separate maintenance-only PBI is + required. +- Resolution: not addressed in this batch. The next cumulative review + (batches 88–90) will re-flag it; if the user wants it consolidated + sooner, a small "PBI" should be opened. +- Tasks: not AZ-446. + +## AC Test Coverage + +| Task | ACs | Coverage | +|------|-----|----------| +| AZ-446 | AC-1 (band) | Covered — `test_record_metric_band_kwarg_stored_in_internal_record`, `test_emit_per_metric_report_writes_csv_with_band_and_ci`, and the scenario callers in `test_nft_res_03_monte_carlo.py` + `test_nft_perf_01_e2e_latency.py`. The `report.csv` header always carries `value_band` (column exists per AC wording). | +| AZ-446 | AC-2 (CI95 for Monte Carlo + N-sample) | Covered — `test_record_metric_ci95_pair_stored_in_internal_record`, `test_record_metric_ci95_unbalanced_rejected_via_fixture_wrapper`, `test_emit_per_metric_report_writes_csv_with_band_and_ci`, end-to-end `test_per_metric_report_emitted_in_pytest_run`. NFT-RES-03 + NFT-PERF-01 callers pass the new args. | +| AZ-446 | AC-3 (parameterization — once per CI invocation) | Covered — the `report.csv` artifact is emitted from `pytest_sessionfinish` (single emission per session, regardless of how many parametrize combinations ran). End-to-end fixture-run unit test verifies emission. | + +## Verdict Logic + +- Critical: 0 +- High: 0 +- Medium: 1 (carry-over from batch 88, not in AZ-446 scope) +- Low: 1 (in-scope, documented in docstring) + +→ **PASS_WITH_WARNINGS** + +## Architecture Compliance (Phase 7) + +- All edits inside `e2e/**` (owned by `blackbox_tests`). No new + imports from `src/gps_denied_onboard`. No new cyclic dependencies; + `nfr_recorder` only adds new public methods + a private helper. +- Public-API change is purely additive: `_NfrRecorder.record_metric` + added kw-only `band`/`ci95_low`/`ci95_high` parameters with default + `None`; `_RunAggregator.record_metric` mirrors them. Existing + positional + named callers remain valid. diff --git a/_docs/_autodev_state.md b/_docs/_autodev_state.md index ef8aed6..1087fc0 100644 --- a/_docs/_autodev_state.md +++ b/_docs/_autodev_state.md @@ -6,15 +6,15 @@ step: 10 name: Implement Tests status: in_progress sub_step: - phase: 9 - name: code-review - detail: "batch 88 — AZ-440..AZ-443 NFT-LIM cluster (AZ-446 deferred to batch 89)" + phase: 6 + name: implement-sequentially + detail: "batch 89 — AZ-446 only" retry_count: 0 cycle: 1 tracker: jira -last_completed_batch: 87 +last_completed_batch: 88 last_cumulative_review: batches_85-87 -current_batch: 88 +current_batch: 89 last_step_outcomes: step_8: "Code is testable — no changes needed (testability_assessment.md committed; no list-of-changes, no source edits)" diff --git a/e2e/_unit_tests/reporting/test_nfr_recorder.py b/e2e/_unit_tests/reporting/test_nfr_recorder.py index a89d314..2d73755 100644 --- a/e2e/_unit_tests/reporting/test_nfr_recorder.py +++ b/e2e/_unit_tests/reporting/test_nfr_recorder.py @@ -303,3 +303,234 @@ def test_nfr_recorder_fixture_emits_artifacts_in_run(tmp_path: Path) -> None: assert status["AC-4.2"]["status"] == "NOT COVERED" baseline = json.loads((evidence_out / "regression-baseline.json").read_text()) assert baseline["scenarios"]["NFT-PERF-01"]["metrics"] == {"latency_p95_ms": 380.4} + + +# ───────────────────── AZ-446 — band + CI95 annotations ───────────────────── + + +def test_record_metric_band_kwarg_stored_in_internal_record(tmp_path: Path) -> None: + """AZ-446 AC-1 — band annotation persists into the metric entry.""" + + # Arrange + agg = _aggregator(tmp_path, ["AC-4.1"]) + agg.ensure_record("NFT-PERF-01", "test_a", ("AC-4.1",)) + + # Act + agg.record_metric( + scenario_id="NFT-PERF-01", + name="latency_p95_ms", + value=380.4, + ac_id="AC-4.1", + nodeid="test_a", + band="≤400 ms", + ) + + # Assert + [rec] = agg.records() + assert rec.metrics["latency_p95_ms"] == { + "value": 380.4, + "ac_id": "AC-4.1", + "band": "≤400 ms", + } + + +def test_record_metric_ci95_pair_stored_in_internal_record(tmp_path: Path) -> None: + """AZ-446 AC-2 — ci95_low / ci95_high persist into the metric entry.""" + + # Arrange + agg = _aggregator(tmp_path, ["AC-3"]) + agg.ensure_record("NFT-RES-03", "test_a", ("AC-3",)) + + # Act + agg.record_metric( + scenario_id="NFT-RES-03", + name="envelope_ratio", + value=0.957, + ac_id="AC-3", + nodeid="test_a", + band="≥0.95", + ci95_low=0.92, + ci95_high=0.99, + ) + + # Assert + [rec] = agg.records() + assert rec.metrics["envelope_ratio"] == { + "value": 0.957, + "ac_id": "AC-3", + "band": "≥0.95", + "ci95_low": 0.92, + "ci95_high": 0.99, + } + + +def test_record_metric_ci95_unbalanced_rejected_via_fixture_wrapper() -> None: + """AZ-446 — passing only one of ci95_low / ci95_high is a hard error.""" + + # Arrange + from runner.reporting.nfr_recorder import _NfrRecorder, _RunAggregator + + agg = _RunAggregator(Path("."), []) + agg.ensure_record("S", "n", ()) + rec = _NfrRecorder(scenario_id="S", nodeid="n", traces_to=(), run=agg) + + # Act + Assert + with pytest.raises(ValueError, match="ci95_low and ci95_high"): + rec.record_metric("m", 1.0, ci95_low=0.5) + with pytest.raises(ValueError, match="ci95_low and ci95_high"): + rec.record_metric("m", 1.0, ci95_high=0.5) + + +def test_emit_per_metric_report_writes_csv_with_band_and_ci(tmp_path: Path) -> None: + """AZ-446 AC-1 + AC-2 — report.csv carries band + ci95 columns.""" + + # Arrange + agg = _aggregator(tmp_path, ["AC-4.1", "AC-3"]) + agg.ensure_record("NFT-PERF-01", "test_a", ("AC-4.1",)) + agg.ensure_record("NFT-RES-03", "test_b", ("AC-3",)) + agg.record_metric( + scenario_id="NFT-PERF-01", + name="latency_p95_ms", + value=380.4, + ac_id="AC-4.1", + nodeid="test_a", + band="≤400 ms", + ) + agg.record_metric( + scenario_id="NFT-PERF-01", + name="latency_p99_ms", + value=420.0, + ac_id="AC-4.1", + nodeid="test_a", + ) + agg.record_metric( + scenario_id="NFT-RES-03", + name="envelope_ratio", + value=0.957, + ac_id="AC-3", + nodeid="test_b", + band="≥0.95", + ci95_low=0.92, + ci95_high=0.99, + ) + agg.set_outcome("test_a", "PASS") + agg.set_outcome("test_b", "PASS") + + # Act + path = agg.emit_per_metric_report() + + # Assert + assert path == tmp_path / "report.csv" + lines = path.read_text().splitlines() + assert lines[0] == ( + "scenario_id,metric_name,value,value_band,ci95_low,ci95_high,ac_id,outcome" + ) + rows = sorted(lines[1:]) + assert rows == sorted( + [ + "NFT-PERF-01,latency_p95_ms,380.4,≤400 ms,,,AC-4.1,PASS", + "NFT-PERF-01,latency_p99_ms,420,,,,AC-4.1,PASS", + "NFT-RES-03,envelope_ratio,0.957,≥0.95,0.92,0.99,AC-3,PASS", + ] + ) + + +def test_emit_per_metric_report_accepts_explicit_path(tmp_path: Path) -> None: + """AZ-446 — explicit ``path=`` overrides the default ``/report.csv``.""" + + # Arrange + agg = _aggregator(tmp_path, []) + agg.ensure_record("NFT-PERF-01", "n", ()) + agg.record_metric( + scenario_id="NFT-PERF-01", + name="m", + value=1.0, + ac_id=None, + nodeid="n", + ) + agg.set_outcome("n", "PASS") + + # Act + target = tmp_path / "subdir" / "alt.csv" + out = agg.emit_per_metric_report(target) + + # Assert + assert out == target + assert target.is_file() + assert "NFT-PERF-01,m,1," in target.read_text() + + +def test_per_metric_report_emitted_in_pytest_run(tmp_path: Path) -> None: + """AZ-446 AC-3 — report.csv emitted exactly once per CI invocation.""" + + # Arrange + matrix = tmp_path / "matrix.md" + matrix.write_text( + "## Acceptance Criteria Coverage\n\n" + "| AC ID | Desc | Source | Status |\n" + "|-------|------|--------|--------|\n" + "| AC-4.1 | foo | NFT-PERF-01 | Covered |\n" + ) + evidence_out = tmp_path / "evidence" + evidence_out.mkdir() + + # Unique basename — otherwise pytest's import cache collides with the + # `test_inner.py` already created by + # ``test_nfr_recorder_fixture_emits_artifacts_in_run``. + inner = tmp_path / "test_inner_az446.py" + inner.write_text( + textwrap.dedent( + """ + import pytest + + @pytest.mark.scenario_id("NFT-PERF-01") + @pytest.mark.traces_to(("AC-4.1",)) + def test_inner_perf(nfr_recorder): + nfr_recorder.record_metric( + "latency_p95_ms", + 380.4, + ac_id="AC-4.1", + band="≤400 ms", + ci95_low=350.0, + ci95_high=395.0, + ) + """ + ) + ) + (tmp_path / "conftest.py").write_text( + textwrap.dedent( + """ + def pytest_addoption(parser): + parser.addoption( + "--evidence-out", + action="store", + default=".", + ) + """ + ) + ) + + # Act + rc = pytest.main( + [ + "-p", + "runner.reporting.csv_reporter", + "-p", + "runner.reporting.nfr_recorder", + str(inner), + f"--evidence-out={evidence_out}", + f"--traceability-matrix={matrix}", + "--no-header", + "-q", + ] + ) + + # Assert + assert rc == 0, f"inner pytest run failed with rc={rc}" + report = evidence_out / "report.csv" + assert report.is_file() + lines = report.read_text().splitlines() + assert lines[0] == ( + "scenario_id,metric_name,value,value_band,ci95_low,ci95_high,ac_id,outcome" + ) + assert "NFT-PERF-01,latency_p95_ms,380.4,≤400 ms,350,395,AC-4.1,PASS" in lines diff --git a/e2e/runner/reporting/nfr_recorder.py b/e2e/runner/reporting/nfr_recorder.py index 24cafec..ea73168 100644 --- a/e2e/runner/reporting/nfr_recorder.py +++ b/e2e/runner/reporting/nfr_recorder.py @@ -27,6 +27,7 @@ calling ``partial`` are recorded as Covered. from __future__ import annotations +import csv import json import logging import re @@ -38,6 +39,15 @@ import pytest from .csv_reporter import reporter_for + +def _stringify(value: Any) -> str: + """CSV cell projection — ``None`` → empty cell; floats keep precision.""" + if value is None: + return "" + if isinstance(value, float): + return f"{value:.6g}" + return str(value) + logger = logging.getLogger(__name__) @@ -99,16 +109,44 @@ class _NfrRecorder: self.traces_to = traces_to self._run = run - def record_metric(self, name: str, value: Any, ac_id: str | None = None) -> None: - """Capture a numeric / structured metric for this scenario.""" + def record_metric( + self, + name: str, + value: Any, + ac_id: str | None = None, + *, + band: str | None = None, + ci95_low: float | None = None, + ci95_high: float | None = None, + ) -> None: + """Capture a numeric / structured metric for this scenario. + + Optional kwargs (AZ-446): + + * ``band`` — short human-readable AC threshold text (e.g. + ``"≤400 ms"``). Surfaces as ``_band`` in the per-metric + report.csv and as ``"band"`` in regression-baseline.json. + * ``ci95_low`` / ``ci95_high`` — 95% interval bounds for the + metric, used by Monte Carlo (NFT-RES-03) and N-sample + (NFT-PERF-01) scenarios. Both must be passed together or + both omitted; passing only one raises ``ValueError``. + """ if not isinstance(name, str) or not name: raise ValueError(f"metric name must be a non-empty str, got {name!r}") + if (ci95_low is None) != (ci95_high is None): + raise ValueError( + f"ci95_low and ci95_high must be provided together " + f"(got low={ci95_low!r}, high={ci95_high!r})" + ) self._run.record_metric( scenario_id=self.scenario_id, name=name, value=value, ac_id=ac_id, nodeid=self.nodeid, + band=band, + ci95_low=ci95_low, + ci95_high=ci95_high, ) def partial(self, ac_id: str, reason: str) -> None: @@ -161,9 +199,19 @@ class _RunAggregator: value: Any, ac_id: str | None, nodeid: str, + band: str | None = None, + ci95_low: float | None = None, + ci95_high: float | None = None, ) -> None: rec = self._records[nodeid] - rec.metrics[name] = {"value": value, "ac_id": ac_id} + entry: dict[str, Any] = {"value": value, "ac_id": ac_id} + if band is not None: + entry["band"] = band + if ci95_low is not None: + entry["ci95_low"] = ci95_low + if ci95_high is not None: + entry["ci95_high"] = ci95_high + rec.metrics[name] = entry def mark_partial( self, @@ -254,7 +302,15 @@ class _RunAggregator: return path def emit_regression_baseline(self) -> Path: - """Flat dump of every numeric metric for diff tooling.""" + """Flat dump of every numeric metric for diff tooling. + + Kept intentionally flat (``{metric_name: numeric_value}``) so + regression-detection scripts can diff two baselines via a + simple dict-walk. The AZ-446 ``band`` / ``ci95_low`` / + ``ci95_high`` annotations live in ``report.csv`` and the per-NFR + JSON instead — they're documentation about the metric, not + independently diffable measurements. + """ path = self.evidence_dir / "regression-baseline.json" blob = { "scenarios": { @@ -272,6 +328,55 @@ class _RunAggregator: path.write_text(json.dumps(blob, sort_keys=True, indent=2) + "\n") return path + def emit_per_metric_report(self, path: Path | None = None) -> Path: + """AZ-446 — flat per-metric report (one row per scenario × metric). + + Default path: ``/report.csv``. Columns: + + scenario_id, metric_name, value, value_band, + ci95_low, ci95_high, ac_id, outcome + + Non-numeric metric values are still emitted (cast to ``str``) + so the file captures every captured signal; downstream tooling + filters by ``value_band`` / ``ci95_low`` to decide what to + treat as numeric. Rows are sorted by ``(scenario_id, + metric_name)`` for deterministic diffing across runs. + """ + target = path if path is not None else self.evidence_dir / "report.csv" + target.parent.mkdir(parents=True, exist_ok=True) + rows: list[tuple[str, str, str, str, str, str, str, str]] = [] + for rec in self._records.values(): + for name, entry in rec.metrics.items(): + rows.append( + ( + rec.scenario_id, + name, + _stringify(entry.get("value")), + _stringify(entry.get("band")), + _stringify(entry.get("ci95_low")), + _stringify(entry.get("ci95_high")), + _stringify(entry.get("ac_id")), + rec.outcome or "UNKNOWN", + ) + ) + rows.sort(key=lambda r: (r[0], r[1])) + with target.open("w", newline="") as fh: + writer = csv.writer(fh) + writer.writerow( + [ + "scenario_id", + "metric_name", + "value", + "value_band", + "ci95_low", + "ci95_high", + "ac_id", + "outcome", + ] + ) + writer.writerows(rows) + return target + # ───────────────────── pytest plugin glue ───────────────────── @@ -356,6 +461,7 @@ class _PluginHooks: self._agg.emit_per_nfr_json() self._agg.emit_traceability_status() self._agg.emit_regression_baseline() + self._agg.emit_per_metric_report() def _scenario_id_for(item: pytest.Item) -> str: diff --git a/e2e/tests/performance/test_nft_perf_01_e2e_latency.py b/e2e/tests/performance/test_nft_perf_01_e2e_latency.py index c4bca22..3061ef1 100644 --- a/e2e/tests/performance/test_nft_perf_01_e2e_latency.py +++ b/e2e/tests/performance/test_nft_perf_01_e2e_latency.py @@ -136,10 +136,19 @@ def test_nft_perf_01_e2e_latency( f"nft_perf_01.{r.config_id}.frame_drop_ratio", float(r.frame_drop_ratio), ac_id="AC-4", + band=f"≤{r.frame_drop_budget:.2f}", ) + # AZ-446 AC-2 — CI95 columns derive from the empirical 2.5 / 97.5 + # percentile of the underlying per-frame latency samples (N≥900). + latencies = [s.latency_ms for s in r.samples] + ci_low, ci_high = _percentile_pair(latencies, 2.5, 97.5) if r.p50_ms is not None: nfr_recorder.record_metric( - f"nft_perf_01.{r.config_id}.latency_ms_p50", float(r.p50_ms) + f"nft_perf_01.{r.config_id}.latency_ms_p50", + float(r.p50_ms), + band="(median, no budget)", + ci95_low=ci_low, + ci95_high=ci_high, ) if r.p95_ms is not None: ac_id = "AC-3" if r.config_id == "k2-hybrid-50c" else "AC-2" @@ -147,10 +156,17 @@ def test_nft_perf_01_e2e_latency( f"nft_perf_01.{r.config_id}.latency_ms_p95", float(r.p95_ms), ac_id=ac_id, + band=f"≤{r.p95_budget_ms:.0f} ms", + ci95_low=ci_low, + ci95_high=ci_high, ) if r.p99_ms is not None: nfr_recorder.record_metric( - f"nft_perf_01.{r.config_id}.latency_ms_p99", float(r.p99_ms) + f"nft_perf_01.{r.config_id}.latency_ms_p99", + float(r.p99_ms), + band="(p99, no budget)", + ci95_low=ci_low, + ci95_high=ci_high, ) breaches = [] @@ -170,6 +186,31 @@ def test_nft_perf_01_e2e_latency( assert not breaches, "\n".join(breaches) +def _percentile_pair( + values: list[float], q_low: float, q_high: float +) -> tuple[float | None, float | None]: + """Linear-interpolation percentile pair (AZ-446 CI95 helper). + + Returns ``(None, None)`` for empty input. Used to project the + empirical 95% interval (2.5th / 97.5th percentile) of the + underlying latency samples onto the recorded percentile metrics. + """ + if not values: + return None, None + ordered = sorted(values) + if len(ordered) == 1: + return float(ordered[0]), float(ordered[0]) + + def _at(q: float) -> float: + rank = (q / 100.0) * (len(ordered) - 1) + lo = int(rank) + hi = min(lo + 1, len(ordered) - 1) + frac = rank - lo + return float(ordered[lo] + (ordered[hi] - ordered[lo]) * frac) + + return _at(q_low), _at(q_high) + + def _resolve_latency_fixture_path() -> Path: from runner.helpers import sitl_observer diff --git a/e2e/tests/resilience/test_nft_res_03_monte_carlo.py b/e2e/tests/resilience/test_nft_res_03_monte_carlo.py index b86f53c..491af33 100644 --- a/e2e/tests/resilience/test_nft_res_03_monte_carlo.py +++ b/e2e/tests/resilience/test_nft_res_03_monte_carlo.py @@ -115,14 +115,26 @@ def test_nft_res_03_monte_carlo( ) nfr_recorder.record_metric( - "nft_res_03.iteration_count", float(report1.iteration_count), ac_id="AC-1" + "nft_res_03.iteration_count", + float(report1.iteration_count), + ac_id="AC-1", + band=f"≥{report1.min_iteration_count} iterations", ) nfr_recorder.record_metric( "nft_res_03.total_samples", float(report1.total_samples) ) if report1.envelope_ratio is not None: + # AZ-446 AC-2 — per-iteration envelope ratios provide the empirical + # 95% interval (2.5th / 97.5th percentile across 100 iterations). + per_iter_ratios = _per_iteration_envelope_ratios(report1) + ci_low, ci_high = _percentile_pair(per_iter_ratios, 2.5, 97.5) nfr_recorder.record_metric( - "nft_res_03.envelope_ratio", float(report1.envelope_ratio), ac_id="AC-3" + "nft_res_03.envelope_ratio", + float(report1.envelope_ratio), + ac_id="AC-3", + band=f"≥{report1.envelope_ratio_budget:.2f}", + ci95_low=ci_low, + ci95_high=ci_high, ) nfr_recorder.record_metric( "nft_res_03.master_seed", float(report1.master_seed) @@ -147,6 +159,41 @@ def _full_matrix_enabled() -> bool: return os.environ.get(NFT_RES_03_FULL_MATRIX_ENV_VAR, "").strip() in {"1", "true", "yes"} +def _per_iteration_envelope_ratios(report: mce.MonteCarloReport) -> list[float]: + """Per-iteration ``covered/frames`` ratios (AZ-446 CI95 input).""" + ratios: list[float] = [] + for it in report.iterations: + if not it.samples: + continue + covered = sum( + 1 + for s in it.samples + if s.error_m <= mce.ENVELOPE_MULTIPLIER * s.cov_semi_major_m + ) + ratios.append(covered / len(it.samples)) + return ratios + + +def _percentile_pair( + values: list[float], q_low: float, q_high: float +) -> tuple[float | None, float | None]: + """Linear-interpolation percentiles. Returns ``(None, None)`` if empty.""" + if not values: + return None, None + ordered = sorted(values) + if len(ordered) == 1: + return float(ordered[0]), float(ordered[0]) + + def _at(q: float) -> float: + rank = (q / 100.0) * (len(ordered) - 1) + lo = int(rank) + hi = min(lo + 1, len(ordered) - 1) + frac = rank - lo + return float(ordered[lo] + (ordered[hi] - ordered[lo]) * frac) + + return _at(q_low), _at(q_high) + + def _resolve_fixture_path() -> Path: raw = os.environ.get(NFT_RES_03_FIXTURE_ENV_VAR, "").strip() from runner.helpers import sitl_observer