refactor: remove deploy.cmd and update Dockerfile for health checks

- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 11:31:10 +00:00 · 2026-05-13 08:47:21 +03:00
parent 43fe38e67d
commit c7b297de83
76 changed files with 4034 additions and 832 deletions
@@ -0,0 +1,99 @@
+# Performance Test Report — Cycle 1
+
+**Date**: 2026-05-13
+**Cycle**: 1
+**Verdict**: **PASS** — all thresholds met, 0% error rate.
+**Runner**: k6 v2.0.0 (local) against `docker-compose.test.yml` (Postgres 16-alpine + .NET admin API), seeded with 500 perf users.
+**Artifacts**: `scripts/perf-scenarios.js`, `scripts/run-performance-tests.sh`, raw JSON at `e2e/test-results/perf-summary.json`.
+
+## Scenarios run
+
+| ID | Scenario | Threshold | Observed (p95) | Verdict |
+|----|----------|-----------|---------------:|---------|
+| NFT-PERF-01 | Login (10 VUs, 30s) | p95 < 500ms · err < 1% | **33.4ms · 0%** | Pass (15× headroom) |
+| NFT-PERF-04 | User list (10 VUs, 30s, 500 users seeded) | p95 < 1000ms · err < 1% | **152.5ms · 0%** | Pass (6.5× headroom) |
+
+## Scenarios skipped
+
+| ID | Scenario | Reason |
+|----|----------|--------|
+| NFT-PERF-02 | Encrypted resource download (small) | Endpoint deleted (AZ-183 OTA revert + AZ-197 hardware removal). Pruned from `_docs/02_document/tests/performance-tests.md`. |
+| NFT-PERF-03 | Encrypted resource download (large) | Same — the OTA / hardware-bound download path no longer exists. |
+
+## Detailed metrics (full distribution)
+
+### NFT-PERF-01 — Login
+
+| Metric | Value |
+|--------|------:|
+| Iterations | 2 617 |
+| min | 1.3 ms |
+| median | 6.3 ms |
+| avg | 13.7 ms |
+| p90 | 18.6 ms |
+| **p95** | **33.4 ms** |
+| max | 630.0 ms (single outlier — first request after JIT/connection-pool warmup) |
+| Error rate | 0.00% |
+| Checks | 2 617 / 2 617 (status 200, token returned) |
+
+### NFT-PERF-04 — User list (500 users)
+
+| Metric | Value |
+|--------|------:|
+| Iterations | 1 944 |
+| min | 3.1 ms |
+| median | 12.0 ms |
+| avg | 43.8 ms |
+| p90 | 86.9 ms |
+| **p95** | **152.5 ms** |
+| max | 1 974.6 ms (cold-cache outlier) |
+| Error rate | 0.00% |
+| Checks | 1 944 / 1 944 (status 200, ≥ 500 users returned) |
+
+### Aggregate
+
+| Metric | Value |
+|--------|------:|
+| Total iterations | 4 561 |
+| Total HTTP requests | 4 562 |
+| Aggregate throughput | 65.1 req/s |
+| Max VUs | 20 (10 per scenario, sequential) |
+| Run duration | ~70 s (incl. 5 s gap between scenarios) |
+
+## Threshold table
+
+All four `options.thresholds` entries returned `ok: true`:
+
+```
+http_req_duration{scenario:nft_perf_01_login}    p(95)<500   →  ok
+http_req_duration{scenario:nft_perf_04_user_list} p(95)<1000  →  ok
+http_req_failed{scenario:nft_perf_01_login}      rate<0.01   →  ok (rate=0)
+http_req_failed{scenario:nft_perf_04_user_list}  rate<0.01   →  ok (rate=0)
+```
+
+## Environment
+
+- Host: macOS 25.4.0 (Apple Silicon)
+- Docker Desktop, single host, no resource throttling
+- SUT: `system-under-test` container built from repo `Dockerfile`, running on `http://localhost:8080`
+- DB: `test-db` (postgres:16-alpine), in-process to the same Docker host
+- Seed: functional fixtures from `e2e/db-init/00_run_all.sh` + `99_test_seed.sql`, plus 500 dummy `perf-user-NNNNN@perf.azaion.com` rows inserted by `run-performance-tests.sh` after SUT readiness check
+- k6 v2.0.0 (Homebrew bottle, arm64)
+
+## Caveats / coverage gaps
+
+1. **Single-host run** — perf was measured with k6 and the SUT on the same machine, no network RTT, no inter-AZ latency. Production numbers will be higher; the 15×/6.5× headroom should absorb that comfortably for an internal admin API.
+2. **No DB warmup phase** — both p99/max values include cold-cache outliers (login max 630ms, user-list max ~2s). The p95 already excludes those, but a future iteration could add a 5–10s warmup ramp.
+3. **No realistic load on the user-list filter path** — only the unfiltered `GET /users` is exercised. Adding a `?searchEmail=` variant would catch the case where the LinqToDB `WhereIf` fails to fold into the SQL.
+4. **No `/classes` CRUD perf coverage** — cycle 1 added these endpoints (AZ-513) but the perf spec was not extended. Recommend adding a NFT-PERF-05 in the next test-spec sync.
+5. **`acceptance_criteria.md` is stale post-cycle-1** — AC-13/14/15/16 (hardware binding) and AC-17–24 (resource management) reference deleted features. Step 12 (Test-Spec Sync) of cycle 1 missed this. Surface in Step 17 retro and clean up in cycle 2.
+
+## Recommendations for next cycle
+
+- **Cycle 2 test-spec sync must prune AC-13..24** and add an AC for `/classes` CRUD.
+- **Add NFT-PERF-05** for `POST /classes` and `PATCH /classes/{id}` to cover the new write paths.
+- **CI gate**: wire `scripts/run-performance-tests.sh` into the deploy pipeline so threshold breaches block release. Today it is run-on-demand only.
+
+## Verdict logic
+
+PASS — all thresholds met, no failed checks, no errors, no warn-band scenarios. Auto-chain to Step 16 (Deploy).
@@ -0,0 +1,98 @@
+# Production Incident Post-Mortem — Template
+
+**Save as**: `_docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md`
+
+**Required**: every production rollback (per `_docs/04_deploy/deployment_procedures.md` §5).
+**Recommended**: any user-impacting incident even if no rollback was needed.
+**Owner**: the on-call engineer at the time of the incident.
+**Deadline**: within 24 hours of the incident.
+
+---
+
+## Header
+
+| Field | Value |
+|-------|-------|
+| Incident date | YYYY-MM-DD |
+| Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
+| Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
+| Duration (user-impacting) | mm:ss |
+| Affected environment | staging / production |
+| Detected by | alert / smoke test / user report / operator |
+| Severity | Critical / High / Medium |
+| Deploy SHA at incident start | `<full sha>` |
+| Rollback SHA (if rolled back) | `<full sha>` |
+
+## Timeline (UTC)
+
+```
+HH:MM  <event>            (source: alert / Slack / log file)
+HH:MM  <event>
+…
+```
+
+Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.
+
+## Detection
+
+How was the issue first noticed?
+
+- Alert: which one? Was the threshold appropriate? Did it fire in time?
+- User report: how did the user reach us? How long after the incident started?
+- Smoke test: which step? (1–6 from `scripts/smoke.sh`)
+
+## Impact
+
+- User impact (number of failed requests, revenue, data loss — be specific)
+- Internal impact (engineering time, lost productivity)
+- Regulatory / compliance impact (if any)
+
+## Root cause
+
+One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.
+
+> Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.
+
+## Repair
+
+- What action mitigated the user impact? (Rollback, config change, restart, etc.)
+- What action fully resolved the issue? (Code fix, infrastructure change, etc.)
+- Were there any side-effects of the repair? (Data loss, missed messages, etc.)
+
+## Detection gaps
+
+What would we want the system to have done instead?
+
+- New alert(s) needed? With what threshold?
+- New health check needed? At what level?
+- Better dashboard panel?
+- New smoke-test step?
+
+## Prevention
+
+| Owner | Action | Target date |
+|-------|--------|-------------|
+| @… | <concrete action — write a test, add an alert, change a procedure> | YYYY-MM-DD |
+| @… | … | YYYY-MM-DD |
+
+Each row MUST be tracked as a Jira ticket (per `.cursor/rules/tracker.mdc`). Reference the ticket here.
+
+## What went well
+
+(Resist the urge to skip this. Reinforces good habits.)
+
+- …
+
+## What was lucky
+
+(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)
+
+- …
+
+## Appendix: evidence links
+
+- Container logs: `/var/log/azaion/rollback-<timestamp>.log`
+- Container inspect: `/var/log/azaion/rollback-<timestamp>.inspect.json`
+- Grafana dashboard snapshot: <url>
+- Slack thread: <url>
+- Deploy ticket: <Jira link>