mirror of
https://github.com/azaion/admin.git
synced 2026-06-21 11:31:10 +00:00
refactor: remove deploy.cmd and update Dockerfile for health checks
- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,99 @@
|
||||
# Performance Test Report — Cycle 1
|
||||
|
||||
**Date**: 2026-05-13
|
||||
**Cycle**: 1
|
||||
**Verdict**: **PASS** — all thresholds met, 0% error rate.
|
||||
**Runner**: k6 v2.0.0 (local) against `docker-compose.test.yml` (Postgres 16-alpine + .NET admin API), seeded with 500 perf users.
|
||||
**Artifacts**: `scripts/perf-scenarios.js`, `scripts/run-performance-tests.sh`, raw JSON at `e2e/test-results/perf-summary.json`.
|
||||
|
||||
## Scenarios run
|
||||
|
||||
| ID | Scenario | Threshold | Observed (p95) | Verdict |
|
||||
|----|----------|-----------|---------------:|---------|
|
||||
| NFT-PERF-01 | Login (10 VUs, 30s) | p95 < 500ms · err < 1% | **33.4ms · 0%** | Pass (15× headroom) |
|
||||
| NFT-PERF-04 | User list (10 VUs, 30s, 500 users seeded) | p95 < 1000ms · err < 1% | **152.5ms · 0%** | Pass (6.5× headroom) |
|
||||
|
||||
## Scenarios skipped
|
||||
|
||||
| ID | Scenario | Reason |
|
||||
|----|----------|--------|
|
||||
| NFT-PERF-02 | Encrypted resource download (small) | Endpoint deleted (AZ-183 OTA revert + AZ-197 hardware removal). Pruned from `_docs/02_document/tests/performance-tests.md`. |
|
||||
| NFT-PERF-03 | Encrypted resource download (large) | Same — the OTA / hardware-bound download path no longer exists. |
|
||||
|
||||
## Detailed metrics (full distribution)
|
||||
|
||||
### NFT-PERF-01 — Login
|
||||
|
||||
| Metric | Value |
|
||||
|--------|------:|
|
||||
| Iterations | 2 617 |
|
||||
| min | 1.3 ms |
|
||||
| median | 6.3 ms |
|
||||
| avg | 13.7 ms |
|
||||
| p90 | 18.6 ms |
|
||||
| **p95** | **33.4 ms** |
|
||||
| max | 630.0 ms (single outlier — first request after JIT/connection-pool warmup) |
|
||||
| Error rate | 0.00% |
|
||||
| Checks | 2 617 / 2 617 (status 200, token returned) |
|
||||
|
||||
### NFT-PERF-04 — User list (500 users)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|------:|
|
||||
| Iterations | 1 944 |
|
||||
| min | 3.1 ms |
|
||||
| median | 12.0 ms |
|
||||
| avg | 43.8 ms |
|
||||
| p90 | 86.9 ms |
|
||||
| **p95** | **152.5 ms** |
|
||||
| max | 1 974.6 ms (cold-cache outlier) |
|
||||
| Error rate | 0.00% |
|
||||
| Checks | 1 944 / 1 944 (status 200, ≥ 500 users returned) |
|
||||
|
||||
### Aggregate
|
||||
|
||||
| Metric | Value |
|
||||
|--------|------:|
|
||||
| Total iterations | 4 561 |
|
||||
| Total HTTP requests | 4 562 |
|
||||
| Aggregate throughput | 65.1 req/s |
|
||||
| Max VUs | 20 (10 per scenario, sequential) |
|
||||
| Run duration | ~70 s (incl. 5 s gap between scenarios) |
|
||||
|
||||
## Threshold table
|
||||
|
||||
All four `options.thresholds` entries returned `ok: true`:
|
||||
|
||||
```
|
||||
http_req_duration{scenario:nft_perf_01_login} p(95)<500 → ok
|
||||
http_req_duration{scenario:nft_perf_04_user_list} p(95)<1000 → ok
|
||||
http_req_failed{scenario:nft_perf_01_login} rate<0.01 → ok (rate=0)
|
||||
http_req_failed{scenario:nft_perf_04_user_list} rate<0.01 → ok (rate=0)
|
||||
```
|
||||
|
||||
## Environment
|
||||
|
||||
- Host: macOS 25.4.0 (Apple Silicon)
|
||||
- Docker Desktop, single host, no resource throttling
|
||||
- SUT: `system-under-test` container built from repo `Dockerfile`, running on `http://localhost:8080`
|
||||
- DB: `test-db` (postgres:16-alpine), in-process to the same Docker host
|
||||
- Seed: functional fixtures from `e2e/db-init/00_run_all.sh` + `99_test_seed.sql`, plus 500 dummy `perf-user-NNNNN@perf.azaion.com` rows inserted by `run-performance-tests.sh` after SUT readiness check
|
||||
- k6 v2.0.0 (Homebrew bottle, arm64)
|
||||
|
||||
## Caveats / coverage gaps
|
||||
|
||||
1. **Single-host run** — perf was measured with k6 and the SUT on the same machine, no network RTT, no inter-AZ latency. Production numbers will be higher; the 15×/6.5× headroom should absorb that comfortably for an internal admin API.
|
||||
2. **No DB warmup phase** — both p99/max values include cold-cache outliers (login max 630ms, user-list max ~2s). The p95 already excludes those, but a future iteration could add a 5–10s warmup ramp.
|
||||
3. **No realistic load on the user-list filter path** — only the unfiltered `GET /users` is exercised. Adding a `?searchEmail=` variant would catch the case where the LinqToDB `WhereIf` fails to fold into the SQL.
|
||||
4. **No `/classes` CRUD perf coverage** — cycle 1 added these endpoints (AZ-513) but the perf spec was not extended. Recommend adding a NFT-PERF-05 in the next test-spec sync.
|
||||
5. **`acceptance_criteria.md` is stale post-cycle-1** — AC-13/14/15/16 (hardware binding) and AC-17–24 (resource management) reference deleted features. Step 12 (Test-Spec Sync) of cycle 1 missed this. Surface in Step 17 retro and clean up in cycle 2.
|
||||
|
||||
## Recommendations for next cycle
|
||||
|
||||
- **Cycle 2 test-spec sync must prune AC-13..24** and add an AC for `/classes` CRUD.
|
||||
- **Add NFT-PERF-05** for `POST /classes` and `PATCH /classes/{id}` to cover the new write paths.
|
||||
- **CI gate**: wire `scripts/run-performance-tests.sh` into the deploy pipeline so threshold breaches block release. Today it is run-on-demand only.
|
||||
|
||||
## Verdict logic
|
||||
|
||||
PASS — all thresholds met, no failed checks, no errors, no warn-band scenarios. Auto-chain to Step 16 (Deploy).
|
||||
@@ -0,0 +1,98 @@
|
||||
# Production Incident Post-Mortem — Template
|
||||
|
||||
**Save as**: `_docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md`
|
||||
|
||||
**Required**: every production rollback (per `_docs/04_deploy/deployment_procedures.md` §5).
|
||||
**Recommended**: any user-impacting incident even if no rollback was needed.
|
||||
**Owner**: the on-call engineer at the time of the incident.
|
||||
**Deadline**: within 24 hours of the incident.
|
||||
|
||||
---
|
||||
|
||||
## Header
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Incident date | YYYY-MM-DD |
|
||||
| Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
|
||||
| Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
|
||||
| Duration (user-impacting) | mm:ss |
|
||||
| Affected environment | staging / production |
|
||||
| Detected by | alert / smoke test / user report / operator |
|
||||
| Severity | Critical / High / Medium |
|
||||
| Deploy SHA at incident start | `<full sha>` |
|
||||
| Rollback SHA (if rolled back) | `<full sha>` |
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
```
|
||||
HH:MM <event> (source: alert / Slack / log file)
|
||||
HH:MM <event>
|
||||
…
|
||||
```
|
||||
|
||||
Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.
|
||||
|
||||
## Detection
|
||||
|
||||
How was the issue first noticed?
|
||||
|
||||
- Alert: which one? Was the threshold appropriate? Did it fire in time?
|
||||
- User report: how did the user reach us? How long after the incident started?
|
||||
- Smoke test: which step? (1–6 from `scripts/smoke.sh`)
|
||||
|
||||
## Impact
|
||||
|
||||
- User impact (number of failed requests, revenue, data loss — be specific)
|
||||
- Internal impact (engineering time, lost productivity)
|
||||
- Regulatory / compliance impact (if any)
|
||||
|
||||
## Root cause
|
||||
|
||||
One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.
|
||||
|
||||
> Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.
|
||||
|
||||
## Repair
|
||||
|
||||
- What action mitigated the user impact? (Rollback, config change, restart, etc.)
|
||||
- What action fully resolved the issue? (Code fix, infrastructure change, etc.)
|
||||
- Were there any side-effects of the repair? (Data loss, missed messages, etc.)
|
||||
|
||||
## Detection gaps
|
||||
|
||||
What would we want the system to have done instead?
|
||||
|
||||
- New alert(s) needed? With what threshold?
|
||||
- New health check needed? At what level?
|
||||
- Better dashboard panel?
|
||||
- New smoke-test step?
|
||||
|
||||
## Prevention
|
||||
|
||||
| Owner | Action | Target date |
|
||||
|-------|--------|-------------|
|
||||
| @… | <concrete action — write a test, add an alert, change a procedure> | YYYY-MM-DD |
|
||||
| @… | … | YYYY-MM-DD |
|
||||
|
||||
Each row MUST be tracked as a Jira ticket (per `.cursor/rules/tracker.mdc`). Reference the ticket here.
|
||||
|
||||
## What went well
|
||||
|
||||
(Resist the urge to skip this. Reinforces good habits.)
|
||||
|
||||
- …
|
||||
|
||||
## What was lucky
|
||||
|
||||
(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)
|
||||
|
||||
- …
|
||||
|
||||
## Appendix: evidence links
|
||||
|
||||
- Container logs: `/var/log/azaion/rollback-<timestamp>.log`
|
||||
- Container inspect: `/var/log/azaion/rollback-<timestamp>.inspect.json`
|
||||
- Grafana dashboard snapshot: <url>
|
||||
- Slack thread: <url>
|
||||
- Deploy ticket: <Jira link>
|
||||
Reference in New Issue
Block a user