refactor: remove deploy.cmd and update Dockerfile for health checks
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status

- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 08:47:21 +03:00
parent 43fe38e67d
commit c7b297de83
76 changed files with 4034 additions and 832 deletions
@@ -0,0 +1,99 @@
# Performance Test Report — Cycle 1
**Date**: 2026-05-13
**Cycle**: 1
**Verdict**: **PASS** — all thresholds met, 0% error rate.
**Runner**: k6 v2.0.0 (local) against `docker-compose.test.yml` (Postgres 16-alpine + .NET admin API), seeded with 500 perf users.
**Artifacts**: `scripts/perf-scenarios.js`, `scripts/run-performance-tests.sh`, raw JSON at `e2e/test-results/perf-summary.json`.
## Scenarios run
| ID | Scenario | Threshold | Observed (p95) | Verdict |
|----|----------|-----------|---------------:|---------|
| NFT-PERF-01 | Login (10 VUs, 30s) | p95 < 500ms · err < 1% | **33.4ms · 0%** | Pass (15× headroom) |
| NFT-PERF-04 | User list (10 VUs, 30s, 500 users seeded) | p95 < 1000ms · err < 1% | **152.5ms · 0%** | Pass (6.5× headroom) |
## Scenarios skipped
| ID | Scenario | Reason |
|----|----------|--------|
| NFT-PERF-02 | Encrypted resource download (small) | Endpoint deleted (AZ-183 OTA revert + AZ-197 hardware removal). Pruned from `_docs/02_document/tests/performance-tests.md`. |
| NFT-PERF-03 | Encrypted resource download (large) | Same — the OTA / hardware-bound download path no longer exists. |
## Detailed metrics (full distribution)
### NFT-PERF-01 — Login
| Metric | Value |
|--------|------:|
| Iterations | 2 617 |
| min | 1.3 ms |
| median | 6.3 ms |
| avg | 13.7 ms |
| p90 | 18.6 ms |
| **p95** | **33.4 ms** |
| max | 630.0 ms (single outlier — first request after JIT/connection-pool warmup) |
| Error rate | 0.00% |
| Checks | 2 617 / 2 617 (status 200, token returned) |
### NFT-PERF-04 — User list (500 users)
| Metric | Value |
|--------|------:|
| Iterations | 1 944 |
| min | 3.1 ms |
| median | 12.0 ms |
| avg | 43.8 ms |
| p90 | 86.9 ms |
| **p95** | **152.5 ms** |
| max | 1 974.6 ms (cold-cache outlier) |
| Error rate | 0.00% |
| Checks | 1 944 / 1 944 (status 200, ≥ 500 users returned) |
### Aggregate
| Metric | Value |
|--------|------:|
| Total iterations | 4 561 |
| Total HTTP requests | 4 562 |
| Aggregate throughput | 65.1 req/s |
| Max VUs | 20 (10 per scenario, sequential) |
| Run duration | ~70 s (incl. 5 s gap between scenarios) |
## Threshold table
All four `options.thresholds` entries returned `ok: true`:
```
http_req_duration{scenario:nft_perf_01_login} p(95)<500 → ok
http_req_duration{scenario:nft_perf_04_user_list} p(95)<1000 → ok
http_req_failed{scenario:nft_perf_01_login} rate<0.01 → ok (rate=0)
http_req_failed{scenario:nft_perf_04_user_list} rate<0.01 → ok (rate=0)
```
## Environment
- Host: macOS 25.4.0 (Apple Silicon)
- Docker Desktop, single host, no resource throttling
- SUT: `system-under-test` container built from repo `Dockerfile`, running on `http://localhost:8080`
- DB: `test-db` (postgres:16-alpine), in-process to the same Docker host
- Seed: functional fixtures from `e2e/db-init/00_run_all.sh` + `99_test_seed.sql`, plus 500 dummy `perf-user-NNNNN@perf.azaion.com` rows inserted by `run-performance-tests.sh` after SUT readiness check
- k6 v2.0.0 (Homebrew bottle, arm64)
## Caveats / coverage gaps
1. **Single-host run** — perf was measured with k6 and the SUT on the same machine, no network RTT, no inter-AZ latency. Production numbers will be higher; the 15×/6.5× headroom should absorb that comfortably for an internal admin API.
2. **No DB warmup phase** — both p99/max values include cold-cache outliers (login max 630ms, user-list max ~2s). The p95 already excludes those, but a future iteration could add a 510s warmup ramp.
3. **No realistic load on the user-list filter path** — only the unfiltered `GET /users` is exercised. Adding a `?searchEmail=` variant would catch the case where the LinqToDB `WhereIf` fails to fold into the SQL.
4. **No `/classes` CRUD perf coverage** — cycle 1 added these endpoints (AZ-513) but the perf spec was not extended. Recommend adding a NFT-PERF-05 in the next test-spec sync.
5. **`acceptance_criteria.md` is stale post-cycle-1** — AC-13/14/15/16 (hardware binding) and AC-1724 (resource management) reference deleted features. Step 12 (Test-Spec Sync) of cycle 1 missed this. Surface in Step 17 retro and clean up in cycle 2.
## Recommendations for next cycle
- **Cycle 2 test-spec sync must prune AC-13..24** and add an AC for `/classes` CRUD.
- **Add NFT-PERF-05** for `POST /classes` and `PATCH /classes/{id}` to cover the new write paths.
- **CI gate**: wire `scripts/run-performance-tests.sh` into the deploy pipeline so threshold breaches block release. Today it is run-on-demand only.
## Verdict logic
PASS — all thresholds met, no failed checks, no errors, no warn-band scenarios. Auto-chain to Step 16 (Deploy).
+98
View File
@@ -0,0 +1,98 @@
# Production Incident Post-Mortem — Template
**Save as**: `_docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md`
**Required**: every production rollback (per `_docs/04_deploy/deployment_procedures.md` §5).
**Recommended**: any user-impacting incident even if no rollback was needed.
**Owner**: the on-call engineer at the time of the incident.
**Deadline**: within 24 hours of the incident.
---
## Header
| Field | Value |
|-------|-------|
| Incident date | YYYY-MM-DD |
| Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
| Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
| Duration (user-impacting) | mm:ss |
| Affected environment | staging / production |
| Detected by | alert / smoke test / user report / operator |
| Severity | Critical / High / Medium |
| Deploy SHA at incident start | `<full sha>` |
| Rollback SHA (if rolled back) | `<full sha>` |
## Timeline (UTC)
```
HH:MM <event> (source: alert / Slack / log file)
HH:MM <event>
```
Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.
## Detection
How was the issue first noticed?
- Alert: which one? Was the threshold appropriate? Did it fire in time?
- User report: how did the user reach us? How long after the incident started?
- Smoke test: which step? (16 from `scripts/smoke.sh`)
## Impact
- User impact (number of failed requests, revenue, data loss — be specific)
- Internal impact (engineering time, lost productivity)
- Regulatory / compliance impact (if any)
## Root cause
One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.
> Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.
## Repair
- What action mitigated the user impact? (Rollback, config change, restart, etc.)
- What action fully resolved the issue? (Code fix, infrastructure change, etc.)
- Were there any side-effects of the repair? (Data loss, missed messages, etc.)
## Detection gaps
What would we want the system to have done instead?
- New alert(s) needed? With what threshold?
- New health check needed? At what level?
- Better dashboard panel?
- New smoke-test step?
## Prevention
| Owner | Action | Target date |
|-------|--------|-------------|
| @… | <concrete action — write a test, add an alert, change a procedure> | YYYY-MM-DD |
| @… | … | YYYY-MM-DD |
Each row MUST be tracked as a Jira ticket (per `.cursor/rules/tracker.mdc`). Reference the ticket here.
## What went well
(Resist the urge to skip this. Reinforces good habits.)
-
## What was lucky
(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)
-
## Appendix: evidence links
- Container logs: `/var/log/azaion/rollback-<timestamp>.log`
- Container inspect: `/var/log/azaion/rollback-<timestamp>.inspect.json`
- Grafana dashboard snapshot: <url>
- Slack thread: <url>
- Deploy ticket: <Jira link>