refactor: remove deploy.cmd and update Dockerfile for health checks

- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 11:31:10 +00:00 · 2026-05-13 08:47:21 +03:00
parent 43fe38e67d
commit c7b297de83
76 changed files with 4034 additions and 832 deletions
@@ -0,0 +1,98 @@
+# Production Incident Post-Mortem — Template
+
+**Save as**: `_docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md`
+
+**Required**: every production rollback (per `_docs/04_deploy/deployment_procedures.md` §5).
+**Recommended**: any user-impacting incident even if no rollback was needed.
+**Owner**: the on-call engineer at the time of the incident.
+**Deadline**: within 24 hours of the incident.
+
+---
+
+## Header
+
+| Field | Value |
+|-------|-------|
+| Incident date | YYYY-MM-DD |
+| Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
+| Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
+| Duration (user-impacting) | mm:ss |
+| Affected environment | staging / production |
+| Detected by | alert / smoke test / user report / operator |
+| Severity | Critical / High / Medium |
+| Deploy SHA at incident start | `<full sha>` |
+| Rollback SHA (if rolled back) | `<full sha>` |
+
+## Timeline (UTC)
+
+```
+HH:MM  <event>            (source: alert / Slack / log file)
+HH:MM  <event>
+…
+```
+
+Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.
+
+## Detection
+
+How was the issue first noticed?
+
+- Alert: which one? Was the threshold appropriate? Did it fire in time?
+- User report: how did the user reach us? How long after the incident started?
+- Smoke test: which step? (1–6 from `scripts/smoke.sh`)
+
+## Impact
+
+- User impact (number of failed requests, revenue, data loss — be specific)
+- Internal impact (engineering time, lost productivity)
+- Regulatory / compliance impact (if any)
+
+## Root cause
+
+One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.
+
+> Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.
+
+## Repair
+
+- What action mitigated the user impact? (Rollback, config change, restart, etc.)
+- What action fully resolved the issue? (Code fix, infrastructure change, etc.)
+- Were there any side-effects of the repair? (Data loss, missed messages, etc.)
+
+## Detection gaps
+
+What would we want the system to have done instead?
+
+- New alert(s) needed? With what threshold?
+- New health check needed? At what level?
+- Better dashboard panel?
+- New smoke-test step?
+
+## Prevention
+
+| Owner | Action | Target date |
+|-------|--------|-------------|
+| @… | <concrete action — write a test, add an alert, change a procedure> | YYYY-MM-DD |
+| @… | … | YYYY-MM-DD |
+
+Each row MUST be tracked as a Jira ticket (per `.cursor/rules/tracker.mdc`). Reference the ticket here.
+
+## What went well
+
+(Resist the urge to skip this. Reinforces good habits.)
+
+- …
+
+## What was lucky
+
+(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)
+
+- …
+
+## Appendix: evidence links
+
+- Container logs: `/var/log/azaion/rollback-<timestamp>.log`
+- Container inspect: `/var/log/azaion/rollback-<timestamp>.inspect.json`
+- Grafana dashboard snapshot: <url>
+- Slack thread: <url>
+- Deploy ticket: <Jira link>