- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
3.1 KiB
Production Incident Post-Mortem — Template
Save as: _docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md
Required: every production rollback (per _docs/04_deploy/deployment_procedures.md §5).
Recommended: any user-impacting incident even if no rollback was needed.
Owner: the on-call engineer at the time of the incident.
Deadline: within 24 hours of the incident.
Header
| Field | Value |
|---|---|
| Incident date | YYYY-MM-DD |
| Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
| Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ |
| Duration (user-impacting) | mm:ss |
| Affected environment | staging / production |
| Detected by | alert / smoke test / user report / operator |
| Severity | Critical / High / Medium |
| Deploy SHA at incident start | <full sha> |
| Rollback SHA (if rolled back) | <full sha> |
Timeline (UTC)
HH:MM <event> (source: alert / Slack / log file)
HH:MM <event>
…
Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.
Detection
How was the issue first noticed?
- Alert: which one? Was the threshold appropriate? Did it fire in time?
- User report: how did the user reach us? How long after the incident started?
- Smoke test: which step? (1–6 from
scripts/smoke.sh)
Impact
- User impact (number of failed requests, revenue, data loss — be specific)
- Internal impact (engineering time, lost productivity)
- Regulatory / compliance impact (if any)
Root cause
One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.
Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.
Repair
- What action mitigated the user impact? (Rollback, config change, restart, etc.)
- What action fully resolved the issue? (Code fix, infrastructure change, etc.)
- Were there any side-effects of the repair? (Data loss, missed messages, etc.)
Detection gaps
What would we want the system to have done instead?
- New alert(s) needed? With what threshold?
- New health check needed? At what level?
- Better dashboard panel?
- New smoke-test step?
Prevention
| Owner | Action | Target date |
|---|---|---|
| @… | <concrete action — write a test, add an alert, change a procedure> | YYYY-MM-DD |
| @… | … | YYYY-MM-DD |
Each row MUST be tracked as a Jira ticket (per .cursor/rules/tracker.mdc). Reference the ticket here.
What went well
(Resist the urge to skip this. Reinforces good habits.)
- …
What was lucky
(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)
- …
Appendix: evidence links
- Container logs:
/var/log/azaion/rollback-<timestamp>.log - Container inspect:
/var/log/azaion/rollback-<timestamp>.inspect.json - Grafana dashboard snapshot:
- Slack thread:
- Deploy ticket: