# Production Incident Post-Mortem — Template **Save as**: `_docs/06_metrics/postmortem__.md` **Required**: every production rollback (per `_docs/04_deploy/deployment_procedures.md` §5). **Recommended**: any user-impacting incident even if no rollback was needed. **Owner**: the on-call engineer at the time of the incident. **Deadline**: within 24 hours of the incident. --- ## Header | Field | Value | |-------|-------| | Incident date | YYYY-MM-DD | | Detection time (UTC) | YYYY-MM-DDTHH:MM:SSZ | | Mitigation time (UTC) | YYYY-MM-DDTHH:MM:SSZ | | Duration (user-impacting) | mm:ss | | Affected environment | staging / production | | Detected by | alert / smoke test / user report / operator | | Severity | Critical / High / Medium | | Deploy SHA at incident start | `` | | Rollback SHA (if rolled back) | `` | ## Timeline (UTC) ``` HH:MM (source: alert / Slack / log file) HH:MM … ``` Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator. ## Detection How was the issue first noticed? - Alert: which one? Was the threshold appropriate? Did it fire in time? - User report: how did the user reach us? How long after the incident started? - Smoke test: which step? (1–6 from `scripts/smoke.sh`) ## Impact - User impact (number of failed requests, revenue, data loss — be specific) - Internal impact (engineering time, lost productivity) - Regulatory / compliance impact (if any) ## Root cause One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause. > Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm. ## Repair - What action mitigated the user impact? (Rollback, config change, restart, etc.) - What action fully resolved the issue? (Code fix, infrastructure change, etc.) - Were there any side-effects of the repair? (Data loss, missed messages, etc.) ## Detection gaps What would we want the system to have done instead? - New alert(s) needed? With what threshold? - New health check needed? At what level? - Better dashboard panel? - New smoke-test step? ## Prevention | Owner | Action | Target date | |-------|--------|-------------| | @… | | YYYY-MM-DD | | @… | … | YYYY-MM-DD | Each row MUST be tracked as a Jira ticket (per `.cursor/rules/tracker.mdc`). Reference the ticket here. ## What went well (Resist the urge to skip this. Reinforces good habits.) - … ## What was lucky (Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.) - … ## Appendix: evidence links - Container logs: `/var/log/azaion/rollback-.log` - Container inspect: `/var/log/azaion/rollback-.inspect.json` - Grafana dashboard snapshot: - Slack thread: - Deploy ticket: