Files
admin/_docs/06_metrics/postmortem_template.md
Oleksandr Bezdieniezhnykh c7b297de83
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
refactor: remove deploy.cmd and update Dockerfile for health checks
- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:47:21 +03:00

3.1 KiB
Raw Permalink Blame History

Production Incident Post-Mortem — Template

Save as: _docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md

Required: every production rollback (per _docs/04_deploy/deployment_procedures.md §5). Recommended: any user-impacting incident even if no rollback was needed. Owner: the on-call engineer at the time of the incident. Deadline: within 24 hours of the incident.


Header

Field Value
Incident date YYYY-MM-DD
Detection time (UTC) YYYY-MM-DDTHH:MM:SSZ
Mitigation time (UTC) YYYY-MM-DDTHH:MM:SSZ
Duration (user-impacting) mm:ss
Affected environment staging / production
Detected by alert / smoke test / user report / operator
Severity Critical / High / Medium
Deploy SHA at incident start <full sha>
Rollback SHA (if rolled back) <full sha>

Timeline (UTC)

HH:MM  <event>            (source: alert / Slack / log file)
HH:MM  <event>
…

Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.

Detection

How was the issue first noticed?

  • Alert: which one? Was the threshold appropriate? Did it fire in time?
  • User report: how did the user reach us? How long after the incident started?
  • Smoke test: which step? (16 from scripts/smoke.sh)

Impact

  • User impact (number of failed requests, revenue, data loss — be specific)
  • Internal impact (engineering time, lost productivity)
  • Regulatory / compliance impact (if any)

Root cause

One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.

Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.

Repair

  • What action mitigated the user impact? (Rollback, config change, restart, etc.)
  • What action fully resolved the issue? (Code fix, infrastructure change, etc.)
  • Were there any side-effects of the repair? (Data loss, missed messages, etc.)

Detection gaps

What would we want the system to have done instead?

  • New alert(s) needed? With what threshold?
  • New health check needed? At what level?
  • Better dashboard panel?
  • New smoke-test step?

Prevention

Owner Action Target date
@… <concrete action — write a test, add an alert, change a procedure> YYYY-MM-DD
@… YYYY-MM-DD

Each row MUST be tracked as a Jira ticket (per .cursor/rules/tracker.mdc). Reference the ticket here.

What went well

(Resist the urge to skip this. Reinforces good habits.)

What was lucky

(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)

  • Container logs: /var/log/azaion/rollback-<timestamp>.log
  • Container inspect: /var/log/azaion/rollback-<timestamp>.inspect.json
  • Grafana dashboard snapshot:
  • Slack thread:
  • Deploy ticket: