azaion/admin

Fork 0

mirror of https://github.com/azaion/admin.git synced 2026-06-21 08:21:10 +00:00

Files

T

Oleksandr Bezdieniezhnykh c7b297de83

ci/woodpecker/push/01-test Pipeline failed

Details

ci/woodpecker/push/02-build-push unknown status

Details

refactor: remove deploy.cmd and update Dockerfile for health checks

- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-13 08:47:21 +03:00

3.1 KiB

Raw Permalink Blame History

Production Incident Post-Mortem — Template

Save as: _docs/06_metrics/postmortem_<YYYY-MM-DD>_<short-slug>.md

Required: every production rollback (per _docs/04_deploy/deployment_procedures.md §5). Recommended: any user-impacting incident even if no rollback was needed. Owner: the on-call engineer at the time of the incident. Deadline: within 24 hours of the incident.

Header

Field	Value
Incident date	YYYY-MM-DD
Detection time (UTC)	YYYY-MM-DDTHH:MM:SSZ
Mitigation time (UTC)	YYYY-MM-DDTHH:MM:SSZ
Duration (user-impacting)	mm:ss
Affected environment	staging / production
Detected by	alert / smoke test / user report / operator
Severity	Critical / High / Medium
Deploy SHA at incident start	`<full sha>`
Rollback SHA (if rolled back)	`<full sha>`

Timeline (UTC)

HH:MM  <event>            (source: alert / Slack / log file)
HH:MM  <event>
…

Be liberal with entries — every paging, every Slack message, every action taken. The point is to make the post-mortem reproducible without re-asking the operator.

Detection

How was the issue first noticed?

Alert: which one? Was the threshold appropriate? Did it fire in time?
User report: how did the user reach us? How long after the incident started?
Smoke test: which step? (1–6 from scripts/smoke.sh)

Impact

User impact (number of failed requests, revenue, data loss — be specific)
Internal impact (engineering time, lost productivity)
Regulatory / compliance impact (if any)

Root cause

One paragraph. Include the specific commit / config change / external event. Link to the failing test / log line that proves the cause.

Avoid "human error" as a root cause — it's almost never a useful answer. Focus on the system gap that allowed the human action to cause harm.

Repair

What action mitigated the user impact? (Rollback, config change, restart, etc.)
What action fully resolved the issue? (Code fix, infrastructure change, etc.)
Were there any side-effects of the repair? (Data loss, missed messages, etc.)

Detection gaps

What would we want the system to have done instead?

New alert(s) needed? With what threshold?
New health check needed? At what level?
Better dashboard panel?
New smoke-test step?

Prevention

Owner	Action	Target date
@…	<concrete action — write a test, add an alert, change a procedure>	YYYY-MM-DD
@…	…	YYYY-MM-DD

Each row MUST be tracked as a Jira ticket (per .cursor/rules/tracker.mdc). Reference the ticket here.

What went well

(Resist the urge to skip this. Reinforces good habits.)

What was lucky

(Not the same as "what went well". Things that worked but only because of fortunate timing or configuration that we didn't choose deliberately.)

Appendix: evidence links

Container logs: /var/log/azaion/rollback-<timestamp>.log
Container inspect: /var/log/azaion/rollback-<timestamp>.inspect.json
Grafana dashboard snapshot:
Slack thread:
Deploy ticket:

3.1 KiB Raw Permalink Blame History Unescape Escape