mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-04-22 09:16:38 +00:00
2.8 KiB
2.8 KiB
Incident Playbook Template
Incident Overview
| Field | Value |
|---|---|
| Playbook Name | [Name] |
| Severity | Critical / High / Medium / Low |
| Last Updated | [YYYY-MM-DD] |
| Owner | [Team/Person] |
Detection
Symptoms
- [How will you know this incident is occurring?]
- Alert: [Alert name that triggers]
- User reports: [Expected user complaints]
Monitoring
- Dashboard: [Link to relevant dashboard]
- Logs: [Log query to investigate]
- Metrics: [Key metrics to watch]
Assessment
Impact Analysis
- Users affected: [All / Subset / Internal only]
- Data at risk: [Yes / No]
- Revenue impact: [High / Medium / Low / None]
Severity Determination
| Condition | Severity |
|---|---|
| Service completely down | Critical |
| Partial degradation | High |
| Intermittent issues | Medium |
| Minor impact | Low |
Response
Immediate Actions (First 5 minutes)
- Acknowledge alert
- Verify incident is real (not false positive)
- Notify on-call team
- Start incident channel/call
Investigation Steps
- Check recent deployments
- Review error logs
- Check infrastructure metrics
- Identify affected components
Communication
| Audience | Channel | Frequency |
|---|---|---|
| Engineering | Slack #incidents | Continuous |
| Stakeholders | Every 30 min | |
| Users | Status page | Major updates |
Resolution
Common Fixes
Fix 1: [Common issue]
# Commands to fix
Expected outcome: [What should happen]
Fix 2: [Another common issue]
# Commands to fix
Expected outcome: [What should happen]
Rollback Procedure
- Identify last known good version
- Execute rollback
# Rollback commands
- Verify service restored
- Monitor for 15 minutes
Escalation Path
| Time | Action |
|---|---|
| 0-15 min | On-call engineer |
| 15-30 min | Team lead |
| 30-60 min | Engineering manager |
| 60+ min | Director/VP |
Post-Incident
Verification
- Service fully restored
- All alerts cleared
- User-facing functionality verified
- Monitoring back to normal
Documentation
- Timeline documented
- Root cause identified
- Action items created
- Post-mortem scheduled
Post-Mortem Template
## Incident Summary
- Date/Time:
- Duration:
- Impact:
- Root Cause:
## Timeline
- [Time] - Event
## What Went Well
-
## What Went Wrong
-
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |
Contacts
| Role | Name | Contact |
|---|---|---|
| On-call | ||
| Team Lead | ||
| Manager |
Revision History
| Date | Author | Changes |
|---|---|---|