# Incident Playbook Template ## Incident Overview | Field | Value | |-------|-------| | Playbook Name | [Name] | | Severity | Critical / High / Medium / Low | | Last Updated | [YYYY-MM-DD] | | Owner | [Team/Person] | --- ## Detection ### Symptoms - [How will you know this incident is occurring?] - Alert: [Alert name that triggers] - User reports: [Expected user complaints] ### Monitoring - Dashboard: [Link to relevant dashboard] - Logs: [Log query to investigate] - Metrics: [Key metrics to watch] --- ## Assessment ### Impact Analysis - Users affected: [All / Subset / Internal only] - Data at risk: [Yes / No] - Revenue impact: [High / Medium / Low / None] ### Severity Determination | Condition | Severity | |-----------|----------| | Service completely down | Critical | | Partial degradation | High | | Intermittent issues | Medium | | Minor impact | Low | --- ## Response ### Immediate Actions (First 5 minutes) 1. [ ] Acknowledge alert 2. [ ] Verify incident is real (not false positive) 3. [ ] Notify on-call team 4. [ ] Start incident channel/call ### Investigation Steps 1. [ ] Check recent deployments 2. [ ] Review error logs 3. [ ] Check infrastructure metrics 4. [ ] Identify affected components ### Communication | Audience | Channel | Frequency | |----------|---------|-----------| | Engineering | Slack #incidents | Continuous | | Stakeholders | Email | Every 30 min | | Users | Status page | Major updates | --- ## Resolution ### Common Fixes #### Fix 1: [Common issue] ```bash # Commands to fix ``` Expected outcome: [What should happen] #### Fix 2: [Another common issue] ```bash # Commands to fix ``` Expected outcome: [What should happen] ### Rollback Procedure 1. [ ] Identify last known good version 2. [ ] Execute rollback ```bash # Rollback commands ``` 3. [ ] Verify service restored 4. [ ] Monitor for 15 minutes ### Escalation Path | Time | Action | |------|--------| | 0-15 min | On-call engineer | | 15-30 min | Team lead | | 30-60 min | Engineering manager | | 60+ min | Director/VP | --- ## Post-Incident ### Verification - [ ] Service fully restored - [ ] All alerts cleared - [ ] User-facing functionality verified - [ ] Monitoring back to normal ### Documentation - [ ] Timeline documented - [ ] Root cause identified - [ ] Action items created - [ ] Post-mortem scheduled ### Post-Mortem Template ```markdown ## Incident Summary - Date/Time: - Duration: - Impact: - Root Cause: ## Timeline - [Time] - Event ## What Went Well - ## What Went Wrong - ## Action Items | Action | Owner | Due Date | |--------|-------|----------| | | | | ``` --- ## Contacts | Role | Name | Contact | |------|------|---------| | On-call | | | | Team Lead | | | | Manager | | | --- ## Revision History | Date | Author | Changes | |------|--------|---------| | | | |