# Incident Playbook Template

## Incident Overview

| Field | Value |
|-------|-------|
| Playbook Name | [Name] |
| Severity | Critical / High / Medium / Low |
| Last Updated | [YYYY-MM-DD] |
| Owner | [Team/Person] |

---

## Detection

### Symptoms
- [How will you know this incident is occurring?]
- Alert: [Alert name that triggers]
- User reports: [Expected user complaints]

### Monitoring
- Dashboard: [Link to relevant dashboard]
- Logs: [Log query to investigate]
- Metrics: [Key metrics to watch]

---

## Assessment

### Impact Analysis
- Users affected: [All / Subset / Internal only]
- Data at risk: [Yes / No]
- Revenue impact: [High / Medium / Low / None]

### Severity Determination
| Condition | Severity |
|-----------|----------|
| Service completely down | Critical |
| Partial degradation | High |
| Intermittent issues | Medium |
| Minor impact | Low |

---

## Response

### Immediate Actions (First 5 minutes)
1. [ ] Acknowledge alert
2. [ ] Verify incident is real (not false positive)
3. [ ] Notify on-call team
4. [ ] Start incident channel/call

### Investigation Steps
1. [ ] Check recent deployments
2. [ ] Review error logs
3. [ ] Check infrastructure metrics
4. [ ] Identify affected components

### Communication
| Audience | Channel | Frequency |
|----------|---------|-----------|
| Engineering | Slack #incidents | Continuous |
| Stakeholders | Email | Every 30 min |
| Users | Status page | Major updates |

---

## Resolution

### Common Fixes

#### Fix 1: [Common issue]
```bash
# Commands to fix
```
Expected outcome: [What should happen]

#### Fix 2: [Another common issue]
```bash
# Commands to fix
```
Expected outcome: [What should happen]

### Rollback Procedure
1. [ ] Identify last known good version
2. [ ] Execute rollback
```bash
# Rollback commands
```
3. [ ] Verify service restored
4. [ ] Monitor for 15 minutes

### Escalation Path
| Time | Action |
|------|--------|
| 0-15 min | On-call engineer |
| 15-30 min | Team lead |
| 30-60 min | Engineering manager |
| 60+ min | Director/VP |

---

## Post-Incident

### Verification
- [ ] Service fully restored
- [ ] All alerts cleared
- [ ] User-facing functionality verified
- [ ] Monitoring back to normal

### Documentation
- [ ] Timeline documented
- [ ] Root cause identified
- [ ] Action items created
- [ ] Post-mortem scheduled

### Post-Mortem Template
```markdown
## Incident Summary
- Date/Time:
- Duration:
- Impact:
- Root Cause:

## Timeline
- [Time] - Event

## What Went Well
-

## What Went Wrong
-

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |
```

---

## Contacts

| Role | Name | Contact |
|------|------|---------|
| On-call | | |
| Team Lead | | |
| Manager | | |

---

## Revision History

| Date | Author | Changes |
|------|--------|---------|
| | | |