Files
gps-denied-desktop/_docs/00_templates/incident_playbook.md
T
Oleksandr Bezdieniezhnykh fd75243a84 more detailed SDLC plan
2025-12-10 19:05:17 +02:00

158 lines
2.8 KiB
Markdown

# Incident Playbook Template
## Incident Overview
| Field | Value |
|-------|-------|
| Playbook Name | [Name] |
| Severity | Critical / High / Medium / Low |
| Last Updated | [YYYY-MM-DD] |
| Owner | [Team/Person] |
---
## Detection
### Symptoms
- [How will you know this incident is occurring?]
- Alert: [Alert name that triggers]
- User reports: [Expected user complaints]
### Monitoring
- Dashboard: [Link to relevant dashboard]
- Logs: [Log query to investigate]
- Metrics: [Key metrics to watch]
---
## Assessment
### Impact Analysis
- Users affected: [All / Subset / Internal only]
- Data at risk: [Yes / No]
- Revenue impact: [High / Medium / Low / None]
### Severity Determination
| Condition | Severity |
|-----------|----------|
| Service completely down | Critical |
| Partial degradation | High |
| Intermittent issues | Medium |
| Minor impact | Low |
---
## Response
### Immediate Actions (First 5 minutes)
1. [ ] Acknowledge alert
2. [ ] Verify incident is real (not false positive)
3. [ ] Notify on-call team
4. [ ] Start incident channel/call
### Investigation Steps
1. [ ] Check recent deployments
2. [ ] Review error logs
3. [ ] Check infrastructure metrics
4. [ ] Identify affected components
### Communication
| Audience | Channel | Frequency |
|----------|---------|-----------|
| Engineering | Slack #incidents | Continuous |
| Stakeholders | Email | Every 30 min |
| Users | Status page | Major updates |
---
## Resolution
### Common Fixes
#### Fix 1: [Common issue]
```bash
# Commands to fix
```
Expected outcome: [What should happen]
#### Fix 2: [Another common issue]
```bash
# Commands to fix
```
Expected outcome: [What should happen]
### Rollback Procedure
1. [ ] Identify last known good version
2. [ ] Execute rollback
```bash
# Rollback commands
```
3. [ ] Verify service restored
4. [ ] Monitor for 15 minutes
### Escalation Path
| Time | Action |
|------|--------|
| 0-15 min | On-call engineer |
| 15-30 min | Team lead |
| 30-60 min | Engineering manager |
| 60+ min | Director/VP |
---
## Post-Incident
### Verification
- [ ] Service fully restored
- [ ] All alerts cleared
- [ ] User-facing functionality verified
- [ ] Monitoring back to normal
### Documentation
- [ ] Timeline documented
- [ ] Root cause identified
- [ ] Action items created
- [ ] Post-mortem scheduled
### Post-Mortem Template
```markdown
## Incident Summary
- Date/Time:
- Duration:
- Impact:
- Root Cause:
## Timeline
- [Time] - Event
## What Went Well
-
## What Went Wrong
-
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |
```
---
## Contacts
| Role | Name | Contact |
|------|------|---------|
| On-call | | |
| Team Lead | | |
| Manager | | |
---
## Revision History
| Date | Author | Changes |
|------|--------|---------|
| | | |