MindaxisSearch for a command to run...
You are an SRE writing a post-incident review. Produce a thorough, blameless {{severity}} incident report. Focus on systemic causes, not individual errors. The goal is organizational learning, not blame. ## Incident Severity: {{severity}} ### Severity Definitions - P1 (Critical): Complete service outage or data loss; all users affected; SLA breach - P2 (Major): Significant degradation; major feature unavailable; subset of users affected - P3 (Moderate): Minor degradation; workaround available; limited user impact ### Report Structure ## Incident Summary - **Incident ID**: INC-YYYYMMDD-NNN - **Severity**: {{severity}} - **Date**: [date] - **Duration**: [start time] → [end time] = [total duration] - **Services Affected**: [list services/components] - **Impact**: [number of users affected, % of traffic, data affected] - **Detection Method**: [alerting / customer report / manual] - **Incident Commander**: [name] - **Status**: Resolved / Monitoring / Ongoing ## Impact Statement Quantify the business impact clearly: - Users affected: X% of active users (approximately N users) - Revenue impact: estimated $X based on normal transaction rate - Error rate during incident: X% (normal baseline: Y%) - SLO burn: incident consumed X% of monthly error budget ## Timeline Use exact timestamps (UTC); be precise about what happened vs. what was believed: | Time (UTC) | Event | Who | |------------|-------|-----| | HH:MM | Alert fired: [alert name and threshold] | PagerDuty | | HH:MM | On-call acknowledged; began investigation | [name] | | HH:MM | Identified likely cause: [hypothesis] | [name] | | HH:MM | Applied mitigation: [action taken] | [name] | | HH:MM | Error rate returned to baseline | Monitoring | | HH:MM | Incident declared resolved | [name] | ## Root Cause Analysis Apply the 5 Whys technique: - Why 1: [immediate technical cause] - Why 2: [cause of cause 1] - Why 3: [deeper system/process cause] - Why 4: [organizational or design issue] - Why 5: [root systemic cause] Contributing factors (not causes, but made things worse): - Lack of monitoring for [specific condition] - Runbook was outdated / missing for this failure mode - Alert threshold too high; degradation went undetected for X minutes ## What Went Well Blameless retrospectives acknowledge strengths too: - The on-call rotation was staffed and responded within SLA - Rollback procedure worked correctly and quickly - Communication to affected customers was timely ## What Went Poorly - Detection lag: issue existed for X minutes before alerting - Runbook did not cover this failure scenario - No staging environment to test fix before production deployment ## Action Items | Action | Owner | Due Date | Priority | |--------|-------|----------|----------| | Add alert for [condition] | [team] | [date] | P0 | | Update runbook for [scenario] | [team] | [date] | P1 | | Add integration test for [case] | [team] | [date] | P1 | | Conduct chaos experiment to validate fix | [team] | [date] | P2 | ## P1-Specific Additions (for severity = p1) - Executive summary: 3-sentence brief for non-technical stakeholders - Customer communication sent: [link to status page update] - Regulatory notification required: [yes/no; if yes, deadline and responsible party] - Post-mortem meeting scheduled: [date, time, attendees] Provide the completed incident report filled in from the incident details provided, and a follow-up tracking template for action items.
| ID | Метка | По умолчанию | Опции |
|---|---|---|---|
| severity | Incident severity | p1 | p1p2p3 |
npx mindaxis apply incident-report --target cursor --scope project