Incident Management

Incidents are inevitable. No matter how carefully you build systems, something will eventually break in production. The difference between a minor hiccup and a major disaster often comes down to how well you respond. Good incident management turns chaos into coordinated action.

Incident Roles

During an incident, clear roles prevent confusion and duplicated effort. Not every incident needs all roles, but knowing who does what matters.

The Incident Commander coordinates the overall response. They don't fix the problem directly — they ensure the right people are working on it, make decisions about escalation, and keep the response organized.

The Technical Lead directs the actual investigation and fix. They're hands-on-keyboard, coordinating with other engineers to diagnose and resolve the issue.

Communications handles updates to stakeholders — customers, executives, support teams. This keeps the technical team focused on fixing rather than answering questions.

The Scribe documents everything: what was tried, what worked, what didn't, and when. This timeline proves invaluable for the post-incident review.

The Incident Process

A structured process helps even when adrenaline is high:

  1. Detect and alert — Monitoring systems identify the problem and notify the on-call engineer.
  2. Assemble the team — Page additional help based on the problem's scope and affected systems.
  3. Assess severity — Determine how bad it is and who needs to know.
  4. Investigate and mitigate — Find the cause and stop the bleeding. Mitigation (stopping the impact) often comes before root cause identification.
  5. Communicate status — Keep stakeholders informed with regular updates.
  6. Resolve and verify — Fix the underlying issue and confirm the system is healthy.
  7. Post-incident review — Learn from what happened.

Severity Levels

Not every incident deserves the same response. Severity levels help calibrate:

SEV1: Critical impact, all hands on deck
      - Major outage affecting most users
      - Data loss or security breach

SEV2: Significant impact, dedicated response
      - Partial outage or degraded service
      - Affects substantial user segment

SEV3: Minor impact, normal priority
      - Limited functionality affected
      - Workarounds available

SEV4: Low impact, address when able
      - Cosmetic issues
      - Affects very few users

Severity determines who gets paged, how quickly you need to respond, and how much you communicate externally.

Communication During Incidents

Good communication reduces panic and builds trust. Update stakeholders regularly — even if the update is "still investigating." Silence creates anxiety.

Be honest about what you know and don't know. "We've identified the affected component and are working on a fix" is better than vague reassurances. Provide estimated resolution times when possible, but update them if circumstances change.

See More

Further Reading

You need to be signed in to leave a comment and join the discussion