26
What is an Incident Response process
- Timing is a surprise, typically little to no waarning
- Time matters, need to respond quickly
- Situation is rarely perfectly understood at the start
- Require coordination and mobilization, often cross-functional
Size-up -> Stabilize -> Update -> Verify
"Hello, this is Camden. I'm the Incident Commander."
The oncall incident commander would not take over automatically when they join a call, you'll be the IC until you performed a handover.
> "Is there an IC on the call?""
< ...
> "Hearing nothing. My name is Camden, I'm the incident commander."
"What actions can we take?" (ask the SMEs what they want to do)
"What are the risks involved?" (understand the impact, that may change the decision)
> "I propose xxx. Are there any strong objections?"
< ...
> "Hearing none. Let's proceed."
"Hayley, I'd like you to investigate the increased latency, try to find the cause. I'll come back to you in 5 minutes. Understood?"
"Hayley, it's been 5 minutes. Do you have any information on the latency issue?"
> "How much time do you need?"
< "20 minutes should be enough"
> "OK, I'll come back to you in 20."
< "Ignore the IC, do what I say!"
> "Do you wish to take command?"
< "..."
> "We understand your concerns. We are working to resolve the incident quickly. Your instructions are slowing down the response. So please take your comment for discussion after the incident has been resolved."
< "Let's try and resolve this in 10 minutes please!"
> "We're in the middle of an incident, please keep your comments until the end."
< "Can I get a spreadsheet of all affected customers?"
> "This will take time away from the incident. This is the time needed to solve the problem, after then we can look at the list."
> "We can either get you that list, or fix the incident. Not both. The incident takes priority."
< "Is this really a SEV-1?"
> "We do not discuss incident severity during the call. We're treating this as a SEV-1."
-> Get the right people at the right time
-> If you don't need this person anymore, let them go.
-> Keep the bigger picture in mind (as an SME and IC)
-> IC can be team agnostic. IC is the person expert in coordinating the response, not actually solving technical issues. That's what SMEs are for.
"Hey, you're being obstructive to the team on the call. If you continue, I will have to remove you."
> "Everyone on the call, be advised I'm handing over command to Tatiana."
< "This is Tatiana, I'm now the Incident Commander."
Institutionalize the culture of continuous improvement.
Completing a postmortem should be prioritized over planned work.
Create SLA:
- 3 business days for SEV-1
- 5 business days for SEV-2
IC will select and directly notify one responder to own completing the postmortem.
Postmortem owner is not the only person responsible for completing the postmortem itself. It is a collaborative effort and should include everyone involved in the incident response.
Postmortems are not a punishment. Effective postmortems are blameless.
We don't call postmortems RCAs. Because in a complex systems, we have multiple root causes that leads to failure.
Owner is the accountable individual who performs the administrative tasks, follows up the information needed to drive it home. Writing it is a collaborative effort, but the single owner is the person orchestrating the entire effort.
Pointing finger in the old view of human error will increase time to acknowledge the incident, MTTR and exacerbating the impact of incident.
Becoming aware of our biases, we can identify when they occur and work to move past them.
Fundamental attribution error.
- Tendency to believe what people do reflect their character rather than the circumstances.
- To combat: Intentionally focus the analysis on the situational causes rather than discrete actions that people took.
Confirmation bias.
- Tendency to favor information that reinforces our existing beliefs.
- When presenting with ambiguous information, the human mind interprets it in a way that supports the existing assumptions a lot of the time.
- To combat: Pointing someone to play the devil's advocate. Their job is to take a contrarian viewpoint during the investigation. Be cautious of introducing negativity or combativeness with that devil's advocate.
- Alternatively, invite someone from other team to ask any and all questions that come to mind. Help to surface the things the team take for granted.
Hindsight bias.
- Memory distortion where we recall events to form a judgement.
- If we know the outcome, it's easy to see the event as being predictable, despite there has been little to no objective basis of predicting it.
- People often call events to make themselves look better, believe they knew it's going to happen as the event is unfolding. Acting on this bias can lead to defensiveness in the team.
- To combat: explaining events in terms of foresight. Work the timeline forward instead of starting from the resolution then work backwards.
Negativity bias.
- The notion of things that have more of a negative nature have a greater effect on one's mental state than those of a neutral or positive nature.
- Research on social judgement show that negative information disproportionately impact the person's impression of others. We tend to focus and magnify the negative events, and this can lead to demoralizing, burnouts, chaos.
- Culture change is hard. Change does not have to be driven by management, can be bottom-up changes that are often more successful than top-down mandate.
- Make sure you have buy-in and go up to your leadership team once you have buy-in from individual contributors.
- Need commitment from leadership that no individual will be reprimanded after an incident.
- Explain why blameness is harmful to trust and collaboration.
- Agree to work together to become blame-aware and be accountable by kindly call to each other when blame is observed.
- Avoid blaming the management for blaming others. Ask leadership if they could be receptive to receiving the feedback if and when they accidentally suggest blame after an incident.
- How long the impact is visible? The length of time user/customers/partners are affected. Often they were impacted before the incident was triggered.
- How many customers were affected, how many percentage? Support may need to list the number of customers so they can reach out individually.
- How many customers wrote or call support about the incident?
- What functionality was impacted and how severely impacted?
- Actionable: each action item is a sentence that should start with a verb.
- Specific: the action should resolve in a useful outcome.
- Bounded: to tell when it's actually finished as opposed to continually ongoing.
- Encourage participants to speak up and keep the discussion on track.
- Helpful to designate a facilitator who is not also trying to participate in the discussion.