Small Problem - Large Cost
On Wednesday, August 1st 2012 an 18 year old financial services firm began a test run of their new trading software.
Carefully configured to interact with a very small number of stocks and with Buy and Sell points set well outside where the markets were trading that day, a reasonable amount of caution was taken to reduce the blast radius in the unlikely event anything should go wrong.
45 minutes later, the updated software had sent millions of orders, generating 4 million executions in 154 stocks for more than 397 million shares. - Source: Wikipedia.org
The result was a $440 million dollar cash loss that nearly bankrupt the company overnight and forced them to sell the company outright less than a year later.
All because of a bug in the software.
“Computers do what they’re told.”
“If they’re told to do the wrong thing, they’re going to do it and they’re going to do it really, really well.” - Lawrence Pingree (Gartner) Source: CNN (Money)
Most of us have seen bugs in the wild. Fortunately, most of us haven’t experienced one as catastrophic as this … yet.
As engineers of modern systems, we accept that they are indeed complex. Regardless of the tech stack, we have services talking to other services. APIs calling other APIs. Containers running apps en masse. Serverless functions using resources only when executed. There is a lot of “uknown unknowns” in our systems. Even as early architects and technical decision makers, we could never possibly know the exact state of our systems at any given moment.
Even when they appear simpel, we’re dealing with complex systems here, folks.
Failure Is Unavoidable
In 1998 Richard I. Cook, MD published a white paper titled “How Complex Systems Fail”. In less than 4 pages of text, Dr. Cook instantly challenged a widely accepted assumption that nearly every “IT expert” held, especially those deeply devoted to ITSM and ITIL “best practices”…
That conventional wisdom is: Through control and process, incidents can be prevented.
It turns out, that’s wildly (and depending on the industry) dangerously innacurate.
A property of complex system is constant change. Despite carefully crafted delivery pipelines, change control procedures, or any type of automated remediation that may be in place, we can’t prevent incidents and service disruptions from happening.
Failure and service disruptions are a normal part of a complex system. Thus the subsequent response to them are a normal part of engineering work.
“Incidents aren’t deviations from some idyllic norm: they are the norm.” - Rob England (itskeptic) - Source: ITSkeptic.org
In fact they happen so frequently that entire sciences (Human Factors and System Safety at Lund University) are dedicated to better understanding incidents and accidents from a technical perspective but more importantly, the human side of things.
For example, no matter how hard we try, there are always fires in the world. The better we prepare for them, the less of a problem they seem to be. We even know the common causes and scenarios that contribute to devasting fires. But there will always be fires. There will always be unplanned events. Things that despite our best efforts to prevent, we just couldn’t.
In fact, there’s a lot we can learn from industries such as Fire & Rescue and Aviation about how to respond to the “unknown unknowns” we are faced with in our attempts to understand and operate our own systems?
One of the easiest things we can take from these industries is that because failure is a normal property of any complex system (which we are building and operating), this means that incidents are no longer a surprise.
We have divorced ourselves from the notion that with enough process they can be eliminated.
They simply can’t be.
Therefore, when something goes wrong our posture should be one of a response rather than a reaction.
There’s enough of a difference that we should examine them.
React or Respond
When we are not prepared for an event, what comes next is a reaction to the inputs we are receiving.
Inputs that often contain no context, intended to alert individuals who aren’t prepared. Surprised in fact! Possibly asleep. Definitely forced to context shift.
Not only are they surprised, but they may not have the knowledge, tooling, access, or skillset to even do ANYTHING.
Don’t React to incidents … Respond.
If we accept that incidents will happen we have to plan for them. We have to be prepared. Rehearsed and ready when things go sideways. Because they will.
Incidents should be anticipated. Responding to them should be an understood process.
Robustness or Resiliency
Now that we’ve begun our initial thinking on how we can begin to plan for incidents, our teams form a robustness property. We are now working within the world of “known unknowns”.
However, we need to go one step further. Our systems still contain a vast and always growing number of “unknown unknown” characterstics we have to account for if we want to build a resilient system.
Resiliency requires engineers be able to respond to unanticipated problems. Things never seen before where there are no check lists or known solutions.
“While software can be robust, only humans can be truly resilient.” - Ryn Daniels - Source: InfoQ
Regardless of the phase your company is in… one easy way to begin building a resilient team (and system) is to put in place a baic response plan to incidents - a Minimum Viable (incident response) Plan.
To do so, let’s first examine what we mean by an incident.
Incidents come in many forms. Some higher impacting than others. Some self-resolving. Some that go on longer than expected. Some that can be (as we saw in the Knight Capital story) VERY expensive.
Regardless to severity or impact, all incidents can be broken down in to much smaller buckets. There are in fact 5 phases of an incident.
Lifecycle of an incident
The first phase of an incident, an issue is detected through various monitoring methods and a notification has been triggered. Early detection is key to building resilient systems but knowing about a problem is just the beginning. We’ll touch more on monitoring and detection shortly.
The notification is received by an engineer (or engineers) explicitly assigned the role of “on-call” during that period. Troubleshooting, querying, diagnosing, and triaging takes place in an attempt to understand what is happening as well as what (and who) is impacted in order to formulate theories around remediation steps.
Informed and armed with tooling, access to relevant systems, and a working hypothesis of how to make things better, engineers begin to take action to remediate the problem through various countermeasures and tasks. Service is restored.
Once the problem has been solved a retrospective discussion of the timline of events and all relevant data points provides not only an opportunity for actionable improvement but a chance for deeper learning on “How the system actually works”.
Holistic mental models of the system’s reality are improved among a broader audience. We have a good understanding of how the incident unfolded through an accurate and transparent account of what took place from diverse points of view by all involved.
We can begin to look at the shape of the incident as it relates to each of the first three phases (detection, response, remediation). By analyzing each phase individually, opportunities for improvement can be easily isolated and planned for in the following and final phase.
Unnecessary “unkown unknowns” are reduced across the team and even an entire organization simply by openly discussing and amplifying the important things learned from the incident. We can’t avoid incidents, but we can learn something extremely valuable from each and every one of them.
Now that we understand our systems a little better (i.e. less uknown uknowns) and we explicitly look for ways to learn about and improve our people, process, and technology, we need to put something in to action.
How can we shorten the time of each phase?
If we modified our on-call response in some way or instrumented additional monitoring, could we reduce the time it took to even know about the problem in the first place (detection)?
Typically the answer is almost always yes.
To build a resilient system you must first have a solid foundation of monitoring in place.
How else would you know there is a problem?
NOTE: Insert Mikey Dickerson’s “Hierachy of Resilient Systems” diagram
A response plan is useless if our system(s) can’t tell us about a problem in the first place. The foundation of building a resilient system is monitoring, followed by incident response. Engineering teams should spend a considerable amount of time discussing what areas of the system are currently black boxes. What questions about the system can you NOT answer today? Fill those blind spots as best as possible - continuously.
As discussed already, we accept failure in our systems. We know we have to deal with incidents. We know we should be prepared, rehearsed, and have everything we can think of (i.e. countermeasures for known knowns and known unknowns) ready and set up.
We should know EXACTLY who is on-call at any given time. We should have a fair amount of confidence that engineers are armed with everything they need (including access to relevant systems) to make an immediate positive impact to the recovery of a service disruption.
The bare essentials of any response plan should include the following:
There are several roles that are part of an established on-call rotations. We’ll cover those soon. But in terms of rotations, they are typically arranged seasonally based on the number of participants in the rotation.
Rotations can last anywhere from a few hours to several days. Although it’s never recommended to allow engineers to be on one continuous rotation for longer than a few days, especially in noisy or problematic systems. This is a fast track to burnout.
Rotations may occur once a week, a month, a quarter or even longer. In some large organizations that share the responsibility of being on-call across many individuals, it’s common for some to only have a handful of rotations per year.
Types of rotations:
Standard Gives you the ability to schedule a 24×7 rotation. Most teams start here as they can establish for example, a rotation such as 24 hours per day for 3 days (or whatever is decided amongst the team).
Follow the Sun This approach allows teams to create multiple shifts within a single rotation. This provides coverage for specific time ranges. Great for distributed teams. Allows engineers to be on-call during their normal hours of operation and hand off to someone else as they begin their work day in a different timezone.
Custom Custom Shifts give you the ability to set a Custom Day/Time Range. This type of rotations comes in handy for scheduling weekend coverage.
At the bare minimum, every engineering team should have a working knowledge of the following roles and concepts when building out an on-call rotation:
First Responder(s): The specific person (or persons) who is on-call. They are aware of their responsibility to receive all initial alerts and acklowdege their receipt and awareness.
Escalation Paths: First responders won’t always know how to remediate issues without a little help from additional teammates and subject matter experts. Knowing who (and when) to escalate an issue should be clearly defined.
Secondary Responder(s): Often the first stop of an established escalation path, clearly identifying the appropriate engineer who can assist in the response and remediation effort. The secondary may also be the first to acknowledge an incident should (for some reason) the first responder be unable to.
Incident Commander: In high severity incidents that are long running or involve many different teams and subject matter experts in the response, it is common for an engineer to assume the role of coordinating the response and remediation effort as it grows. Often, this is the first or second responder once additional members join the effort and the danger of repeating steps or causing more noise becomes a concern. This role is known as the Incident Commander.
Now we know who is responsible when something goes wrong but how do they know what to do once that dreadful alert is delivered?
Most would agree that emails aren’t going to wake us up in the middle of the night to alert us to a production outage. It’s probably even less likely to believe one would catch our attention during the workday as our inboxes are flooded with a barrage of correspondance and memos. So how do we sort out the signal from the noise?
Ideally each individual can choose their own preference for being alerted to problems that need attention. Some may want a push notification or SMS while others may prefer a phone call. I know for me, it’s much more likely I’ll notice my phone ringing (day or night) than just about any other method of delivery. But maybe that’s because I HATE PHONE CALLS! :D
Capture & Provide Context
Context is important in the moment as first responders begin to dive in, but it’s just as important during the analysis phase. Clearly identifying what took place, who was doing what, what the results of specific troubleshooting steps were.
These are all important data points that should be explored during a post-incident review. One sure fire way of providing that contextual timeline is by capturing as much as possible in a persistent group chat tool such as Slack and Teams.
A verbose account of exactly what took place helps to isolate each phase in an effort to shorten them not to mention an opportunity for engineers to discuss what they believed were the right judgements and actions to take given the circumstances and information available to them at the time.
Hindsight is a path to a lot of useless counterfactual statements such as “We should have known” or “If only I would have tried the other thing first.”.
These statements hold no value in our discussion of how to improve and in fact are determintal to our effort of building resilient systems because they inherently assume a human is at fault.
One of the key indictments of failures within a complex system from Dr. Cook’s paper is that:
Hindsight biases post-accident assessments of human performance
Its natural for us to allow a number of biases (especially hindsight) to cloud our analysis of what actually took place and more importantly, how to improve.
We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight. - John Allspaw (former CTO Etsy) - Source CodeAsCraft.com
Be mindful of bias while analyzing an incident in retrospect.
Rotation Handoffs and Debriefings
Once an engineer has completed their on-call rotation, it’s important to debrief the next person. Whether this is in a formal write-up or a face to face discussion, appraising the next person to the problems that have surfaced during the previous shift goes a long way in setting context and preparing the next person for their front-line responsibility of system resiliency.
Be sure to detail how many incidents there were. What aspects of the system they were tied to. What the impact to users was (or still is). What actions and countermeasures have been deployed to resond and remediate these problems. Are there runbooks that should be reviewed?
Sharing more about the reality of the system with the next on-call engineer sets them up for success and helps to transfer more operational knowledge about what’s going on within our systems.
Game Day Exercises
Nobody is immediately good at anything. Sure we all posses natural talent and inclinations towards certain areas and skills, but it’s only through repetitive practice that we actually improve. Rarely has a championship basketball team pulled off a winning season without an immense amount of practice and study of their actions. Tediously rehearsing over and over looking for minor improvements and opportunities to learn along the way.
The same applies to building resilient systems. As pointed out before, a robust team may have the skillset and knowledge to deal with problems that have been seen before (or something similar). But it’s the stuff that we’ve never seen before that really trips us up. The stuff that we’ve never had to deal with before. That’s where a service disruption can have a very large negative impact to a business.
With this in mind, many teams regularly conduct Game Day Excercises. Scheduled and prioritized, specific failure scenarios are planned out and on-call engineers are expected to respond as though there is a real incident.
Building muscle memory, challenging teams to coordinate and share more about what really happens during an incident exposes a treasure trove of opportunities for improvment.
It’s through these regular exercises that the responsibility of being on-call becomes less of a burdon and clearly a “part of the job”. Anxiety is reduced, knowledge is transfered, new documentation is generated to fill blind spots.
No game day exercise goes by without providing a shocking amount of information engineering teams can use to better improve their response plans.
- Rehearse, Experiement, Measure
- Prioritize resources towards continuous improvement
Now that we have covered the essentials for a basic response plan, discuss and form a plan for addressing and establishing the following:
Failure Is Unavoidable - Get used to it
React or Respond - We want to be prepared, which means we can respond to problems, not react. There is a difference.
Monitor & Measure As Much as Possible - Without a way of knowing about problems, engineers can’t respond and remediate. Discuss and look for existing gaps in your monitoring systems.
Who’s On-Call - When everybody is on-call - nobody is on-call. Establish a well understood and communicated rotation so engineers know exactly who is responsible for the initial response to problems.
Rotations - Being on-call for systems is tough work. It’s taxing on the body, the mind, and can quickly lead to burnout if engineers remain on-call for too long. Establish a schedule where people can prioritize responding to incidents as normal work. Don’t allow engineers to have a rotation lasting longer than a few days, especially if they are responsible around the clock.
Alerting - You can have the best in breed monitoring in place but if alerts are difficult to parse, contain no real meaningful context about what is happening, who is impacted, and what initial first steps should be, minor service disruptions can quickly turn into expensive outages. Put yourself in the shoes of the recipient of the alert. What would you need to know to begin making a positive impact to the recovery of service?
Capture & Provide Context - Context in the alert is extremely critical but so is high fidelity information that is taking place during an active incident. Capturing as much as possible about what really took place during the problem means engineers can go back in retrospective and look for opportunities for improvement. Use persistent group chat tools rather than a phone bridge or conference call.
Rotation Handoffs and debriefings - Once engineers are no longer on the hook as the first responder, pass on as much relevant and helpful information as possible for the next person.
Game Day Exercises - Practice, practice, practice. Sadly, teams rarely set aside sufficient time for this. Ensure that engineers are building and maintaining “Incident Response” muscles by simulating a problem, observing and measuring what takes place to restore service, and then discuss with a broad audience how the elapsed time of the first three phases of an incident (Detection, Response, Remediation) can be reduced. This leads to an overall shorter and less expensive outage the next time.
By establishing these basic concepts and practices, even the smallest engineering team can be well prepared for the next incident - large or small. Focus on continuous improvement through data. Actively look for ways to improve each area on your way to shortening the first three phases of an incident and you’ll find that having a minimum viable response plan in place will make the responsibility of on-call part of the normal responsibilities of software engineering.
Unlike the Knight Capital story, a minor bug that makes it in to production can be quickly caught, addressed, and resolved with just a few basic things put in place.
Good luck and don’t forget - reliability is all of our jobs.
Knight Capital Story
“How Complex Systems Fail”
Check out the market leaders in this space: