Cloud Service Reliability (Part II): Houston, we have an… outage!

July 3, 2023
Administrator
Consulting page
0

Of all the phrases from movies that have become a part of pop culture – from ‘Luke, I am your Father’, to ‘Play it again, Sam’, none have made it more to daily usability than ‘Houston, we have a problem!’ from the movie Apollo 13. Reminding my son ‘Saransh, I am their father’ does not get the same smiles from him it used to get when he first saw Empire Strikes Back. But giving a deadpan ‘Houston, we have a problem’ gets everyone’s attention fast.

Incident Response – Apollo 13 to Google SRE

In full disclosure, after writing part one of this blog series, I watched the movie Apollo 13 again. This time with different eyes though. I gave much of my attention to what started happening at Houston mission control from the point they got the ‘we have a problem’ message from the Odyssey spacecraft, on its way to the Moon.

Before I dissect the actions they took to respond to the incident on the spacecraft, let’s first look at the response guidance for Incident Management prescribed by the Google SRE book. Let’s see how much correlation there is between the Google SRE approach from this decade and NASA space mission incident response approach from the 1970s. Here is what the SRE book prescribes as its guidance on incident response and management:

Recursive Separation of Responsibilities: Ensure who owns which area of responsibility and ensure they have autonomy to act. This ensures that no one steps into someone else’s area of responsibility, there is efficient use of everyone’s time, and there can be delegation of sub-incidents/sub-systems to other teams.
A Recognized Command Post: Is there a way for all stake holders to communicate in real-time, and can find the right stake holder when they need them? ‘War rooms’ and dedicated Slack channels are good examples of this.
Live Incident State Document: Visibility of the status of the work being done and by whom is ensured by this. This should be a live document (Wiki) that is owned by the various stake holders who are also responsible for updating it in as real time as possible.
Clear, Live Handoff: As team members hand off their part of the incident response to other team members, they need to ensure a proper and complete hand off, with documented acknowledgement of an understanding of open issues and tasks.

Going back to Apollo 13, here are the steps the mission control in Houston took, once they were alerted to the incident on the spacecraft, mapped to the SRE guidance:

Recursive Separation of Responsibilities: The engineers at Mission Control broke up into response teams by system ownership and determine what was working and what was not. Was it an instrumentation error or a real problem? If a real problem, what could be the root cause of failure? Which problem is most critical and needed to be address first. This allowed mission control to move the astronauts to the Lunar Module immediately after the incident, saving their lives. It also allowed the team to re-prioritize the response actions as they discovered new or additional problems. The CO₂level issue was discovered by one such separate team that owned that particular sub-system, leading them to prioritize addressing that above anything else, and building a square CO₂ scrubber to fit into a round device with the limited components available on the spacecraft (brilliant engineering in action!)
A Recognized Command Post: They met in a ‘War Room’ for discussions, for ‘situation awareness’ and planning. They began by listing what all was working or not working, system by system. When not in the war room, the command post was in mission control and led by the mission chief, who every team reported to.
Live Incident State Document: The teams kept up to date documentation of what steps they recommended on how to respond. Whether it was the actions to save the astronauts by shutting down the damaged Odyssey spacecraft and moving them to the Lunar Excursion Module (LEM) before oxygen ran out, or building the CO₂scrubber, they had detailed steps validated for the astronauts to execute. As there was no Wiki or Slack channel to leverage, everything was documented on paper.
Clear, Live Handoff: The critical handoff of tasks was to the astronauts in the spacecraft, which had to be done by voice instructions. These steps being handed-off were validated and walked thru multiple times by the ground teams. When the spacecraft needed to do a transfer of navigation from the Odyssey to the LEM, the gimbals needed to be manually validated and entered. The astronauts read out their calculations to mission control where multiple engineers validated them (with slide rules, no less), and relayed them back.

Not much has changed on how to respond to incidents from Apollo 13 to Google SRE, beyond of course, the technologies being used. Slack saves lives! Even if we look at best practices on Incident Management from the SRE book, almost all were followed by the Apollo 13 Mission control team, and are pretty well documented on the movie. Here is a subset:

Prioritize– saving the astronauts lives was priority #1. Managing batteries became high priority for that.
Trust– every engineer trusted the expertise of everyone else on the team. They were all allowed to find solutions without focusing on fixing blame or deflecting responsibility.
Introspect– The astronauts were hit hard by stress, the lack of heat and exhaustion. They needed to be walked thru every step by the engineers on the ground.
Consider Alternatives– Every option was on the table. When they found they did not have enough battery power to turn the spacecraft back on for re-entry, they came up with the idea of reversing the power connector between the two spacecraft.

I am pretty sure the other best practices – Prepare, Practice and Change it around were also practiced by the mission engineers. The response to this unique incident succeeded because of that. Now, there is no denying that reality was surely much messier and involved hours of tedium working on going down dead-end paths and solutions, not included in the dramatic rendering in the movie Apollo 13. But that is reality. The way to ensure that the response to an incident goes well comes from well prepared, validated, practiced and well understood response processes. And above all a response team that is well trained and works well as a team, with excellent leadership. These are all areas that cannot be fixed while responding to an incident. Downloading the SRE book to start reading it during an outage is too late…

Let’s now step away from lunar missions and spacecraft and come back (to Earth) to challenges us mere mortals are more likely to deal with, given the type of incidents we will need to respond to. This post is about Cloud Service management after all, not a promotional piece for Apollo 13 the movie, or a PSA for working for NASA (#DreamJob).

The Response Time Calculus:

In the Cloud Service management world, there are several definitions of what qualifies as an ‘incident’. One definition that every Cloud service provider and consumer understands though is quite simply – a Service Level Objective (SLO) is not being met. This missing of an SLO may be because of an outage with a particular service, or an out of bounds response time from a service. Either way, it results in an SLO not being met. The work now becomes detecting the incident, triaging the root cause, and restoring the service, within the acceptable Mean Time to Repair (MTTR) built into the SLO. I referred to the concept of an Error Budget in part 1 of this post. Let’s look at the calculus here for the incident response to an outage of a service, which is resulting in the service provider missing its availability SLO.

The Availability of a service is 1 minus total downtime, or 100 minus downtime, when talking in percentage terms. So, let’s say we want to achieve a very realistic 99.95% availability SLO. Or an acceptable downtime of 0.05%. As we established in part one of this blog series, this is a downtime of 43 seconds per day, or 21 minutes and 30 seconds a month. This 21:30 minutes is your monthly Error Quota. This is the maximum cumulative MTTR you have for ALL incidents in a single calendar month.

Hence, we can break down MTTR as such:

Mean Time to Repair = Mean Time to Detect + Mean Time to Triage + Mean Time to Restore.

(If you work for one of the several dysfunctional organizations I have encountered, you will also need to add ‘Mean Time to pass/deflect Blame’ to the equation).

The Mean time to Detectis surprisingly a major issue in many organizations. If a customer’s tweet tells you that your service is down before you detect it – Houston, do you have problems?! Rapid detection requires a robust monitoring regime. For true Cloud Native applications it is not sufficient nor effecient to monitor the availability of individual services consumed by the applications. Such monitoring would also not provide an accurate picture of the application’s composite response time. Monitoring in the Cloud Native world requires constant running of Synthetic Transactions to measure availability and response time – your key Service Level Indicators (SLI). I will discuss synthetics transactions in detail in a later post, but for our purposes here, they are a set of key business transactions that keeps getting executed over and over again, using dummy (synthetic) data and identities. The response time is measured at each run of the transaction and any divergence from acceptable time limits results in an alert notification.
Next comes Mean time to Triage– what happens when an issue (outage or slow response time) is detected. What is the process to triage the incident and do a root cause analysis (RCA)? This is no easy task when there are multiple services from multiple service providers involved. It takes several steps and analysis of data to determine whether the outage is being caused due an application issue or due to a service being consumed by the application. If an application issue, is the issue with the application due to a bug in the application, or because it is unable to handle the behavior of a Cloud Service it consumes. Inability to handle unexpected behaviors of a Cloud Service by the way, is also an application issue, not a Cloud Service issues (I will come back to this soon). If it is a Cloud Service issue, one needs to determine which Cloud Service is having an outage and why? Next, what is the incident response process to assign work when one does find the root cause? Is there a well-documented RACI for the various services, including all the service providers, for each incident type?

Above all, it is important that this triage process be blameless. That it be done to determine how to restore the application or service, not to figure out who to blame. That, and the analysis to determine how to prevent the incident from reoccurring should happen after the application/service have been restored via a blameless post-mortem.

Last comes Mean Time to Restore. How does one respond to restore the service – Is the right action to restart a new instance of the same service, to roll back to a prior version of the service, or to roll forward with a patch? Is there a well-documented architecture of the various services and their dependencies, including all 3^rdparty delivered services, for each service that needs to be restored and in what order?

SRE Adoption – Delivering Antifragile systems in the Enterprise

As we look at adopting SRE, manual processes – whether they are manual system level actions that are taken, or manual approvals that need to be acted upon –end up being the Achilles heel of traditional enterprises trying to adopt SRE. They are the single biggest bottleneck to achieving the desired availability times and hence the service SLOs. Trust in automation needs to be established to eliminate manual actions, from Ops practitioners manually restoring VM/container instances, to management approvals of steps to restore services. This requires well documented, validated and tested automation. Automation that is tested continuously for all possible scenarios. It also requires the joint operation of all the stakeholders – internal and external. Saying that it is not easy is an understatement, especially for an organization that is new to the word of Cloud.

The term Antifragile was coined by Nassim Nicholas Taleb, an options trader who has written a series of books on randomness, probability and their impacts on the markets, and on life. He introduced the term Antifragile for the first time in his book The Black Swan, where he discusses rare events, which he contends are not as random and rare as people think—like stock market crashes. He then wrote a book named Antifragile, where he expanded on the concept of antifragility to describe things that are neither fragile nor robust, but that, in fact, benefit from chaos.

What is needed is the development of Antifragile systems. As I discussed extensively in my book The DevOps Adoption Playbook, Antifragile systems are those that thrive in chaos. These are systems that are designed from the ground up to exist in environments that have outages, that have non-responsive services, and they still self-heal and get restored efficiently, with maximum automation and well refined service incident management processes.

Coming back to my earlier point of applications handling unexpected responses from cloud services. Antifragile behavior of cloud native applications builds into the application itself error handling so that the application can continue running even if a cloud service it consumes is unavailable. It may not be able to deliver the business functions it is designed for, but it should handle the incident/outage elegantly – by notifying the user that it is unable to provide the function asked for, and send an alert notification on the service outage. An Antifragile application would not hang or crash when a cloud service it consumes has an outage. To be continued in Part 3 of this blog series.

Lastly but not the least, Antifragile systems need the organizations supporting them to be Learning Organizations. Such organizations learn from each incident to ensure they have processes and automation in place to handle them better the next time they occur. They strive for Continuous Improvement, and in fact, go one step further to proactively come up with creative ways there can be an incident to proactively be ready to handle the incident before it even occurs. We will discuss how Netflix does this with their Simian Army in Part 3 of the series.

Author: Sanjeev Sharma