Apollo 13, in my humble opinion, is the best movie ever made on engineering and especially reliability engineering in action. It is certainly one of my personal favourite movies of all times. But how I view the movie differs by a full 180-degrees how the rest of my family views it. To them it is a movie about three heroic astronauts stranded in space trying to make it back to Earth, and it has Tom Hanks as one of those brave astronauts. Without taking anything away from the heroism and bravery of the astronauts, to me it is a movie about a group of heroic and ingenious engineers back on Earth solving real life and death engineering problem within tremendous constraints. It is the story of a group of Engineers working against time to build a reliable system to address defects in what was already a highly-reliable system but had failed. (We can go into how that failure was caused due to lack of good integration requirements and a lack of integration testing, but I will leave that for another post). A system like the Apollo mission spacecraft are designed for 100% reliability. Human lives are at stake, that too 250,000 miles from Earth. President Kennedy said in his famous address announcing the Moon mission goal of ‘… landing a man on the Moon and returning him safely to Earth’ (emphasis mine). The reliability requirements came from the ‘returning him safely to Earth’ part[i]. This requires 100% reliability at a system of systems levels. Something impossible to achieve, as the mission showed us. The engineers on the ground had to solve some of the toughest set of engineering problems ever solved, given the hostile environmental conditions, and the time and resource constraints they had. We weren’t talking of just rebooting servers and routers here. My heroes!

Understanding Reliability

As an Electrical engineer by training, the reliability Service Level Objectives (SLOs) I was trained on were always in the 5-nines definition – 99.999% reliability. Engineering firms, especially those in the electrical engineering space have had those levels of reliability objectives for decades. In today’s world of web applications, such an objective is unreasonable. We will discuss why that is so soon, but first, let’s understand what that level of reliability means. Let’s step away from space mission reliability requirements and look at reliability in terms we may understand and relate to more – service availability. How much downtime does 99.999% availability translate to?
  • Daily:9s
  • Weekly:0s
  • Monthly:3s
  • Yearly:5m 15.6s
Yes, less than a second per day, only 6 seconds of downtime a week, or just 26.3 seconds of downtime in an entire month! Looking at more common availability objectives, like say 4-nines or 99.99%. It translates to downtime of:
  • Daily:6s
  • Weekly:1m 0.5s
  • Monthly:4m 23.0s
  • Yearly:52m 35.7s
That’s a full order of magnitude lower than the 5-nines reliability we used to talk about in engineering school, and it translates to just 8.6 seconds a day or one minute (and 0.5 sec) of downtime a week! Even the more common 99.95% availability SLO is a mere 43 seconds/day or 5:24 minutes/week.

 Systems of systems

Reliability Engineering by definition requires Systems thinking and in reality, systems of systems thinking. A final reliability number is a composite number for all the systems of systems, which in turn are made up of multiple services themselves. You are only as reliable as the least reliable system (or really, least reliable service). If I am trying to achieve such a level of reliability, the first thing I need is visibility and control over all the systems in the systems of systems. Unfortunately, in the internet connected world today, that is not a luxury we have. If I am delivering a copper-line based phone service, I have full control over every component of that system of systems, from the exchange, all the way to your home. The only part I do not own Is the wiring and device inside your home (or workplace).  Hence, I can guarantee reliability. I own it, I measure it, I deliver it. In a web delivered system, I own very little. In a cloud hosted system, I own even less. I am dependent on the reliability of multiple services from potentially myriad providers, that I have no control over. Outages are common and expected somewhere in the set of systems and services. What do I do if my cloud service provider has an outage to their storage service? What do I do if someone turns off the UPS power backup in a data center? My focus has to shift from preventing outages, to ensuring rapid and data-loss-less recovery. I have to shift thinking from Mean Time Between Failure (MTBF) to Mean Time To Repair (MTTR). I have to build what Nassim Taleb calls Antifragile Systems.

Site Reliability Engineering

This complexity of managing and delivering the high level of reliability expected of web-based, cloud hosted systems today (ever seen Facebook or Google search engine have even a scheduled outage), and the expectation of Continuous Delivery of new features and bug fixes (my mobile phone always has Apps that need to be updated – always), has led to the evolution of a totally new field of Reliability Engineering catered for such systems. Google, who has been a pioneer in this field calls it Site Reliability Engineering (SRE). While it would be more aptly named Service Reliability Engineering (and still keep the acronym of SRE), the name has caught on. The seminal work documenting Google’s approach and practices are in the book by the same name (commonly referred to as the ‘SRE book[ii]), has become the de facto standard on how to adopt SRE in an organization. ‘SRE Engineer’ has suddenly become almost as common a title on LinkedIn profiles as ‘DevOps Engineer’ (don’t get me started…). Going back to the name Site Reliability Engineering, Google does define SRE as ‘Google’s approach to Service Management’. I guess they use the term ‘Site’ given the nature of their core business, but at the end of the day, it is all about Service Reliability Management, which is the focus of this blog post, (and a few more following posts, given I have already crossed 1,000 words at this point in the post and probably losing a few of my fellow short-attention-span readers). In the SRE book, Google defines the following eight core tenets of SRE:
  • Ensuring a Durable Focus on Engineering: SRE dictates that the team delivering the Services operate under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
  • Pursuing Maximum Change Velocity Without Violating a Service’s SLO: This addresses the structural conflict between pace of innovation and product stability. SRE practices bring this conflict to the fore, and then works to resolve it with the introduction of what Google calls an error budget– the permitted unavailability (downtime) of a service over a finite time period (say a month/year).
  • Monitoring: The SRE approach to monitoring is focused on automation – Monitoring is set up such that it does not require a human to interpret any part of the alerting domain. Instead, software does the interpreting, and humans are notified only when they need to take action that software automatically cannot take.
  • Emergency Response: According to SRE, the most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the aforementioned MTTR.
  • Change Management: As most outages are caused due changes in a live system, SRE prescribes the implementation of three practices to ensure error free change management – progressive rollouts of changes; quick and accurate problem detection; and rollbacks when problems occur.
  • Demand Forecasting and Capacity Planning: Capacity is critical to availability. SRE provides guidance on proper capacity planning based on demand forecasts, and proper load testing to validate the correlation between the infrastructure’s raw capacity (compute, storage and network capacity) and service capacity (capacity for a service to run and scale).
  • Provisioning: Provisioning is needed both for Change Management and Capacity Management. As capacity is expensive, provisioning needs to be judicious and accurate to prevent both over or under provisioning. Provisioning and de-provisioning needs to be fast, accurate and efficient.
  • Efficiency and Performance: SRE provides guidance on how to provision the right capacity to target the specific response speed that needs to be addressed to keep service performance optimal, and keep capacity utilization optimal, as loads on the services vary over time. This goes hand in hand with both Capacity Planning and Provisioning.

SRE in traditional Enterprises:

Why is adopting SRE tough? Why do we need not just introduce the relevant practices, and the people skilled in them, and go to town successfully? It is because just like any other transformation – from adopting Agile to adopting DevOps to adopting Cloud, the Cultural and Organizational change required to enable the practices are both tough to adopt. True transformation needs overcoming Cultural Inertia by both top-down and bottom-up buy-in, sponsorship and an enterprise-wide willingness to change. Organizational change requires breaking down traditional team structures, functional silos and power fiefdoms which people, from the practitioner to the executive, are unwilling to let go of. Job security suddenly comes to the forefront of most minds when introduced to new capabilities and methods one is not skilled in. ‘How will my job change if all provisioning is automated’? ‘Will I have a job if all these tasks I do manually are managed by software’? ‘I do not know how to code. How can I contribute in this new software-driven IT world?’ Let’s, however, assume that all this ‘soft’ adoption effort is already in play – your organization has a transformational leader who is making the necessary changes and investment. SRE is still hard. For a traditional enterprise to achieve the level of service reliability commonly seen in ‘born of the web’ companies (5-Nines level of availability) is not easy. It requires DNA level change. The engineers responsible for Cloud Service Reliability need to work hard to design, develop and deliver reliable systems that can deliver the required SLOs of availability and response time. And just like the engineers on the ground in Apollo 13, they need to have the skills, the freedom, the resources and ability to work together as a team to restore the services needed, minimizing MTTR, when the reliability of the system fails. Because it will fail.

Author: Sanjeev Sharma