Apollo 13, in my humble opinion, is the best movie ever made on engineering and especially reliability engineering in action. It is certainly one of my personal favourite movies of all times. But how I view the movie differs by a full 180-degrees how the rest of my family views it. To them it is a movie about three heroic astronauts stranded in space trying to make it back to Earth, and it has Tom Hanks as one of those brave astronauts. Without taking anything away from the heroism and bravery of the astronauts, to me it is a movie about a group of heroic and ingenious engineers back on Earth solving real life and death engineering problem within tremendous constraints. It is the story of a group of Engineers working against time to build a reliable system to address defects in what was already a highly-reliable system but had failed. (We can go into how that failure was caused due to lack of good integration requirements and a lack of integration testing, but I will leave that for another post). A system like the Apollo mission spacecraft are designed for 100% reliability. Human lives are at stake, that too 250,000 miles from Earth. President Kennedy said in his famous address announcing the Moon mission goal of ‘… landing a man on the Moon and returning him safely to Earth’ (emphasis mine). The reliability requirements came from the ‘returning him safely to Earth’ part[i]. This requires 100% reliability at a system of systems levels. Something impossible to achieve, as the mission showed us. The engineers on the ground had to solve some of the toughest set of engineering problems ever solved, given the hostile environmental conditions, and the time and resource constraints they had. We weren’t talking of just rebooting servers and routers here. My heroes!
Understanding ReliabilityAs an Electrical engineer by training, the reliability Service Level Objectives (SLOs) I was trained on were always in the 5-nines definition – 99.999% reliability. Engineering firms, especially those in the electrical engineering space have had those levels of reliability objectives for decades. In today’s world of web applications, such an objective is unreasonable. We will discuss why that is so soon, but first, let’s understand what that level of reliability means. Let’s step away from space mission reliability requirements and look at reliability in terms we may understand and relate to more – service availability. How much downtime does 99.999% availability translate to?
- Yearly:5m 15.6s
- Weekly:1m 0.5s
- Monthly:4m 23.0s
- Yearly:52m 35.7s
Systems of systemsReliability Engineering by definition requires Systems thinking and in reality, systems of systems thinking. A final reliability number is a composite number for all the systems of systems, which in turn are made up of multiple services themselves. You are only as reliable as the least reliable system (or really, least reliable service). If I am trying to achieve such a level of reliability, the first thing I need is visibility and control over all the systems in the systems of systems. Unfortunately, in the internet connected world today, that is not a luxury we have. If I am delivering a copper-line based phone service, I have full control over every component of that system of systems, from the exchange, all the way to your home. The only part I do not own Is the wiring and device inside your home (or workplace). Hence, I can guarantee reliability. I own it, I measure it, I deliver it. In a web delivered system, I own very little. In a cloud hosted system, I own even less. I am dependent on the reliability of multiple services from potentially myriad providers, that I have no control over. Outages are common and expected somewhere in the set of systems and services. What do I do if my cloud service provider has an outage to their storage service? What do I do if someone turns off the UPS power backup in a data center? My focus has to shift from preventing outages, to ensuring rapid and data-loss-less recovery. I have to shift thinking from Mean Time Between Failure (MTBF) to Mean Time To Repair (MTTR). I have to build what Nassim Taleb calls Antifragile Systems.
Site Reliability EngineeringThis complexity of managing and delivering the high level of reliability expected of web-based, cloud hosted systems today (ever seen Facebook or Google search engine have even a scheduled outage), and the expectation of Continuous Delivery of new features and bug fixes (my mobile phone always has Apps that need to be updated – always), has led to the evolution of a totally new field of Reliability Engineering catered for such systems. Google, who has been a pioneer in this field calls it Site Reliability Engineering (SRE). While it would be more aptly named Service Reliability Engineering (and still keep the acronym of SRE), the name has caught on. The seminal work documenting Google’s approach and practices are in the book by the same name (commonly referred to as the ‘SRE book’[ii]), has become the de facto standard on how to adopt SRE in an organization. ‘SRE Engineer’ has suddenly become almost as common a title on LinkedIn profiles as ‘DevOps Engineer’ (don’t get me started…). Going back to the name Site Reliability Engineering, Google does define SRE as ‘Google’s approach to Service Management’. I guess they use the term ‘Site’ given the nature of their core business, but at the end of the day, it is all about Service Reliability Management, which is the focus of this blog post, (and a few more following posts, given I have already crossed 1,000 words at this point in the post and probably losing a few of my fellow short-attention-span readers). In the SRE book, Google defines the following eight core tenets of SRE:
- Ensuring a Durable Focus on Engineering: SRE dictates that the team delivering the Services operate under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
- Pursuing Maximum Change Velocity Without Violating a Service’s SLO: This addresses the structural conflict between pace of innovation and product stability. SRE practices bring this conflict to the fore, and then works to resolve it with the introduction of what Google calls an error budget– the permitted unavailability (downtime) of a service over a finite time period (say a month/year).
- Monitoring: The SRE approach to monitoring is focused on automation – Monitoring is set up such that it does not require a human to interpret any part of the alerting domain. Instead, software does the interpreting, and humans are notified only when they need to take action that software automatically cannot take.
- Emergency Response: According to SRE, the most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the aforementioned MTTR.
- Change Management: As most outages are caused due changes in a live system, SRE prescribes the implementation of three practices to ensure error free change management – progressive rollouts of changes; quick and accurate problem detection; and rollbacks when problems occur.
- Demand Forecasting and Capacity Planning: Capacity is critical to availability. SRE provides guidance on proper capacity planning based on demand forecasts, and proper load testing to validate the correlation between the infrastructure’s raw capacity (compute, storage and network capacity) and service capacity (capacity for a service to run and scale).
- Provisioning: Provisioning is needed both for Change Management and Capacity Management. As capacity is expensive, provisioning needs to be judicious and accurate to prevent both over or under provisioning. Provisioning and de-provisioning needs to be fast, accurate and efficient.
- Efficiency and Performance: SRE provides guidance on how to provision the right capacity to target the specific response speed that needs to be addressed to keep service performance optimal, and keep capacity utilization optimal, as loads on the services vary over time. This goes hand in hand with both Capacity Planning and Provisioning.