Site Reliability Engineering: How Google Runs Production Systems – Book Review

Essential Read for anyone managing highly available distributed systems at scale

First off – It’s worth let you know that Google lets you read this “entire” book online for free on their website. Yes you read it right, you don’t need to buy the book, just click on below link – https://landing.google.com/sre/sre-book/toc/index.html and start reading!

The book starts with a story about a time Margaret Hamilton brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys accidentally. Hamilton noticed this defect and proactively submitted a change to add error checking code to prevent this from happening again, however the change was rejected because program leadership believed that error should never happen. On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve. Hence early learning from book

“Embrace the idea that systems failures are inevitable, and therefore teams should work to optimize to recover quickly through using SRE principles.”

The book is divided into four parts, each comprised of several sections. Each section is authored by a Google engineer.

In Part I, Introduction, the authors introduce Google’s Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Google approach is truly extraordinary) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a sample service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter.

In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly, and can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic).

In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering the SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engineers, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. This is where I started getting a much better sense of practical SRE (a.ha!)

Overall it’s a great read, however it isn’t perfect. The two big downsides for me are 1.) this is one of those books that’s a collection of chapters by different people, so there’s a fair amount of redundancy and 2.) the book takes a sided approach on “Build Vs Buy” dilemma of engineering. I mean at Google scale, it will always be better to build, however that is rarely true in the real world. But even including the downsides, I’d say that this is the most valuable technical book I’ve read in the year. If you really like these notes, you’ll probably want to read the full book.

Author: Varun Dhawan

I’m Varun. I used to be a Software Engineer building data applications for large corporations like McKinsey and Target. Now, I’m a Product Manager at Microsoft, making Azure PostgreSQL the go-to platform for running mission-critical workloads (and no, I’m not obsessing over every little detail… I swear). When I’m not working, you can find me blogging at data-nerd.blog, where I help fellow data enthusiasts master PostgreSQL, sharpen their coding skills, and navigate their careers with confidence. And if there’s one thing you absolutely need to know about me, it’s that…I'm apparently a great cook—just don’t ask why I’m the only one who eats my food. View all posts by Varun Dhawan

Essential Read for anyone managing highly available distributed systems at scale

Share this:

Related

Author: Varun Dhawan

Leave a comment Cancel reply