Why Averages suck and what make Percentiles great

Average(mean), median, mode are core statistical concepts, that are often applied in software engineering. Whether you are new to programming or have multi years of computer science experience, you’ve likely at some point in time will have used these statistical functions, say to calculate system resource utilization, or network traffic or a website latency. In my current role, my team is responsible for running a telemetry platform to help dev teams measure application performance. We do this by collecting point-in-time data points referred as metrics.

A common use case for metrics is to tell the application latency (i.e. amount of time it took between a user action and web app response to that action). For examples, amount of time it took you to click on twitter photo and till it finally showed up on your device screen. So if you have this metrics collected at regular intervals (say every 1s), you can simply average it over a period of time like an hour or a day to calculate latency. Simple right!

Well, it might not be that simple. Averages are bad in this case, Let me explain why.

Disclaimer: By no means I claim to be expert on statistics,  so please correct me if I’m wrong. 😀

Why do averages suck?

Consider this:

  • You have a code to measures how long a user waits for an image to render.
  • You collected 5 data points over a period of time: 3s, 5s, 7s, 4s, and 2s.
  • If you average them, you get 3.5 (i.e. ((3+5+7+4+2) / 5))
  • However, 3.5 seconds is NOT representative of your actual users experience. From this data, it’s clear some of your users are having a very fast experience (less than 3 seconds ✅), and some are having a very slow experience (greater than 7 seconds ❌). But none of them are having a mathematically average experience. This isn’t helping

Are Percentiles better in this case? Yes!

Percentiles is a value on a scale of 100 that indicates the percent of a distribution that is equal to or below it. For example, the 95th percentile is the value which is greater than 95% of the all observed values. Coming back to our app latency scenario, instead of calculating the average of all observed data points, we calculate Percentile50 or Percentile90.

P50 – 50th Percentile

  • Sort the data points in ascending order: 2s, 3s, 4s, 6s, 7s.
  • You get P50 by throwing out the bottom 50% of the points and looking at the first point that remains: 6s

P90 – 90th Percentile

  • Sort the data points in ascending order: 2s, 3s, 4s, 6s, 7s.
  • You get P90 by throwing out the bottom 90% of the points and looking at the first point which remains: 7s

Using percentiles has these advantages:

  1. Percentiles aren’t skewed by outliers like averages are.
  2. Every percentile data point is an actual user experience, unlike averages.

You can plot percentiles on a time series graph just like averages. And you can also setup threshold alerts on them. So say if P90 is greater than 5 seconds (i.e. 90% of observed values have greater than 5s latency) you can be alerted. Below is a spreadsheet to explain Percentile.

As you might have noticed, when you use percentile based metrics, you get a much better sense for reality.

Some interesting facts about Percentile

  • Percentile are commonly referred: p99 (or P99, or P₉₉) means “99th percentile”, p50 means “50th percentile”…you get the drift.
  • P50 is same as the median (mid-point of a distribution)
  • And Percentile are NOT Percentage!

Conclusion

Now armed with some basic knowledge about percentiles, hopefully you’ll start seeing your metrics in whole different way.

Site Reliability Engineering: How Google Runs Production Systems – Book Review

Essential Read for anyone managing highly available distributed systems at scale

First off – It’s worth let you know that Google lets you read this “entire” book online for free on their website. Yes you read it right, you don’t need to buy the book, just click on below link – https://landing.google.com/sre/sre-book/toc/index.html and start reading!

The book starts with a story about a time Margaret Hamilton brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys accidentally. Hamilton noticed this defect and proactively submitted a change to add error checking code to prevent this from happening again, however the change was rejected because program leadership believed that error should never happen. On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve. Hence early learning from book

“Embrace the idea that systems failures are inevitable, and therefore teams should work to optimize to recover quickly through using SRE principles.”

The book is divided into four parts, each comprised of several sections. Each section is authored by a Google engineer.

In Part I, Introduction, the authors introduce Google’s Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Google approach is truly extraordinary) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a sample service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter.

In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly, and can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic).

In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering the SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engineers, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. This is where I started getting a much better sense of practical SRE (a.ha!)

Overall it’s a great read, however it isn’t perfect. The two big downsides for me are 1.) this is one of those books that’s a collection of chapters by different people, so there’s a fair amount of redundancy and 2.) the book takes a sided approach on “Build Vs Buy” dilemma of engineering. I mean at Google scale, it will always be better to build, however that is rarely true in the real world. But even including the downsides, I’d say that this is the most valuable technical book I’ve read in the year. If you really like these notes, you’ll probably want to read the full book.