Corruption, Lies and Statistics: The Hunt for the Bigger Picture
When dealing with big numbers and data, which are everywhere today, looking for that central tendency can be the first mistake that will lead to many missed opportunities
In this age of big data, it’s easy to drown in a sea of numbers. In order to successfully make it to the other side, one needs to understand numbers and measurements, or at least not feel intimidated by them. There’s no other way. Statistics are everywhere. Statistical literacy is an important tool, and it’ll be much more crucial in the future.
You can’t ignore statistics, but to deal with the sheer volume of data and numbers, people usually settle for less information. People ‘compress’ the distribution. The amount of data can get big – too big to handle, and the main idea of compressing it is to be able to deduce something with as little information as possible. The problem is that this is only a good tactic in theory.
When the Numbers Misbehave
Most of us are familiar with averages as a way to compress data. Instead of using all the data, we look at a single number, which is much easier to understand and compare across many distributions. It’s useful in many models.
If I would tell you that I spent $42, $42, $40, $10, $89, $0, $0, $17, $12, $42, $42, $42 and $42 on lunch this month, you will probably feel lost. But what if I told you I spent $30 on average on my mid-day meal?
This is great when the numbers behave “nicely.” In many cases, however, we have non-symmetric, noisy data; in these cases, compressing it will cause us to lose information. Another important thing to consider is statistical corruption, which occurs when an estimator is too sensitive to a change in a single data point.
Let’s take a group of company employees sitting in a room together. It would be easy to calculate their average salary if they’re willing to share it. But what will happen to this number if the CEO walks in? And let’s say, what will happen to it if Jeff Bezos follows him?
When we are dealing with the average, if we change one observation in the sample to infinity, the mean will be infinity. But – in this same example, the median won’t change. So the median – the value separating the data to two halves – is less prone to corruption. Maybe the median is the way to go? What about the mode – the value that occurs most often? It all lies in the bigger picture.
The Three M’s
Central tendency metrics can be corrupted rather easily. Let’s think of this example: The AC in a room is set to a daily mean (average) temperature of 24°C. Are you OK with this? What if we set the AC to a daily median temperature of 24°C? What if I promise a mode temp of 24 degrees?
Obviously, if you care about your comfort, you will dismiss all of these options, and run to grab the AC remote. Central tendencies don’t tell you anything about the distribution, so you can freeze, or take the heat, while thinking you made the right choice – the right observation.
Let’s take another more complex example. Players of a social gaming company pay $159 per month on average. Now, adding the whole distribution, I see that the median is $3. What happened here? Are there any outliers? Not really. The answer is that more than half of the players paid the minimum amount ($1.99), so they’ll be able to participate.
You can look at this data and say to yourselves – well, it could be a nice average. But on the other hand, you are missing out on a great opportunity to better communicate with your clients by segmenting them into two types – the ones who paid the minimum, and the rest. Looking at the median in this case would have shed new light on all the data we gathered. And looking at the bigger picture will lead to even better results, with more groups of players we can segment and reach.
It’s easy to avoid, and all we need to do is to look at the entire distribution, never trust the central tendency, always ask for more descriptive metrics, and look for the probability of rare events. A box plot – that would present the important percentiles, is a good place to start. We have to understand that numbers tell us a story. We can’t open a book and start reading on page 37 and figure out everything that’s happening, nor can we settle on reading the synopsis on the back cover. There’s probably a different story out there that we’re missing.
Nitay is a researcher at Optimove’s Data Lab. His fields of expertise include Bayesian inference, advanced stochastic processes and machine learning. At Optimove, Nitay has developed recommendation systems, anomaly detection mechanisms and other state-of-the-art solutions. Nitay holds an M.Sc in applied statistics and machine learning from Tel Aviv University.