Outliers

Mar 31, 2025

A definition and how to think about what it means to “remove outliers”

What is an outlier? The operative definition: “My analysis looks funky unless I delete these observations” leaves something to be desired.

Here’s my stab at a punchy definition:

Outlier (n.) A highly informative observation.

Suppose we have a model:

Y = f(X) + U, E[U|X] = 0

And we’re trying to estimate f.

Suppose observation i has a very unusual value of Y given its X compared to the rest of the data. If we ignore just this one observation, our estimate of f changes substantially. I think this is a fair summary of what we mean when we say “outlier.” It’s what Cook’s D and DFBETA-type measures are getting at.

So, observation i contains a lot of information. It tells us that the rest of the data is missing something important about the relationship between X and Y. An observation that merely confirms what the rest of the data says about the relationship between X and Y has very little information, so our estimate of f doesn’t change much if you drop it.

When we delete outliers, we delete highly informative data points, which doesn’t sound great.

What should we do?

I like to think about outliers as being a part of a mixture model like this:

D = 1(Outlier)

f(X) = g(X)(1-pr(D|X)) + h(X) pr(D|X)

When we delete outliers, we claim that we don’t have enough data to pin down h(X), so we’re focusing on estimating g(X) by conditioning on non-outliers. We’re removing information about h(X) to focus on g(X). We’re choosing more bias (because we’re ignoring a relevant part of the population) to reduce variance.

When we cast the problem this way, it’s easier to think about what we’re doing. If we find ourselves throwing away a decent percentage of the data, we should stop and realize that the real problem is that our model is too wrong. It can’t account for an important feature of the data. We need to rethink things. If we’re dropping only a few observations, maybe the variance reduction is worth it.

An example:

Suppose we want to measure the average wealth of people who started but didn’t finish college. One of those people is Bill Gates, and he will be an “outlier.” But Bill Gates is very important information. The average wealth of people who didn’t finish college has to include people like Bill Gates to be right. So deleting Bill Gates introduces bias.

Suppose we have a randomly sampled dataset of 1000 people who started but didn’t finish college, and, by chance, Bill Gates ends up in the dataset. Our estimate will look wild because Bill Gates does not represent 0.1% of people who don’t finish college. So the estimator will have a lot of variability unless we commit to ignoring the Bill Gates of the world. But it’s clear that if we, say, always deleted observations with net worth greater than 10 million dollars, the estimated mean would be biased asymptotically.

The point of this blog is to clear up a common mistake: “This estimate looks weird, so we dropped outliers so that these weird observations don’t bias results.”

We do not remove outliers to reduce bias. We remove outliers to increase bias and reduce variance.

Thanks for reading!

Zach

Connect at: https://linkedin.com/in/zlflynn

Take my Udemy course!: Identifying Causal Effects for Data Scientists

If you want my help with any Experimentation, Analytics, etc. problem, click here.

Under the Null

Discussion about this post

Ready for more?