Why are so many quantities we measure in nature approximately normally distributed? The central limit theorem (CLT), a key tenet of probability theory, says that the average or sum of a large number of independent and identically distributed random variables will be approximately normally distributed, no matter whether those underlying variables are normally distributed themselves or not. Many outcomes we measure–someone’s height, their math aptitude, the temperature in New Orleans on a summer day–represent the sum of many independent factors; the CLT is a mathematical explanation of why these quantities follow a roughly bell-shaped distribution.
The CLT also provides justification for null hypothesis testing of mean and mean difference values. It tells us that no matter what the underlying distribution of the quantity we’re measuring, the distribution of means will look normal, so long as we take a large enough sample.
Understanding the central limit theorem
Here’s an easy-to-understand definition of the central limit theorem:
The distribution of an average tends to be normal, even when the distribution from which the average is computed is decidedly non-normal.
Let’s dig into this definition a little.
The CLT is usually presented as applying to the sampling distribution of means, as here. Applied to a sampling distributions of means, is justifies null hypothesis testing of means and differences of means using the normal distribution. But it applies to sums as well. And that’s the form you need to keep in mind if you are thinking about the assumption of normal distribution of errors in linear regression. The error terms are sums of many independent shocks to the hypothesized underlying linear relationship.
This definition notes that even when your underlying random variables are extremely non-normal, the distribution of mean (and sum) of those variables tends toward normality, as the sample size gets large. What if the quantities you’re averaging or summing actually are normal? If they are exactly normal, the distribution of the average or sum will be exactly normal as well (Ott & Longnecker, 2001).
But the definition leaves out two important requirements of the CLT. First, you need to average a sufficiently large number of sample points and second, the underlying random variables need to be “i.i.d” — independent and identically distributed.
How many independent variables do you need before the CLT kicks in? Statisticians typically use n=30 as the cutoff. Below that, we use other distributions, such as Student’s t, to model the sampling distribution of means and sums.
What about the requirement that the underlying random variables be independent and identically distributed? Lindeberg’s condition and Lyapunov’s CLT say that they only need to be independent, with finite mean and variance. The independence requirement is important however. You can’t do without that.
The mathematical version
For the symbolically inclined, here’s a mathematical definition of the CLT:
Let X1, X2, … , be independent, identically distributed random variables having mean μ and finite nonzero variance σ2.
Let Sn=X1+ … + Xn. Then
where Φ(x) is the normal cumulative distribution function.
Ott, L., & Longnecker, M. (2001). An Introduction to Statistical Methods and Data Analysis (5th ed). Australia: Duxbury.