statistics

The normal distribution

Ph.D. Topics : Statistics

The normal distribution, or bell curve, is probably the most important probability distribution in statistics. Many quantities we observe are roughly normally distributed; the central limit theorem provides a mathematical explanation for this.

The probability density function is given by: $f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

About 68% (a little more than two-thirds) of the density of a normal distribution lies within one standard deviation of the mean. If you go out two standard deviations, you get about 95% of the values. More than 99% are within three standard deviations of the mean. This fact about the normal distribution is sometimes known as the empirical rule, the 68-95-99.7 rule, or the three-sigma rule. Source: Petter Strandmark via Wikimedia Commons

The normal distribution may be used to approximate a variety of other distributions including the binomial, Poisson, chi-squared, and Student’s t, under certain circumstances. This is a consequence of the central limit theorem.

Skewness and kurtosis

The normal distribution has skewness of zero, meaning that it is exactly symmetric around its center, which is where the mean, median, and mode coincide. It has kurtosis of zero, also–it is not overly peaked with heavy tails and positive kurtosis (leptokurtic) or over flat with light tails and negative kurtosis (platykurtic). The t-distribution has positive kurtosis: it has more density at the tails and less at the center, representing greater uncertainty.

The normality assumption

Normality is an assumption of many statistical methods such as the t-test, which is used for testing means or differences between means when you have samples so small that the central limit theorem hasn’t kicked in. Gliner & Morgan (2000) suggest that a violation of the normality assumption is not reason to choose a nonparametric method unless the violation is very severe.

Linear regression assumes normally distributed errors, but Gelman & Hill (2007) say this is generally the least important assumption of regression modeling and they do not recommend checking residuals for normality. You would, of course, want to check residuals for other reasons; for example, to check the assumption of a linear relationship between the dependent and independent variables.

Checking univariate normality

If you do want to check univariate normality, there are a variety of ways to do it. You can check histograms or box plots for evidence of skewness. Q-Q plots and P-P plots graph quantiles or percentages of theoretical distributions (i.e., the normal in this case) with quantiles or percentages of the observed distribution. If the result is a roughly straight line, your observed distribution is roughly normal.

Most statistical programs will output the skewness and kurtosis for each variable along with a standard error. If the skewness or kurtosis is greater than two standard errors from zero, you can conclude that there is significant skew or kurtosis, hence a lack of normality.

There are also statistical tests of univariate normality: the Kolmogorov-Smirnov and Shapiro-Wilk are two that can be output by SPSS and SAS. These tests are quite sensitive to deviations from normality, so most statisticians recommend using them only with other methods of checking univariate normality, such as inspection of histograms.

For more advanced statistical methods such as structural equation modeling, you need to check multivariate normality. But that’s a topic for multivariate methods, not intro stats.

References

Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical methods for social research. Cambridge: Cambridge University Press.

Gliner, J. A., & Morgan, G. A. (2000). Research Methods in Applied Settings: An Integrated Approach to Design and Analysis. Mahwah, N.J: Lawrence Erlbaum.