In statistics, the Gaussian, or normal, distribution is used to characterize complex systems with many factors. As described in Stephen Stigler’s The History of Statistics, Abraham De Moivre invented the distribution that bears Karl Fredrick Gauss’s name. Gauss’s contribution lay in his application of the distribution to the least squares approach to minimizing error in fitting data with a line of best fit. He thus made it the most important error distribution in statistics.
What is the distribution of a sample of data? What if you don’t know the data’s underlying distribution? Is there any way to test hypotheses about the data without knowing the underlying distribution? Thanks to the Central Limit Theorem, the answer is yes.
Statement of the Theorem
It states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.
The approximation statement is sometimes misstated as a conclusion about convergence to a normal distribution. Since the approximating normal distribution changes as the sample size increases, such a statement is misleading.
The theorem was developed by Pierre Simon Laplace.
Why It's Everywhere
Normal distributions are omnipresent. The reason comes from the Central Limit Theorem. Oftentimes, when a value is measured, it is the sum effect of many independent variables. Therefore, the value being measured itself has a sample-mean quality to it. For example, a distribution of athlete’s performances may have a bell-shape, as a result of differences in diet, training, genetics, coaching and psychology. Even men's heights has a normal distribution, being a function of many biological factors.
What is called a “copula function” with a Gaussian distribution was in the news in 2009 because of its use in assessing the risk of investing in collateralized bonds. The misuse of the function was instrumental in the financial crisis of 2008-2009. Although there were many causes of the crisis, in hindsight Gaussian distributions likely should not have been used. A function with a thicker tail would have assigned greater probability to adverse events.
The Central Limit Theorem can be proven in many lines by analyzing the moment generating function (mgf) of (sample mean - population mean)/?(population variance / sample size) as a function of the mgf of the underlying population. The approximation part of the theorem is introduced by expanding the underlying population’s mgf as a power series, then showing most terms are insignificant as the sample size gets large.
It can be proven in far fewer lines by using a Taylor expansion on the characteristic equation of the same function and making the sample size large.
Some statistical models presume the errors to be Gaussian. This enables distributions of functions of normal variables, like the chi-square- and F-distribution, to be used in hypothesis testing. Specifically, in the F-test, the F statistic is composed of a ratio of chi-square distributions, which themselves are functions of a normal variance parameter. The ratio of the two causes the variance to cancel out, enabling hypothesis testing without knowledge of the variances aside from their normality and constancy.
- The Formula That Killed Wall Street (on Gaussian Copulas)
- John Freund; Mathematical Statistics; 1992 (Proof of CLT)
- Comstock Images/Comstock/Getty Images