KNOWLEDGEBASE - ARTICLE #959

Q&A: Normality tests

Why the term "normality"?

Because Gaussian distributions are also called Normal distributions.

Which normality test is best?

Prism (4.01 and 4.0b and later) offers three normality tests (offered as part of the Column Statistics analysis):

D'Agostino-Pearson omnibus test. We recommend using this test. It first computes the skewness and kurtosis to quantify how far from Gaussian the distribution is in terms of asymmetry and shape. It then calculates how far each of these values differs from the value expected with a Gaussian distribution, and computes a single P value from the sum of these discrepancies. It is a versatile and powerful normality test, and is recommended. Note that D'Agostino developed several normality tests. The one used by Prism is the "omnibus K2" test.
Shapiro-Wilk test. This test works very well if every value is unique, but does not work well when there are ties. The basis of the test is hard for nonmathematicians to understand. For these reasons, we prefer the D'Agostino-Pearson test, even though the Shapiro-Wilk test works well in most cases.
Kolmogorov-Smirnov test, with the Dallal-Wilkinson-Lilliefor corrected P value. It compares the cumulative distribution of the data with the expected cumulative Gaussian distribution, and bases its P value simply on the largest discrepancy. This is a very crude way of quantifying deviations from the Gaussian ideal, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. RB D'Agostino¹ says "The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used." We agree. Note that versions of Prism prior to 4.01 and 4.0b inappropriately reported the Kolmogorov-Smirnov P value directly (without the Dallall-Wilkinson-Lilliefor correction).

Why do the different normality tests give different results?

All three tests ask how far a distribution deviates from the Gaussian ideal. Since the tests quantify deviations from Gaussian using different methods, it isn't surprising they give different results. The fundamentai problem is that these tests do not ask which of two defined distributions (say, Gaussian vs. lognormal) better fit the data. Instead, they compare Gaussian vs. not Gaussian. That is a pretty vague comparison. Since the different tests approach the problem differently, they give different results.

How many values are needed to compute a normality test?

The Kolmogorov-Smirnov test requires 5 or more values. The Shapiro-Wilk test requires 3 or more values. The D'Agostino test requires 8 or more values.

What question does the normality test answer?

The normality tests all report a P value. To understand any P value, you need to know the null hypothesis. In this case, the null hypothesis is that all the values were sampled from a Gaussian distribution. The P value answers the question:

If that null hypothesis were true, what is the chance that a random sample of data would deviate from the Gaussian ideal as much as these data do?

So it tells me whether a data set is Gaussian?

No. A population has a distribution that may be Gaussian or not. A sample of data cannot be Gaussian or be not Gaussian. That term can only apply to the entire population of values from which the data were sampled.

Are any data sets sampled from ideal Gaussian distributions?

Probably not. In almost all cases, we can be sure that the data were not sampled from an ideal Gaussian distribution. That is because an ideal Gaussian distribution includes some very low negative numbers and some superhigh positive values.Those values will comprise a tiny fraction of all the values in the Gaussian population, but they are part of the distribution. When collecting data, there are constraints on the possible values. Pressures, concentrations, weights, enzyme activities, and many other variables cannot have negative values, so cannot be sampled from perfect Gaussian distributions. Other variables can be negative, but have physical or physiological limits that don’t allow super large values (or have extremely low negative values).

But don't t tests, ANOVA, and regression assume Gaussian distributions?

Yes, but plenty of simulations have shown that these tests work well even when the population is only approximately Gaussian (especially when the sample sizes are equal, or nearly so).

So do the normality tests figure out whether the data are close enough to Gaussian to use one of those tests?

Not really. It is hard to define what "close enough" means, and the normality tests were not designed with this in mind.

What should I conclude if the P value from the normality test is high?

All you can say is that the data are not inconsistent with a Gaussian distribution. A normality test cannot prove the data were sampled from a Gaussian distribution.All the normality test can do is demonstrate that the deviation from the Gaussian ideal is not more than you’d expect to see with chance alone. With large data sets, this is reassuring. With smaller data sets, the normality tests don’t have much power to detect modest deviations from the Gaussian ideal.

What should I conclude if the P value from the normality test is tiny?

The null hypothesis is that the data are sampled from a Gaussian distribution. If the P value is small enough, you reject that null hypothesis and so accept the alternative hypothesis that the data are not sampled from a Gaussian population. The distribution could be close to Gaussian (with large data sets) or very far form it. The normality test tells you nothing about the alternative distributions.

If you P value is small enough to declare the deviations from the Gaussian idea to be "statistically significant", you then have four choices:

The data may come from another identifiable distribution. If so, you may be able to transform your values to create a Gaussian distribution. For example, if the data come from a lognormal distribution, transform all values to their logarithms.
The presence of one or a few outliers might be causing the normality test to fail. Run an outlier test.
If the departure from normality is small, you may choose to do nothing. Statistical tests tend to be quite robust to mild violations of the Gaussian assumption.
Switch to nonparametric tests that don’t assume a Gaussian distribution.

What if my values can take on just a few possible values. Is a normality test helpful?

No. If your data can take on only a few possible values, then you know for sure that the data are not sampled from a Gaussian population. So no normality test will be useful.

Isn't the whole point of a normality test to decide when to use nonparametric tests?

No. Deciding whether to use a parametric or nonparametric test is a hard decision that should not be automated based on a normality test.

So how useful are normality tests?

Not very. Normality tests are less useful than some people guess. With small samples, the normality tests don't have much power to detect nongaussian distributions. With large samples, it doesn't matter so much if data are nongaussian, since the t tests and ANOVA are fairly robust to violations of this standard.

What you would want is a test that tells you whether the deviations from the Gaussian ideal are severe enough to invalidate statistical methods that assume a Gaussian distribution. But normality tests don't do this.

References

¹RB D'Agostino, "Tests for Normal Distribution" in Goodness-Of-Fit Techniques edited by RB D'Agostino and MA Stephens, Marcel Dekker, 1986.

Parts of this page are excerpted from Chapter 24 of Intuitive Biostatsitics.

Q&A: Normality tests

Explore the Knowledgebase