Interpreting a "statistically significant" result. Seven possible explanations.
Let’s consider a simple scenario. You compare cells incubated with a new drug with control cells and measure the activity of an enzyme. Your scientific hypothesis is that the drug will increase activity of that enzyme. You run the analysis and indeed the enzyme activity is higher in the cells treated with the new drug. You run a t test to compute a two-tailed P value, and find that the P value is small enough for you to conclude that the result is statistically significant. Below are seven explanations to explain why this happened.
- Explanation 1: The drug worked. The drug induced the enzyme you are studying, so the enzyme’s activity was much higher in the treated cells than in the control cells. This is, of course, the conclusion everyone jumps to when they see the phrase “statistically significant”.
- Explanation 2: Trivial effect. If you used a large sample size and/or had little experimental error, you would obtain a small P value and correctly conclude that the difference is statistically significant even though the size of that difference is tiny and irrelevant in a scientific, clinical or practical sense. A small P value does not imply there is a large treatment effect.
- Explanation 3: Type I error or false discovery. The drug really did not affect enzyme expression, but random sampling just happened to give you higher values in the cells treated with the drug and lower levels in the control cells. Accordingly, the P value was small. This is called making a Type I error. If the drug really has no effect on the enzyme activity (the null hypothesis is true) and you choose the traditional 0.05 significance level (alpha = 5%) and ran many experiments, you’ll make a Type I error 5% of the time. It is easy to conclude therefore, that you’ll make a Type I error 5% of the time when you conclude that a difference is statistically significant. This is wrong. The False Discovery Rate (FDR) is the fraction of "statistical significant" conclusions that are wrong. The FDR does not equal alpha (usually 5%), and depends on the context of the research. In realistic research situationsm when a P value only a tiny bit less than 0.05, the false discovery rate is quite likely to be greater than 30%, and can even be greater than 80%.
- Explanation 4: Type S error. In this scenario, the cells treated with drug had higher enzyme activity than control cells, and this difference was large enough and consistent enough to be statistically significant. In fact, however, the drug on average decreases enzyme activity. You'd only know this if you repeated the experiment many times. But even thought the drug actually decreases enzyme activity on average, in this particular experiment random chance gave you high enzyme activity in the drug-treated cells and low activity in the control cells. This would happen rarely, but is not impossible. You've correctly concluded that the null hypothesis is false, and that the drug influences enzyme activity. But your conclusion is backwards. You concluded that the drug would increase enzyme activity, when in fact on average the drug decreases enzyme activity. This is called making a Type S error, because the sign (plus or minus, increase or decrease) of the actual population effect is opposite to the results you happened to observe in this one experiment. It is also referred to as making a Type III error.
- Explanation 5: Artifact due to poor experimental design. The enzyme activity really did go up in the tubes that you added drug to. But perhaps the reason for the increase in enzyme activity had nothing to do with the drug, but rather with the fact that the drug was dissolved in an acid (and the cells were poorly buffered) and the controls did not receive the acid (due to bad experimental design). In this scenario, the increase in enzyme activity was actually due to acidifying the cells, and had nothing to do with the drug itself. The statistical conclusion was correct – adding the drug did increase the enzyme activity – but the scientific conclusion was completely wrong. Statistical analyses are only a small part of good science. That is why it is so important to design experiments well, to randomize and blind when possible, to include necessary positive and negataive controls, and to validate all methods.
- Explanation 6: Uninterpretable due to dynamic sample size. Say you first ran the experiment three times, and the result (n=3) was not statistically significant. Then you ran it three more times, and the pooled results (n=6) were still not statistically significant. Then you ran it four more times, and finally the results (with n=10) were statistically significant. The P value you obtain from this approach simply cannot be interpreted. P values can only be interpreted at face value when the sample size, the experimental protocol, and all data manipulations and analyses were planned in advance. Otherwise you are P-hacking, and the results cannot be interpreted.
- Explanation 7: Uninterpretable due to ad hoc multiple comparisons. Say you actually ran this experiment many times, each time testing a different drug. The results were not statistically significant for the first 24 drugs tested, but the 25th drug achieved statistical significance. These results would be is impossible to interpret, as you’d expect some small P values just by chance when you do many comparisons. If you know how many comparisons were made (or planned) you can correct for multiple comparisons. But here the design was ad hoc so a rigorous interpretation is impossible. It is another form of P-hacking.
Be cautious when interpreting small P values. Don't make the mistake of instantly believing explanation 1 above, without also thinking about the possibility that the true explanation is one of the other six possibilities listed above.
(Edited March 26, 2014 to properly explain a Type S error.)
Keywords: FDR alpha