Statistician: "Oh, so you have already calculated the P value?"
Surgeon: "Yes, I used multinomial logistic regression."
Statistician: "Really? How did you come up with that?"
Surgeon: "Well, I tried each analysis on the SPSS drop-down menus, and that was the one that gave the smallest P value".
For statistical analyses to be interpretable at face value, it is essential that these three statements be true:
•All analyses were planned.
•All planned analyses were conducted exactly as planned and then reported.
•You take into account all the analyses when interpreting the results.
These simple and sensible rules are commonly violated in many ways as explained below.
Before the data are analyzed, some decisions get made. Which values should be deleted because they are so high or so low that they are considered to be mistakes? Whether and how to normalize? Whether and how to transform the data?
To properly interpret a P value, the experimental protocol has to be set in advance. Usually this means choosing a sample size, collecting data, and then analyzing it.
But what if the results aren’t quite statistically significant? It is tempting to run the experiment a few more times (or add a few more subjects), and then analyze the data again, with the larger sample size. If the results still aren’t “significant”, then do the experiment a few more times (or add more subjects) and reanalyze once again.
When data are analyzed in this way, it is impossible to interpret the results. This informal sequential approach should not be used.
If the null hypothesis of no difference is in fact true, the chance of obtaining a “statistically significant” result using that informal sequential approach is far higher than 5%. In fact, if you carry on that approach long enough, then every single experiment will eventually reach a “significant” conclusion, even if the null hypothesis is true. Of course, “long enough” might be very long indeed and exceed your budget or even your lifespan.
The problem is that the experiment continues when the result is not “significant”, but stops when the result is “significant”. If the experiment was continued after reaching “significance”, adding more data might then result in a “not significant” conclusion. But you’d never know this, because the experiment would have been terminated once “significance” was reached. If you keep running the experiment when you don’t like the results, but stop the experiment when you like the results, the results are impossible to interpret.
Statisticians have developed rigorous ways to handle sequential data analysis. These methods use much more stringent criteria to define “significance” to account for the sequential analyses. Without these special methods, you can’t interpret the results unless the sample size is set in advance
Analyzing multiple subgroups of data is a form of multiple comparisons. When a treatment works in some subgroups but not others, analyses of subgroups becomes a form of multiple comparisons and it is easy to be fooled.
A simulated study by Lee and coworkers points out the problem. They pretended to compare survival following two “treatments” for coronary artery disease. They studied a group of real patients with coronary artery disease who they randomly divided into two groups. In a real study, they would give the two groups different treatments, and compare survival. In this simulated study, they treated the subjects identically but analyzed the data as if the two random groups actually represented two distinct treatments. As expected, the survival of the two groups was indistinguishable (2).
They then divided the patients into six groups depending on whether they had disease in one, two, or three coronary arteries, and depending on whether the heart ventricle contracted normally or not. Since these are variables that are expected to affect survival of the patients, it made sense to evaluate the response to “treatment” separately in each of the six subgroups. Whereas they found no substantial difference in five of the subgroups, they found a striking result among the sickest patients. The patients with three-vessel disease who also had impaired ventricular contraction had much better survival under treatment B than treatment A. The difference between the two survival curves was statistically significant with a P value less than 0.025.
If this were an actual study, it would be tempting to conclude that treatment B is superior for the sickest patients, and to recommend treatment B to those patients in the future. But this was not a real study, and the two “treatments” reflected only random assignment of patients. The two treatments were identical, so the observed difference was absolutely positively due to chance.
It is not surprising that the authors found one low P value out of six comparisons. There is a 26% chance that one of six independent comparisons will have a P value less than 0.05, even if all null hypotheses are true.
If all the subgroup comparisons are defined in advance, it is possible to correct for many comparisons – either as part of the analysis or informally while interpreting the results. But when this kind of subgroup analysis is not defined in advance, it becomes a form of “data torture”.
In 2000, the Intergovernmental Panel on Climate Change made predictions about future climate. Pielke asked what seemed like a straightforward question: How accurate were those predictions over the next seven years? That’s not long enough to seriously assess predictions of global warming, but it is a necessary first step. Answering this question proved to be impossible. The problems are that the report contained numerous predictions, and didn’t specify which sources of climate data should be used. Did the predictions come true? The answer depends on the choice of which prediction to test and which data set you test it against -- “a feast for cherry pickers” (3)
You can only evaluate the accuracy of predictions or diagnoses when the prediction unambiguously stated what was being predicted and when it would happen.
When comparing two groups, the groups must be defined as part of the study design. If the groups are defined by the data, many comparisons are being made implicitly and ending the results cannot be interpreted.
Austin and Goldwasser demonstrated this problem(4). They looked at the incidence of hospitalization for heart failure in Ontario (Canada) in twelve groups of patients defined by their astrological sign (based on their birthday). People born under the sign of Pisces happened to have the highest incidence of heart failure. They then did a simple statistics test to compare the incidence of heart failure among people born under Pisces with the incidence of heart failure among all others (born under all the other eleven signs, combined into one group). Taken at face value, this comparison showed that the difference in incidence rates is very unlikely to be due to chance (the P value was 0.026). Pisces have a “statistically significant” higher incidence of heart failure than do people born in the other eleven signs.
The problem is that the investigators didn’t test really one hypothesis; they tested twelve. They only focused on Pisces after looking at the incidence of heart failure for people born under all twelve astrological signs. So it isn’t fair to compare that one group against the others, without considering the other eleven implicit comparisons. After correcting for those multiple comparisons, there was no significant association between astrological sign and heart failure.
Fitting a multiple regression model provides even more opportunities to try multiple analyses:
•Try including or excluding possible confounding variables.
•Try including or excluding interactions.
•Change the definition of the outcome variable.
•Transform the outcome or any of the independent variables to logarithms or reciprocals or something else.
Unless these decisions were made in advance, the results of multiple regression (or multiple logistic or proportional hazards regression) cannot be interpreted at face value.
Chapter 38 of Intuitive Biostatistics(8) explains this problem of overfitting, as does Babyok (5).
In some cases, you first look at the data (and perhaps do a preliminary analysis) and then decide what test to run next depending on those values. Gelman calls this "the garden of forking paths" and states that it is a form of multiple comparisons (10).
Editors prefer to publish papers that report results that are statistically significant. Interpreting published results becomes problematic when studies with “not significant” conclusions are abandoned, while the ones with “statistically significant” results get published. This means that the chance of observing a ‘significant’ result in a published study can be much greater than 5% even if the null hypotheses are all true.
Turner demonstrated this kind of selectivity -- called publication bias -- in industry-sponsored investigations of the efficacy of antidepressant drugs (6). Between 1987 and 2004, the Food and Drug Administration (FDA) reviewed 74 such studies, and categorized them as “positive”, “negative” or “questionable”. The FDA reviewers found that 38 studies showed a positive result (the antidepressant worked). All but one of these studies was published. The FDA reviewers found that the remaining 36 studies had negative or questionable results. Of these, 22 were not published, 11 were published with a ‘spin’ that made the results seem somewhat positive, and only 3 of these negative studies were published with clear negative findings.
The problem is a form of multiple comparisons. Many studies are done, but only some are published, and these are selected because they show "desired" results.
Statistical analyses can be interpreted at face value only when all steps are planned, all planned analyses are published, and all the results are considered when reaching conclusions. These simple rules are violated in many ways in common statistical practice.
If you try hard enough, eventually ‘statistically significant’ findings will emerge from any reasonably complicated data set. This is called data torture (6) or P-hacking (9). When reviewing results, you often can't even correct for the number of ways the data were analyzed since the number of possible comparisons was not defined in advance, and is almost unlimited. When results were analyzed many ways without a plan, the results simply cannot be interpreted. At best, you can treat the findings as an hypothesis to be tested in future studies with new data.
1. Vickers, A., What is a p value anyway, 2009. ISBN: 978-0321629302.
2. Lee, K. L., J. F. McNeer, C. F. Starmer, P. J. Harris, and R. A. Rosati. 1980. Clinical judgment and statistics. Lessons from a simulated randomized trial in coronary artery disease. Circulation 61, (3) (Mar): 508-15
3. Pielke, R. Prometheus: Forecast verification for climate scient, part 3. Retrieved April 20, 2008.
4. Austin, P. C., and M. A. Goldwasser. 2008. Pisces did not have increased heart failure: Data-driven comparisons of binary proportions between levels of a categorical variable can result in incorrect statistical significance levels. Journal of Clinical Epidemiology 61, (3) (Mar): 295-300.
5. Babyak, M.A.. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosomatic Medicine (2004) vol. 66 (3) pp. 411
6. Mills, J. L. 1993. Data torturing. New England Journal of Medicine 329, (16): 1196.
7. Turner, E. H., A. M. Matthews, E. Linardatos, R. A. Tell, and R. Rosenthal. 2008. Selective publication of antidepressant trials and its influence on apparent efficacy. The New England Journal of Medicine 358, (3) (Jan 17): 252-60.
8. Motulsky, H.J. (2010). Intuitive Biostatistics, 2nd edition. Oxford University Press. ISBN=978-0-19-973006-3.
9. Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22, 1359–1366 (2011).
10. Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or ‘p-hacking’ and the research hypothesis was posited ahead of time. Downloaded January, 30, 2014.