|
||
|
Viewing By Month : November 2008 / Main
November 17, 2008Data torture and multiple comparisons. It is hard enough to interpret one statistical result. Interpreting multiple comparisons at once is harder, but necessary. Picking and choosing among many results makes all conclusions invalid. For statistical analyses to be interpretable, it is essential that all analyses be planned, and that all planned analyses are conducted and reported. These simple and sensible rules can be violated in many situations. Data Torture“Data torture” occurs when investigators, without a clear plan, analyze their data in many ways, desperately seeking “statistical significance” (1). Vickers told this story: Statistician: "Oh, so you have already calculated the P value?" Surgeon: "Yes, I used multinomial logistic regression." Statistician: "Really? How did you come up with that?" Surgeon: "Well, I tried each analysis on the SPSS drop-down menus, and that was the one that gave the smallest P value". Investigators have found many ways to torture data. Change the definition of the outcome. Use a different time scale. Try different criteria for including or excluding a subject. Arbitrarily decide which points to remove as outliers. Try different ways to clump or separate subgroups. Try different algorithms for computing statistical tests. Try different statistical tests. Fitting a multiple regression model provides even more opportunities for data torture. Include or exclude possible confounding variables. Include or exclude interactions. Change the definition of the outcome variable. If you try hard enough, eventually ‘statistically significant’ findings will emerge from any reasonably complicated data set. Since the number of possible comparisons is not defined in advance, and is almost unlimited, results from data torture cannot be interpreted except perhaps as a method to generate hypotheses to be tested in future studies. Torture by Editors -- Publication BiasEditors prefer to publish papers that report results that are statistically significant. Interpreting published results becomes problematic when studies with “not significant” conclusions are abandoned, while the ones with “statistically significant” results get published. This means that the chance of observing a ‘significant’ result in a published study can be much greater than 5% even if the null hypotheses are all true. Turner demonstrated this kind of selectivity -- called publication bias -- in industry-sponsored investigations of the efficacy of antidepressant drugs (2). Between 1987 and 2004, the Food and Drug Administration (FDA) reviewed 74 such studies, and categorized them as “positive”, “negative” or “questionable”. The FDA reviewers found that 38 studies showed a positive result (the antidepressant worked). All but one of these studies was published. The FDA reviewers found that the remaining 36 studies had negative or questionable results. Of these, 22 were not published, 11 were published with a ‘spin’ that made the results seem somewhat positive, and only 3 of these negative studies were published with clear negative findings. Studies that show ‘positive’ results are far more likely to be published than ones that reach negative or ambiguous conclusions. Selective publication makes it impossible to properly interpret the published literature. Multiple Time Points -- Sequential AnalysesTo properly interpret a P value, the experimental protocol has to be set in advance. Usually this means choosing a sample size, collecting data, and then analyzing it. But what if the results aren’t quite statistically significant? It is tempting to run the experiment a few more times (or add a few more subjects), and then analyze the data again, with the larger sample size. If the results still aren’t “significant”, then do the experiment a few more times (or add more subjects) and renanalyze once again. When data are analyzed in this way, it is impossible to interpret the results. This informal sequential approach should not be used. If the null hypothesis of no difference is in fact true, the chance of obtaining a “statistically significant” result using that informal sequential approach is far higher than 5%. In fact, if you carry on that approach long enough, then every single experiment will eventually reach a “significant” conclusion, even if the null hypothesis is true. Of course, “long enough” might be very long indeed and exceed your budget or even your lifespan. The problem is that the experiment continues when the result is not “significant”, but stops when the result is “significant”. If the experiment was continued after reaching “significance”, adding more data might then result in a “not significant” conclusion. But you’d never know this, because the experiment would have been terminated once “significance” was reached. If you keep running the experiment when you don’t like the results, but stop the experiment when you like the results, the results are impossible to interpret. Statisticians have developed rigorous ways to handle sequential data analysis. These methods use much more stringent criteria to define “significance” to make up for the multiple comparisons. Without these special methods, you can’t interpret the results unless the sample size is set in advance Multiple SubgroupsAnalyzing multiple subgroups of data is a form of multiple comparisons. When a treatment works in some subgroups but not others, analyses of subgroups becomes a form of multiple comparisons and it is easy to be fooled. A simulated study by Lee and coworkers points out the problem. They pretended to compare survival following two “treatments” for coronary artery disease. They studied a group of real patients with coronary artery disease who they randomly divided into two groups. In a real study, they would give the two groups different treatments, and compare survival. In this simulated study, they treated the subjects identically but analyzed the data as if the two random groups actually represented two distinct treatments. As expected, the survival of the two groups was indistinguishable (3). They then divided the patients into six groups depending on whether they had disease in one, two, or three coronary arteries, and depending on whether the heart ventricle contracted normally or not. Since these are variables that are expected to affect survival of the patients, it made sense to evaluate the response to “treatment” separately in each of the six subgroups. Whereas they found no substantial difference in five of the subgroups, they found a striking result among the sickest patients. The patients with three-vessel disease who also had impaired ventricular contraction had much better survival under treatment B than treatment A. The difference between the two survival curves was statistically significant with a P value less than 0.025. If this were an actual study, it would be tempting to conclude that treatment B is superior for the sickest patients, and to recommend treatment B to those patients in the future. But this was not a real study, and the two “treatments” reflected only random assignment of patients. The two treatments were identical, so the observed difference was absolutely positively due to chance. It is not surprising that the authors found one low P value out of six comparisons. There is a 26% chance that one of six independent comparisons will have a P value less than 0.05, even if all null hypotheses are true. If all the subgroup comparisons are defined in advance, it is possible to correct for many comparisons – either as part of the analysis or informally while interpreting the results. But when this kind of subgroup analysis is not defined in advance, it becomes a form of “data torture”. Multiple PredictionsIn 2000, the Intergovernmental Panel on Climate Change made predictions about future climate. Pielke asked what seemed like a straightforward question: How accurate were those predictions over the next seven years? That’s not long enough to seriously assess predictions of global warming, but it is a necessary first step. Answering this question proved to be impossible. The problems are that the report contained numerous predictions, and didn’t specify which sources of climate data should be used. Did the predictions come true? The answer depends on the choice of which prediction to test and which data set you test it against -- “a feast for cherry pickers”. You can only evaluate the accuracy of predictions or diagnoses when the prediction, and the method or data source to compare it with, is unambiguous. Combining GroupsWhen comparing two groups, the groups must be defined as part of the study design. If the groups are defined by the data, many comparisons are being made implicitly and ending the results cannot be interpreted. Austin and Goldwasser demonstrated this problem(4). They looked at the incidence of hospitalization for heart failure in Ontario (Canada) in twelve groups of patients defined by their astrological sign (based on their birthday). People born under the sign of Pisces happened to have the highest incidence of heart failure. They then did a simple statistics test to compare the incidence of heart failure among people born under Pisces with the incidence of heart failure among all others (born under all the other eleven signs, combined into one group). Taken at face value, this comparison showed that the difference in incidence rates is very unlikely to be due to chance (the P value was 0.026). Pisces have a “statistically significant” higher incidence of heart failure than do people born in the other eleven signs. The problem is that the investigators didn’t test really one hypothesis; they tested twelve. They only focused on Pisces after looking at the incidence of heart failure for people born under all twelve astrological signs. So it isn’t fair to compare that one group against the others, without considering the other eleven implicit comparisons. After correcting for those multiple comparisons, there was no significant association between astrological sign and heart failure. SummaryMultiple comparisons can be interpreted correctly only when all comparisons are planned, and all planned comparisons are published. These simple ideas are violated in many ways in common statistical practice.
References: 1. Mills, J. L. 1993. Data torturing. New England Journal of Medicine 329, (16): 1196.2. Turner, E. H., A. M. Matthews, E. Linardatos, R. A. Tell, and R. Rosenthal. 2008. Selective publication of antidepressant trials and its influence on apparent efficacy. The New England Journal of Medicine 358, (3) (Jan 17): 252-60. 3. Lee, K. L., J. F. McNeer, C. F. Starmer, P. J. Harris, and R. A. Rosati. 1980. Clinical judgment and statistics. lessons from a simulated randomized trial in coronary artery disease. Circulation 61, (3) (Mar): 508-15 4. Austin, P. C., and M. A. Goldwasser. 2008. Pisces did not have increased heart failure: Data-driven comparisons of binary proportions between levels of a categorical variable can result in incorrect statistical significance levels. Journal of Clinical Epidemiology 61, (3) (Mar): 295-300.
November 14, 2008The joy of EPS
When submitting Prism 5 graphs or layouts to a journal, Prism offers many export formats. One format that is often overlooked, but should be your first choice, is EPS. Compared to TIF files, EPS files are compact and crisp.
EPS files contain the same postscript information as PDF files, but with some headers that make them more compatible with the systems journals use to layout pages. EPS files are based on vectors and fonts (not bitmaps) so scale to any size. Of course, different journals use different systems and have different requirements. Many biological journals are produced by Cadmus, and we have heard that they accept EPS files from Prism. After creating a EPS file, and before sending it to a journal, you probably want to preview it. With a Mac, that is no problem. The Mac Preview program will let you view the EPS file (actually it converts it to PDF, and lets you preview it). With Windows, however, you won't be able to preview EPS files with standard software. However, EPS and PDF files are very similar (and are created by the same software module), so the solution is to also export in PDF format, and preview those files. The ability to export in EPS format is new to Prism 5.
November 13, 2008When to not correct for multiple comparisons Multiple comparisons can be accounted for with Bonferroni and other corrections, or by the approach of calculating the False Discover Rate. But these approaches are not always needed. Here are three situations were special calculations are not needed. Account for multiple comparisons when interpreting the results rather than in the calculations Some statisticians recommend never correcting for multiple comparisons while analyzing data (1). Instead report all of the individual P values and confidence intervals, and make it clear that no mathematical correction was made for multiple comparisons. This approach requires that all comparisons be reported. When you interpret these results, you need to informally account for multiple comparisons. If all the null hypotheses are true, you’d expect 5% of the comparisons to have uncorrected P values less than 0.05. Compare this number to the actual number of small P values. Corrections for multiple comparisons may not be needed if you make only a few planned comparisons Other statisticians recommend not doing any formal corrections for multiple comparisons when the study focuses on only a few scientifically sensible comparisons, rather than every possible comparison. The term planned comparison is used to describe this situation. These comparisons must be designed into the experiment, and cannot be decided upon after inspecting the data. Corrections for multiple comparisons are not needed when the comparisons are complementary Ridker and colleagues (2) asked whether lowering LDL cholesterol would prevent heart disease in patients who did not have high LDL concentrations and did not have a prior history of heart disease (but did have an abnormal blood test suggesting the presence of some inflammatory disease). They study included almost 18,000 people. Half received a statin drug to lower LDL cholesterol and half received placebo. The investigators primary goal (planned as part of the protocol) was to compare the number of “end points” that occurred in the two groups, including deaths from a heart attack or stroke, nonfatal heart attacks or strokes, and hospitalization for chest pain. These events happened about half as often to many people treated with the drug compared to people taking placebo. The drug worked. The investigators also analyzed each of the endpoints. Those taking the drug (compared to those taking placebo) had fewer deaths, and fewer heart attacks, and fewer strokes, and fewer hospitalizations for chest pain. The data from various demographic groups were then analyzed separately. Separate analyses were done for men and women, old and young, smokers and nonsmokers, people with hypertension and without, people with a family history of heart disease and those without. In each of 25 subgroups, patients receiving the drug experienced fewer primary endpoints than those taking placebo, and all these effects were statistically significant. The investigators made no correction for multiple comparisons for all these separate analyses of outcomes and subgroups. No corrections were needed, because the results are so consistent. The multiple comparisons each ask the same basic question a different way, and all the comparisons point to the same conclusion – people taking the drug had less cardiovascular disease than those taking placebo.
References 1. Rothman, K.J. (1990). No adjustments are needed for multiple comparisons.Epidemiology, 1: 43-46. 2. Ridker. Rosuvastatin to Prevent Vascular Events in Men and Women with Elevated C-Reactive Protein. N Engl J Med (2008) vol. 359 pp. 3195
November 12, 2008Using InStat's help with Windows Vista. The Windows version of InStat and StatMate were written before Vista, and use the older .HLP style of online help. Viewing this help requires the Windows Help Viewer, and this is no longer a standard part of Windows Vista. If you try to access help, Windows presents an error message with a link to a page on microsoft.com. Follow the instructions on that page to download and install the Windows Help program (WinHlp32.exe) for Windows Vista. Once that program is installed, the Help for InStat and StatMate will work just fine. Do note that you must be logged into Windows as an administrator to install that program. If you don't want to fuss with installing Windows components, all the same information that is in the Help system is also in this free InStat pdf manual .
|
||