KNOWLEDGEBASE - ARTICLE #2186

How can I make my experiment replicable?

While you can’t guarantee that your experiment will be replicable, there are many steps that you can take to put your research on a solid foundation. Most think science has a cut and dry answer, but any experimental results beg the question of whether the result would happen again or if it happened by chance. “By chance” includes all sorts of obvious possibilities (perhaps your samples were contaminated and you never knew) as well as lesser-known ones based on math and probabilities. In this article, we provide suggestions to increase the chances of your results being replicable.

What is replicability?

A scientific experiment is replicable if it can be repeated with the same analytical results. Due to all sorts of factors, including random variability, this is not as common as some might think.

What is reproducibility?

Often used interchangeably with replicability, reproducibility has to do with clearly laying out steps to reproduce the original experiment. A reproducible experiment, then, would detail the process of sampling, data collection and experimentation sufficiently for another skilled researcher to be able to conduct the same experiment.

Background on the replication crisis and p value controversy

You’ve probably heard of the replication crisis, where (valid) concerns have been raised questioning how many published experiments with statistically significant results could be reproduced if they were run again. Scientists attempted to recreate experiments published in top-tier Psychology journals with soberingly unsuccessful results, and it has led to some alarming claims about science. In response, some journals blamed and banned p values and classical null hypothesis testing.

This “crisis” is a good thing

From the perspective of most statisticians and many scientists, this “crisis” sparked a positive and much-needed debate around some of the systemic issues in science that can lead to misleading or unreplicable results. As the American Statistical Association wrote in a statement on the use and misuse of p values, “Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail.”

Is the issue that statistics can’t be trusted?

Not at all. Statistics is excellently suited to analyze data from well-designed experiments. However, statistical results can be manipulated (often unwittingly) leading to fallaciously high estimates of significance. Hence our following recommendations.

What do I do to make my results replicable?

Before statistical analysis

The main issues affecting replicability in quantitative scientific research centers are around the integrity of data collection and analysis. Much of that takes place before statistical modeling. The early steps of research are guided by the scientific method, and involve:

Drilling down to a specific question of interest with a corresponding, measurable quantity
Collecting a representative sample of the population of interest by randomizing appropriately
Identification of the appropriate statistical model for your data and experimental design

We cannot emphasize enough the trickle-down effect of these choices on the quality of the resulting research. A plethora of articles, textbooks and courses exist on each of these topics for most scientific disciplines. Our focus for the remainder of the article is on what steps you can take with statistics to put your research on solid footing, and one of the most pernicious of these is p-hacking.

P-hacking

P-hacking (also called p value hacking, data dredging and data snooping), refers to a variety of analytical errors where a researcher digs around in their data to find a significant p value. This can greatly affect the replicability of your research.

The motivation to p-hack can largely be attributed to misunderstanding p values and scientific journals publishing “significant statistical results” based on a p value below a .05 cutoff. This creates an incentive for scientists to find statistical significance in their data, but p values aren’t robust to repeated examinations, so it’s easy to misuse them. For more information on p values, see advice on interpreting small and large p values and using two-tailed tests.

Right now you might be asking yourself: What kind of researcher would p-hack? At first, this should sound like a problem plaguing only the least-ethical studies, since the entire reasoning is in direct contrast with the scientific method. Of course, if this were the case, the issues would be fewer and easier to spot. There probably would be no replicability crisis at all!

The truth is that p-hacking most commonly occurs in several more subtle ways, many of which come from a place that is inquisitive rather than malicious. Here are several examples of ways that p-hacking can become a temptation in most any study, and the ways to guard yourself against it.

P-hacking #1: Dealing with an almost significant result

Let’s say your experiment resulted in a “nearly significant” outcome. Perhaps you were aiming for a p value of .050 or less, and your p value was .054. Technically, this is not a significant result, but just a little bit of rounding it would make it so. Do you round to .05 and report it as significant?

If that seems like p-hacking to you, you’re correct. Most recognize it as such and propose a new idea that seems more straightforward: Just add a few more replicates and see if the result is officially significant when you add them into your dataset. To most, this seems like a good compromise. It is common across many disciplines and seems sensible, and who could possibly argue with obtaining a more robust dataset than you had previously?

The answer? Statisticians. We say that if you look for anything enough times, you’ll eventually find it. Even if it’s just expanding your dataset and re-running results, p values are designed to analyze a single experiment one time. If you get a “nearly” significant result on your first analysis of your experiment, even though it may be tempting, you violate statistical principles by adding a few more data points without accounting for multiple testing (more on that later).

Expanding your dataset once when you don’t get quite the result you hoped for brings up more questions: What will you do if the result still is not quite within the bounds of significance? For these reasons, even though it seems harmless on the outset, playing with a nearly significant result contributes to the replication crisis.

This means that you’ll want to calculate your sample size ahead of time and not be tempted to add experimental replicates until you reach significance unless you clearly document this and use an analysis method that adjusts for sequential sampling. Failure to do this results in a much higher probability of a false positive result.

P-hacking #2: Data-dredging

In another case, imagine the situation where you have gathered an in-depth dataset on your subjects, but when you look at the results relating to your hypothesis, they are not statistically significant. You might remember that you have a rich dataset, and so, rather than admit defeat over your original hypothesis, you can investigate other relationships. A few (or more than a few) investigations later, you probably will have found a significant relationship in your dataset; it just wasn’t the one you were looking for at the outset. Good thing you didn’t give up hope early, right?

Actually, this contributes to replicability issues, too. With a closer eye, it’s easy to see that this has the same issue as the first case: If you look for anything enough times, you’ll eventually find it. Even if it only takes looking at the data a couple of different ways to find something that seems compelling, the laws of probability see it the same as if you tested all sorts of comparisons until you find some p values less than .05.

In a paper from 2000, Peter Sleight shares some great examples of data dredging. For example, he used data from a study evaluating aspirin’s effect compared to placebo for the prevention of myocardial infarctions. He found that by partitioning the data into subgroups based on astrological signs, two of the signs (Gemini and Libra) showed no significant effect.

Should you never look into these additional questions in the first place? Not necessarily: After all, many exploratory studies in a new field cannot always start with clear hypotheses. One thing you will notice though is a difference in how their conclusions are presented. An exploratory study cannot make the same level of claims that a randomized experimental study can.

If your study began with a main hypothesis (as most should), you should make addressing your hypothesis the main goal of your article, even if that means an insignificant result. However, you can always include exploratory analyses in addition. Here you can acknowledge the potential for other findings, such as the ones you did not expect to find. You should mention these within the limitations of your study, and suggest that future research look into it in more depth.

P-hacking #3: Multiple Testing

A related p-hacking mistake is failure to properly adjust when you test many p values within a single experiment, called multiple testing. Fortunately in general, this is easier to catch via the peer-review process. The idea is that you want to control your overall experiment’s probability of a false positive. This means that if you conduct a scattershot approach to find significant results, you would penalize your p value cutoff. We’ve broken advice for multiple testing into three different approaches:

Not correcting for multiple testing, when the experiment was designed intentionally for just a few specific comparisons.
Controlling the family-wise error rate (FWER), which includes methods such as Bonferroni correction, Holm-Sidak test, Tukey Method, and Dunnet’s Method.
Controlling the false discovery rate (FDR), which includes (1) Benjamini and Hochberg, (2) Benjamini, Kireger and Yekutieli (3) Benjamini and Yekutieli

There are many options for controlling the family-wise error rate and FDR in Prism. Start your free trial of Prism today.

What if I can’t follow all of your recommendations?

At the very least, address the limitations when you publish. As much as we try to make science purely objective, that’s not always possible. The greatest researchers in our history have made mistakes, and that’s just part of the scientific process. If you have doubts about the way you’ve conducted your experiment or statistical analysis, ask for help. In addition to mentors in your field, most universities and many private companies offer statistical consulting services.

In this article, we’ve given tips for using statistics appropriately with your research. As or more important than that, however, is honest, open dialogue between researchers. Employ that, and we’ll be one step closer to scientific research built on a solid foundation.