Let's consider what would happen if you did many comparisons, and determined whether each result is 'significant' or not. Also assume that we are 'mother nature' so know whether a difference truly exists or not in the populations from which the data were sampled.
In the table below, the top row represents the results of comparisons where the null hypothesis is true -- the treatment really doesn't work. Nonetheless, some comparisons will mistakenly yield a 'significant' conclusion. The second line shows the results of comparisons where there truly is a difference. Even so, you won't get a 'significant' result in every experiment.
A, B, C and D represent numbers of comparisons, so the sum of A+B+C+D equals the total number of comparisons you are making. You can't make this table from experimental data because this table is an overview of many experiments.
"Significant" |
"Not significant" |
Total |
|
No difference. Null hypothesis true |
A |
B |
A+B |
A difference truly exists |
C |
D |
C+D |
Total |
A+C |
B+D |
A+B+C+D |
In the table above, alpha is the expected value of A/(A+B). If you set alpha to the usual value of 0.05, this means you expect 5% of all comparisons done when the null hypothesis is true (A+B) to be statistically significant (in the first column). So you expect A/(A+B) to equal 0.05.
The usual approach to correcting for multiple comparisons is to set a stricter threshold to define statistical significance. The goal is to set a strict definition of significance such that -- if all null hypotheses are true -- there is only a 5% chance of obtaining one or more 'significant' results by chance alone, and thus a 95% chance that none of the comparisons will lead to a 'significant' conclusion. The 5% applies to the entire experiment, so is sometimes called an experimentwise error rate or familywise error rate (the two are synonyms).
Setting a stricter threshold for declaring statistical significance ensures that you are far less likely to be mislead by false conclusions of 'statistical significance'. But this advantage comes at a cost: your experiment will have less power to detect true differences.
The methods of Bonferroni, Tukey, Dunnett, Dunn, Holm (and more) all use this approach.