The SD computed from tiny samples underestimate the population SD (but not by much)
The standard deviation (SD) quantifies scatter. The equation used to compute the sample SD (which uses n-1 in the denominator), underestimates the true population SD by a small amount.
The following simulation demonstrates this. The graph shows the results of 400 simulations. Each simulation randomly sampled from a Gaussian population with mean=100 and SD=15. One hundred samples had only duplicate values (n=2, left panel of graph). Another 100 had n=3, n=10 and n=50. Each dot on the graph shows the SD of one randomly generated sample. The long horizontal line shows the true population SD, which is 15.0. The shorter horizontal lines show the mean of the SDs from the 100 simulated samples for each sample size.
You can see that the mean SD is a bit too low for the n=2 and n=3 samples. This is not just a glitch due to random sampling, but rather is a consistent finding. The SD is the square root of variance. The equation that computes variance (with N-1 in the denominator) is correct. The average of the variances of these simulated samples are indeed very close to the true population variance. Taking the square root of the variances to compute the SD reduces the large variances more than the small, and the mean of the SDs underestimates the true population SD.
An unbiased estimate of the population SD equals the computed sample SD divided by a quantity known as c4 (The c, I think, is for control chart; I don't know why it is called c4). The value of c4, of course, depends on sample size. it is computed with this Excel formula:
=EXP(GAMMALN(N/2)+LN(SQRT(2/(N-1)))-GAMMALN((N-1)/2))
With n=2, the computed SD is too low by about 20%. With n=10, the discrepancy is only about 3%. Other values are tabulated below:
n | C4 |
2 | 0.79788 |
3 | 0.88623 |
4 | 0.92132 |
5 | 0.93999 |
6 | 0.95153 |
7 | 0.95937 |
8 | 0.96503 |
9 | 0.96931 |
10 | 0.97266 |
15 | 0.98232 |
20 | 0.98693 |
30 | 0.99142 |
50 | 0.99491 |
100 | 0.99748 |
Prism and InStat compute the sample standard deviation without the correction detailed above. They don't even offer the option of including the c4 correction. Few programs do. Why is this correction commonly ignored?
- If you really care about differences between means, then what matters is the variance. The t test and one-way ANOVA use variances in their internal calculation. The square of the sample SD (without the c4 correction) is the best estimate of the population variance.
- Inferences based on the confidence interval of the mean are also correct when the sample SD is used (the theory is based onthe variance not the standard deviation).
- The correction is tiny unless the samples are really small. Even then, as the graph above shows, the systematic deviation of the sample SD from the population SD is tiny compared to the random variation.
- Tradition. A new definition of SD would be confusing. It is confusing enough to have two definitions (n vs. n-1).