When analyzing data, you'll sometimes find that one value is far from the others. Such a value is called an outlier, a term that is usually not defined rigorously.
When you encounter an outlier, you may be tempted to delete it from the analyses. First, ask yourself these questions:
•Was the value entered into the computer correctly? If there was an error in data entry, fix it.
•Were there any experimental problems with that value? For example, if you noted that one tube looked funny, you can use that as justification to exclude the value resulting from that tube without needing to perform any calculations.
•Could the outlier be caused by biological diversity? If each value comes from a different person or animal, the outlier may be a correct value. It is an outlier not because of an experimental mistake, but rather because that individual may be different from the others. This may be the most exciting finding in your data!
If you answered “no” to all three questions, you are left with two possibilities.
•The outlier was due to chance. In this case, you should keep the value in your analyses. The value came from the same distribution as the other values, so should be included.
•The outlier was due to a mistake: bad pipetting, voltage spike, holes in filters, etc. Since including an erroneous value in your analyses will give invalid results, you should remove it. In other words, the value comes from a different population than the other values, and is misleading.
The problem, of course, is that you can never be sure which of these possibilities is correct.
Some statistical tests are designed so that the results are not altered much by the presence of one or a few outliers. Such tests are said to be robust. When you use a robust method, there is less reason to want to exclude outliers.
Most nonparametric tests compare the distribution of ranks. This makes the test robust because the largest value has the largest rank, but it doesn't matter how large that value is.
Other tests are robust to outliers because rather than assuming a Gaussian distribution, they assume a much wider distribution where outliers are more common (so have less impact).