Viewing By Month : May 2008 / Main
May 25, 2008

P values. One-tail or  two-tail ?

When comparing two groups, you must distinguish between one- and two-tail P values. Some books refer to one-sided and two-sided P values, which mean the same thing.

What does one-sided mean?

It is easiest to understand the distinction in context. So let’s imagine that you are comparing the mean of two groups (with an unpaired t test). Both one- and two-tail P values are based on the same null hypothesis, that two populations really are the same and that an observed discrepancy between sample means is due to chance.

A two-tailed P value answers this question:

Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) you observed in this experiment with either group having the larger mean?


To interpret a one-tail P value, you must predict which group will have the larger mean before collecting any data. The one-tail P value answers this question:

Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) observed in this experiment with the specified group having the larger mean?


If the observed difference went in the direction predicted by the experimental hypothesis, the one-tailed P value is half the two-tailed P value (with most, but not quite all, statistical tests).

When is it appropriate to use a one-sided P value?

A one-tailed test is appropriate when previous data, physical limitations, or common sense tells you that the difference, if any, can only go in one direction. You should only choose a one-tail P value when both of the following are true.

  • You predicted which group will have the larger mean (or proportion) before you collected any data.
  • If the other group had ended up with the larger mean – even if it is quite a bit larger – you would have attributed that difference to chance and called the difference 'not statistically significant'.

Here is an example in which you might appropriately choose a one-tailed P value: You are testing whether a new antibiotic impairs renal function, as measured by serum creatinine. Many antibiotics poison kidney cells, resulting in reduced glomerular filtration and increased serum creatinine. As far as I know, no antibiotic is known to decrease serum creatinine, and it is hard to imagine a mechanism by which an antibiotic would increase the glomerular filtration rate. Before collecting any data, you can state that there are two possibilities: Either the drug will not change the mean serum creatinine of the population, or it will increase the mean serum creatinine in the population. You consider it impossible that the drug will truly decrease mean serum creatinine of the population and plan to attribute any observed decrease to random sampling. Accordingly, it makes sense to calculate a one-tailed P value. In this example, a two-tailed P value tests the null hypothesis that the drug does not alter the creatinine level; a one-tailed P value tests the null hypothesis that the drug does not increase the creatinine level.

The issue in choosing between one- and two-tailed P values is not whether or not you expect a difference to exist. If you already knew whether or not there was a difference, there is no reason to collect the data. Rather, the issue is whether the direction of a difference (if there is one) can only go one way. You should only use a one-tailed P value when you can state with certainty (and before collecting any data) that in the overall populations there either is no difference or there is a difference in a specified direction. If your data end up showing a difference in the “wrong” direction, you should be willing to attribute that difference to random sampling without even considering the notion that the measured difference might reflect a true difference in the overall populations. If a difference in the “wrong” direction would intrigue you (even a little), you should calculate a two-tailed P value.

Recommendation
I recommend using only two-tailed P values for the following reasons:

  • The relationship between P values and confidence intervals is more straightforward with two-tailed P values.
  • Two-tailed P values are larger (more conservative). Since many experiments do not completely comply with all the assumptions on which the statistical calculations are based, many P values are smaller than they ought to be. Using the larger two-tailed P value partially corrects for this.
  • Some tests compare three or more groups, which makes the concept of tails inappropriate (more precisely, the P value has more than two tails). A two-tailed P value is more consistent with P values reported by these tests.
  • Choosing one-tailed P values can put you in awkward situations. If you decided to calculate a one-tailed P value, what would you do if you observed a large difference in the opposite direction to the experimental hypothesis? To be honest, you should state that the P value is large and you found “no significant difference.” But most people would find this hard. Instead, they’d be tempted to switch to a two-tailed P value, or stick with a one-tailed P value, but change the direction of the hypothesis. You avoid this temptation by choosing two-tailed P values in the first place.


When interpreting published P values, note whether they are calculated for one or two tails. If the author didn’t say, the result is somewhat ambiguous.

Common misunderstandings about P values.

Kline (see book listing below) lists commonly believed fallacies about P values, which I summarize here:
 
Fallacy: P value is the probability that the result was due to sampling error
The P value is computed assuming the null hypothesis is true. In other words,  the P value is computed based on the assumption that the difference was due to sampling error. Therefore the P value cannot tell you the probability that the result is due to sampling error. 

Fallacy: The P value Is the probability that the null hypothesis is true
Nope. The P value is computed assuming that the null hypothesis is true, so cannot be the probability that it is true.

Fallacy: 1-P is the probability that the alternative hypothesis is true
If the P value is 0.03, it is very tempting to think: If there is only a 3% probability that my difference would have been caused by random chance, then there must be a 97% probability that it was caused by a real difference. But this is wrong!

What you can say is that if the null hypothesis were true, then 97% of experiments would lead to a difference smaller than the one you observed, and 3% of experiments would lead to a difference as large or larger than the one you observed.

Calculation of a P value is predicated on the assumption that the null hypothesis is correct. P values cannot tell you whether this assumption is correct. P value tells you how rarely you would observe a difference as larger or larger than the one you observed if the null hypothesis were true.

The question that the scientist must answer is whether the result is so unlikely that the null hypothesis should be discarded.

Fallacy: 1-P is the probability that the results will hold up when the experiment is repeated
If the P value is 0.03, it is tempting to think that this means there is a 97% chance of getting ‘similar’ results on a repeated experiment. Not so.

Fallacy: A high P value proves that the null hypothesis is true.
No. A high P value means that if the null hypothesis were true, it would not be surprising to observe the treatment effect seen in this experiment. But that does not prove the null hypothesis is true.

Reference: RB Kline, Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research, 2004.

The Mann-Whitney test doesn't really compare medians.

You'll sometimes read that the Mann-Whitney test compares the medians of two groups. But this is not precisely correct. 

Consider this example:

 

The graph shows each value obtained from control and treated subjects. The two-tail P value from the Mann-Whitney test is 0.0288, so you conclude that there is a statistically significant difference between the groups. But the two medians, shown by the horizontal lines, are identical. The Mann-Whitney test compared the distributions of ranks, which is quite different in the two groups even though the medians are the same.

It is not correct, however, to say that the Mann-Whitney test asks whether the two groups come from populations with different distributions. The two groups in the graph below clearly come from different distributions, but the P value from the Mann-Whitney test is high (0.46).

The Mann-Whitney test compares sums of ranks -- it does not compare medians and does not compare distributions.

The Mann-Whiteny test is a comparison of medians only when you assume that the distributions of the two populations have the same shape, even if they are shifted (have different medians). If you accept this assumption, then a small P value from a Mann-Whitney test leads you to conclude that the difference between medians is statistically signficant.

More generally, the P value answers this question: What is the chance that a randomly selected value from the population with the larger median is greater than than a randomly selected value from the other population?

 

May 20, 2008

Before-after graphs with different colors for different subjects.

When you enter data on a column table and choose a before-after graph, Prism plots all the symbols the same way. You can choose different colors or shapes for "before" than for "after" (which is not helpful). And you can right click on each symbol and change its color (and that of the connecting line). But this approach would be very tedious. .

Prism 5 can, in fact, create before-after graphs with multiple colors for different subjects. The trick is to enter the data enter the data on a Grouped table. Follow these steps or examine this Prism file.

1. Create a Grouped table.

 

 

Choose the appropriate number of "replicates" (subjects) for your data. Be sure to choose to plot each replicate, and to connect each replicate.

2. Enter the data.

Note that the arrangement of data is different than with a column table. The before-after pairs are stacked into subcolumns.

This table has two rows, because it plots just before and after. If you had more time points, add more rows.

This table has two data sets, male and female, because we want symbols with two different appearances. Use as many data sets as you want. If you want each subject to have its own appearance, create a table with no subcolumns, and enter each subject into its own data set.

3. Polish the graph.

 


Also see this related example for creating column scatter graphs with multiple symbol colors.

How to turn off automatic snapping.

When you move a text object in Prism 5, it will automatically snap into alignment with bars on graphs, groups of bars, the center of the page, and other text objects. This almost always is a great feature, as it lets you quickly move text to an appropriate spot. But sometimes, you may find that the automatic snapping prevents you from fine-tuning a graph.

There are two ways to work around the automatic snapping.

If you just want to nudge the text object a tiny bit, select it, and then use the arrow keys on the keyboard. Each click will move the object one pixel.

If you want to move the object with the mouse without snapping, hold the ALT key while dragging.

On the Mac, the Alt key is also called the Option key. Be sure to first start dragging and then hold down the ALT key. If you click the ALT key first, and then start dragging with Prism Mac, you will have created a duplicate copy of the object.

The Standard Addition Method for determing concentrations.

Prism can easily interpolate from a linear or nonlinear standard curve. You perform the assay at a number of known concentrations, fit a line or curve, and interpolate the uknown values.

But there is a problem with interpolating from a standard curve. The results can be incorrect when the unknown sample are contaminated with other substances that alter the assay. This is known as the 'matrix effect problem'.

The Standard Addition Method is a way to bypass this problem. You don't need to perform the assay with known concentrations of substance. Instead you add various known concentrations (including zero) of known substance to a constant amount of the unknown. This ensures that all the samples have the same amount of unknown, including any substances that interfere with the assay.

Fit the data with linear regression. The value you want to know is how much of the known substance has to be added to double the signal. There is an easier, somewhat trickier, way to find out: Extrapolate the line down to Y=0.  One of the parameters that Prism reports is the X intercept, which will be negative. Take the absolute value, and that is the concentration of the unknown substance. The confidence inerval for the X intercept gives you the confidence interval for the concentration of the uknown. Simply multiply both confidence limits by -1.

To plot the data in Prism, you'll want to extend the linear regression line to start at an X value equal to the X intercept (a choice in the Linear regression parameters dialog). You may also want to move the origin to the lower left, a choice on the first tab of the Format Axis dialog. Here is a graph created with this Prism file.