Viewing By Month : December 2008 / Main
December 18, 2008

How a Lurking Variable can Confuse Data Analysis

 "When the data don’t make sense, it’s usually because you have an erroneous preconception about how the system works."

Ernest Beutler

When you are unaware of the presence of a confounding variable, that variable is said to be lurking. This example illustrates the problem of lurking variables and the quotation above. I got the idea for this from the text by Freedman (reference below), but have extended it far beyond his example. The example uses synthetic data so seems a bit silly, but it makes an important point. 

Everyone knows what determines the area of a rectangle. But let’s pretend we don’t know. Furthermore, let’s pretend that we don’t know the height and width of each rectangle, but only know each rectangle’s perimeter and area. Our goal is to find a model that predicts the area of a rectangle from its perimeter. This graph shows that generally rectangles with a larger perimeters also have a larger area.  

 

 Clearly, it seems, two outliers get in the way of seeing a clear relationship between perimeter and area. So let’s remove those two “outliers” and fit the remaining points to possible models. The straight line model (left panel) might be adequate, but the sigmoid shaped model (right panel) fit the data better. 

 If these were real data, you might think you were on the right track. After removing two outliers, we found a clear relationship and fit some models that seem useful. Now let’s collect data from more rectangles so we can refine the model.

Now it seems that those two “outliers” were really not so unusual. Instead it seems that there might be two distinct categories of rectangles. The right side of that figure tentatively identifies the two types of rectangles with open and closed circles and fits each to a different model.  Definite progress, it seems. 

This process sort of feels like real science, and it seems as though we are moving forward. In fact, of course, it is all nonsense. Two rectangles with the same perimeter can have vastly different areas, depending on their shape. Predicting the area of a rectangle from its perimeter is simply impossible. We didn’t need better statistics or a better model. Hiring a statistical consultant wouldn’t have helped. The data only made sense once we understood the problem better, and realized that an important variable was missing. 

Statistics, 4th Edition Statistics, 4th Edition
by David Freedman
IBSN:0393929728. List price:$106.95
Buy from amazon.com for $69.99

December 16, 2008

The Pros and Cons of Using Excel for Statistical Calculations

Microsoft Excel is widely used, and is a great program for managing and wrangling data sets. Excel has some statistical capabilities, and many also use it to do some statistical calculations. Because these statistics functions are poorly documented, this chapter clarifies their use. The excellent book by Pace (2008) gives many more details (it can be purchased as a printed book, or as a pdf download).

Use of Excel for statistics is somewhat controversial, and some recommend that Excel not be used for statistics. One problem is that Excel is far from a complete statistics program. It lacks nonparametric tests, post tests following ANOVA, and many others tests. Another problem is that Excel reports statistical results without all the supporting details other programs provide.

 More seriously, Excel uses some poor algorithms for computing statistics which can lead to incorrect results (McCullough,, 2005; Knusel, 2005). Microsoft responded to these criticisms, and fixed many issues in Excel 2003. There really is no point in using earlier versions of Excel for statistical work.

Unfortunately, some errors remain in Excel 2007 for Windows and Excel 2008 for Mac. McCullough (2008) pointed out many erroneous results produced by Excel 2007 (especially its Solver) and concludes, "Microsoft has repeatedly proved itself incapable of providing reliable statistical functionality.” Yalta (2008) reached a similar conclusion, “the accuracy of various statistical functions in Excel 2007 range from unacceptably bad to acceptable but inferior.” In contrast, Pace (2008) concludes that the statistical errors produced by Excel 2007 are all trivial or obscure. He concludes that Excel 2007 is a reasonable choice for analyzing the kinds of data most academics and professionals collect.

 Given these problems, you should use another program to check important calculations, especially if your data seem unusual or include missing values.

 

References:

 

Knusel, L. 2005. On the accuracy of statistical distributions in microsoft excel 2003. Computational Statistics & Data Analysis 48, (3): 445.

McCullough, B. D., and D. A. Hellser. 2008. On the accuracy of statistical procedures in microsoft excel 2007. Computational Statistics & Data Analysis 52, (10): 4570.

McCullough, B. D., and B. Wilson. 2005. On the accuracy of statistical procedures in microsoft excel 2003. Computational Statistics & Data Analysis 49, (4): 1244.

Pace, L. A. 2008. The excel 2007 data & statistics cookbook. 2nd ed.TwoPaces LLC.

Yalta, A. T. 2008. The accuracy of statistical distributions in Microsoft® Excel 2007. Computational Statistics & Data Analysis 52, (10): 4579.

A great paper on the problem of multiple comparisons
I just discovered a paper by Donald Berry (1) that does a great job of explaining how pervasive the problem of multiple comparisons is. It goes way beyond post tests in ANOVA.

His first paragraph:

Most scientists are oblivious to the problems of multiplicities. Yet they are everywhere. In one or more of its forms, multiplicities are present in every statistical application. They may be out in the open or hidden. And even if they are out in the open, recognizing them is but the first step in a difficult process of inference. Problems of multiplicities are the most difficult that we statisticians face. They threaten the validity of every statistical conclusion.

1. Berry. The difficult and ubiquitous problems of multiplicities. Pharmaceutical Statistics, 6: 155–160 (2007).

December 8, 2008
Scientific Solutions web forum
While at the Neurosciences meeting in Washington DC, I learned about interesting free web discussion board for scientists. Check out Scientist Solutions.

The site contains forty discussion forums (by scientific discipline), where you may may post questions, answers, comments, ideas, and protocols, and where you may develop collaborative relationships.

December 5, 2008

A great statistical encylopedia

Sheskin has written a  comprehensive (1736 pages) statistical encyclopedia. It discusses every variation on every test, with detailed examples and tables. It has plenty of equations for those who want to do calculations themselves. But it is does not derive any equations or prove any theorems. It explains concepts in words (with examples), not equations. This makes it quite understandable by scientists (as well as statisticians). It is extremely well written. Although it purports to be comprehensive, no book really can be. It makes no mention of nonlinear regression, or model comparisons. Its coverage of survival curves is a bit weak (compared to the rest of the book), as is its coverage of modern (computer intensive) statistical methods. If you analyze data, you should have access to this book as a reference. No other book is so comprehensive and yet readable. Per page, it is a bargain. 

Handbook of Parametric and Nonparametric Statistical Procedures Handbook of Parametric and Nonparametric Statistical Procedures
by David J. Sheskin
IBSN:1584888148. List price:$139.95
Buy from amazon.com for $111.96