KNOWLEDGEBASE - ARTICLE #1407

How a Lurking Variable can Confuse Data Analysis

"When the data don’t make sense, it’s usually because you have an erroneous preconception about how the system works."

Ernest Beutler

When you are unaware of the presence of a confounding variable, that variable is said to be lurking. This example illustrates the problem of lurking variables and the quotation above. I got the idea for this from the text by Freedman (reference below), but have extended it far beyond his example. The example uses synthetic data so seems a bit silly, but it makes an important point.

Everyone knows what determines the area of a rectangle. But let’s pretend we don’t know. Furthermore, let’s pretend that we don’t know the height and width of each rectangle, but only know each rectangle’s perimeter and area. Our goal is to find a model that predicts the area of a rectangle from its perimeter. This graph shows that generally rectangles with a larger perimeters also have a larger area.

Clearly, it seems, two outliers get in the way of seeing a clear relationship between perimeter and area. So let’s remove those two “outliers” and fit the remaining points to possible models. The straight line model (left panel) might be adequate, but the sigmoid shaped model (right panel) fit the data better.

If these were real data, you might think you were on the right track. After removing two outliers, we found a clear relationship and fit some models that seem useful. Now let’s collect data from more rectangles so we can refine the model.

Now it seems that those two “outliers” were really not so unusual. Instead it seems that there might be two distinct categories of rectangles. The right side of that figure tentatively identifies the two types of rectangles with open and closed circles and fits each to a different model. Definite progress, it seems.

This process sort of feels like real science, and it seems as though we are moving forward. In fact, of course, it is all nonsense. Two rectangles with the same perimeter can have vastly different areas, depending on their shape. Predicting the area of a rectangle from its perimeter is simply impossible. We didn’t need better statistics or a better model. Hiring a statistical consultant wouldn’t have helped. The data only made sense once we understood the problem better, and realized that an important variable was missing.

How a Lurking Variable can Confuse Data Analysis

Explore the Knowledgebase