﻿ Comparing linear regression to correlation

# Comparing linear regression to correlation

Linear regression is distinct from correlation.

## What is the goal?

Linear regression finds the best line that predicts Y from X.

Correlation quantifies the degree to which two variables are related. Correlation does not fit a line through the data points. You simply are computing a correlation coefficient (r) that tells you how much one variable tends to change when the other one does. When r is 0.0, there is no relationship. When r is positive, there is a trend that one variable goes up as the other one goes up. When r is negative, there is a trend that one variable goes up as the other one goes down.

## What kind of data?

Linear regression is usually used when X is a variable you manipulate (time, concentration, etc.)

Correlation is almost always used when you measure both variables. It rarely is appropriate when one variable is something you experimentally manipulate.

## Does it matter which variable is X and which is Y?

The decision of which variable you call "X" and which you call "Y" matters in regression, as you'll get a different best-fit line if you swap the two. The line that best predicts Y from X is not the same as the line that predicts X from Y (however both those lines have the same value for R2).

With correlation, you don't have to think about cause and effect. It doesn't matter which of the two variables you call "X" and which you call "Y". You'll get the same correlation coefficient if you swap the two.

## Assumptions

With linear regression, the X values can be measured or can be a variable controlled by the experimenter. The X values are not assumed to be sampled from a Gaussian distribution. The distances of the points from the best-fit line is assumed to follow a Gaussian distribution, with the SD of the scatter not related to the X or Y values.

The correlation coefficient itself is simply a way to describe how two variables vary together, so it can be computed and interpreted for any two variables. Further inferences, however, require an additional assumption -- that both X and Y are measured (are interval or ratio variables), and both are sampled from Gaussian distributions.  This is called a bivariate Gaussian distribution. If those assumptions are true, then you can interpret the confidence interval of r and the P value testing the null hypothesis that there really is no correlation between the two variables (and any correlation you observed is a consequence of random sampling).

## Relationship between results

Linear regression quantifies goodness of fit with r2, sometimes shown in uppercase as R2.  If you put the same data into correlation (which is rarely appropriate; see above), the square of r from correlation will equal r2 from regression.

Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges from -1 to +1.