Multicollinearity in multiple regression
What is multiple regression?
Multiple regression is a statistical analysis offered by GraphPad InStat, but not GraphPad Prism. Multiple regression fits a model to predict a dependent (Y) variable from two or more independent (X) variables:
If the model fits the data well, the overall R2 value will be high, and the corresponding P value will be low (the great fit is unlikely to be a coincidence). In addition to the overall P value, multiple regression also reports an individual P value for each independent variable. A low P value here means that this particular independent variable significantly improves the fit of the model. It is calculated by comparing the goodness-of-fit of the entire model to the goodness-of-fit when that independent variable is omitted. If the fit is much worse when that variable is omitted from the model, the P value will be low, telling you that the variable has a significant impact on the model.
What is multicollinearity?
In some cases, multiple regression results may seem paradoxical. Even though the overall P value is very low, all of the individual P values are high. This means that the model fits the data well, even though none of the X variables has a statistically significant impact on predicting Y. How is this possible? When two X variables are highly correlated, they both convey essentially the same information. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If you removed both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to your model last. When this happens, the X variables are collinear and the results show multicollinearity.
To help you assess multicollinearity, InStat tells you how well each independent (X) variable is predicted from the other X variables. The results are shown both as an individual R2 value (distinct from the overall R2 of the model) and a Variance Inflation Factor (VIF). When those R2 and VIF values are high for any of the X variables, your fit is affected by multicollinearity.
Why is multicollinearity a problem?
If your goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still be accurate, and the overall R2 (or adjusted R2) quantifies how well the model predicts the Y values.
If your goal is to understand how the various X variables impact Y, then multicollinearity is a big problem. One problem is that the individual P values can be misleading (a P value can be high, even though the variable is important). The second problem is that the confidence intervals on the regression coefficients will be very wide. The confidence intervals may even include zero, which means you can’t even be confident whether an increase in the X value is associated with an increase, or a decrease, in Y. Because the confidence intervals are so wide, excluding a subject (or adding a new one) can change the coefficients dramatically – and may even change their signs.
What can I do about multicollinearity?
The best solution is to understand the cause of multicollinearity and remove it. Multicollinearity occurs because two (or more) variables are related – they measure essentially the same thing. If one of the variables doesn’t seem logically essential to your model, removing it may reduce or eliminate multicollinearity. Or perhaps you can find a way to combine the variables. For example, if height and weight are collinear independent variables, perhaps it would make scientific sense to remove height and weight from the model, and use surface area (calculated from height and weight) instead.
You can also reduce the impact of multicollinearity. One way to reduce the impact of collinearity is to increase sample size. You'll get narrower confidence intervals, despite multicollinearity, with more data. Even better, collect samples over a wider range of some of the X variables. If you include an interaction term (the product of two independent variables), you can also reduce multicollinearity by "centering" the variables. To do this, compute the mean of each independent variable, and then replace each value with the difference between it and the mean. For example, if the variable is weight and the mean is 72, then enter "6" for a weight of 78 and "-3" for a weight of 69.
Keywords: multi-collinearity