Modeling refresher, consequences of correlation?

I could remember this, but do not. what are the consequences to including in your model variables with correlation? I say i have two correlated variables in my model. and i want to keep them in my model because there indicated relativities make sense, the direction of the relativities makes sense from a business perspective, p-values are low, standard deviations/confidence intervals of indicated factors are tight, and the data is consistent across test and training data.

This is for a GLM. Thanks!

Depending on the modeling software/program, a lot of unintended consequences can result.

Many programs will simply “exclude” one when fitting the model; that is, one will just be given a fixed parameter value of 1.000 for all values/levels and allow the other to be fitted normally.

But which one that is excluded may change between fittings (often, the order that the variables are presented to the software is used to determine this).

If one isn’t excluded, then you’ll get some rather “meaningless” results for the parameters as the model-fitting process is attempting to “assign” how much “weight” to give to the influence of each variable. For example, after one fitting, Variable_X has a parameter value of 0.805 and Variable_Y is 1.553. You make a few tweaks to some other, completely unrelated variables in the model; and then after you fit the model again, Variable_X is 3.558 and Variable_Y is 0.129.

As a tangent, this latter result is also similar to what happens if you select a level/value for the base that has sparse data.

Software output should tell you this is happening though I think?

Yes; but it’s not always “front-and-center” to a new user.

But it should be obvious if you’re looking at outputs and parameters/factors.

If you can, you want to transform the variables so that they are no longer strongly correlated.

If you cannot, then you in practice you almost want to treat them as a single variable. So p values should be calculated by removing both of them together, for example.

You also need to keep in mind that the model cannot necessarily separate out the effects of one from the other.

Suppose you are fitting math skills for school children and both “grade” and “age” are included in the fit. You should not expect the model to tell you whether age is really more important for math performance, or grade, because they will always appear together. There may be a few data points where this is not true, but you don’t want to rely on this for inference. In that case, you would want to include only one or the other in the fit, probably.

In this case, you’d probably want to calculate “difference from average age of grade.” That would create a variable like “age” that is not correlated with grade, or at least not much.

1 Like

Moderate correlation is fine, in fact, moderate correlation is the reason for fitting a GLM in the first place. If all the variables were perfectly independent, you could get away with univariate analyses.

Very high correlation causes the fitted parameters to be unstable and the estimated errors to inflate. Because there is some randomness involved in the fitting procedure, you might get very different parameter estimates on the same data. If you are not encountering this problem you are fine.

Near or perfect aliasing will cause convergence issues. If you are not encountering this problem either, then you are fine.

1 Like

Certainly less correlation from about grade 4 or 5 onward (or whenever social promotion to the next grade isn’t considered).

1 Like