January 12, 2012

# Multicollinearity and Ridge Regression

In marketing mix modelling you have to be very lucky not to run into problems with multicollinearity. It’s in the nature of marketing campaigns that everything tends to happen at once: the TV is supported by radio, both are timed to coincide with the relaunch of the website. One of the techniques that is often touted as a solution is ridge regression. However there is quite a bit of disagreement over whether this works. So I thought we’d just try it out with the simulated sales data I created in the last post.

In fact we’ll need to modify that data a little as we need a case of serious multicollinearity. I’ve adjusted the Tv campaigns to ensure that they always occur in the same winter months (not uncommon in marketing mix data) and I’ve added radio campaigns alongside the TV campaigns. Here is the modified code.

```  #TV now coincides with winter. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[40:45]&lt;-c(390,250,100,80,120,60)
tv_grps[92:97]&lt;-c(390,250,100,80,120,60)
tv_grps[144:149]&lt;-c(390,250,100,80,120,60)
tv_grps[196:201]&lt;-c(390,250,100,80,120,60)
tv_grps[248:253]&lt;-c(390,250,100,80,120,60)
```

The sales data now looks like this:

The correlation matrix of the explanatory variables shows that we have serious multicollinearity issues even when only two variables are taken at a time.

``` &gt; cor(test[,c(2,4:6)])
temp         1.0000000 -0.41545174 -0.15593463 -0.47491671
week        -0.1559346  0.09096521  1.00000000  0.08048096
```

What is this going to mean for the chances of recovering the parameters in our simulated data set? Well we know that even with heavy multicollinearity our estimates using linear regression are going to be unbiased; the problem is going to be their high variance.

We can show this quite nicely by generating lots of examples of our sales data (always with the same parameters but allowing a different random draw each from the normally distributed error term) and plotting the distribution of the estimates arrived at using linear regression. (See Monte Carlo Simulation for more details about this kind of technique.)

}

You can see that on average the estimates for tv and radio are close to correct but the distributions are wide. So for any one instance of the data (which in real life is all we have) chances are that our estimate is quite wide of the mark. The data and plots are created using the following code:

```coefs&lt;-NA
for (i in 1:10000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
coefs&lt;-rbind(coefs,coef(lm_std))
}
col_means&lt;-colMeans(coefs[-1,])
for_div&lt;-matrix(rep(col_means,10000), nrow=10000, byrow=TRUE)
mean_div&lt;-coefs[-1,]/for_div
m_coefs&lt;-melt(mean_div)
ggplot(data=m_coefs, aes(x=value))+geom_density()+facet_wrap(~X2, scales=&quot;free_y&quot;) + scale_x_continuous('Scaled as % of Mean')
```

What does ridge regression do to fix this? Ridge regression is best explained using a concept more familiar in machine learning and data mining: the bias-variance trade off. The idea is that you will often achieve better predictions (or estimates) if you are prepared to swap a bit of unbiasedness for much less variance. In other words the average of your predictions will no longer converge on the right answer but any one prediction is likely to be much closer.

In ridge regression we have a parameter lambda that controls the bias-variance trade off. As lambda increases our estimates get more biased but their variance increases. Cross-validation (another machine learning technique) is used to estimate the best possible setting of lambda.

So let’s see if ridge regression can help us with the multicolinearity in our marketing mix data. What we hope to see is a decent reduction in variance but not at too high a price in bias. The code below simulates the distribution of the ridge regression estimates of the parameters for increasing values of lambda.

```library(MASS)
for (i in 1:1000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
if (i==1){coefs_rg&lt;-coef(lm_rg)}
else {coefs_rg&lt;-rbind(coefs_rg,coef(lm_rg))}
}
colnames(coefs_rg)[1]&lt;-&quot;intercept&quot;
m_coefs_rg&lt;-melt(coefs_rg)
names(m_coefs_rg)&lt;-c(&quot;lambda&quot;, &quot;variable&quot;, &quot;value&quot;)
ggplot(data=m_coefs_rg, aes(x=value, y=lambda))+geom_density2d()+facet_wrap(~variable, scales=&quot;free&quot;)
```

The results are not encouraging. Variance decreases slightly for tv and radio however the cost in bias is far too high.

I’m aware that this by no means proves that ridge regression is never a solution for marketing mix data but it does at least show that it is not always the solution and I’m inclined to think that if it doesn’t work in a simple situation like this then it doesn’t work very often.

However I might try varying the parameters for the simulated data set to see if there are some settings where it looks more promising.

Still, for now, I won’t be recommending it as a solution to multicollinearity in marketing mix models.

A good explanation of ridge regression can be found in this post

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

### Comment (1)

1. Absolutely fantastic Simon, I am glad I found your blog on MMX. However, could it be possible that you can explain the MMX in more simpler way (considering that there are lots of people who are at beginners stage, including me). By the way,
1) What programming (Software) is this?
2) If there is multicollinearity with two variables is it always good to remove one variable? or is it ok to continue with both ? If the later is fine then how to justify the inclusion of both variables?
3) How to know which regression to use, as there are few variants of regression available and what are the justification to use a particular type of regression?
4) how convenient is to use SAS for this kind of exercise?

thanks so much for you blog and to the answers in advance!!