## Visualising The Correlation Matrix

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

```# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels, MASS
# |
# *--------------------------------------------------------------------
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------
visCorrel<-function(dataset, abr, method="metric"){
#Create correlation matrix
cor_ts<-cor(dataset)
n<-dim(cor_ts)[2]
# Create dissimilarities
ones<-matrix(rep(1,n^2), nrow=n)
dis_ts<-ones-abs(cor_ts)
# Do MDS
if ( method=="ordinal"){
fit <- isoMDS(dis_ts, k=2)\$points
} else {
cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
eig<-cmd.res\$eig
fit<-cmd.res\$points
prop<-sum(abs(eig[1:2]))/sum(abs(eig))
print(paste("Proportion of squared distances represented:", round(prop*100)))
if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
}
x <- fit[,1]
y <- fit[,2]
labels<-row.names(cor_ts)
if (abr>0){labels<-substr(labels,1,abr)}
mds_plot<-data.frame(labels, x, y)
#Plot the results
ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
}
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)
```

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

 crmrte crimes committed per person prbarr probability of arrest prbarr probability of arrest prbconv probability of conviction prbpris probability of prison sentence avgsen average sentence, days polpc police per capita density people per square mile taxpc tax revenue per capita pctmin percentage minority in 1980 wcon weekly wage in construction wtuc weekly wage in trns, util, commun wtrd weekly wage in whole sales and retail trade wfir weekly wage in finance, insurance and real estate wser weekly wage in service industry wmfg weekly wage in manufacturing wfed weekly wage of federal emplyees wsta weekly wage of state employees wloc weekly wage of local governments employees mix offence mix: face-to-face/other pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

## Multicollinearity and Ridge Regression

In marketing mix modelling you have to be very lucky not to run into problems with multicollinearity. It’s in the nature of marketing campaigns that everything tends to happen at once: the TV is supported by radio, both are timed to coincide with the relaunch of the website. One of the techniques that is often touted as a solution is ridge regression. However there is quite a bit of disagreement over whether this works. So I thought we’d just try it out with the simulated sales data I created in the last post.

In fact we’ll need to modify that data a little as we need a case of serious multicollinearity. I’ve adjusted the Tv campaigns to ensure that they always occur in the same winter months (not uncommon in marketing mix data) and I’ve added radio campaigns alongside the TV campaigns. Here is the modified code.

```  #TV now coincides with winter. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[40:45]&lt;-c(390,250,100,80,120,60)
tv_grps[92:97]&lt;-c(390,250,100,80,120,60)
tv_grps[144:149]&lt;-c(390,250,100,80,120,60)
tv_grps[196:201]&lt;-c(390,250,100,80,120,60)
tv_grps[248:253]&lt;-c(390,250,100,80,120,60)
```

The sales data now looks like this:

The correlation matrix of the explanatory variables shows that we have serious multicollinearity issues even when only two variables are taken at a time.

``` &gt; cor(test[,c(2,4:6)])
temp         1.0000000 -0.41545174 -0.15593463 -0.47491671
week        -0.1559346  0.09096521  1.00000000  0.08048096
```

What is this going to mean for the chances of recovering the parameters in our simulated data set? Well we know that even with heavy multicollinearity our estimates using linear regression are going to be unbiased; the problem is going to be their high variance.

We can show this quite nicely by generating lots of examples of our sales data (always with the same parameters but allowing a different random draw each from the normally distributed error term) and plotting the distribution of the estimates arrived at using linear regression. (See Monte Carlo Simulation for more details about this kind of technique.)

}

You can see that on average the estimates for tv and radio are close to correct but the distributions are wide. So for any one instance of the data (which in real life is all we have) chances are that our estimate is quite wide of the mark. The data and plots are created using the following code:

```coefs&lt;-NA
for (i in 1:10000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
coefs&lt;-rbind(coefs,coef(lm_std))
}
col_means&lt;-colMeans(coefs[-1,])
for_div&lt;-matrix(rep(col_means,10000), nrow=10000, byrow=TRUE)
mean_div&lt;-coefs[-1,]/for_div
m_coefs&lt;-melt(mean_div)
ggplot(data=m_coefs, aes(x=value))+geom_density()+facet_wrap(~X2, scales=&quot;free_y&quot;) + scale_x_continuous('Scaled as % of Mean')
```

What does ridge regression do to fix this? Ridge regression is best explained using a concept more familiar in machine learning and data mining: the bias-variance trade off. The idea is that you will often achieve better predictions (or estimates) if you are prepared to swap a bit of unbiasedness for much less variance. In other words the average of your predictions will no longer converge on the right answer but any one prediction is likely to be much closer.

In ridge regression we have a parameter lambda that controls the bias-variance trade off. As lambda increases our estimates get more biased but their variance increases. Cross-validation (another machine learning technique) is used to estimate the best possible setting of lambda.

So let’s see if ridge regression can help us with the multicolinearity in our marketing mix data. What we hope to see is a decent reduction in variance but not at too high a price in bias. The code below simulates the distribution of the ridge regression estimates of the parameters for increasing values of lambda.

```library(MASS)
for (i in 1:1000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
if (i==1){coefs_rg&lt;-coef(lm_rg)}
else {coefs_rg&lt;-rbind(coefs_rg,coef(lm_rg))}
}
colnames(coefs_rg)[1]&lt;-&quot;intercept&quot;
m_coefs_rg&lt;-melt(coefs_rg)
names(m_coefs_rg)&lt;-c(&quot;lambda&quot;, &quot;variable&quot;, &quot;value&quot;)
ggplot(data=m_coefs_rg, aes(x=value, y=lambda))+geom_density2d()+facet_wrap(~variable, scales=&quot;free&quot;)
```

The results are not encouraging. Variance decreases slightly for tv and radio however the cost in bias is far too high.

I’m aware that this by no means proves that ridge regression is never a solution for marketing mix data but it does at least show that it is not always the solution and I’m inclined to think that if it doesn’t work in a simple situation like this then it doesn’t work very often.

However I might try varying the parameters for the simulated data set to see if there are some settings where it looks more promising.

Still, for now, I won’t be recommending it as a solution to multicollinearity in marketing mix models.

A good explanation of ridge regression can be found in this post