Visualising The Correlation Matrix

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

```# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels, MASS
# |
# *--------------------------------------------------------------------
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------
visCorrel<-function(dataset, abr, method="metric"){
#Create correlation matrix
cor_ts<-cor(dataset)
n<-dim(cor_ts)[2]
# Create dissimilarities
ones<-matrix(rep(1,n^2), nrow=n)
dis_ts<-ones-abs(cor_ts)
# Do MDS
if ( method=="ordinal"){
fit <- isoMDS(dis_ts, k=2)\$points
} else {
cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
eig<-cmd.res\$eig
fit<-cmd.res\$points
prop<-sum(abs(eig[1:2]))/sum(abs(eig))
print(paste("Proportion of squared distances represented:", round(prop*100)))
if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
}
x <- fit[,1]
y <- fit[,2]
labels<-row.names(cor_ts)
if (abr>0){labels<-substr(labels,1,abr)}
mds_plot<-data.frame(labels, x, y)
#Plot the results
ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
}
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)
```

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

 crmrte crimes committed per person prbarr probability of arrest prbarr probability of arrest prbconv probability of conviction prbpris probability of prison sentence avgsen average sentence, days polpc police per capita density people per square mile taxpc tax revenue per capita pctmin percentage minority in 1980 wcon weekly wage in construction wtuc weekly wage in trns, util, commun wtrd weekly wage in whole sales and retail trade wfir weekly wage in finance, insurance and real estate wser weekly wage in service industry wmfg weekly wage in manufacturing wfed weekly wage of federal emplyees wsta weekly wage of state employees wloc weekly wage of local governments employees mix offence mix: face-to-face/other pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

Multicollinearity and Ridge Regression

In marketing mix modelling you have to be very lucky not to run into problems with multicollinearity. It’s in the nature of marketing campaigns that everything tends to happen at once: the TV is supported by radio, both are timed to coincide with the relaunch of the website. One of the techniques that is often touted as a solution is ridge regression. However there is quite a bit of disagreement over whether this works. So I thought we’d just try it out with the simulated sales data I created in the last post.

In fact we’ll need to modify that data a little as we need a case of serious multicollinearity. I’ve adjusted the Tv campaigns to ensure that they always occur in the same winter months (not uncommon in marketing mix data) and I’ve added radio campaigns alongside the TV campaigns. Here is the modified code.

```  #TV now coincides with winter. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[40:45]&lt;-c(390,250,100,80,120,60)
tv_grps[92:97]&lt;-c(390,250,100,80,120,60)
tv_grps[144:149]&lt;-c(390,250,100,80,120,60)
tv_grps[196:201]&lt;-c(390,250,100,80,120,60)
tv_grps[248:253]&lt;-c(390,250,100,80,120,60)
```

The sales data now looks like this:

The correlation matrix of the explanatory variables shows that we have serious multicollinearity issues even when only two variables are taken at a time.

``` &gt; cor(test[,c(2,4:6)])
temp         1.0000000 -0.41545174 -0.15593463 -0.47491671
week        -0.1559346  0.09096521  1.00000000  0.08048096
```

What is this going to mean for the chances of recovering the parameters in our simulated data set? Well we know that even with heavy multicollinearity our estimates using linear regression are going to be unbiased; the problem is going to be their high variance.

We can show this quite nicely by generating lots of examples of our sales data (always with the same parameters but allowing a different random draw each from the normally distributed error term) and plotting the distribution of the estimates arrived at using linear regression. (See Monte Carlo Simulation for more details about this kind of technique.)

}

You can see that on average the estimates for tv and radio are close to correct but the distributions are wide. So for any one instance of the data (which in real life is all we have) chances are that our estimate is quite wide of the mark. The data and plots are created using the following code:

```coefs&lt;-NA
for (i in 1:10000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
coefs&lt;-rbind(coefs,coef(lm_std))
}
col_means&lt;-colMeans(coefs[-1,])
for_div&lt;-matrix(rep(col_means,10000), nrow=10000, byrow=TRUE)
mean_div&lt;-coefs[-1,]/for_div
m_coefs&lt;-melt(mean_div)
ggplot(data=m_coefs, aes(x=value))+geom_density()+facet_wrap(~X2, scales=&quot;free_y&quot;) + scale_x_continuous('Scaled as % of Mean')
```

What does ridge regression do to fix this? Ridge regression is best explained using a concept more familiar in machine learning and data mining: the bias-variance trade off. The idea is that you will often achieve better predictions (or estimates) if you are prepared to swap a bit of unbiasedness for much less variance. In other words the average of your predictions will no longer converge on the right answer but any one prediction is likely to be much closer.

In ridge regression we have a parameter lambda that controls the bias-variance trade off. As lambda increases our estimates get more biased but their variance increases. Cross-validation (another machine learning technique) is used to estimate the best possible setting of lambda.

So let’s see if ridge regression can help us with the multicolinearity in our marketing mix data. What we hope to see is a decent reduction in variance but not at too high a price in bias. The code below simulates the distribution of the ridge regression estimates of the parameters for increasing values of lambda.

```library(MASS)
for (i in 1:1000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
if (i==1){coefs_rg&lt;-coef(lm_rg)}
else {coefs_rg&lt;-rbind(coefs_rg,coef(lm_rg))}
}
colnames(coefs_rg)[1]&lt;-&quot;intercept&quot;
m_coefs_rg&lt;-melt(coefs_rg)
names(m_coefs_rg)&lt;-c(&quot;lambda&quot;, &quot;variable&quot;, &quot;value&quot;)
ggplot(data=m_coefs_rg, aes(x=value, y=lambda))+geom_density2d()+facet_wrap(~variable, scales=&quot;free&quot;)
```

The results are not encouraging. Variance decreases slightly for tv and radio however the cost in bias is far too high.

I’m aware that this by no means proves that ridge regression is never a solution for marketing mix data but it does at least show that it is not always the solution and I’m inclined to think that if it doesn’t work in a simple situation like this then it doesn’t work very often.

However I might try varying the parameters for the simulated data set to see if there are some settings where it looks more promising.

Still, for now, I won’t be recommending it as a solution to multicollinearity in marketing mix models.

A good explanation of ridge regression can be found in this post

Generating Artificial Sales Data

Our statistics lecturers would often end each session with a demonstration of the power of the statistical model under discussion. This would usually mean generating some artificial data and showing how good the tool was at recovering the parameters or correctly classifying the observations. It was highly artificial but had a very useful feature: you knew the true mechanism behind the data so you could see how good your model was at getting at the truth.

We work with marketing data, building models to understand the effect of marketing activity on sales. Of course here, as in any real world situation, we don’t know which mechanism generated the data (that’s what we are trying to find out). But we can get an idea of how good our tools are by testing them out on artificial data in the way we described above. If they don’t work here in these highly idealised situations then we ought to be concerned.

In this series I’m going to take some very simple simulated data sets and look at how well some of the best known marketing mix modelling techniques do at getting back to the true values. I will start by looking at LDSV (Least Squares Dummy Variables) models and then move on to mixed effects and Bayesian modelling.

There’s one other thing worth mentioning before we get started. With our simulated data sets we are able to turn the usual situation on its head and vary the data set rather than the modelling approach. This means we can ask questions like: under what conditions do our models work best?

Building an artificial data set

Our world will be very simple. Weekly sales will follow an overall linear trend to which we will add an annual seasonal cycle which we imagine to be a function of temperature (simulated using a sine wave). On top of that we need some marketing activity which we will add as TV adstock. Finally we will add some noise by simulating from a normal distribution. The final data generating equation looks like this:

$sales_t = alpha + theta_1 week_t + theta_2 temp_t + theta_3 adstock_t + epsilon_t$

where $epsilon sim N(0, sigma^2)$

and adstock is defined recursively as

$adstock_t= 1-e^{-frac{GRPs_t}{phi}} + lambda adstock_{t-1}$

I have generated this data set in R (we will use R throughout – if you are unfamiliar with this language please see the R homepage).

It would also be nice if we could vary the parameters to generate different sets of data so I have created the whole thing as an R function with the parameters as arguments.

```# *--------------------------------------------------------------------
# | FUNCTION: create_test_sets
# | Creates simple artifical marketing mix data for testing code and
# | techniques
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |29/11/2011|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  base_p         Number of base sales
# |          trend_p        Increase in sales for every unit increase
# |                         in time
# |          season_p       The seasonality effect will be
# |                         season_p*temp where -10&lt;temp&lt;10
# |          dim            The dim parameter in adstock (see below)
# |          dec            The dec parameter in adstock (see below)
# |          adstock_form   If 1 then the form is:
# |                         If 2 then the form is:
# |                         Default is 1.
# |          error_std      Standard deviation of the noise
# *--------------------------------------------------------------------
# | OUTPUTS: dataframe      Consists of sales, temp, tv_grps, week,
# |
# *--------------------------------------------------------------------
# | USAGE:   create_test_sets(base_p,
# |                           trend_p,
# |                           season_p,
# |                           dim,
# |                           dec,
# |                           error_std)
# |
# *--------------------------------------------------------------------
# | DEPENDS: None
# |
# *--------------------------------------------------------------------
# | NOTES:   Usually the test will consists of trying to predict sales
# |          using temp, tv_grps, week and recover the parameters.
# |
# *--------------------------------------------------------------------
length&lt;-length(media_var)
for(i in 2:length){
}
}
length&lt;-length(media_var)
for(i in 2:length){
}
}
#Function for creating test sets
#National level model
#Five years of weekly data
week&lt;-1:(5*52)
#Base sales of base_p units
base&lt;-rep(base_p,5*52)
#Trend of trend_p extra units per week
trend&lt;-trend_p*week
#Winter is season_p*10 units below, summer is season_p*10 units above
temp&lt;-10*sin(week*3.14/26)
seasonality&lt;-season_p*temp
#7 TV campaigns. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[20:25]&lt;-c(390,250,100,80,120,60)
tv_grps[60:65]&lt;-c(250,220,100,100,120,120)
tv_grps[100:103]&lt;-c(100,80,60,100)
tv_grps[150:155]&lt;-c(500,200,200,100,120,120)
tv_grps[200:205]&lt;-c(250,120,200,100,120,120)
tv_grps[220:223]&lt;-c(100,100,80,60)
tv_grps[240:245]&lt;-c(350,290,100,100,120,120)
#Error has a std of error_var
error&lt;-rnorm(5*52, mean=0, sd=error_std)
#Full series
sales&lt;-base+trend+seasonality+TV+error
#Plot
#plot(sales, type='l', ylim=c(0,1200))
output
}
```

Here is a line graph showing a simulated sales series generated with the following parameters:

``` #Example
test&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,