January 23, 2012

# Visualising The Correlation Matrix

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

```# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels, MASS
# |
# *--------------------------------------------------------------------
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------
visCorrel<-function(dataset, abr, method="metric"){
#Create correlation matrix
cor_ts<-cor(dataset)
n<-dim(cor_ts)[2]
# Create dissimilarities
ones<-matrix(rep(1,n^2), nrow=n)
dis_ts<-ones-abs(cor_ts)
# Do MDS
if ( method=="ordinal"){
fit <- isoMDS(dis_ts, k=2)\$points
} else {
cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
eig<-cmd.res\$eig
fit<-cmd.res\$points
prop<-sum(abs(eig[1:2]))/sum(abs(eig))
print(paste("Proportion of squared distances represented:", round(prop*100)))
if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
}
x <- fit[,1]
y <- fit[,2]
labels<-row.names(cor_ts)
if (abr>0){labels<-substr(labels,1,abr)}
mds_plot<-data.frame(labels, x, y)
#Plot the results
ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
}
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)
```

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

 crmrte crimes committed per person prbarr probability of arrest prbarr probability of arrest prbconv probability of conviction prbpris probability of prison sentence avgsen average sentence, days polpc police per capita density people per square mile taxpc tax revenue per capita pctmin percentage minority in 1980 wcon weekly wage in construction wtuc weekly wage in trns, util, commun wtrd weekly wage in whole sales and retail trade wfir weekly wage in finance, insurance and real estate wser weekly wage in service industry wmfg weekly wage in manufacturing wfed weekly wage of federal emplyees wsta weekly wage of state employees wloc weekly wage of local governments employees mix offence mix: face-to-face/other pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

1. I wonder how it would compare to a Dendogram of the same data.

• Thanks Robert. What I’d like to do if I get time is overlay the minimum spanning tree on the chart. That will show the structure of the dendrogram (for single linkage clustering) and if I label the links we can see how much the 2 dimensional plot distorts the true distances.

• Nice piece of code!
Here’s a modified version of your function: I’ve added the links between nodes defined by the correlation. More correlation is indicated by a more pronounced link. Check it out here: http://t.co/ygxrAbVw.

And the code:
vis_correl2 <-function(dataset, abr=0, minCorr=0){

#Create correlation matrix
cor_ts<-cor(dataset)
n<-dim(cor_ts)[2]
# Create dissimilarities
ones<-matrix(rep(1,n^2), nrow=n)
dis_ts<-ones-abs(cor_ts)
# Do MDS
fit <- cmdscale(dis_ts, k=2)
x <- fit[,1]
y <- fit[,2]
labels0){labels<-substr(labels,1,abr)}

mds_plot<-data.frame(labels, x, y)

### define the edges
mds_plot <- cbind(id = 1:nrow(mds_plot), mds_plot)
### building the appropriate edge representation
edge_weight <- as.data.frame.table(cor_ts)
edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var2", by.y="labels" )
edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var1", by.y="labels" )
### build the edges in coordinate form
edge_weight_coord <- data.frame(mds_plot[edge_weight[,4],], mds_plot[edge_weight[,5],], edge_weight\$Freq)
rownames(edge_weight_coord) <- NULL
colnames(edge_weight_coord) <- c("id1", "labels1", "X1","Y1","id2", "labels2", "X2","Y2", "Corr")
###
edge_weight_coord edge_weight_coord\$id2,]

### restric to positive corr
edge_weight_coord minCorr, ]

#Plot the results
ggplot() +
geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2, alpha = Corr), data=edge_weight_coord, colour=”gray”, size=1) +
geom_text(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot, size=4, angle=0) + coord_fixed() + opts(title =”MDS Plot of Correlations”) +
geom_point(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot,size = 5, colour=”red”, alpha=0.3) +
scale_x_continuous(breaks = NULL) + scale_y_continuous(breaks = NULL) +
# discard default grid + titles in ggplot2
opts(panel.background = theme_blank()) + opts(legend.position=”none”)+
opts(axis.title.x = theme_blank(), axis.title.y = theme_blank()) +
opts( legend.background = theme_rect(colour = NA)) +
opts(panel.background = theme_rect(fill = “white”, colour = NA)) +
opts(panel.grid.minor = theme_blank(), panel.grid.major = theme_blank())
}

• That’s fantastic. Thank you so much!

2. Having used this a few times I’ve now realised quite how non-euclidean some correlation matrices can be. This means that sometimes metric scaling does a poor job of representing the distances. I’ve done two things to allow for this. 1) provided output that shows what proportion of the distances are represented in the plot. 2) provided the option to do ordinal scaling instead as it often does a much better job if the distances are a long way from Euclidean. I’ve updated the crime data plot as it is better represented with the ordinal scaling. One other thing: I’ve used the directlabels package to add the labels as it does a better job and looks nicer.

3. Ranaivoson

Hello!
I’m a beginner in R. How can I adapt this script to my own dataset. What line(s) should I replace?

• Hi there,

all you need to do is load the function into memory by running it. Then you can run the line visCorrel(data, x) where data is your dataframe of explanatory variables and x is the number of places you want the column names truncated to. You’ll need to install the packages ggplot2 and directlabels for it to work. Also see my note above about trying both metric and ordinal scaling.

Simon

4. Ranaivoson

OK! Thank you!