January 23, 2012 Simon Raper

Visualising The Correlation Matrix

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels, MASS
# |
# *--------------------------------------------------------------------
# | NOTES:   For more information about MDS please see
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------
visCorrel<-function(dataset, abr, method="metric"){
  #Create correlation matrix
  # Create dissimilarities
  ones<-matrix(rep(1,n^2), nrow=n)
  # Do MDS
  if ( method=="ordinal"){
    fit <- isoMDS(dis_ts, k=2)$points
  } else {
    cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
    print(paste("Proportion of squared distances represented:", round(prop*100)))
    if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
  x <- fit[,1]
  y <- fit[,2]
  if (abr>0){labels<-substr(labels,1,abr)}
  mds_plot<-data.frame(labels, x, y)
  #Plot the results
  ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

crmrte crimes committed per person
prbarr probability of arrest
prbarr probability of arrest
prbconv probability of conviction
prbpris probability of prison sentence
avgsen average sentence, days
polpc police per capita
density people per square mile
taxpc tax revenue per capita
pctmin percentage minority in 1980
wcon weekly wage in construction
wtuc weekly wage in trns, util, commun
wtrd weekly wage in whole sales and retail trade
wfir weekly wage in finance, insurance and real estate
wser weekly wage in service industry
wmfg weekly wage in manufacturing
wfed weekly wage of federal emplyees
wsta weekly wage of state employees
wloc weekly wage of local governments employees
mix offence mix: face-to-face/other
pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

Tagged: , , ,

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Comments (8)

    • Thanks Robert. What I’d like to do if I get time is overlay the minimum spanning tree on the chart. That will show the structure of the dendrogram (for single linkage clustering) and if I label the links we can see how much the 2 dimensional plot distorts the true distances.

      • Nice piece of code!
        Here’s a modified version of your function: I’ve added the links between nodes defined by the correlation. More correlation is indicated by a more pronounced link. Check it out here: http://t.co/ygxrAbVw.

        And the code:
        vis_correl2 <-function(dataset, abr=0, minCorr=0){

        #Create correlation matrix
        # Create dissimilarities
        ones<-matrix(rep(1,n^2), nrow=n)
        # Do MDS
        fit <- cmdscale(dis_ts, k=2)
        x <- fit[,1]
        y <- fit[,2]

        mds_plot<-data.frame(labels, x, y)

        ### define the edges
        mds_plot <- cbind(id = 1:nrow(mds_plot), mds_plot)
        ### building the appropriate edge representation
        edge_weight <- as.data.frame.table(cor_ts)
        edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var2", by.y="labels" )
        edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var1", by.y="labels" )
        ### build the edges in coordinate form
        edge_weight_coord <- data.frame(mds_plot[edge_weight[,4],], mds_plot[edge_weight[,5],], edge_weight$Freq)
        rownames(edge_weight_coord) <- NULL
        colnames(edge_weight_coord) <- c("id1", "labels1", "X1","Y1","id2", "labels2", "X2","Y2", "Corr")
        edge_weight_coord edge_weight_coord$id2,]

        ### restric to positive corr
        edge_weight_coord minCorr, ]

        #Plot the results
        ggplot() +
        geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2, alpha = Corr), data=edge_weight_coord, colour=”gray”, size=1) +
        geom_text(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot, size=4, angle=0) + coord_fixed() + opts(title =”MDS Plot of Correlations”) +
        geom_point(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot,size = 5, colour=”red”, alpha=0.3) +
        scale_x_continuous(breaks = NULL) + scale_y_continuous(breaks = NULL) +
        # discard default grid + titles in ggplot2
        opts(panel.background = theme_blank()) + opts(legend.position=”none”)+
        opts(axis.title.x = theme_blank(), axis.title.y = theme_blank()) +
        opts( legend.background = theme_rect(colour = NA)) +
        opts(panel.background = theme_rect(fill = “white”, colour = NA)) +
        opts(panel.grid.minor = theme_blank(), panel.grid.major = theme_blank())

  1. Having used this a few times I’ve now realised quite how non-euclidean some correlation matrices can be. This means that sometimes metric scaling does a poor job of representing the distances. I’ve done two things to allow for this. 1) provided output that shows what proportion of the distances are represented in the plot. 2) provided the option to do ordinal scaling instead as it often does a much better job if the distances are a long way from Euclidean. I’ve updated the crime data plot as it is better represented with the ordinal scaling. One other thing: I’ve used the directlabels package to add the labels as it does a better job and looks nicer.

  2. Ranaivoson

    I’m a beginner in R. How can I adapt this script to my own dataset. What line(s) should I replace?

    • Hi there,

      all you need to do is load the function into memory by running it. Then you can run the line visCorrel(data, x) where data is your dataframe of explanatory variables and x is the number of places you want the column names truncated to. You’ll need to install the packages ggplot2 and directlabels for it to work. Also see my note above about trying both metric and ordinal scaling.


Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK