Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

# *-------------------------------------------------------------------- # | FUNCTION: visCorrel # | Creates an MDS plot where the distance between variables represents # | correlation between the variables (closer=more correlated) # *-------------------------------------------------------------------- # | Version |Date |Programmer |Details of Change # | 01 |05/01/2012|Simon Raper |first version. # *-------------------------------------------------------------------- # | INPUTS: dataset A dataframe containing only the explanatory # | variables. It should not contain missing # | values or variables where there is no # | variation # | abr The number of characters used when # | abbreviating the variables. If set to zero # | there is no abbreviation # | method The options are 'metric' or 'ordinal'. The # | default is metric # *-------------------------------------------------------------------- # | OUTPUTS: graph An MDS plot where the similarity measure is # | correlation between the variables. # | # *-------------------------------------------------------------------- # | USAGE: vis_correl(dataset, # | abr) # | # *-------------------------------------------------------------------- # | DEPENDS: ggplot2, directlabels, MASS # | # *-------------------------------------------------------------------- # | NOTES: For more information about MDS please see # | http://en.wikipedia.org/wiki/Multidimensional_scaling # | # *-------------------------------------------------------------------- visCorrel<-function(dataset, abr, method="metric"){ #Create correlation matrix cor_ts<-cor(dataset) n<-dim(cor_ts)[2] # Create dissimilarities ones<-matrix(rep(1,n^2), nrow=n) dis_ts<-ones-abs(cor_ts) # Do MDS if ( method=="ordinal"){ fit <- isoMDS(dis_ts, k=2)$points } else { cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE) eig<-cmd.res$eig fit<-cmd.res$points prop<-sum(abs(eig[1:2]))/sum(abs(eig)) print(paste("Proportion of squared distances represented:", round(prop*100))) if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")} } x <- fit[,1] y <- fit[,2] labels<-row.names(cor_ts) if (abr>0){labels<-substr(labels,1,abr)} mds_plot<-data.frame(labels, x, y) #Plot the results ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none") } # *-------------------------------------------------------------------- # * Examples # *-------------------------------------------------------------------- # library(MASS) # library(ggplot2) # library(plm) # data(midwest) # data(Crime) # visCorrel(midwest[,4:27],10, method="classical") # visCorrel(midwest[,4:27],10, method="ordinal") # visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

crmrte | crimes committed per person |

prbarr | probability of arrest |

prbarr | probability of arrest |

prbconv | probability of conviction |

prbpris | probability of prison sentence |

avgsen | average sentence, days |

polpc | police per capita |

density | people per square mile |

taxpc | tax revenue per capita |

pctmin | percentage minority in 1980 |

wcon | weekly wage in construction |

wtuc | weekly wage in trns, util, commun |

wtrd | weekly wage in whole sales and retail trade |

wfir | weekly wage in finance, insurance and real estate |

wser | weekly wage in service industry |

wmfg | weekly wage in manufacturing |

wfed | weekly wage of federal emplyees |

wsta | weekly wage of state employees |

wloc | weekly wage of local governments employees |

mix | offence mix: face-to-face/other |

pctymle | percentage of young males |

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

I wonder how it would compare to a Dendogram of the same data.

Thanks Robert. What I’d like to do if I get time is overlay the minimum spanning tree on the chart. That will show the structure of the dendrogram (for single linkage clustering) and if I label the links we can see how much the 2 dimensional plot distorts the true distances.

Nice piece of code!

Here’s a modified version of your function: I’ve added the links between nodes defined by the correlation. More correlation is indicated by a more pronounced link. Check it out here: http://t.co/ygxrAbVw.

And the code:

vis_correl2 <-function(dataset, abr=0, minCorr=0){

#Create correlation matrix

cor_ts<-cor(dataset)

n<-dim(cor_ts)[2]

# Create dissimilarities

ones<-matrix(rep(1,n^2), nrow=n)

dis_ts<-ones-abs(cor_ts)

# Do MDS

fit <- cmdscale(dis_ts, k=2)

x <- fit[,1]

y <- fit[,2]

labels0){labels<-substr(labels,1,abr)}

mds_plot<-data.frame(labels, x, y)

### define the edges

mds_plot <- cbind(id = 1:nrow(mds_plot), mds_plot)

### building the appropriate edge representation

edge_weight <- as.data.frame.table(cor_ts)

edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var2", by.y="labels" )

edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var1", by.y="labels" )

### build the edges in coordinate form

edge_weight_coord <- data.frame(mds_plot[edge_weight[,4],], mds_plot[edge_weight[,5],], edge_weight$Freq)

rownames(edge_weight_coord) <- NULL

colnames(edge_weight_coord) <- c("id1", "labels1", "X1","Y1","id2", "labels2", "X2","Y2", "Corr")

###

edge_weight_coord edge_weight_coord$id2,]

### restric to positive corr

edge_weight_coord minCorr, ]

#Plot the results

ggplot() +

geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2, alpha = Corr), data=edge_weight_coord, colour=”gray”, size=1) +

geom_text(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot, size=4, angle=0) + coord_fixed() + opts(title =”MDS Plot of Correlations”) +

geom_point(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot,size = 5, colour=”red”, alpha=0.3) +

scale_x_continuous(breaks = NULL) + scale_y_continuous(breaks = NULL) +

# discard default grid + titles in ggplot2

opts(panel.background = theme_blank()) + opts(legend.position=”none”)+

opts(axis.title.x = theme_blank(), axis.title.y = theme_blank()) +

opts( legend.background = theme_rect(colour = NA)) +

opts(panel.background = theme_rect(fill = “white”, colour = NA)) +

opts(panel.grid.minor = theme_blank(), panel.grid.major = theme_blank())

}

That’s fantastic. Thank you so much!

Having used this a few times I’ve now realised quite how non-euclidean some correlation matrices can be. This means that sometimes metric scaling does a poor job of representing the distances. I’ve done two things to allow for this. 1) provided output that shows what proportion of the distances are represented in the plot. 2) provided the option to do ordinal scaling instead as it often does a much better job if the distances are a long way from Euclidean. I’ve updated the crime data plot as it is better represented with the ordinal scaling. One other thing: I’ve used the directlabels package to add the labels as it does a better job and looks nicer.

Hello!

I’m a beginner in R. How can I adapt this script to my own dataset. What line(s) should I replace?

Hi there,

all you need to do is load the function into memory by running it. Then you can run the line visCorrel(data, x) where data is your dataframe of explanatory variables and x is the number of places you want the column names truncated to. You’ll need to install the packages ggplot2 and directlabels for it to work. Also see my note above about trying both metric and ordinal scaling.

Simon

OK! Thank you!