Dendrograms in R2D3

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Hi, I’m Andrew and this is my first post for Coppelia! If you like the look of this feel free to visit my blog dinner with data (and see what happens when a data scientist hits the kitchen!)

I was excited by James’s last post on the new package R2D3, and I thought I would try to help further develop the package. This is a great new package, built by James Thomson (and in collaboration with myself and Simon Raper at Coppelia) that utilises D3 visualisations inside R. You can quickly create very striking visualisations with a just a few lines of code. This has recently been shared with a recent post, but since then a couple of updates have been made to increase the functionality.

In particular to the function D3Dendro, which creates dendrograms based on a hclust object in R. I had been working on a number of alternatives to the usual static dendrogram found in the package so far, so I thought I would add these in and describe them below.

I have created two new distinct functionalities:

  • Collapsible nodes
  • Radial output (rather than the more traditional ‘linear’ dendrogram)

You can clone the package from James’s github repository or run the following in R:


I will include the example in the original post, so you can easily compare the differences.

Original dendrogram:

hc < - hclust(dist(USArrests), "ave") JSON<-jsonHC(hc) D3Dendro(JSON, file_out="USArrests_Dendo.html")

Read more

Visualising The Correlation Matrix

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels, MASS
# |
# *--------------------------------------------------------------------
# | NOTES:   For more information about MDS please see
# |
# |
# *--------------------------------------------------------------------
visCorrel<-function(dataset, abr, method="metric"){
  #Create correlation matrix
  # Create dissimilarities
  ones<-matrix(rep(1,n^2), nrow=n)
  # Do MDS
  if ( method=="ordinal"){
    fit <- isoMDS(dis_ts, k=2)$points
  } else {
    cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
    print(paste("Proportion of squared distances represented:", round(prop*100)))
    if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
  x <- fit[,1]
  y <- fit[,2]
  if (abr>0){labels<-substr(labels,1,abr)}
  mds_plot<-data.frame(labels, x, y)
  #Plot the results
  ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

crmrte crimes committed per person
prbarr probability of arrest
prbarr probability of arrest
prbconv probability of conviction
prbpris probability of prison sentence
avgsen average sentence, days
polpc police per capita
density people per square mile
taxpc tax revenue per capita
pctmin percentage minority in 1980
wcon weekly wage in construction
wtuc weekly wage in trns, util, commun
wtrd weekly wage in whole sales and retail trade
wfir weekly wage in finance, insurance and real estate
wser weekly wage in service industry
wmfg weekly wage in manufacturing
wfed weekly wage of federal emplyees
wsta weekly wage of state employees
wloc weekly wage of local governments employees
mix offence mix: face-to-face/other
pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

Machine Learning and Analytics based in London, UK