July 8, 2014 Simon Raper

Buster – a new R package for bagging hierarchical clustering

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation.

Some background

Clustering (assuming everyone is happy with this technique but if not click here) typically works on a matrix of distances between data points. We sometimes refer to the distances as dissimilarities – the greater the distance the more dissimilar the data points. So in a simple case the data points might be customers and the distances reflect how different the customers are in terms of demographics, purchasing behaviour etc.

A simple way to achieve this would be to plot your data points against a set of axes representing the things you would like to include in the dissimilarity measure and then just measure the distance between the points. Scaling the data would ensure that none of the attributes are prioritised over the others. For example plot your customers by age and income (both standardised) and then measure the distance between them to determine how similar they are in age and income.

This is Euclidean distance and it is implicit in many popular and powerful clustering algorithms for example k-means and its variants

But how do you know if your measure of dissimilarity is representable as a Euclidean distance and therefore amenable to k-means? It’s simple enough if you started with some variables and derived your Euclidean distance but it doesn’t always work this way. For example suppose I am measuring similarity by correlation. Is the absolute of correlation representable as a Euclidean distance? Is there some n-dimensional space, where we could plot our data points such that the distance between them represented the absolute value of their correlation?

A quick check to see if your measure is ever going to be representable as euclidean distance is: does it satisfy the triangle inequality?

d(x,z) \le d(x,y) + d(y,z) where d(x,y) is the distance between x and y.

For example love and hate (assuming they are quantifiable) do not satisfy the triangle equality. If x loves y and y loves z this places no constraint on the degree to which x loves z. It could easily be more than the sum of the love of x for y and y for z!

The same is true of the absolute value of correlation. Here if you are interested is a proof.

Note

The triangle inequality is actually part of the defining criteria for the more general concept of a metric space (thanks to Richard Brooker for pointing this out to me). In other words it’s a necessary not a sufficient condition for Euclidean distance. However it works well as a way of ruling out candidates.

If your measure isn’t representable in Euclidean space then you have hierarchical clustering to fall back on since it builds up clusters point by point, assigning the next point to its closest cluster, in a way that can handle almost any kind of distance metric.

Bagging

However we still have the problem of stability. The concern is that the clustering solution you arrive at might be to a high degree an artifact of the particular data set you have at the time. When you collect more data or take another sample and rerun the process a very different solution appears. What we’d like is some measure of the degree of certainty we have about whether a data point is a member of a cluster. Techniques like fuzzy c-means clustering provide this for data that is representable in Euclidean space but I can’t find anything that does this for hierarchical clustering. I thought a way around this might be to use some kind of bagging.

Bagging (bootstrap aggregating) is usually used with supervised methods to improve their stability and accuracy. The idea is to bootstrap the sample, build a predictive model on each bootstrapped sample and then combine the results to produce for classification a vote on the predicted class and for the continuous case an average prediction. If we bootstrap sample our data and build a separate hierarchical clustering solution on each sample can we then combine the results to produce a more stable clustering solution. This is what I’ve tried to do in the buster package.

The buster R package

The buster function takes as arguments a distance matrix and several other parameters including the number of bootstraps and the number of clusters you wish to have in your solution. It then does the following

  • Takes a bootstrap sample of the data points (actually more of a subsample at the moment – will convert it to proper resampling shortly).
  • For each sample builds a hierarchical clustering solution using hclust.
  • For each clustering solution cuts the dendrogram at the right height to produce the desired number of clusters.
  • Works out, for each pair of data points a and b the percentage of times that they both appeared in the same cluster. (Note here the denominator has to be the number of times they both appeared in the same sample.) Call this their rate of co-occurrence.
  • Builds another hierarchical clustering solution but this time using the rate of co-occurrence as the distance measure. The higher the rate of co-occurrence the shorter the distance. This final clustering is our stable solution.

Why should this help? Well using this approach groups of data points that appear together in the same cluster regardless of sampling variation will form the cores of the clusters and data points which are unstable will be furthest from other points and recognisable as outliers – they will tend to join the dendrogram further up the tree.

We can also highlight the unstable data points by identifying those whose co-occurrences with all other data points have the lowest variance. Stable data points will tend to have co-occurrence of close to one or zero giving them a high variance relative to the unstable data points. There are practical applications where I might want to exclude such points. For example the I might be clustering some variables in order to do some dimension reduction. I will discard any variables that are on the periphery and keep the stable cores.

Applying buster to the iris data set

In order to visually check that it works we will have to apply it to some data that is representable in Euclidean space. Let’s have a quick look at the Iris data set

The new dendrogram based on co-occurrence looks like this with unstable data points faded out. Click in to get a better look.

DendroUnstable

To see what this means for the Iris data here’s a lattice plot with the unstable data points in pink. You’ll see that these are often the boundary cases.

irisBuster

Testing on an artificial data set

Just to help make the point about highlighting unstable data points using the variance of co-occurrence I’ve created an artificial data set with three clusters overlapping.

The overlapping creates instability as we can see in this plot (I put it into D3 just to make it look nicer!) Mouse over to show the variance of the data points (red = high variance).


It’s a work in progress. Needs to be tested more and as I said I’ve no idea whether something similar already exists. I couldn’t find anything. Feel free to fork in github.

Tagged: , ,

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Comments (14)

  1. Very cool.
    Any chance you’ll use a tanglegram plot (from the dendextend R package) to compare your solution with the one from one of the original trees?

    I would love to “showcase” such an example for a good use of the function.

    Yours,
    Tal

    • simonraper

      Hi Tal,

      That sounds like a great idea. I love your package but I’ve not yet fully explored it. I’ll have a go in the next few days

      Cheers

      Simon

    • simonraper

      Hi Andrej,

      Yes I checked out the pvclust package while I was looking for a solution. The problem is that it doesn’t (as far as I can see) take a distance matrix as an input. It does have an option to do correlations on the input data frame however I have some other distance metrics that are non Euclidean and not covered in the options. Otherwise I agree it’s similar.

      Thanks

      Simon

      • neverfox

        It is possible to us a custom distance matrix with pvclust if you take advantage of cmdscale to get coordinates that will return the same distance matrix:

        x 0))
        pvc <- pvclust::pvclust(t(x))

  2. If you want some help,
    Please write a simple code that at the end of it we will have two objects – one with some tree at the “beginning”, and one with a “final” tree.
    Once done, I can help show how to use it…

    Cheers,
    Tal

    • simonraper

      Hi Tal,

      That’s a very nice tool. I did something similar with a Sankey diagram (see this post) but I like the way your solution allows us to compare the whole tree structure

      I did the following to give a comparison of the clustering on the Iris data using hclust and buster


      iris.dist< -dist(iris[,1:4]) bhc<-buster(iris.dist, n=250, k=3, size=0.66, method='ward', pct.exc=0.07) tanglegram(bhc$hclust , bhc$bhclust)

      Here's the output. It is interesting that those observations that do indeed switch groups from one solution to the other are those that I identified as unstable (check out for example 112 and 135). The thought is that the allocation based on buster is more stable as it is based on a high co-occurrence with members of that group across many resamples.

      Using tanglegram

    • simonraper

      Thanks Ty. Just had a look at the documentation. Seems to be almost the same idea which I find reassuring in a way!

  3. mathuna

    hi… i saw your post.. i would like to know how to compare the clustering algorithms in R for text mining

Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK