I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.
It’s not a typical everyday human dilemma. It needs a bit more explanation.
Clustering (assuming everyone is happy with this technique but if not click here) typically works on a matrix of distances between data points. We sometimes refer to the distances as dissimilarities – the greater the distance the more dissimilar the data points. So in a simple case the data points might be customers and the distances reflect how different the customers are in terms of demographics, purchasing behaviour etc.
A simple way to achieve this would be to plot your data points against a set of axes representing the things you would like to include in the dissimilarity measure and then just measure the distance between the points. Scaling the data would ensure that none of the attributes are prioritised over the others. For example plot your customers by age and income (both standardised) and then measure the distance between them to determine how similar they are in age and income.
This is Euclidean distance and it is implicit in many popular and powerful clustering algorithms for example k-means and its variants
But how do you know if your measure of dissimilarity is representable as a Euclidean distance and therefore amenable to k-means? It’s simple enough if you started with some variables and derived your Euclidean distance but it doesn’t always work this way. For example suppose I am measuring similarity by correlation. Is the absolute of correlation representable as a Euclidean distance? Is there some n-dimensional space, where we could plot our data points such that the distance between them represented the absolute value of their correlation?
A quick check to see if your measure is ever going to be representable as euclidean distance is: does it satisfy the triangle inequality?
where is the distance between x and y.
For example love and hate (assuming they are quantifiable) do not satisfy the triangle equality. If x loves y and y loves z this places no constraint on the degree to which x loves z. It could easily be more than the sum of the love of x for y and y for z!
The same is true of the absolute value of correlation. Here if you are interested is a proof.
The triangle inequality is actually part of the defining criteria for the more general concept of a metric space (thanks to Richard Brooker for pointing this out to me). In other words it’s a necessary not a sufficient condition for Euclidean distance. However it works well as a way of ruling out candidates.
If your measure isn’t representable in Euclidean space then you have hierarchical clustering to fall back on since it builds up clusters point by point, assigning the next point to its closest cluster, in a way that can handle almost any kind of distance metric.
However we still have the problem of stability. The concern is that the clustering solution you arrive at might be to a high degree an artifact of the particular data set you have at the time. When you collect more data or take another sample and rerun the process a very different solution appears. What we’d like is some measure of the degree of certainty we have about whether a data point is a member of a cluster. Techniques like fuzzy c-means clustering provide this for data that is representable in Euclidean space but I can’t find anything that does this for hierarchical clustering. I thought a way around this might be to use some kind of bagging.
Bagging (bootstrap aggregating) is usually used with supervised methods to improve their stability and accuracy. The idea is to bootstrap the sample, build a predictive model on each bootstrapped sample and then combine the results to produce for classification a vote on the predicted class and for the continuous case an average prediction. If we bootstrap sample our data and build a separate hierarchical clustering solution on each sample can we then combine the results to produce a more stable clustering solution. This is what I’ve tried to do in the buster package.
The buster R package
The buster function takes as arguments a distance matrix and several other parameters including the number of bootstraps and the number of clusters you wish to have in your solution. It then does the following
- Takes a bootstrap sample of the data points (actually more of a subsample at the moment – will convert it to proper resampling shortly).
- For each sample builds a hierarchical clustering solution using hclust.
- For each clustering solution cuts the dendrogram at the right height to produce the desired number of clusters.
- Works out, for each pair of data points a and b the percentage of times that they both appeared in the same cluster. (Note here the denominator has to be the number of times they both appeared in the same sample.) Call this their rate of co-occurrence.
- Builds another hierarchical clustering solution but this time using the rate of co-occurrence as the distance measure. The higher the rate of co-occurrence the shorter the distance. This final clustering is our stable solution.
Why should this help? Well using this approach groups of data points that appear together in the same cluster regardless of sampling variation will form the cores of the clusters and data points which are unstable will be furthest from other points and recognisable as outliers – they will tend to join the dendrogram further up the tree.
We can also highlight the unstable data points by identifying those whose co-occurrences with all other data points have the lowest variance. Stable data points will tend to have co-occurrence of close to one or zero giving them a high variance relative to the unstable data points. There are practical applications where I might want to exclude such points. For example the I might be clustering some variables in order to do some dimension reduction. I will discard any variables that are on the periphery and keep the stable cores.
Applying buster to the iris data set
In order to visually check that it works we will have to apply it to some data that is representable in Euclidean space. Let’s have a quick look at the Iris data set
The new dendrogram based on co-occurrence looks like this with unstable data points faded out. Click in to get a better look.
To see what this means for the Iris data here’s a lattice plot with the unstable data points in pink. You’ll see that these are often the boundary cases.
Testing on an artificial data set
Just to help make the point about highlighting unstable data points using the variance of co-occurrence I’ve created an artificial data set with three clusters overlapping.
The overlapping creates instability as we can see in this plot (I put it into D3 just to make it look nicer!) Mouse over to show the variance of the data points (red = high variance).
It’s a work in progress. Needs to be tested more and as I said I’ve no idea whether something similar already exists. I couldn’t find anything. Feel free to fork in github.