I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.
To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.
One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.
Seeing what’s going on
Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result.
A Sankey diagram showing changes to clusters as more data is added
What’s happening in the diagram
On the left hand side you see five clusters built on the original 20 point data set. Adding the next five points doesn’t make a lot of difference (you can see the new data coming in at the bottom of each node.)
With next five data points it gets more interesting as a piece of cluster fours splits off. After that cluster two breaks in half with the majority of it joining cluster one. And so on.
Assessing the original solution I’d say that the third, fourth and fifth clusters remained reasonably intact. However the original distinction between the first and second clusters didn’t stand up as more data was added.
As always the code is available. Here is the gist
And here it is on JSFiddle