June 20, 2014 Simon Raper

Visualising cluster stability using Sankey diagrams

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone
The problem

I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.

To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.

One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.

Seeing what’s going on

Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result.

A Sankey diagram showing changes to clusters as more data is added

What’s happening in the diagram

On the left hand side you see five clusters built on the original 20 point data set. Adding the next five points doesn’t make a lot of difference (you can see the new data coming in at the bottom of each node.)

With next five data points it gets more interesting as a piece of cluster fours splits off. After that cluster two breaks in half with the majority of it joining cluster one. And so on.

Assessing the original solution I’d say that the third, fourth and fifth clusters remained reasonably intact. However the original distinction between the first and second clusters didn’t stand up as more data was added.

The code

As always the code is available. Here is the gist

And here it is on JSFiddle

Tagged: , , , ,

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Machine Learning and Analytics based in London, UK