I’ve just added (pseudo) random number generation for Gamma and Dirichlet distributions to the glasseye package (a necessary new part of the toolkit as I’ll be adding some Bayesian things shortly).
To celebrate I’ve created a wholly pointless new chart in the library: the animated density plot. I admit it adds nothing (apart from showing you how the distribution can slowly be built up by the randomly generated data) but sometimes it’s good to be pointless.
Here’s the gamma(5,1) distribution.
You can find the latest version of glasseye on github.
R2D3 is a new package for R I’ve been working on. As the name suggests this package uses R to produce D3 visualisations. It builds on some work I previously blogged about here.
There are some similar packages out there on CRAN already. Notably rjson and d3Network. However I found with these packages that they covered parts of the process (creating a json or creating a D3) but not the whole process and not ensuring the json was in the right format for the D3. So that was the thinking with this package. I was the aiming to create an end to end process for converting R objects into D3 visualisations. When i mentioned it to [email protected] he was keen to contribute. So we’ve been collaborating on it over the last few weeks. Its by no means finished, but I think it contains enough that its worth sharing.
Hi all I’m James. This is my first blog for Coppelia. Thanks to Simon for encouraging me to do this.
I’ve been doing a lot of hierarchical clustering in R and have started to find the the standard dendrogram plot fairly unreadable once you have over a couple of hundred records. I’ve recently been introduced to the D3.js gallery and I wondered if I could hack something better together. I found this dendrogram I liked and started to play. I soon realised in order to get my data into it I needed a nested json.
I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.
To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.
One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.
Seeing what’s going on
Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result.
I’ve adapted the code from this stackoverflow answer (from D3’s creator Mike Bostock) so that you can highlight neighbours in a force directed layout. Just doubleclick on a node to fade out any non-immediate neighbours and double click again to bring them back.
A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).
The interactive example
I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That’s test subjects altogether.
Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)
This gives us the confusion matrix below:
Now some statistics that need to be untangled!
Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =
Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =
Specificity is TN/(TN+FP). In our case /( + ) =
Negative predictive value is TN/(TN+FN). In our case /( + ) =
Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =
False discovery rate is FP/(FP+TP). In our case /( + ) =
False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =