November 19, 2013

# A confused tangle

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).

To unravel it a little i’ve used the tangle js library to create an interactive document that shows how these values are related. The code can be found here.

###### The interactive example

I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That&#8217s test subjects altogether.

Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)

This gives us the confusion matrix below:

Your browser does not support the HTML5 canvas tag.

Now some statistics that need to be untangled!

Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =

Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =

Specificity is TN/(TN+FP). In our case /( + ) =

Negative predictive value is TN/(TN+FN). In our case /( + ) =

Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =

False discovery rate is FP/(FP+TP). In our case /( + ) =

False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

### Comment (1)

1. It\’s spooky how clever some ppl are. Thsnka!