A confused tangle

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).

To unravel it a little i’ve used the tangle js library to create an interactive document that shows how these values are related. The code can be found here.

The interactive example

I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That&#8217s test subjects altogether.

Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)

This gives us the confusion matrix below:

Your browser does not support the HTML5 canvas tag.

Now some statistics that need to be untangled!

Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =

Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =

Specificity is TN/(TN+FP). In our case /( + ) =

Negative predictive value is TN/(TN+FN). In our case /( + ) =

Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =

False discovery rate is FP/(FP+TP). In our case /( + ) =

False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =

Box Me

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s a short R function I wrote to turn a long data set into a wide one for viewing. It’s not the most exciting function ever but I find it quite useful when my screen is wide and short. It simply cuts the data set horizontally into equal size pieces and puts them side by side. Lazy I know!

#'Turns an overly long data frame into something easier to look at
#' @param d A dataframe or matrix
#' @param nrow The number of rows you would like to see in the new dataframe
#' @examples
#' test.set<-data.frame(x=rnorm(100), y=rnorm(100))
#' boxMe(test.set, 18)
#' library(ggplot2)
#' boxMe(diamonds, 10)
boxMe<-function(d, nrow){
  # Number of rows and columns
  rem<-r %% nrow # Number of blank rows
  reps<-floor(r/nrow) # Number of folds
  s<-seq(1, reps*nrow, by=nrow) # Breaks
  box<-d[1:nrow,] # First col
  for (i in s[-1]){
    box<-cbind(box, ap)
  #Append remainder
  if (rem>0){
    null.block<-as.data.frame(matrix(rep(NA, (n.null.rows*c)), nrow=n.null.rows))
    last.block<-rbind(rem.rows, null.block)
    box<-cbind(box, last.block)

Excuses and Opportunities

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Regular readers of the blog will have noticed that I haven’t been a regular contributor to the blog over the last year. There are some good reasons/excuses for that, predominantly around buying and renovating our new home, getting married and starting a new job at Facebook. Having an Arsenal season ticket also doesn’t help.

I am hoping that now that most of the renovations are done, I’m five months into married life and 13 months into the new job, that should allow for some more writing, so watch this space.

In the mean time, there is an awesome role for a senior researcher available in my team, working to understand and quantify advertising effectiveness. The full description is available here and it’s based in either New York or California.

Needless to say that Facebook is an incredible place to work, with more data than you could ever hope for and lots of interesting, yet unanswered, questions to ask of the data.

Within the team, you’ll be working with other researchers who have a strong background in data science and research more widely, including Eurry Kim and Dan Chapsky who both recently joined the team.

If you’re interested in the role and think you’d be a good fit, then please submit your CV/Resume or alternatively, if you’ve got any questions, please reach out using the form below!

[contact-form-7 404 "Not Found"]

Machine Learning and Analytics based in London, UK