Scoring a Neural Net using R on AWS

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

nnet scoring plot

One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.

With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.

Read more

Freehand Diagrams with Adobe Ideas

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Freehand diagrams have two big virtues: they are quick and they are unconstrained.

I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast.

A diagram for teaching machine learning

A diagram for teaching machine learning

It takes a bit of perseverance to get into a new habit but it pays off. Here are some examples and some tips on how to make it work as an alternative to paper.

First don’t bother using your fingers. You won’t have nearly enough control. Get a stylus. I use this one which is fairly cheap but works fine. I’ve noticed that it moves more freely when the ipad is in landscape position so I always draw with it that way round. I don’t know why this is but I guess it’s something to do with the glass.

The confusion matrix

The confusion matrix

Neighbourhood of C major

Neighbourhood of C major

When you first open Adobe Ideas it’s tempting to see the page in front of view as something that needs to be filled as you would a piece of paper. That’s the wrong way to approach it as you’ll get overly distracted by the task of orientating your work on the page. Instead treat the page as an enormous workboard. Start by zooming in somewhere in the middle and work outwards.

A diagram for teaching machine learning

A diagram for teaching machine learning

It’s vector graphic based so in principle infinitely deep. You’re only limited by how thin the pens get. Draw your diagram without any thought for the borders then right at the end scale it up to fit the page.

The next thing that might strike you is that there’s no copy and paste. Ideas is part of Adobe’s creative suite along with Photoshop and Illustrator and as such inherits their layer based approach to image editing. If you want to be able to adjust the positioning of parts of your diagram (say the key with respect to the rest of it) then you would be wise to put these on separate layers. You’d think this would be deeply frustrating however it’s one of those constraints that somehow makes you work better as you think more about your layout.

You are allowed 10 layers in total with one photo layer. I sometimes place an image of a grid on this layer if I need grid lines to work to.

If you are working quickly you don’t want to spending too much fretting about the colour palette. Again Ideas forces you into simplicity by limiting your colours to five at a time. Even better you can connect up to Kuler provided its app is also on your ipad. This gives you access to a huge range of palletes. The Kuler tool, which sucks the pallete out of an image, is also there by default in Ideas if you click on themes. This allows you to pull the colour scheme out any image on the photo layer.

Some Diagram Elements

Some Diagram Elements

When it comes to using the pen tool I tend to stick to the marker and constantly zoom in and out as I draw. I zoom right in for writing text and out again for drawing shapes. You soon get used to flipping between using your fingers to zoom and pan and your stylus to draw. It’s worth noting down the width of the pens you are using as it’s easy to get lose track of where you are and a diagram made up of many different line widths looks bad. I tend to use two widths: 6.5 and 3.

The toughest thing to pull off is anything that requires large scales swipes across the surface of the ipad, for example long straight lines or wide circles. It is easier to do this on a small scale with a thin pen and then enlarge the image.

One thing I didn’t realise for a while was that holding the pen tool for a few seconds in any completely circumscribed areas floods that areas with colour. This is very useful for filling shapes and creating colorful backgrounds.

I try to vary styles and colours although it takes an effort to keep this up if you are taking notes or in rush. I’ve listed in the panel on the left some of the diagram elements I’ve experimented with.

Adobe Ideas looks at first sight too limited to be particularly powerful but take a look at How to Use Adobe Ideas by Michael Startzman to see what is achievable by a professional!

A confused tangle

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).

To unravel it a little i’ve used the tangle js library to create an interactive document that shows how these values are related. The code can be found here.

The interactive example

I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That&#8217s test subjects altogether.

Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)

This gives us the confusion matrix below:

Your browser does not support the HTML5 canvas tag.

Now some statistics that need to be untangled!

Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =

Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =

Specificity is TN/(TN+FP). In our case /( + ) =

Negative predictive value is TN/(TN+FN). In our case /( + ) =

Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =

False discovery rate is FP/(FP+TP). In our case /( + ) =

False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =

Book Recommendations from Beyond the Grave: A Mahout Example

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!

The Case of Charles Dexter Ward

It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.

Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.

So all we need to supply the algorithm with is a list of users and items.

Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.

Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.

It works (sort of) and was very interesting to put together. Here’s a demo:

Say you liked English Romantic poets. Then you might submit the names: keats, william wordsworth, samuel taylor coleridge

Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.

Or you might try some painters you like: Picasso, Henri Matisse, Paul Gauguin

Or even just … Dylan

Or you could upset things a bit by adding in someone incongruous! keats, william wordsworth, samuel taylor coleridge, arthur c clarke

Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.

What are we getting back?

The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)

The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)

The code is available on github

Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.

I plan to do several more posts on how this was done. At the moment I’m thinking

  • A walk though of the code
  • An explanation of how the recommender was tuned
  • How it was deployed as a web app on Open Shift

But if anyone out there is interested in other things or has questions just let me know.


Mahout for R Users

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included at the bottom some notes for setting up Mahout on Ubuntu.

What is Mahout?

A machine learning library written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (they fit the map-reduce paradigm and therefore can run on Hadoop.) Using Mahout you can do clustering, recommendation, prediction etc. on huge datasets by increasing the number of CPUs it runs over. Any job that you can split up into little jobs that can done at the same time is going to see vast improvements in performance when parallelized.

Like R it’s open source and free!

So why use it?

Should be obvious from the last point. The parallelization trick brings data and tasks that were once beyond the reach of machine learning suddenly into view. But there are other virtues. Java’s strictly object orientated approach is a catalyst to clear thinking (once you get used to it!). And then there is a much shorter path to integration with web technologies. If you are thinking of a product rather than just a one off piece of analysis then this is a good way to go.

How is it different from doing machine learning in R or SAS?

Unless you are highly proficient in Java, the coding itself is a big overhead. There’s no way around it, if you don’t know it already you are going to need to learn Java and it’s not a language that flows! For R users who are used to seeing their thoughts realised immediately the endless declaration and initialisation of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.

What do you need to do to get started?

You’ll need to install the JDK (Jave Development Kit) and some kind of Java IDE (I like netbeans). You’ll also need maven (see below) to organise your code and its dependencies. A book is always useful. The only one around it seems is Mahout in Action but its good and all the code for examples is available for download. If you plan to run it on hadoop (which is recommended) then of course you need that too. If you’re going to be using hadoop in seriousness you’ll need an AWS account (assuming you haven’t your own grid). Finally you’ll need the mahout package itself. I found this all a lot easier on Linux with its natural affinity with other open source projects. You are welcome to follow my notes below on how to get this all up and running on an AWS Ubuntu instance.

Object Orientated

R is a nice gentle introduction object-orientated programming. If you’ve declared your own classes and methods using S3 you’re on your way. Even better more so if you’ve used S4 (must admit I haven’t). Even so there’s a big jump to the OO world of Java. Here’s a few tips:

  • To get something that executes include a method inside your class that begins public static void main(String[] args). An IDE like netbeans will pick this up and allow you to run that file. See here for a Hello World example
  • Remember every variable needs to be both declared and initialised and for everything that is not a Java literal this means creating a new instance of an object (I keep forgetting to include new when initialising.)
  • The easy R world of a script and a few functions is not an option. Everything should be an object or something pertaining to it. I find the easiest way to make this jump is to imagine I’m making bits of a machine and make an effort to keep this in my head. Everything is now like a little robot with data on the inside and predefined actions and responses.

Some useful terms

Maven a piece of software used by Mahout for managing project builds. It is similar to the package writing tools in R but more flexible.

JDK and JRE The first is the Java Development Kit, the software needed to write code in Java, the second is the Java Runtime Environment, the software that executes Java code. JRE will be on the machines of anyone who runs anything that uses Java (i.e. most people)

AWS Amazon web services – a cloud computing platform. We’ve quite a few posts on this subject. Here it is significant because it’s what you’ll need to run hadoop if you’ve not got your own grid.

Hadoop and map reduce There’s a million online tutorials on these but very quickly map-reduce is a powerful algorithm for parallelizing a very large class of tasks and Hadoop is an open source software framework that implements it. If you’ve used the parallel library in R then it does something similar on a much smaller scale (although I’m not sure whether it is formally map-reduce).

netbeans A good IDE for Java (there are many others). If you use R Studio for R it’s the same kind of thing but less stripped down, if you use Eclipse (which can also be used for Java) then you are already familiar with the set up.

Some general tips

  • When Mahout installs it does a lot of checks. I found it kept failing certain ones and this prevented the whole thing from installing. I disabled the checks with the option -DskipTests install and so far I’ve had no issues
  • I found it very useful when running the examples in Mahout In Action to explore the objects using the Netbeans debugger. This allows you to inspect the objects giving you good sense of how it all hangs together
  • Here’s a nice post explaining the map-reduce algorithm
  • Don’t forget to install the Maven plug-in in netbeans otherwise you’ll be struggling when executing the Mahout examples
  • Do a bit of Java programming to get your head into it (it might not be your thing but I downloaded and adapted this space invaders example)

My notes for setting up Mahout and running a simple example

This worked for me as of April 2013 on an AWS Ubuntu image (see earlier posts for setting this up). Obviously I’m referring to my particular directory set up. You’ll need to change it appropriately here and there and in particular change the versions of hadoop, maven and mahout to the latest. Thanks to the following post for the example.

Apologies, it’s a bit raw but it gets you from the beginning to the end.

Install Java JDK 7

sudo java -version [check whether java is installed]
sudo apt-get update
sudo apt-get install openjdk-7-jdk

Download and install hadoop

cd home/ubuntu
sudo cp hadoop-1.0.4.tar.gz /usr/local [Move the file to usr/lib]
cd /usr/local
sudo tar -zxvf  hadoop-1.0.4.tar.gz [unzip the package]
sudo rm hadoop-1.0.4.t

Set up environment variables

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop-1.0.4

Set up variable permanently

sudo vi /etc/environment



Append to the path line “:HADOOP_HOME/bin”

Test hadoop is working

$HADOOP_HOME/bin/hadoop [displays help files]

Runs stand alone example

cd /usr/local/hadoop-1.0.4
sudo mkdir input
sudo cp conf/*.xml input
sudo bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
sudo cat output/*

Install maven

sudo apt-get update
sudo apt-get install maven
mvn -version [to check it installed ok]

Install mahout

cd home/ubuntu
sudo tar -zxvf  mahout-distribution-0.7-src.tar.gz
sudo cp -r /home/ubuntu/mahout-distribution-0.7 /usr/local
sudo mv mahout-distribution-0.7 mahout
cd mahout/core
sudo mvn -DskipTests install
cd mahout/examples
sudo mvn install

Create a maven project

cd /usr/local/mahout
sudo mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.unresyst -DartifactId=mahoutrec
cd mahoutrec
sudo mvn compile
sudo mvn exec:java -Dexec.mainClass="com.unresyst.App" [to print hello world]
sudo vi pom.xml

Then insert into pom.xml


Also add


to the parent clause

Create recommender

Create datasets directory in mahoutrec folder

Add the csv file in

Create the java file listed in the post above in src/main/java/com/unresyst/

Go back to the project directory and run

sudo mvn compile
sudo mvn exec:java -Dexec.mainClass="com.unresyst.UnresystBoolRecommend"

Follow steps in D&L post to set up NX server

Set up netbeans

Download and install netbeans using the Ubuntu software centre

Enable all centres
Install the Maven plug in

Install Git

sudo apt-get install git

Download the repository for Analysing Data with Hadoop

cd /home/ubuntu
mkdir repositories
cd repositories
git clone

Download the repository for Mahout in Action

git clone

Running the hadoop maxtemperature example

Set up a new directory and copy across example files :

cp /home/ubuntu/repositories/hadoop-book/ch02/src/main/java/* /home/ubuntu/hadoopProjects/maxTemp

Make a build/classes directory within maxTemp

javac -verbose -classpath /usr/local/hadoop-1.0.4/hadoop-core-1.0.4.jar MaxTemperature*.java -d build/classes
export HADOOP_CLASSPATH=/home/ubuntu/hadoopProjects/maxTemp/build/classes
hadoop MaxTemperature /home/ubuntu/repositories/hadoop-book/input/ncdc/sample.txt output

To run the mahout example through netbeans just go to the mahoutrec maven directory and execute

Machine Learning and Analytics based in London, UK