Book Recommendations from Beyond the Grave: A Mahout Example

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!

The Case of Charles Dexter Ward

It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.

Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.

So all we need to supply the algorithm with is a list of users and items.

Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.

Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.

It works (sort of) and was very interesting to put together. Here’s a demo:

Say you liked English Romantic poets. Then you might submit the names:

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge

Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.

Or you might try some painters you like:

http://simpleweb-deadrecommender.rhcloud.com/?names=Pablo Picasso, Henri Matisse, Paul Gauguin

Or even just …

http://simpleweb-deadrecommender.rhcloud.com/?names=Bob Dylan

Or you could upset things a bit by adding in someone incongruous!

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge, arthur c clarke

Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.

What are we getting back?

The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)

The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)

The code is available on github

Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.

I plan to do several more posts on how this was done. At the moment I’m thinking

  • A walk though of the code
  • An explanation of how the recommender was tuned
  • How it was deployed as a web app on Open Shift

But if anyone out there is interested in other things or has questions just let me know.

Simon

Lazy D3 on some astronomical data

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I can’t claim to be anything near an expert on D3 (a JavaScript library for data visualisation) but being both greedy and lazy I wondered if I could get some nice results with minimum effort. In any case the hardest thing about D3 for a novice to the world of web design seems to be getting started at all so perhaps this post will be useful for getting people up and running.

A Reingold–Tilford Tree of the IVOA Astronomical Object Ontology using D3

A Reingold–Tilford Tree of the IVOA Astronomical Object Ontology using D3

The images above and below are visualisations using D3 of a classification hierarchy for astronomical objects provided by the IVOA (International Virtual Observatory Alliance). I take no credit for the layout. The designs are taken straight from the D3 examples gallery but I will show you how I got the environment set up and my data into the graphs. The process should be replicable for any hierarchical dataset stored in a similar fashion.

A dendrogram of the same data

A dendrogram of the same data

Even better than the static images are various interactive versions such as the rotating Reingold–Tilford Tree, the collapsible dendrogram and collapsible indented tree . These were all created fairly easily by substituting the astronomical object data for the data in the original examples. (I say fairly easily as you need to get the hierarchy into the right format but more on that later.)

First the environment.

If you are an R user with little experience of building web pages then you’ll probably find yourself squinting at the D3 documentation wondering how you get from the code to the output. With just a browser to read the JavaScript and an editor (notepad if you like but preferably a specialist HTML editor) you can get through the most of the tutorials referred to in Mark’s previous post . Doing this locally on your own pc works ok because the data visualised in the tutorials is mostly hard coded into the JavaScript. However once you want to start referring to data contained in external files some sensible security restrictions on what files your browser can access will block your attempts. The options are to turn these off (not advised) or switch to working on a webserver.

You can either opt for one of the many free hosting services or if you are feeling more adventurous you can follow Mark’s posts on setting up a Linux instance on Amazon Web Services and then follow Libby Hemphill’s instructions for starting up Apache and opening the port. Finally, going with the latter, I use FileZilla to transfer over any data I need to my Linux instance. See this post for getting it to work with your authentication key.

This should leave you in the following situation:

  • You have a working web server and an IP address that will take you to an HTML index page from which you can provide links to your D3 documents.
  • You have a way of transferring work that you do locally over to your web server.
  • You have a whole gallery of examples to play around with

Rather than creating my own D3 scripts I’m just going too substitute my own data into the examples. The issue here is that the hierarchical examples take as an input a nested JSON object. This means something like:

{
 "name": "Astronomical Object",
 "children": [
  {
   "name": "Planet",
   "children": [
    {
     "name": "Type",
     "children": [
      {"name": "Terrestrial"},
      {"name": "Gas Giant"}
     ]
    }

The issue is that our data looks like this:

1. Planet
1.1. [Type]
1.1.1. Terrestrial
1.1.2. Gas Giant
1.2.  [Feature]
1.2.1. Surface
1.2.1.1. Mountain
1.2.1.2. Canyon
1.2.1.3. Volcanic
1.2.1.4. Impact
1.2.1.5. Erosion
1.2.1.6. Liquid
1.2.1.7. Ice
1.2.2. Atmosphere
1.2.2.1. Cloud
1.2.2.2. Storm
1.2.2.3. Belt
1.2.2.4. Aurora

To put this into the right format I’ve used Python to read in the file as csv (with dot as a delimiter) and construct the nested JSON object. Here’s the full python code:

import csv, re
#Open the file and read as dot delimited
def readLev(row):
    level=0
    pattern = re.compile("[a-zA-Z]")
    for col in row:
        if pattern.search(col) !=None:
            level=row.index(col)
    return level
with open('B:PythonFilesPythonInOutAstroObjectAstro.csv', 'rb') as csvfile:
    astroReader = csv.reader(csvfile, delimiter='.')
    astroJson='{"name": "Astronomical Object"'
    prevLev=0
    for row in astroReader:
        #Identify the depth
        currentLev=readLev(row)
        if currentLev>prevLev:
            astroJson=astroJson+', "children": [{"name": "' + row[currentLev]+'"'
        elif currentLev<prevLev:
            jump=prevLev-currentLev
            astroJson=astroJson+', "size": 1}'+']}'*jump+', {"name": "' + row[currentLev]+'"'
        elif currentLev==prevLev:
            astroJson=astroJson+', "size": 1},{"name": "' + row[currentLev]+'"'
        prevLev=readLev(row)
    astroJson=astroJson+ '}]}]}'
    print astroJson

Since only the level of indentation matters to this process it could be repeated with on any data that has the form

* Planet
** [Type]
*** Terrestrial
*** Gas Giant

So that’s it. You’ll see that it in the source code for the hierarchical examples there is a reference to flare.json. Substitute in a reference to your own file containing the outputted JSON object and be sure to include that file in the directory on your web server.

Of course it’s a poor substitute for learning the language itself as that enables you to construct your own innovative visualisations but it gets you started.

Graphing the history of philosophy

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A close up of ancient and medieval philosophy ending at Descartes and Leibniz

If you are interested in this data set you might like my latest post where I use it to make book recommendations.

This one came about because I was searching for a data set on horror films (don’t ask) and ended up with one describing the links between philosophers.

To cut a long story very short I’ve extracted the information in the influenced by section for every philosopher on Wikipedia and used it to construct a network which I’ve then visualised using gephi

It’s an easy process to repeat. It could be done for any area within Wikipedia where the information forms a network. I chose philosophy because firstly the influences section is very well maintained and secondly I know a little bit about it. At the bottom of this post I’ve described how I got there.

First I’ll show why I think it’s worked as a visualisation. Here’s the whole graph.

Each philosopher is a node in the network and the lines between them (or edges in the terminology of graph theory) represents lines of influence. The node and text are sized according to the number of connections (both in and out). The algorithm that visualises the graph also tends to put the better connected nodes in the centre of the diagram so we see the most influential philosophers, in large text, clustered in the centre. It all seems about right with the major figures in the western philosophical tradition taking the centre stage. (I need to also add the direction of influence with a arrow head – something I’ve not got round to yet.) A shortcoming however is that this evaluation only takes into account direct lines of influence. Indirect influence via another person in the network does not enter into it. This probably explains why Descartes is smaller than you’d think. It would also be better if the nodes were sized only by the number of outward connections although I think overall the differences would be slight. I’ll get round to that.

It gets more interesting when we use Gephi to identify communities (or modules) within the network. Roughly speaking it identifies groups of nodes which are more connected with each other than with nodes in other groups. Philosophy has many traditions and schools so a good test would be whether the algorithm picks them out.

It has been fairly successful. Below we can see the so called continental tradition picked out in green, stemming from Hegel and Nietzsche, leading into Heidegger and Sartre and ending in the isms of the twentieth century. It’s interesting that there is separate subgroup, in purple, influenced mainly by Schopenhauer (out of shot) and Freud.

The Continental Tradition

And this close up is of the analytical tradition emerging out of Frege, Russell and Wittgenstein. At the top and to the left you can see the British empirical school and the American pragmatists.

British Empiricism, American Pragmatism and the Analytical Tradition

It would be interesting to play with the number of groups picked out by the algorithm. It would hopefully identify sub groups within these overarching traditions.

The graph is probably most insightful when you zoom in close. Gephi produces a vector graphic output so if you’re interested you can download it here and explore it yourself.

Now for how you do it.

The first stop is dbpedia. This is a fantastic resource which stores structured information extracted from wikepdia in a database that accessible through the web. Among other things it stores all of the information you see in an infobox on a Wikipedia page. For example I was after the influenced and influenced by fields that you find on the infobox on the page for Plato.

The next step is to extract this information. For this we need two things: a SPARQL endpoint (try snorql), which is an online interface to submit our queries and little knowledge of SPARQL a specialist language for querying the semantic web. This is a big (and exciting) area that has to do with querying information that is structured in triples (subject-relationship-object). I assume it has its roots in predicate logic so the analytical philosophers would have been pleased. However the downside is that the language itself a lot more difficult to learn than say SQL and to complicate things still further you need to know the ontological structure of the resource you are querying. I probably wouldn’t have got anywhere at all were it not for a great blog post by Bob DuCharme which is a simple guide for getting the information out of wikipedia.

In the end the query I needed was very simple. You can test it by submitting it in snorql.

SELECT *
WHERE {
?p a <http://dbpedia.org/ontology/Philosopher > .
?p <http://dbpedia.org/ontology/influenced > ?influenced.
}

(Please note you’ll need to remove the spaces before the greater than signs in the code above before submitting. A quirk of wordpress which I haven’t got round yet.)

It then needed a bit of cleaning as the punctuation was coded for URLS. For this I used the following online URL decoder.

After a bit more simple manipulation in excel I had a finished csv file that consisted of three columns

Philosopher A
Philosopher B
Weight

Each row in the data set represented a line of influence from philosopher A to philosopher B. The weight column contained a dummy entry of 1 because in our graph we do not want any one link to matter more than any other.

Gephi is the tool I used to create the visualisation of the graph. It’s both fantastic and open source. You can download it and set it up in minutes. For a quick tutorial see this link. There are many settings you can use to change the way your graph looks. I used a combination of the force atlas and the fruchterman-reingold layout algorithms. I then scaled text and node size by node degree (number of connections) and suppressed all nodes with less than four connections (it was overwhelming otherwise). The partition tool is used to create the communities. Full instructions are in the tutorial. I also found this blog entry very useful as a guide.

I hope that helps anyone who is trying to do something similar. If anyone does has a data set on horror films tagged with keywords please let me know!

If you liked this post and would like to see more like it then please subscribe by email (see the link in the side bar ) or sign up to our RSS feed.

Simon

Update Griff at Griff’s graphs has used the instructions above to create a fantastic visualization of the influence network of everyone on Wikipedia. It’s well worth a look.

Creative Commons Licence
Graphing the history of philosophy by Simon Raper is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Based on a work at drunksandlampposts.files.wordpress.com.
Permissions beyond the scope of this license may be available at http://drunks-and-lampposts.com/.

The changing face of “Analysis”

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s something I started to write a few months ago, but never got around to finish it off.

Neil Charles over at Wallpapering Fog has just written an excellent post about the growing importance of R and Tableau to the modern day analyst. Although not as old in the tooth as Neil (sorry Neil), even in the last 5 years, there has been a definite movement towards a much wider skill set for everyday analysis, at least within marketing.

The days of only using the likes of SPSS, SAS and Excel are long gone as the need to make work more repeatable, scalable and downright flexible. Today’s analyst needs to be comfortable getting hold of new datasources that don’t necessarily sit in an excel file or in tabular form, manipulating it, using a statistical technique that they didn’t necessarily learn at university and then visualising the results (maybe on a map or a small multiples plot).

Drawing on Neil’s post, I thought I would add my tuppence worth on some of the themes he mentions.

New Languages to learn

I’ve been a fan of R for the last 3 years, enjoying its power and flexibility throughout the analysis process and allowing for much greater repeatability. The sheer number of packages that exist for it have massively levelled the playing field between academia and industry now.

To put it bluntly, there are datasets that I wouldn’t be able to access and visualise if I was just using Excel, a similar sentiment to that by the guys behind Processing. SAS is a bit better from what I understand, but still very limited (this might well be why it’s dropped out of the top 50 programming languages).

James Cheshire over at SpatialAnalysis has shown what is possible with visualising data using ggplot2. Even five years ago this would have been the preserve of expensive GIS software which would have been available to a small group of people.

Tools like Tableau take this one step further, providing software that makes it very easy to produce excellent quality graphics and visualisations that provide an ability to tell stories with data that again, would have been very expensive to produce not long ago and would have required a very strong programming background.

Python is my other tool of choice, mainly due to the number of wrappers that have been written for it to access the various APIs that exist. For instance, up until recently, there was no implementation of OAuth (a way of gaining authorisation to some APIs) within R, whereas Python had this and also has wrappers written for the likes of Google Analytics, Facebook and Twitter.

I tend to be a lazy analyst – I’ll find the easiest and quickest way for the computer to do something so that I don’t have to. What that’s meant is that I’ve tended to use R, Python and Tableau as parts of my toolkit. The future is one where learning just one language is likely to leave you in some painful situations when something else can accomplish the task a lot easier.

The growing importance of databases

As datasets are getting larger and more complicated, so traditional methods for storing smaller datasets (e.g. in CSVs and Excel files) are becoming increasingly unfit for purpose. Often, datasets are now constantly being added to and the need to query these in different ways databases the way to go.

MySQL is a fantastic and freely available solution which can sit locally on your laptop, on a company server or even in the cloud on an EC2 instance (or similar). It’s easy to set up and easy to manage for small to medium size pieces.

My other thing of the moment is MongoDB, one of the growing number of NoSQL (which stands for Not only SQL) database solutions out there. Essentially, its a way of storing data in a non table format, so each record can have a different number of records. For anyone familiar with XML or JSON, it’s very similar in thinking to these. I’ve become a big fan due to this flexibility, combined with the speed and relatively easy to understand syntax to write queries.

The one thing to note with databases is that bigger ones need to be designed right. There is an important role in database architecture and administration that it’s very easy to overlook when you’ve got an analyst who roughly understands how these things work. Of course, there are also lots of times where the smaller data sizes mean that performance delays due to poor architecture don’t result in any tangible delay in getting the data.

The result of this trend is that there’s an important Interplay between analysts and those who are involved in more traditional database administration and how the needs of the analyst can be accommodated in an agile way.

Collaboration

I’ve been using BitBucket for a couple of years now as the default place that I store my code. Rather than having it on one machine, I store it in the cloud and can access it from anywhere. I can choose whether I want to share the code with no one, everyone or somewhere in between.

It’s a bit of a struggle in places (e.g. those damned two heads), but as a device for sharing and managing version control, it’s a no brainer.

Keeping up to speed with the latest developments

Probably the most important and hardest to define piece is how to keep up to speed with all this – how to hear about Mongo or Haskell or how to scrape Facebook data. I tend to use a few things:

  1. Netvibes to follow blogs.
  2. Twitter to follow people.
  3. Pinboard to tag the stuff I might need at some point in the future.
  4. StackOverflow and the R Mailing List for help with answering questions (which someone has normally come by before).
  5. Online Courses like those run by Stanford to learn new techniques.

That’s it for now, but I’ll doubtless pick up on this thread again in the future. I’d also be really keen to see what other analysts experience has been over the last few years as new software languages and technologies have become available.

Google Refine: One of The Best Tools You’ve Probably Never Heard About

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Lots of data that’s available online tends to not be the cleanest thing in the world, particularly if you’ve had to scrape it in the first place. At the same time, lots of internal data sets can be just as messy, with columns having different names in what should be identical spreadsheet templates, or with what should be identical values (e.g. “United Kingdom” vs “UK”) not being identical. There are lots more examples that I could mention, but

Google Refine is a great tool for dealing with messy data and turning it quickly and easily into a much better dataset, which then allows for the fun to begin with the analysis and visualisation.

Three handy Google Refine tricks.

My reason for using Refine was to clean up and append additional information to the Information is Beautiful Awards. I’m still learning how to use the software, but found these three things to be very handy and things that I’ve not encountered elsewhere (without writing some code).

Add together multiple different Excel worksheet. This can be an absolute pain when there are either a lot of them, or where each one has some additional or missing columns. In the past, I’ve dealt with this using a combination of Python and VBA. However, Refine makes it incredibly easy to join together multiple Excel worksheets (particularly if they have similarly named columns). NB: this functionality doesn’t seem to exist when plugging into Google spreadsheets.

Reconciliation. My new favourite word. The data set that was provided, whilst exhaustive in some respects, still lacked a lot of potentially interesting information, such as when in the year it was released, the stars who featured, and so on. My first approach to this was to use the URLs provided and see if I could scrape the data from the link, as most of them were from a particular website, which had a consistent format.

Then I discovered Reconciliation.

This screencast does it far more justice than I can, but essentially the idea is that my reconciling my database with another one (in this case Freebase), there’s a whole lot more information that I can easily add on to my dataset. Think of reconciling a bit like Primary Keys that are universally defined and being able to left join. Also, think of remakes of films (Casino Royale, Ocean’s Eleven, etc.) – which version of the film am I referring to?

I simply asked Refine to start Reconciling, and by looking at the data in my film name column, was able to identify that it was a list of films. It then did its best at fuzzy matching the names, and where it wasn’t sure, gave me a list of options from which to choose. I could then choose a confidence level and it would leave me to manually choose the remaining records which were lower than this level.

Facets. Without Data Validation built into data entry, lots of alternative spellings can crop up for what should be the same value (e.g. “United Kingdom” vs “UK”). There are normally a few ways to deal with these, and tools like Tableau make it easy to group such values together.

Refine takes a nice approach to these (and other data validation issues) using what it calls “Facets”, which are essentially a combination of a Summary of the data, combined with Data Manipulation. What this means, is that (a) I can see what mistakes there are in the data and (b) I can then easily correct them.

 

 

Machine Learning and Analytics based in London, UK