Do politicians sound the same? An investigation using Latent Semantic Analysis

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

By Richard Brooker

Mountain View

“Politicians all sound the same and everyone is moving to the centre…”

You must have heard this a lot over the last few months? However is it fair? What evidence is there to show this?


There is a theory in economics called “The Median Voter Theorem” that provides some justification for the above statement.

The Median Voter Theorem: a majority rule voting system will select the outcome most preferred by the median voter” [1]

It hypothesises that the frequency of inclinations, on a given political spectrum, is likely to follow a bell curve. Thus parties will eventually move to the median in order to capture the most votes.

Mountain View

So for this election, I decided to have a look at what effect this phenomenon is having on British politics and the speeches given in the House of Commons.

The Data

Hansard is the name for the Transcripts of Parliamentary Debates in Britain. Records going back to 1935 can be downloaded from in .xml format. They are published daily and record every speech along with who gave it and their party.

Mountain View

In this post I look at how easy it is to distinguish a party by what they are saying.

More specifically, I built a statistical model to work out the probability that a speech came from a particular party based on the words and phrases used in the speech.

Feature  Construction

The input variables are contained in a sparse TF-IDF matrix constructed using Scikit Learn TfidfVectorizer in python. The rows represent speeches and there is column for every n-gram (up to length 3) in the transcripts.

An n-gram (/’phrase’/’term’) is any sequence of n words used consecutively.

The values in the matrix are the the term frequency–inverse document frequency (TF-IDF). For an n-gram i and speech j,Mountain View

The TF-IDF is a numerical statistic that is intended to reflect how important a n-gram (word/’phrase’) is to a document (in our case speech).

Feature  Selection

I carried out some light feature selection. I calculated the average uplift in probability by year. The uplift is simply how much more likely a term is to appear given that you know the party it belongs to. I then filtered out features within 2 decimal points of 1.

Averaging it by year does a couple of things. Like boot strapping, it provides confidence in the statistic . Secondly it also helps to filter out terms whose affinity flips across time, (e.g. job titles).

Here are some of the most predictive features for each of the parties. Hover to see the n-gram. The y-axis shows the percentage increase in the probability of seeing the n-gram for a given party. The x-axis is ordered by this value.

The graph shows Conservatives by default, but you can change the y and x labels to look at affinities for other parties, and the party that you order by.

Sparse Matrix Representation

The backbone of my approach is the sparse matrix representation (scipy.sparse is one such type). This is a data model that takes advantage of the large number of zeros in our feature matrix.

Usually a table/matrix is stored as a two dimensional array e.g.

Mountain View

This means the memory requirements are proportional to MxN (where M and N are the matrix dimensions).  A sparse representation on the other hand can cut the memory requirements drastically by only storing non zero values. For the example the same matrix above can be stored the following 3 arrays.
Mountain View

This meant that my feature matrix could be stored in memory at a modest 4.5GB (rather than 58GB).


Now we have our sparse representation we are able to leverage some clever techniques to efficiently fit a classifier.
Mountain View
We use Scikit Learns’ SVMClassifier to fit a linear classifier.

Mountain ViewThe model is fit by minimising the loss function
Mountain ViewWhere L is the loss function (Hinge, Logistic, Least-Squares, or Epsilon-Insensitive…etc). R is the regulation to prevent overfitting (L1, L2 or Elastic Net).

The classifier then iterates through each speech in our sparse matrix and updates the parameters using a lazy approximation of the gradient of E. It then approximates the true gradient by considering each training point at a time. Mountain Viewthe intercept is updated similarly. We can get near optimal results using very few passes of the data.

I use a grid search to find the best loss function, regularisation learning rate and number of iterations. Usually at this point you evaluate each of the models on a probe set, select the one model with the best parameters, then asses your success on a test set.

However, why throw away a bunch of otherwise good models?

There are lots of reasons you might (time, etc), however I wanted to do something a little more interesting.

Instead of finding the best linear model I ensemble them using random forest. I built an smaller dense feature matrix consisting of the scores from each of the models. This takes me from some 100,000 tfidf features to 200 dense ones. I can now train a random forest on my probe set. I got a 10% improvement on my best single model (when evaluated on the test set).

The Results

I now have a probability score for each of the speeches for each of the parties.  Finally I averaged the results by year and party to give the chart below.

The bubbles are the actual parties and the axes show the average probability of their speeches belonging to whichever parties are selected as the axes. It makes sense to start off with the two major parties on the axes but you can change these if you want. When the political speeches are very different the bubbles will of course be further apart. You’ll see for example that before 2010 the Labour speeches were much more predictive of being Labour and likewise for the Conservatives but post 2010 this relationship weakens and the bubbles move to the centre. Move the slider to see these changes over time.

As you move into 2010 the parties move to the centre and become harder to distinguish.



Scoring a Neural Net using R on AWS

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

nnet scoring plot

One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.

With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.

Read more

Buster – a new R package for bagging hierarchical clustering

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation. Read more

Visualising cluster stability using Sankey diagrams

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone
The problem

I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.

To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.

One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.

Seeing what’s going on

Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result. Read more

Picturing the output of a neural net

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Some time ago during a training session a colleague asked me what a surface plot of a two input neural net would look like. That is, if you have two inputs x_1 and x_2 and plot the output y as a surface what do you get? We thought it might look like a set of waterfalls stacked on each other.


For this post I’m going to use, wolfram alpha and some javascript. Check other tools in the toolbox.

Since neural nets are often considered a black box solution I thought it would be interesting to check this. If you can picture the output for simple case it makes things less mysterious in the more general case.

Let’s look at a neural net with a single hidden layer of two nodes and a sigmoid activation function. In other words something that looks like this. (If you need to catch up on the theory please see this brilliant book by David MacKay that’s free to view online)

Neural net with a single hidden layer

Neural net with a single hidden layer

Drawn using the lovely


Output for a single node

We can break the task down a little by doing a surface plot of a single node, say node 1. We are using the sigmoid activation function so what we are plotting is the application of the sigmoid function to what is essentially a function describing a plane. Read more

Freehand Diagrams with Adobe Ideas

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Freehand diagrams have two big virtues: they are quick and they are unconstrained.

I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast.

A diagram for teaching machine learning

A diagram for teaching machine learning

It takes a bit of perseverance to get into a new habit but it pays off. Here are some examples and some tips on how to make it work as an alternative to paper.

First don’t bother using your fingers. You won’t have nearly enough control. Get a stylus. I use this one which is fairly cheap but works fine. I’ve noticed that it moves more freely when the ipad is in landscape position so I always draw with it that way round. I don’t know why this is but I guess it’s something to do with the glass.

The confusion matrix

The confusion matrix

Neighbourhood of C major

Neighbourhood of C major

When you first open Adobe Ideas it’s tempting to see the page in front of view as something that needs to be filled as you would a piece of paper. That’s the wrong way to approach it as you’ll get overly distracted by the task of orientating your work on the page. Instead treat the page as an enormous workboard. Start by zooming in somewhere in the middle and work outwards.

A diagram for teaching machine learning

A diagram for teaching machine learning

It’s vector graphic based so in principle infinitely deep. You’re only limited by how thin the pens get. Draw your diagram without any thought for the borders then right at the end scale it up to fit the page.

The next thing that might strike you is that there’s no copy and paste. Ideas is part of Adobe’s creative suite along with Photoshop and Illustrator and as such inherits their layer based approach to image editing. If you want to be able to adjust the positioning of parts of your diagram (say the key with respect to the rest of it) then you would be wise to put these on separate layers. You’d think this would be deeply frustrating however it’s one of those constraints that somehow makes you work better as you think more about your layout.

You are allowed 10 layers in total with one photo layer. I sometimes place an image of a grid on this layer if I need grid lines to work to.

If you are working quickly you don’t want to spending too much fretting about the colour palette. Again Ideas forces you into simplicity by limiting your colours to five at a time. Even better you can connect up to Kuler provided its app is also on your ipad. This gives you access to a huge range of palletes. The Kuler tool, which sucks the pallete out of an image, is also there by default in Ideas if you click on themes. This allows you to pull the colour scheme out any image on the photo layer.

Some Diagram Elements

Some Diagram Elements

When it comes to using the pen tool I tend to stick to the marker and constantly zoom in and out as I draw. I zoom right in for writing text and out again for drawing shapes. You soon get used to flipping between using your fingers to zoom and pan and your stylus to draw. It’s worth noting down the width of the pens you are using as it’s easy to get lose track of where you are and a diagram made up of many different line widths looks bad. I tend to use two widths: 6.5 and 3.

The toughest thing to pull off is anything that requires large scales swipes across the surface of the ipad, for example long straight lines or wide circles. It is easier to do this on a small scale with a thin pen and then enlarge the image.

One thing I didn’t realise for a while was that holding the pen tool for a few seconds in any completely circumscribed areas floods that areas with colour. This is very useful for filling shapes and creating colorful backgrounds.

I try to vary styles and colours although it takes an effort to keep this up if you are taking notes or in rush. I’ve listed in the panel on the left some of the diagram elements I’ve experimented with.

Adobe Ideas looks at first sight too limited to be particularly powerful but take a look at How to Use Adobe Ideas by Michael Startzman to see what is achievable by a professional!

A confused tangle

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).

To unravel it a little i’ve used the tangle js library to create an interactive document that shows how these values are related. The code can be found here.

The interactive example

I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That&#8217s test subjects altogether.

Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)

This gives us the confusion matrix below:

Your browser does not support the HTML5 canvas tag.

Now some statistics that need to be untangled!

Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =

Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =

Specificity is TN/(TN+FP). In our case /( + ) =

Negative predictive value is TN/(TN+FN). In our case /( + ) =

Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =

False discovery rate is FP/(FP+TP). In our case /( + ) =

False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =

Deploying your Mahout application as a webapp on Openshift

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Openshift (a cloud computing platform from Redhat) does not at present support Hadoop so this is not a route to go down if you have the kind of data that requires map reduce. However it’s not a bad option if you’re just playing with Mahout (see the previous post) and would like to share what you have done in the form of a web app. It’s also free unless you require anything fancy.

Here are some notes I made on how to do it. Assumes you are developing your application in Eclipse.

  1. Set up an openshift account
  2. Follow the steps in this post. It shows you how to push your repository to Openshift through the eclipese IDE.
  3. Because I’m a novice in web app creastion I borrowed the web.xml and servlet code from this post on stack overflow and customised it to call my mahout function. You can see what I mean here Note the server code is placed in the src/main/java folder and the web.xml file is placed in in WEB-INF.
  4. When compiling the project in eclipse use run Maven install. This generates the WAR file. I found this post invaluable for understanding what is going on.
  5. I found I then had to copy the WAR file over from the target folder to the deployments folder.
  6. Switch to the git aspect in Eclipse and commit and push to openshift.

You should then be able to access your web app at the url provided by open shift and use the usual http requests to interact with it. Again see the previous post for an example.

Visualising Shrinkage

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy.

To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students are probably too small a sample to get useful results. In a hierarchical model the data are pooled so that all classes in a school are modelled together as a hierarchy and even all schools in a district.

At each level in this hierarchy an estimate for the efficiency of the teaching method is obtained. You will get an estimate for the school as a whole and for the district. You will even get estimates for the individual classes. These estimates will be weighted averages of the estimates for the class and the estimate for the school (which in turn is a weighted average of the estimate for the school and the district.) The clever part is that this weighting is itself determined by the data. Where a class is an outlier, and therefore the overall school average is less relevant, the estimate will be weighted towards the class. Where it is typical it will be weighted towards the school. This property is known as shrinkage.

I’m often interested in how much shrinkage is affecting my estimates and I want to see it. I’ve created this plot which I find useful. It’s done in R using ggplot and is very simple to code.

The idea is that the non shrunk estimates bi (i.e. the estimates that would be obtained by modelling classes individually) are plotted on along the line x=y at the points (bi, bi). The estimates they are being shrunk towards ai are plotted at the points (bi, ai). Finally we plot the shrunk estimates si at (bi, si) and connect the points with an arrow to illustrate the direction of the shrinkage.

Here is an example. You can see the extent of the shrinkage by the the distance covered by the arrow towards the higher level estimate.


Note the arrows do sometimes point away from the higher level estimate. This is because this data is for a single coefficient in a hierarchical regression model with multiple coefficients. Where other coefficients have been stabilized by shrinkage this causes this particular coefficient to be revised.

The R code is as follows:

# *--------------------------------------------------------------------
# | FUNCTION: shrinkPlot
# | Function for visualising shrinkage in hierarchical models
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |31/08/2013|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  orig      Estimates obtained from individual level
# |                    modelling
# |          shrink    Estimates obtained from hierarchical modelling
# |          prior     Priors in Bayesian model or fixed effects in
# |                    mixed effects model (i.e. what it is shrinking
# |                    towards.
# |          window    Limits for the plot (as a vector)
# |
# *--------------------------------------------------------------------
# | OUTPUTS: A ggplot object
# *--------------------------------------------------------------------
# | DEPENDS: grid, ggplot2
# |
# *--------------------------------------------------------------------
shrinkPlot<-function(orig, shrink, prior, window=NULL){
	data<-data.frame(orig, shrink, prior, group)
  g<-ggplot(data=data, aes(x=orig, xend=orig, y=orig, yend=shrink, col=group))
	g2<-g+geom_segment(arrow = arrow(length = unit(0.3, "cm"))) +geom_point(, aes(x=coef, y=mean))
  g3<-g2+xlab("Estimate")+ylab("Shrinkage")+ ggtitle("Shrinkage Plot")
	if (is.null(window)==FALSE){

Book Recommendations from Beyond the Grave: A Mahout Example

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!

The Case of Charles Dexter Ward

It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.

Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.

So all we need to supply the algorithm with is a list of users and items.

Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.

Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.

It works (sort of) and was very interesting to put together. Here’s a demo:

Say you liked English Romantic poets. Then you might submit the names: keats, william wordsworth, samuel taylor coleridge

Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.

Or you might try some painters you like: Picasso, Henri Matisse, Paul Gauguin

Or even just … Dylan

Or you could upset things a bit by adding in someone incongruous! keats, william wordsworth, samuel taylor coleridge, arthur c clarke

Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.

What are we getting back?

The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)

The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)

The code is available on github

Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.

I plan to do several more posts on how this was done. At the moment I’m thinking

  • A walk though of the code
  • An explanation of how the recommender was tuned
  • How it was deployed as a web app on Open Shift

But if anyone out there is interested in other things or has questions just let me know.


Machine Learning and Analytics based in London, UK