By Richard Brooker
“Politicians all sound the same and everyone is moving to the centre…”
You must have heard this a lot over the last few months? However is it fair? What evidence is there to show this?
There is a theory in economics called “The Median Voter Theorem” that provides some justification for the above statement.
It hypothesises that the frequency of inclinations, on a given political spectrum, is likely to follow a bell curve. Thus parties will eventually move to the median in order to capture the most votes.
So for this election, I decided to have a look at what effect this phenomenon is having on British politics and the speeches given in the House of Commons.
Hansard is the name for the Transcripts of Parliamentary Debates in Britain. Records going back to 1935 can be downloaded from theyworkforyou.com in .xml format. They are published daily and record every speech along with who gave it and their party.
In this post I look at how easy it is to distinguish a party by what they are saying.
More specifically, I built a statistical model to work out the probability that a speech came from a particular party based on the words and phrases used in the speech.
The input variables are contained in a sparse TF-IDF matrix constructed using Scikit Learn TfidfVectorizer in python. The rows represent speeches and there is column for every n-gram (up to length 3) in the transcripts.
An n-gram (/’phrase’/’term’) is any sequence of n words used consecutively.
The values in the matrix are the the term frequency–inverse document frequency (TF-IDF). For an n-gram i and speech j,
The TF-IDF is a numerical statistic that is intended to reflect how important a n-gram (word/’phrase’) is to a document (in our case speech).
I carried out some light feature selection. I calculated the average uplift in probability by year. The uplift is simply how much more likely a term is to appear given that you know the party it belongs to. I then filtered out features within 2 decimal points of 1.
Averaging it by year does a couple of things. Like boot strapping, it provides confidence in the statistic . Secondly it also helps to filter out terms whose affinity flips across time, (e.g. job titles).
Here are some of the most predictive features for each of the parties. Hover to see the n-gram. The y-axis shows the percentage increase in the probability of seeing the n-gram for a given party. The x-axis is ordered by this value.
The graph shows Conservatives by default, but you can change the y and x labels to look at affinities for other parties, and the party that you order by.
Sparse Matrix Representation
The backbone of my approach is the sparse matrix representation (scipy.sparse is one such type). This is a data model that takes advantage of the large number of zeros in our feature matrix.
Usually a table/matrix is stored as a two dimensional array e.g.
This means the memory requirements are proportional to MxN (where M and N are the matrix dimensions). A sparse representation on the other hand can cut the memory requirements drastically by only storing non zero values. For the example the same matrix above can be stored the following 3 arrays.
This meant that my feature matrix could be stored in memory at a modest 4.5GB (rather than 58GB).
Now we have our sparse representation we are able to leverage some clever techniques to efficiently fit a classifier.
We use Scikit Learns’ SVMClassifier to fit a linear classifier.
The model is fit by minimising the loss function
Where L is the loss function (Hinge, Logistic, Least-Squares, or Epsilon-Insensitive…etc). R is the regulation to prevent overfitting (L1, L2 or Elastic Net).
The classifier then iterates through each speech in our sparse matrix and updates the parameters using a lazy approximation of the gradient of E. It then approximates the true gradient by considering each training point at a time. the intercept is updated similarly. We can get near optimal results using very few passes of the data.
I use a grid search to find the best loss function, regularisation learning rate and number of iterations. Usually at this point you evaluate each of the models on a probe set, select the one model with the best parameters, then asses your success on a test set.
However, why throw away a bunch of otherwise good models?
There are lots of reasons you might (time, etc), however I wanted to do something a little more interesting.
Instead of finding the best linear model I ensemble them using random forest. I built an smaller dense feature matrix consisting of the scores from each of the models. This takes me from some 100,000 tfidf features to 200 dense ones. I can now train a random forest on my probe set. I got a 10% improvement on my best single model (when evaluated on the test set).
I now have a probability score for each of the speeches for each of the parties. Finally I averaged the results by year and party to give the chart below.
The bubbles are the actual parties and the axes show the average probability of their speeches belonging to whichever parties are selected as the axes. It makes sense to start off with the two major parties on the axes but you can change these if you want. When the political speeches are very different the bubbles will of course be further apart. You’ll see for example that before 2010 the Labour speeches were much more predictive of being Labour and likewise for the Conservatives but post 2010 this relationship weakens and the bubbles move to the centre. Move the slider to see these changes over time.
As you move into 2010 the parties move to the centre and become harder to distinguish.