A decision process for selecting statistical techniques

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

• It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
• There are of course many other available techniques but these are ones I use a lot.
• Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

Glasseye: bringing together markdown, d3 and the Tufte layout

Glasseye is a package I’m developing to present the results of statistical analysis in an attractive and hopefully interesting way. It brings together three great things that I use a lot:

1. The markdown markup language.
2. The Tufte wide margin layout
3. Visualisation using d3.js

See a full demo of what it can do here and the you can visit the github repository here

Here is what it looks like when transformed into html.

The idea is to be able to write up work in markdown and have the results transformed into something like a Tufte layoutof which more below. For the Tufte layout I took the excellent tufte.css style sheet developed by Dave Liepmann and co and made a few changes to suit my purposes. Finally I’ve added some d3 charts (just a small selection at the moment but this will grow) that can easily invoked from within the markdown.

It’s all very very beta at the moment. I’m not claiming it’s ready to go. I would like to add lots more charts, redesign the d3 code and improve it’s overall usability (in particular replace the tags approach with something more in the spirit of markdown) however I thought I’d share it as it is. Hope you find it interesting

A post by Duncan Stoddard. It first appeared among lots of other great posts at www.sparepiece.com

Bayesian analysis is perfect in situations when data are scarce and experience and intuition are plentiful. We can combine available data with the collective intelligence of a team in the form of parameter priors in a Bayesian model to test hypotheses and inform strategic decision making.

Let’s imagine we want to build a model that explains the drivers of a binary dependent variable, e.g. ‘prospect buys or doesn’t buy product x’, in order to test hypotheses about what makes a good prospective client and to score prospective client lists. We’ve got some descriptive data on past acquisitions, e.g. prospect geography, age, industry and some behavioural data, e.g. ‘exposed to ad A’, ‘responded to email campaign B’ etc. We’ve also got the collective knowledge of a client services team. We’re going to combine these in a Bayesian logistic regression model using the MCMCpack package in R.

1. Generating coefficient priors

First we need to get each member of the team to (independently) place weights on a list of candidate explanatory variables. To keep this simple we’ll allow integer values on [-10,10] and then convert these to usable priors.

The coefficient in a logistic regression can be interpreted as the log of the % increase in the odds of y=1 over a base value, which we can take to be the overall response rate. We can map our weights to a comparable value with the following logic:

Weight w to probability p:

$p = \frac{s+10}{20}$

The odds ratio:

$\frac{p}{1-p} = \frac{2+10}{10-s}$

The base probability:

$p_{0} = \frac{1}{n}\sum_{i}^{n}y_{i}$

Weight to prior coefficient:

$\beta = ln \left ( \frac{\frac{p}{1-p}}{\frac{p_{0}}{1-p_{0}}} \right ) = ln\left (\frac{(\omega+10)(1-p_{0})}{p_{0}(10-\omega)} \right)$

Once we’ve applied this formula we can calculate prior mean and precision values. These will then be read into R and used as the b0 and B0 arguments of the MCMClogit function. The b0 argument is a vector of length k+1, where k is the number of variables in the model, and B0 is a diagonal square matrix (assuming independence between coefficient distributions).

2. Normalising the data

The weighting process implicitly assumes variable values are on the same scale. We therefore need to normalise the data to ensure this is so. We can do that simply  with:

${x_i}'=\frac{x_i - min(x)}{max(x) - min(x)}$

3. Running the model

The MCMClogit(formula, data, b0, B0) function gives posterior distributions for each beta, assuming a multivariate normal prior. We can use summary(posterior) to get the posterior mean and standard deviation values, and plot(posterior) to get distribution plots.

The data updates our prior knowledge and pulls our estimates towards a ‘truer’ representation of reality.

Our posterior means will serve two purposes:

a) parameterise a scoring equation for predicting response rates conditional on prospect characteristics.

b) provide insights on the characteristics of good prospective clients and test decision makers’ hypotheses.

4. Building a scoring equation

Summing the product of our posterior means and normalised data will give the log of the odds ratio. We can convert this into response probabilities like this:

$ln\frac{s}{1-s} = \sum_{i}^{k}\hat{\beta }_ix_i$

$p = \frac{e^{\sum_{i}^{k}\hat{\beta _i}x_i}}{1+e^{\sum_{i}^{k}\hat{\beta _i}x_i}}$

s represents the expected response probability given a vector of characteristics x. We can apply this to our prospect lists and rank to find the best targets for marketing.

5. Converting posterior means back to scores

Lastly, we want to convert our posterior mean estimates back to score integers on [-10,10] so we can report back results in a form they’re now familiar with. We can do this easily with the following:

$\omega = \frac{10e^{\hat{\beta }}+10}{1+e^{\hat{\beta }}}$

Do politicians sound the same? An investigation using Latent Semantic Analysis

By Richard Brooker

“Politicians all sound the same and everyone is moving to the centre…”

You must have heard this a lot over the last few months? However is it fair? What evidence is there to show this?

Introduction

There is a theory in economics called “The Median Voter Theorem” that provides some justification for the above statement.

The Median Voter Theorem: a majority rule voting system will select the outcome most preferred by the median voter” [1]

It hypothesises that the frequency of inclinations, on a given political spectrum, is likely to follow a bell curve. Thus parties will eventually move to the median in order to capture the most votes.

So for this election, I decided to have a look at what effect this phenomenon is having on British politics and the speeches given in the House of Commons.

The Data

Hansard is the name for the Transcripts of Parliamentary Debates in Britain. Records going back to 1935 can be downloaded from theyworkforyou.com in .xml format. They are published daily and record every speech along with who gave it and their party.

In this post I look at how easy it is to distinguish a party by what they are saying.

More specifically, I built a statistical model to work out the probability that a speech came from a particular party based on the words and phrases used in the speech.

Feature  Construction

The input variables are contained in a sparse TF-IDF matrix constructed using Scikit Learn TfidfVectorizer in python. The rows represent speeches and there is column for every n-gram (up to length 3) in the transcripts.

An n-gram (/’phrase’/’term’) is any sequence of n words used consecutively.

The values in the matrix are the the term frequency–inverse document frequency (TF-IDF). For an n-gram i and speech j,

The TF-IDF is a numerical statistic that is intended to reflect how important a n-gram (word/’phrase’) is to a document (in our case speech).

Feature  Selection

I carried out some light feature selection. I calculated the average uplift in probability by year. The uplift is simply how much more likely a term is to appear given that you know the party it belongs to. I then filtered out features within 2 decimal points of 1.

Averaging it by year does a couple of things. Like boot strapping, it provides confidence in the statistic . Secondly it also helps to filter out terms whose affinity flips across time, (e.g. job titles).

Here are some of the most predictive features for each of the parties. Hover to see the n-gram. The y-axis shows the percentage increase in the probability of seeing the n-gram for a given party. The x-axis is ordered by this value.  

 

The graph shows Conservatives by default, but you can change the y and x labels to look at affinities for other parties, and the party that you order by.

Sparse Matrix Representation

The backbone of my approach is the sparse matrix representation (scipy.sparse is one such type). This is a data model that takes advantage of the large number of zeros in our feature matrix.

Usually a table/matrix is stored as a two dimensional array e.g.

This means the memory requirements are proportional to MxN (where M and N are the matrix dimensions).  A sparse representation on the other hand can cut the memory requirements drastically by only storing non zero values. For the example the same matrix above can be stored the following 3 arrays.

This meant that my feature matrix could be stored in memory at a modest 4.5GB (rather than 58GB).

Modeling

Now we have our sparse representation we are able to leverage some clever techniques to efficiently fit a classifier.

We use Scikit Learns’ SVMClassifier to fit a linear classifier.

The model is fit by minimising the loss function
Where L is the loss function (Hinge, Logistic, Least-Squares, or Epsilon-Insensitive…etc). R is the regulation to prevent overfitting (L1, L2 or Elastic Net).

The classifier then iterates through each speech in our sparse matrix and updates the parameters using a lazy approximation of the gradient of E. It then approximates the true gradient by considering each training point at a time. the intercept is updated similarly. We can get near optimal results using very few passes of the data.

I use a grid search to find the best loss function, regularisation learning rate and number of iterations. Usually at this point you evaluate each of the models on a probe set, select the one model with the best parameters, then asses your success on a test set.

However, why throw away a bunch of otherwise good models?

There are lots of reasons you might (time, etc), however I wanted to do something a little more interesting.

Instead of finding the best linear model I ensemble them using random forest. I built an smaller dense feature matrix consisting of the scores from each of the models. This takes me from some 100,000 tfidf features to 200 dense ones. I can now train a random forest on my probe set. I got a 10% improvement on my best single model (when evaluated on the test set).

The Results

I now have a probability score for each of the speeches for each of the parties.  Finally I averaged the results by year and party to give the chart below.

The bubbles are the actual parties and the axes show the average probability of their speeches belonging to whichever parties are selected as the axes. It makes sense to start off with the two major parties on the axes but you can change these if you want. When the political speeches are very different the bubbles will of course be further apart. You’ll see for example that before 2010 the Labour speeches were much more predictive of being Labour and likewise for the Conservatives but post 2010 this relationship weakens and the bubbles move to the centre. Move the slider to see these changes over time.

 

 

As you move into 2010 the parties move to the centre and become harder to distinguish.

Modelling for decisions

Here’s the deck I presented today at the Predictive Analytics Innovation conference in London.

The idea was to look at ways in which we might use modern statistical methods to help build models that are grounded firmly on common sense intuitions and to do it all in a very short time.

If you are interested in more details just let me know on twitter or drop me an email

Using D3 to show cost, revenue and ROI

Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.

(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)

Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).

Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.

Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.

To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.

You can also find the code here as a JSFiddle.

Thanks to Paul McAvoy for posing the problem and for all the other interesting things he’s shown me!

Four weeks to launch!

O nly four weeks to go until our official launch date of 28th October. It feels like it’s been a long build up but we
believe it will be worth the wait! In the meantime here’s a bit more information about the kind of things we do, why we are different
and what motivates us. If you’re interested please do get in touch.

What do we do?

T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.

Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!

Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.

The point is that good solutions tend to be bespoke solutions.

While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.

The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.

We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.

Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.

Why are we different?

O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.

Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.

Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.

Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.

But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.

For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.

Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.

Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects

A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.
We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.

Bayesian statistics and simulation for problem

solving always seem to be rich sources of ideas. We’re also interested in how browser technology allows greater scope for communication. We blog in a potent mixture of text, html, markdown and javascript.

What technology are we into?

I t’s a long list but most prominent are R, python, distributed machine learning
(currently looking at Spark), and d3.js. Some current projects include a package to convert
R output into d3 and AI enhanced statistical modelling.

Quick start regex for analysts: Part II

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)
+ (proceeding item one or more times)
? (proceeding item zero or one time)

For example (click on the code to see how it works):

apples*

will match the word with no “s” as well as one or many “s”. But in:

apples+

the ‘s’ must be there so it will match words with at least one “s”.

Whereas:

apples?

The ‘s’ doesn’t have to be at the end of the string but it can’t be repeated.

Quantified repetition

This works similarly to the repetition metacharacters. The difference is that we can specify exact number of repetitions of the sign.

{ - start of quantified repetition
} – end of the repetition

Syntax:

{min,max} (min and max are positive numbers)

Tip

min must always be there even if it is 0. Max and the coma are optional.

In our previous post we had an example where we wanted to match only the year. We can now do the following:

\d{4}

This represents a digit repeated exactly 4 times and can be used instead of typing \d\d\d\d.

Similarly we find:

\w{5,10} - minimum 5 letter and maximum 10 letter word
\w{5,} - minimum 5 letters word
\w{5} - exactly 5 letters word

Let’s say we are trying to pick out IP addresses from the text. An IP address is a sequence of 4 sets of 1 – 3 digits separated by dots. It can be easily expressed by:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

(what)+

will match “what” one or more times.

And to match words with or without the prefix we use:

(in)?dependent

Coming back to the IP address example. We can group each 1-3 digits and an optional dot and then repeat it 4 times.

(\d{1,3}\.?){4}

This will fully match any IP address.

Alternation metacharacters

A common way of dealing with an incorrect spelling is by using the OR character and then grouping the two (or more) alternatives. This way we don’t need to repeat the [] sets.

| - (previous OR next expression)

Take for example the commonly misspelled word:

w(ei|ie)rd

Anchors

Anchors signify the position of the pattern in the text. Note, that it is a second meaning of carat (it is also a negation, check it out here).

^ (start of string/line)
$(end of string or line) \A (start of string, never end of line) \Z (end of string, never end of line) For example: ^apple will match only ‘apple’ at the beginning of the line. Lookaround assertions Bear in mind that these expressions differ significantly in the different variants of regex. ?= (Assertion of what ought to be ahead) ?! (negative lookahead) ?<= (positive look behind assertion, what ought to be behind) ?<!-- (negative look behind) For example: (?=seashore)sea Will match “sea” only if it is followed by “shore”. Tip Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal). Tip It tends not to work very well in text editors. Differences between programming languages Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions. Regex Ruby Java Perl Python/R Unix JavaScript PHP .NET Character Classes :(e.g. \d; \w) Yes No Yes Yes No Yes Yes Yes POSIX bracket expressions Yes No Yes No Yes No Yes No Quantifiers: * Yes Yes Yes Yes Yes Yes Yes Yes Quantifiers: + and ? Yes Yes Yes Yes No Yes Yes Yes Anchors: \A and \Z Yes Yes Yes Yes No No Yes Yes Line break: /m Yes No Yes No Yes Yes Yes No Special command for line break No Yes No Yes No No No Yes Lookaround assertions only 1.9 and above Yes Yes Yes No No Yes Yes My next post is all about using regex in a real life example! The local neighbourhood of C Major Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me! The neighbourhood of C-Major Distribution for the difference between two binomially distributed random variables I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share. So we start with $X \sim Bin(n_1, p_1)$ $Y \sim Bin(n_2, p_2)$ We are looking for the probability mass function of $Z=X-Y$ First note that the min and max of the support of Z must be $(-n_2, n_1)$ since that covers the most extreme cases ($X=0$ and $Y=n_2$) and ($X=n_1$ and $Y=0$). Then we need a modification of the binomial pmf so that it can cope with values outside of its support. $m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k}$ when $k \leq n$ and 0 otherwise. Then we need to define two cases 1. $Z \geq 0$ 2. $Z < 0$ In the first case $p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)$ since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of $n_1$ and $n_2$. For example if $n_2 = 4$ then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero. For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc.$latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex]

Put them together and that’s your pmf.

Here’s the function in R and a simulation to check it’s right (and it does work.)