In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.
It’s a heuristic in the sense given in Wikipedia:
A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)
It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that
It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
There are of course many other available techniques but these are ones I use a lot.
Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.
This diagram was made with fantastic draw.io. Click into it for the full version.
Glasseye is a package I’m developing to present the results of statistical analysis in an attractive and hopefully interesting way. It brings together three great things that I use a lot:
The markdown markup language.
The Tufte wide margin layout
Visualisation using d3.js
See a full demo of what it can do here and the you can visit the github repository here
Here is what it looks like when transformed into html.
The idea is to be able to write up work in markdown and have the results transformed into something like a Tufte layoutof which more below. For the Tufte layout I took the excellent tufte.css style sheet developed by Dave Liepmann and co and made a few changes to suit my purposes. Finally I’ve added some d3 charts (just a small selection at the moment but this will grow) that can easily invoked from within the markdown.
It’s all very very beta at the moment. I’m not claiming it’s ready to go. I would like to add lots more charts, redesign the d3 code and improve it’s overall usability (in particular replace the tags approach with something more in the spirit of markdown) however I thought I’d share it as it is. Hope you find it interesting
A post by Duncan Stoddard. It first appeared among lots of other great posts at www.sparepiece.com
Bayesian analysis is perfect in situations when data are scarce and experience and intuition are plentiful. We can combine available data with the collective intelligence of a team in the form of parameter priors in a Bayesian model to test hypotheses and inform strategic decision making.
Let’s imagine we want to build a model that explains the drivers of a binary dependent variable, e.g. ‘prospect buys or doesn’t buy product x’, in order to test hypotheses about what makes a good prospective client and to score prospective client lists. We’ve got some descriptive data on past acquisitions, e.g. prospect geography, age, industry and some behavioural data, e.g. ‘exposed to ad A’, ‘responded to email campaign B’ etc. We’ve also got the collective knowledge of a client services team. We’re going to combine these in a Bayesian logistic regression model using the MCMCpack package in R.
1. Generating coefficient priors
First we need to get each member of the team to (independently) place weights on a list of candidate explanatory variables. To keep this simple we’ll allow integer values on [-10,10] and then convert these to usable priors.
The coefficient in a logistic regression can be interpreted as the log of the % increase in the odds of y=1 over a base value, which we can take to be the overall response rate. We can map our weights to a comparable value with the following logic:
Weight w to probability p:
The odds ratio:
The base probability:
Weight to prior coefficient:
Once we’ve applied this formula we can calculate prior mean and precision values. These will then be read into R and used as the b0 and B0 arguments of the MCMClogit function. The b0 argument is a vector of length k+1, where k is the number of variables in the model, and B0 is a diagonal square matrix (assuming independence between coefficient distributions).
2. Normalising the data
The weighting process implicitly assumes variable values are on the same scale. We therefore need to normalise the data to ensure this is so. We can do that simply with:
3. Running the model
The MCMClogit(formula, data, b0, B0) function gives posterior distributions for each beta, assuming a multivariate normal prior. We can use summary(posterior) to get the posterior mean and standard deviation values, and plot(posterior) to get distribution plots.
The data updates our prior knowledge and pulls our estimates towards a ‘truer’ representation of reality.
Our posterior means will serve two purposes:
a) parameterise a scoring equation for predicting response rates conditional on prospect characteristics.
b) provide insights on the characteristics of good prospective clients and test decision makers’ hypotheses.
4. Building a scoring equation
Summing the product of our posterior means and normalised data will give the log of the odds ratio. We can convert this into response probabilities like this:
s represents the expected response probability given a vector of characteristics x. We can apply this to our prospect lists and rank to find the best targets for marketing.
5. Converting posterior means back to scores
Lastly, we want to convert our posterior mean estimates back to score integers on [-10,10] so we can report back results in a form they’re now familiar with. We can do this easily with the following:
“Politicians all sound the same and everyone is moving to the centre…”
You must have heard this a lot over the last few months? However is it fair? What evidence is there to show this?
There is a theory in economics called “The Median Voter Theorem” that provides some justification for the above statement.
The Median Voter Theorem: amajority rulevoting system will select the outcome most preferred by themedianvoter” 
It hypothesises that the frequency of inclinations, on a given political spectrum, is likely to follow a bell curve. Thus parties will eventually move to the median in order to capture the most votes.
So for this election, I decided to have a look at what effect this phenomenon is having on British politics and the speeches given in the House of Commons.
Hansard is the name for the Transcripts of Parliamentary Debates in Britain. Records going back to 1935 can be downloaded from theyworkforyou.com in .xml format. They are published daily and record every speech along with who gave it and their party.
In this post I look at how easy it is to distinguish a party by what they are saying.
More specifically, I built a statistical model to work out the probability that a speech came from a particular party based on the words and phrases used in the speech.
The input variables are contained in a sparse TF-IDF matrix constructed using Scikit Learn TfidfVectorizer in python. The rows represent speeches and there is column for every n-gram (up to length 3) in the transcripts.
An n-gram (/’phrase’/’term’) is any sequence of n words used consecutively.
The values in the matrix are the the term frequency–inverse document frequency (TF-IDF). For an n-gram i and speech j,
The TF-IDF is a numerical statistic that is intended to reflect how important a n-gram (word/’phrase’) is to a document (in our case speech).
I carried out some light feature selection. I calculated the average uplift in probability by year. The uplift is simply how much more likely a term is to appear given that you know the party it belongs to. I then filtered out features within 2 decimal points of 1.
Averaging it by year does a couple of things. Like boot strapping, it provides confidence in the statistic . Secondly it also helps to filter out terms whose affinity flips across time, (e.g. job titles).
Here are some of the most predictive features for each of the parties. Hover to see the n-gram. The y-axis shows the percentage increase in the probability of seeing the n-gram for a given party. The x-axis is ordered by this value.
The graph shows Conservatives by default, but you can change the y and x labels to look at affinities for other parties, and the party that you order by.
Sparse Matrix Representation
The backbone of my approach is the sparse matrix representation (scipy.sparse is one such type). This is a data model that takes advantage of the large number of zeros in our feature matrix.
Usually a table/matrix is stored as a two dimensional array e.g.
This means the memory requirements are proportional to MxN (where M and N are the matrix dimensions). A sparse representation on the other hand can cut the memory requirements drastically by only storing non zero values. For the example the same matrix above can be stored the following 3 arrays.
This meant that my feature matrix could be stored in memory at a modest 4.5GB (rather than 58GB).
Now we have our sparse representation we are able to leverage some clever techniques to efficiently fit a classifier.
We use Scikit Learns’ SVMClassifier to fit a linear classifier.
The model is fit by minimising the loss function
Where L is the loss function (Hinge, Logistic, Least-Squares, or Epsilon-Insensitive…etc). R is the regulation to prevent overfitting (L1, L2 or Elastic Net).
The classifier then iterates through each speech in our sparse matrix and updates the parameters using a lazy approximation of the gradient of E. It then approximates the true gradient by considering each training point at a time. the intercept is updated similarly. We can get near optimal results using very few passes of the data.
I use a grid search to find the best loss function, regularisation learning rate and number of iterations. Usually at this point you evaluate each of the models on a probe set, select the one model with the best parameters, then asses your success on a test set.
However, why throw away a bunch of otherwise good models?
There are lots of reasons you might (time, etc), however I wanted to do something a little more interesting.
Instead of finding the best linear model I ensemble them using random forest. I built an smaller dense feature matrix consisting of the scores from each of the models. This takes me from some 100,000 tfidf features to 200 dense ones. I can now train a random forest on my probe set. I got a 10% improvement on my best single model (when evaluated on the test set).
I now have a probability score for each of the speeches for each of the parties. Finally I averaged the results by year and party to give the chart below.
The bubbles are the actual parties and the axes show the average probability of their speeches belonging to whichever parties are selected as the axes. It makes sense to start off with the two major parties on the axes but you can change these if you want. When the political speeches are very different the bubbles will of course be further apart. You’ll see for example that before 2010 the Labour speeches were much more predictive of being Labour and likewise for the Conservatives but post 2010 this relationship weakens and the bubbles move to the centre. Move the slider to see these changes over time.
As you move into 2010 the parties move to the centre and become harder to distinguish.
Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.
(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)
Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).
Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.
Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.
To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.
O nly four weeks to go until our official launch date of 28th October. It feels like it’s been a long build up but we
believe it will be worth the wait! In the meantime here’s a bit more information about the kind of things we do, why we are different
and what motivates us. If you’re interested please do get in touch.
What do we do?
T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.
Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!
Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.
The point is that good solutions tend to be bespoke solutions.
While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.
The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.
We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.
Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.
Why are we different?
O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.
Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.
Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.
Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.
But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.
For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.
Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.
Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects
What are we excited about?
A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.
We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.
Bayesian statistics and simulation for problem
What technology are we into?
I t’s a long list but most prominent are R, python, distributed machine learning
(currently looking at Spark), and d3.js. Some current projects include a package to convert
R output into d3 and AI enhanced statistical modelling.
In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.
Let’s get started
So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.
‘* (proceeding item zero or more times) + (proceeding item one or more times) ? (proceeding item zero or one time)
For example (click on the code to see how it works):
The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.
There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!
Using () around the groups of characters will enable repetation of that group.
Don’t group in the character sets (within ) as () will have the literal meaning.
Will match “sea” only if it is followed by “shore”.
It tends not to work very well in text editors.
Differences between programming languages
Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.
Character Classes :(e.g. \d; \w)
POSIX bracket expressions
Quantifiers: + and ?
Anchors: \A and \Z
Line break: /m
Special command for line break
only 1.9 and above
My next post is all about using regex in a real life example!
Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!
I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.
So we start with
We are looking for the probability mass function of
First note that the min and max of the support of Z must be since that covers the most extreme cases ( and ) and ( and ).
Then we need a modification of the binomial pmf so that it can cope with values outside of its support.
when and 0 otherwise.
Then we need to define two cases
In the first case
since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of and . For example if then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero.
For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc.