Bayesian Business Intelligence

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A post by Duncan Stoddard. It first appeared among lots of other great posts at www.sparepiece.com

Bayesian analysis is perfect in situations when data are scarce and experience and intuition are plentiful. We can combine available data with the collective intelligence of a team in the form of parameter priors in a Bayesian model to test hypotheses and inform strategic decision making.

Let’s imagine we want to build a model that explains the drivers of a binary dependent variable, e.g. ‘prospect buys or doesn’t buy product x’, in order to test hypotheses about what makes a good prospective client and to score prospective client lists. We’ve got some descriptive data on past acquisitions, e.g. prospect geography, age, industry and some behavioural data, e.g. ‘exposed to ad A’, ‘responded to email campaign B’ etc. We’ve also got the collective knowledge of a client services team. We’re going to combine these in a Bayesian logistic regression model using the MCMCpack package in R.

1. Generating coefficient priors

First we need to get each member of the team to (independently) place weights on a list of candidate explanatory variables. To keep this simple we’ll allow integer values on [-10,10] and then convert these to usable priors.

The coefficient in a logistic regression can be interpreted as the log of the % increase in the odds of y=1 over a base value, which we can take to be the overall response rate. We can map our weights to a comparable value with the following logic:

Weight w to probability p:

p = \frac{s+10}{20}

The odds ratio:

\frac{p}{1-p} = \frac{2+10}{10-s}

The base probability:

p_{0} = \frac{1}{n}\sum_{i}^{n}y_{i}

Weight to prior coefficient:

\beta = ln \left ( \frac{\frac{p}{1-p}}{\frac{p_{0}}{1-p_{0}}} \right ) = ln\left (\frac{(\omega+10)(1-p_{0})}{p_{0}(10-\omega)} \right)

Once we’ve applied this formula we can calculate prior mean and precision values. These will then be read into R and used as the b0 and B0 arguments of the MCMClogit function. The b0 argument is a vector of length k+1, where k is the number of variables in the model, and B0 is a diagonal square matrix (assuming independence between coefficient distributions).

2. Normalising the data

The weighting process implicitly assumes variable values are on the same scale. We therefore need to normalise the data to ensure this is so. We can do that simply  with:

{x_i}'=\frac{x_i - min(x)}{max(x) - min(x)}

 

3. Running the model

The MCMClogit(formula, data, b0, B0) function gives posterior distributions for each beta, assuming a multivariate normal prior. We can use summary(posterior) to get the posterior mean and standard deviation values, and plot(posterior) to get distribution plots.

The data updates our prior knowledge and pulls our estimates towards a ‘truer’ representation of reality.

Our posterior means will serve two purposes:

a) parameterise a scoring equation for predicting response rates conditional on prospect characteristics.

b) provide insights on the characteristics of good prospective clients and test decision makers’ hypotheses.

 

4. Building a scoring equation

Summing the product of our posterior means and normalised data will give the log of the odds ratio. We can convert this into response probabilities like this:

ln\frac{s}{1-s} = \sum_{i}^{k}\hat{\beta }_ix_i

p = \frac{e^{\sum_{i}^{k}\hat{\beta _i}x_i}}{1+e^{\sum_{i}^{k}\hat{\beta _i}x_i}}

s represents the expected response probability given a vector of characteristics x. We can apply this to our prospect lists and rank to find the best targets for marketing.

 

5. Converting posterior means back to scores

Lastly, we want to convert our posterior mean estimates back to score integers on [-10,10] so we can report back results in a form they’re now familiar with. We can do this easily with the following:

\omega = \frac{10e^{\hat{\beta }}+10}{1+e^{\hat{\beta }}}

Do politicians sound the same? An investigation using Latent Semantic Analysis

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

By Richard Brooker

Mountain View

“Politicians all sound the same and everyone is moving to the centre…”

You must have heard this a lot over the last few months? However is it fair? What evidence is there to show this?

Introduction

There is a theory in economics called “The Median Voter Theorem” that provides some justification for the above statement.

The Median Voter Theorem: a majority rule voting system will select the outcome most preferred by the median voter” [1]

It hypothesises that the frequency of inclinations, on a given political spectrum, is likely to follow a bell curve. Thus parties will eventually move to the median in order to capture the most votes.

Mountain View

So for this election, I decided to have a look at what effect this phenomenon is having on British politics and the speeches given in the House of Commons.

The Data

Hansard is the name for the Transcripts of Parliamentary Debates in Britain. Records going back to 1935 can be downloaded from theyworkforyou.com in .xml format. They are published daily and record every speech along with who gave it and their party.

Mountain View

In this post I look at how easy it is to distinguish a party by what they are saying.

More specifically, I built a statistical model to work out the probability that a speech came from a particular party based on the words and phrases used in the speech.

Feature  Construction

The input variables are contained in a sparse TF-IDF matrix constructed using Scikit Learn TfidfVectorizer in python. The rows represent speeches and there is column for every n-gram (up to length 3) in the transcripts.

An n-gram (/’phrase’/’term’) is any sequence of n words used consecutively.

The values in the matrix are the the term frequency–inverse document frequency (TF-IDF). For an n-gram i and speech j,Mountain View

The TF-IDF is a numerical statistic that is intended to reflect how important a n-gram (word/’phrase’) is to a document (in our case speech).

Feature  Selection

I carried out some light feature selection. I calculated the average uplift in probability by year. The uplift is simply how much more likely a term is to appear given that you know the party it belongs to. I then filtered out features within 2 decimal points of 1.

Averaging it by year does a couple of things. Like boot strapping, it provides confidence in the statistic . Secondly it also helps to filter out terms whose affinity flips across time, (e.g. job titles).

Here are some of the most predictive features for each of the parties. Hover to see the n-gram. The y-axis shows the percentage increase in the probability of seeing the n-gram for a given party. The x-axis is ordered by this value.

The graph shows Conservatives by default, but you can change the y and x labels to look at affinities for other parties, and the party that you order by.

Sparse Matrix Representation

The backbone of my approach is the sparse matrix representation (scipy.sparse is one such type). This is a data model that takes advantage of the large number of zeros in our feature matrix.

Usually a table/matrix is stored as a two dimensional array e.g.

Mountain View

This means the memory requirements are proportional to MxN (where M and N are the matrix dimensions).  A sparse representation on the other hand can cut the memory requirements drastically by only storing non zero values. For the example the same matrix above can be stored the following 3 arrays.
Mountain View

This meant that my feature matrix could be stored in memory at a modest 4.5GB (rather than 58GB).

Modeling

Now we have our sparse representation we are able to leverage some clever techniques to efficiently fit a classifier.
Mountain View
We use Scikit Learns’ SVMClassifier to fit a linear classifier.

Mountain ViewThe model is fit by minimising the loss function
Mountain ViewWhere L is the loss function (Hinge, Logistic, Least-Squares, or Epsilon-Insensitive…etc). R is the regulation to prevent overfitting (L1, L2 or Elastic Net).

The classifier then iterates through each speech in our sparse matrix and updates the parameters using a lazy approximation of the gradient of E. It then approximates the true gradient by considering each training point at a time. Mountain Viewthe intercept is updated similarly. We can get near optimal results using very few passes of the data.

I use a grid search to find the best loss function, regularisation learning rate and number of iterations. Usually at this point you evaluate each of the models on a probe set, select the one model with the best parameters, then asses your success on a test set.

However, why throw away a bunch of otherwise good models?

There are lots of reasons you might (time, etc), however I wanted to do something a little more interesting.

Instead of finding the best linear model I ensemble them using random forest. I built an smaller dense feature matrix consisting of the scores from each of the models. This takes me from some 100,000 tfidf features to 200 dense ones. I can now train a random forest on my probe set. I got a 10% improvement on my best single model (when evaluated on the test set).

The Results

I now have a probability score for each of the speeches for each of the parties.  Finally I averaged the results by year and party to give the chart below.

The bubbles are the actual parties and the axes show the average probability of their speeches belonging to whichever parties are selected as the axes. It makes sense to start off with the two major parties on the axes but you can change these if you want. When the political speeches are very different the bubbles will of course be further apart. You’ll see for example that before 2010 the Labour speeches were much more predictive of being Labour and likewise for the Conservatives but post 2010 this relationship weakens and the bubbles move to the centre. Move the slider to see these changes over time.


As you move into 2010 the parties move to the centre and become harder to distinguish.

 

 

Modelling for decisions

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s the deck I presented today at the Predictive Analytics Innovation conference in London.

The idea was to look at ways in which we might use modern statistical methods to help build models that are grounded firmly on common sense intuitions and to do it all in a very short time.

If you are interested in more details just let me know on twitter or drop me an email

Or you can download the presentation from here

Using D3 to show cost, revenue and ROI

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.

(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)

Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).

Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.


Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.

To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.

You can also find the code here as a JSFiddle.

Thanks to Paul McAvoy for posing the problem and for all the other interesting things he’s shown me!

Four weeks to launch!

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone
O nly four weeks to go until our official launch date of 28th October. It feels like it’s been a long build up but we
believe it will be worth the wait! In the meantime here’s a bit more information about the kind of things we do, why we are different
and what motivates us. If you’re interested please do get in touch.

 

What do we do?

 

T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.

Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!

Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.

The point is that good solutions tend to be bespoke solutions.

While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.

The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.

We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.

Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.

 

Why are we different?

 

O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.

Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.

Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.

Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.

But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.

For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.

Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.

Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects

 

What are we excited about?

 

A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.
We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.

Bayesian statistics and simulation for problem

solving always seem to be rich sources of ideas. We’re also interested in how browser technology allows greater scope for communication. We blog in a potent mixture of text, html, markdown and javascript.

 

What technology are we into?

 

I t’s a long list but most prominent are R, python, distributed machine learning
(currently looking at Spark), and d3.js. Some current projects include a package to convert
R output into d3 and AI enhanced statistical modelling.

Quick start regex for analysts: Part II

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Let’s get started

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)
+ (proceeding item one or more times)
? (proceeding item zero or one time)

For example (click on the code to see how it works):

apples*

will match the word with no “s” as well as one or many “s”. But in:

apples+

the ‘s’ must be there so it will match words with at least one “s”.

Whereas:

apples?

The ‘s’ doesn’t have to be at the end of the string but it can’t be repeated.

Quantified repetition

This works similarly to the repetition metacharacters. The difference is that we can specify exact number of repetitions of the sign.

{ - start of quantified repetition
} – end of the repetition

Syntax:

{min,max} (min and max are positive numbers)

Tip

min must always be there even if it is 0. Max and the coma are optional.

In our previous post we had an example where we wanted to match only the year. We can now do the following:

\d{4}

This represents a digit repeated exactly 4 times and can be used instead of typing \d\d\d\d.

Similarly we find:

\w{5,10} - minimum 5 letter and maximum 10 letter word
\w{5,} - minimum 5 letters word
\w{5} - exactly 5 letters word

Let’s say we are trying to pick out IP addresses from the text. An IP address is a sequence of 4 sets of 1 – 3 digits separated by dots. It can be easily expressed by:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

(what)+

will match “what” one or more times.

And to match words with or without the prefix we use:

(in)?dependent

Coming back to the IP address example. We can group each 1-3 digits and an optional dot and then repeat it 4 times.

(\d{1,3}\.?){4}

This will fully match any IP address.

Alternation metacharacters

A common way of dealing with an incorrect spelling is by using the OR character and then grouping the two (or more) alternatives. This way we don’t need to repeat the [] sets.

| - (previous OR next expression)

Take for example the commonly misspelled word:

w(ei|ie)rd

Anchors

Anchors signify the position of the pattern in the text. Note, that it is a second meaning of carat (it is also a negation, check it out here).

^ (start of string/line)
$ (end of string or line)
\A (start of string, never end of line)
\Z (end of string, never end of line)

For example:

^apple

will match only ‘apple’ at the beginning of the line.

Lookaround assertions

Bear in mind that these expressions differ significantly in the different variants of regex.

?= (Assertion of what ought to be ahead)
?! (negative lookahead)
?<= (positive look behind assertion, what ought to be behind)
?<!-- (negative look behind)

For example:

(?=seashore)sea

Will match “sea” only if it is followed by “shore”.

Tip

Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal).

Tip

It tends not to work very well in text editors.

Differences between programming languages

Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.

Regex Ruby Java Perl Python/R Unix JavaScript PHP .NET
Character Classes :(e.g. \d; \w) Yes No Yes Yes No Yes Yes Yes
POSIX bracket expressions Yes No Yes No Yes No Yes No
Quantifiers: * Yes Yes Yes Yes Yes Yes Yes Yes
Quantifiers: + and ? Yes Yes Yes Yes No Yes Yes Yes
Anchors: \A and \Z Yes Yes Yes Yes No No Yes Yes
Line break: /m Yes No Yes No Yes Yes Yes No
Special command for line break No Yes No Yes No No No Yes
Lookaround assertions only 1.9 and above Yes Yes Yes No No Yes Yes

My next post is all about using regex in a real life example!

The local neighbourhood of C Major

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!

The neighbourhood of C-Major

The neighbourhood of C-Major

Distribution for the difference between two binomially distributed random variables

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

So we start with

X \sim Bin(n_1, p_1)

Y \sim Bin(n_2, p_2)

We are looking for the probability mass function of Z=X-Y

First note that the min and max of the support of Z must be (-n_2, n_1) since that covers the most extreme cases (X=0 and Y=n_2 ) and (X=n_1 and Y=0 ).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k} when k \leq n and 0 otherwise.

Then we need to define two cases

1. Z \geq 0
2. Z < 0

In the first case

p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)

since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of n_1 and n_2 . For example if n_2 = 4 then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero.

For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc.

$latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex]

Put them together and that’s your pmf.

CodeCogsEqn

Here’s the function in R and a simulation to check it’s right (and it does work.)

Excuses and Opportunities

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Regular readers of the blog will have noticed that I haven’t been a regular contributor to the blog over the last year. There are some good reasons/excuses for that, predominantly around buying and renovating our new home, getting married and starting a new job at Facebook. Having an Arsenal season ticket also doesn’t help.

I am hoping that now that most of the renovations are done, I’m five months into married life and 13 months into the new job, that should allow for some more writing, so watch this space.

In the mean time, there is an awesome role for a senior researcher available in my team, working to understand and quantify advertising effectiveness. The full description is available here and it’s based in either New York or California.

Needless to say that Facebook is an incredible place to work, with more data than you could ever hope for and lots of interesting, yet unanswered, questions to ask of the data.

Within the team, you’ll be working with other researchers who have a strong background in data science and research more widely, including Eurry Kim and Dan Chapsky who both recently joined the team.

If you’re interested in the role and think you’d be a good fit, then please submit your CV/Resume or alternatively, if you’ve got any questions, please reach out using the form below!

[contact-form-7 404 "Not Found"]

Thorstein Veblen and Hard Coding

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Veblen3a
It is still quite common to hear the career progression of an analyst described as one upwards from the hard graft of coding and “getting your hands dirty” towards the enviable heights of people management and strategic thinking. Whenever I hear this it reminds me of the book Conspicuous Consumption by the American economist Thorstein Veblen. It examines, in a very hypothetical way, the roots of economic behaviour in some of our basic social needs: to impress others, to dominate and to demonstrate status. It’s not a happy book.

His principal concept is the distinction between exploit and drudgery.

 

The institution of a leisure class is the outgrowth of an early discrimination between employments, according to which some employments are worthy and others unworthy. Under this ancient distinction the worthy employments are those which may be classed as exploit; unworthy are those necessary employments into which no appreciable element of exploit enters.

He sees this division arising early in history as honour and status are assigned to those are successful in making others do what they want and “at the same time, employment in industry becomes correspondingly odious, and, in the common-sense apprehension, the handling of the tools and implements of industry falls beneath the dignity of able-bodied men. Labour becomes irksome.”

We’re no longer hairy barbarians using the vanquished for foot-stools but, as Veblen points out, the distinction persists no matter how illogical and unprofitable. Coding is still somehow viewed as necessarily inferior to the boardroom meeting even if it is the genius piece of code that makes or breaks a business.

Attitudes are changing, but still so slowly that there is a desperate shortage of people with both the experience and the hands-on skills to get things done.

So if you are a good analyst, and you love what you do, then this is my advice to you: when they come to lure you from your lovely pristine scripts, resist. Stay where you are. The 21st century is going to need you!

Machine Learning and Analytics based in London, UK