Split down the middle – Using the polls to forecast the likelihood of Brexit

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Pollsters have come in for a lot of stick since the last election. How did they get it so wrong? Was it that their sample sizes were too small, that their interviewing methods were biased or their re-weighting wonky. Maybe, more simply, people just don’t tell them the truth about how they will vote – or even know themselves. Of course, it’s not just the pollsters that can confuse people. Through their scientific fog seeps the spin – the latest poll, the trend, the momentum, the editors desire to influence.

But, if we are trying to figure out whether the UK will ‘Brexit’ this week, the trouble is all we have is the polls. So how do we screw as much useful information out of them as possible?

Well one way is to use Bayesian methods.

The headline polls have shown “Brexit” steadily erode the “Remain” lead, particularly since the campaign began in earnest. The chart shows the “Remain” lead over the last two and a half years with larger dots representing more precise estimates.

Screen Shot 2016-06-20 at 09.41.39

Source: Financial Times

But is this a trend or momentum? How can we aggregate polls from different companies, using different methods and sample sizes, to make a coherent forecast for 23rd June?

One way is to use Bayesian State Space model (see Simon Jackman http://jackman.stanford.edu/papers/). The idea is that there is an underlying, latent public opinion that is sampled by the published polls. These polls differ not only in what they tell us about this underlying latent preference, but also in their quality (eg the sample size and the methodology). So the model is weighted to take greater account of the more reliable polls. The underlying preference evolves over time.

More specifically

  • \(\mu_{t} \sim N(\mu_{t-1},\sigma)\) The latent public opinion, expressed as the lead for remain, is a normally distributed random walk, where sigma is estimated from the data. The underlying preference is estimated for each day since 1st Jan 2014.
  • \(lead_{t,i} \sim N(\mu_{t} + pollster_{i}, \sigma_{lead, i,t})\) The model is measured against the polled lead for remain at time t for pollster i. This approach can measure the bias for each pollster.

The model is estimated using stan and the code is available on github here.

The chart shows how the model can unpick the underlying lead, day by day, that is most consistent with the polls. There is still a lot of daily volatility, reflecting how close the race is, how different the polls have been and that this is a model of the ‘lead’ (which is twice as volatile as the share of vote).

Estimate of the underlying trend

Estimate of the underlying trend

There are systematic ‘biases’ from each pollster’s methodology to either ‘Brexit’ or ‘Remain’. For example, IPSOS MORI are biased toward ‘Remain’ and YouGov toward ‘Brexit’ This might be the sample method, re-weighting calculations etc. This is one of the reasons it is hard just eyeball the raw poll averages. A trend can appear just because of a sequence of pollsters.

Estimate of ‘Pollster Bias

Estimate of ‘Pollster Bias

Putting this together we can forecast the last few days of the campaign. Of course, this requires the assumption that ‘everything else remains the same’ – which is the most improbable thing of all!

Prediction of Remain Lead

Prediction of Remain Lead

So this simple model reckons that, as of Sunday 19th June, there is 66% chance of a ‘Remain’ victory on 23rd June with the average lead being 3%pts.

Prediction of Remain Lead

Prediction of Remain Lead

But don’t rush to the conclusion that this model is secretly receiving EU funding. The same model a week ago was predicting the exact opposite result: a 60% chance of ‘Brexit’.

When the county is on a knife-edge, small shifts in information make a big difference. Half a week is a long time in politics.

Bayesian Business Intelligence

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A post by Duncan Stoddard. It first appeared among lots of other great posts at www.sparepiece.com

Bayesian analysis is perfect in situations when data are scarce and experience and intuition are plentiful. We can combine available data with the collective intelligence of a team in the form of parameter priors in a Bayesian model to test hypotheses and inform strategic decision making.

Let’s imagine we want to build a model that explains the drivers of a binary dependent variable, e.g. ‘prospect buys or doesn’t buy product x’, in order to test hypotheses about what makes a good prospective client and to score prospective client lists. We’ve got some descriptive data on past acquisitions, e.g. prospect geography, age, industry and some behavioural data, e.g. ‘exposed to ad A’, ‘responded to email campaign B’ etc. We’ve also got the collective knowledge of a client services team. We’re going to combine these in a Bayesian logistic regression model using the MCMCpack package in R.

1. Generating coefficient priors

First we need to get each member of the team to (independently) place weights on a list of candidate explanatory variables. To keep this simple we’ll allow integer values on [-10,10] and then convert these to usable priors.

The coefficient in a logistic regression can be interpreted as the log of the % increase in the odds of y=1 over a base value, which we can take to be the overall response rate. We can map our weights to a comparable value with the following logic:

Weight w to probability p:

p = \frac{s+10}{20}

The odds ratio:

\frac{p}{1-p} = \frac{2+10}{10-s}

The base probability:

p_{0} = \frac{1}{n}\sum_{i}^{n}y_{i}

Weight to prior coefficient:

\beta = ln \left ( \frac{\frac{p}{1-p}}{\frac{p_{0}}{1-p_{0}}} \right ) = ln\left (\frac{(\omega+10)(1-p_{0})}{p_{0}(10-\omega)} \right)

Once we’ve applied this formula we can calculate prior mean and precision values. These will then be read into R and used as the b0 and B0 arguments of the MCMClogit function. The b0 argument is a vector of length k+1, where k is the number of variables in the model, and B0 is a diagonal square matrix (assuming independence between coefficient distributions).

2. Normalising the data

The weighting process implicitly assumes variable values are on the same scale. We therefore need to normalise the data to ensure this is so. We can do that simply  with:

{x_i}'=\frac{x_i - min(x)}{max(x) - min(x)}


3. Running the model

The MCMClogit(formula, data, b0, B0) function gives posterior distributions for each beta, assuming a multivariate normal prior. We can use summary(posterior) to get the posterior mean and standard deviation values, and plot(posterior) to get distribution plots.

The data updates our prior knowledge and pulls our estimates towards a ‘truer’ representation of reality.

Our posterior means will serve two purposes:

a) parameterise a scoring equation for predicting response rates conditional on prospect characteristics.

b) provide insights on the characteristics of good prospective clients and test decision makers’ hypotheses.


4. Building a scoring equation

Summing the product of our posterior means and normalised data will give the log of the odds ratio. We can convert this into response probabilities like this:

ln\frac{s}{1-s} = \sum_{i}^{k}\hat{\beta }_ix_i

p = \frac{e^{\sum_{i}^{k}\hat{\beta _i}x_i}}{1+e^{\sum_{i}^{k}\hat{\beta _i}x_i}}

s represents the expected response probability given a vector of characteristics x. We can apply this to our prospect lists and rank to find the best targets for marketing.


5. Converting posterior means back to scores

Lastly, we want to convert our posterior mean estimates back to score integers on [-10,10] so we can report back results in a form they’re now familiar with. We can do this easily with the following:

\omega = \frac{10e^{\hat{\beta }}+10}{1+e^{\hat{\beta }}}

Visualising Shrinkage

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy.

To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students are probably too small a sample to get useful results. In a hierarchical model the data are pooled so that all classes in a school are modelled together as a hierarchy and even all schools in a district.

At each level in this hierarchy an estimate for the efficiency of the teaching method is obtained. You will get an estimate for the school as a whole and for the district. You will even get estimates for the individual classes. These estimates will be weighted averages of the estimates for the class and the estimate for the school (which in turn is a weighted average of the estimate for the school and the district.) The clever part is that this weighting is itself determined by the data. Where a class is an outlier, and therefore the overall school average is less relevant, the estimate will be weighted towards the class. Where it is typical it will be weighted towards the school. This property is known as shrinkage.

I’m often interested in how much shrinkage is affecting my estimates and I want to see it. I’ve created this plot which I find useful. It’s done in R using ggplot and is very simple to code.

The idea is that the non shrunk estimates bi (i.e. the estimates that would be obtained by modelling classes individually) are plotted on along the line x=y at the points (bi, bi). The estimates they are being shrunk towards ai are plotted at the points (bi, ai). Finally we plot the shrunk estimates si at (bi, si) and connect the points with an arrow to illustrate the direction of the shrinkage.

Here is an example. You can see the extent of the shrinkage by the the distance covered by the arrow towards the higher level estimate.


Note the arrows do sometimes point away from the higher level estimate. This is because this data is for a single coefficient in a hierarchical regression model with multiple coefficients. Where other coefficients have been stabilized by shrinkage this causes this particular coefficient to be revised.

The R code is as follows:

# *--------------------------------------------------------------------
# | FUNCTION: shrinkPlot
# | Function for visualising shrinkage in hierarchical models
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |31/08/2013|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  orig      Estimates obtained from individual level
# |                    modelling
# |          shrink    Estimates obtained from hierarchical modelling
# |          prior     Priors in Bayesian model or fixed effects in
# |                    mixed effects model (i.e. what it is shrinking
# |                    towards.
# |          window    Limits for the plot (as a vector)
# |
# *--------------------------------------------------------------------
# | OUTPUTS: A ggplot object
# *--------------------------------------------------------------------
# | DEPENDS: grid, ggplot2
# |
# *--------------------------------------------------------------------
shrinkPlot<-function(orig, shrink, prior, window=NULL){
	data<-data.frame(orig, shrink, prior, group)
  g<-ggplot(data=data, aes(x=orig, xend=orig, y=orig, yend=shrink, col=group))
	g2<-g+geom_segment(arrow = arrow(length = unit(0.3, "cm"))) +geom_point(data=comp.in, aes(x=coef, y=mean))
  g3<-g2+xlab("Estimate")+ylab("Shrinkage")+ ggtitle("Shrinkage Plot")
	if (is.null(window)==FALSE){

Machine Learning and Analytics based in London, UK