Coppelia Machine Learning and Analytics

The Workshop

Our network diagram of philosophers appeared in the online editions of
The New York Times and The New Yorker.
Read it here

Concept map for Spark and Hadoop

Here is a concept map I use in some of our Spark workshops. I find these diagrams very useful when a topic is particularly crowded with different tools, techniques and ideas. It gives a zoomed out view which you can refer back to when you start to get lost.

To read the diagram pick a concept, read off the description underneath and then continue the sentence using one of the arrows. So for example “EMR is a web-based service that allows you to efficiently process large data sets by … running on a cluster of computers built with … EC2”

Click into the image to get a zoomable version else you won’t be able to read the text!

MapReduceConceptMap-2

Spark and R

Spark

Apache Spark is a fast and general engine for large-scale data processing as mentioned on the official project’s page. For many it is considered to be the successor of the popular Hadoop engine.

Spark has considerably increased in popularity in the last years since:

  • It is easy to implement with existing technologies. It can run on:
    • Hadoop
    • Mesos
    • Standalone
    • In the cloud
  • It can also access diverse data sources including:
    • HDFS
    • Cassandra
    • HBase
    • S3
  • It runs programs up to 100x faster than Hadoop MapReduce in RAM or 10x faster on disk.
  • It is implemented in a plethora of languages like:
    • Java
    • Scala
    • Python
    • R

R

R is a statistical programming language (also considered a general purpose language after latest developments) and it is one of the languages used to run Spark for statistical / data analysis.
In this course we will explore the two main R packages used to run data analysis on Spark:

  • SparkR – natively included in Spark after version 1.6.2
  • sparklyr – developed by RStudio

Read more

Split down the middle – Using the polls to forecast the likelihood of Brexit

Pollsters have come in for a lot of stick since the last election. How did they get it so wrong? Was it that their sample sizes were too small, that their interviewing methods were biased or their re-weighting wonky. Maybe, more simply, people just don’t tell them the truth about how they will vote – or even know themselves. Of course, it’s not just the pollsters that can confuse people. Through their scientific fog seeps the spin – the latest poll, the trend, the momentum, the editors desire to influence.

But, if we are trying to figure out whether the UK will ‘Brexit’ this week, the trouble is all we have is the polls. So how do we screw as much useful information out of them as possible?

Well one way is to use Bayesian methods.

The headline polls have shown “Brexit” steadily erode the “Remain” lead, particularly since the campaign began in earnest. The chart shows the “Remain” lead over the last two and a half years with larger dots representing more precise estimates.

Screen Shot 2016-06-20 at 09.41.39

Source: Financial Times

But is this a trend or momentum? How can we aggregate polls from different companies, using different methods and sample sizes, to make a coherent forecast for 23rd June?

One way is to use Bayesian State Space model (see Simon Jackman http://jackman.stanford.edu/papers/). The idea is that there is an underlying, latent public opinion that is sampled by the published polls. These polls differ not only in what they tell us about this underlying latent preference, but also in their quality (eg the sample size and the methodology). So the model is weighted to take greater account of the more reliable polls. The underlying preference evolves over time.

More specifically

  • \(\mu_{t} \sim N(\mu_{t-1},\sigma)\) The latent public opinion, expressed as the lead for remain, is a normally distributed random walk, where sigma is estimated from the data. The underlying preference is estimated for each day since 1st Jan 2014.
  • \(lead_{t,i} \sim N(\mu_{t} + pollster_{i}, \sigma_{lead, i,t})\) The model is measured against the polled lead for remain at time t for pollster i. This approach can measure the bias for each pollster.

The model is estimated using stan and the code is available on github here.

The chart shows how the model can unpick the underlying lead, day by day, that is most consistent with the polls. There is still a lot of daily volatility, reflecting how close the race is, how different the polls have been and that this is a model of the ‘lead’ (which is twice as volatile as the share of vote).

Estimate of the underlying trend

Estimate of the underlying trend

There are systematic ‘biases’ from each pollster’s methodology to either ‘Brexit’ or ‘Remain’. For example, IPSOS MORI are biased toward ‘Remain’ and YouGov toward ‘Brexit’ This might be the sample method, re-weighting calculations etc. This is one of the reasons it is hard just eyeball the raw poll averages. A trend can appear just because of a sequence of pollsters.

Estimate of ‘Pollster Bias

Estimate of ‘Pollster Bias

Putting this together we can forecast the last few days of the campaign. Of course, this requires the assumption that ‘everything else remains the same’ – which is the most improbable thing of all!

Prediction of Remain Lead

Prediction of Remain Lead

So this simple model reckons that, as of Sunday 19th June, there is 66% chance of a ‘Remain’ victory on 23rd June with the average lead being 3%pts.

Prediction of Remain Lead

Prediction of Remain Lead

But don’t rush to the conclusion that this model is secretly receiving EU funding. The same model a week ago was predicting the exact opposite result: a 60% chance of ‘Brexit’.

When the county is on a knife-edge, small shifts in information make a big difference. Half a week is a long time in politics.

Animated Densities

I’ve just added (pseudo) random number generation for Gamma and Dirichlet distributions to the glasseye package (a necessary new part of the toolkit as I’ll be adding some Bayesian things shortly).

To celebrate I’ve created a wholly pointless new chart in the library: the animated density plot. I admit it adds nothing (apart from showing you how the distribution can slowly be built up by the randomly generated data) but sometimes it’s good to be pointless.

Here’s the gamma(5,1) distribution.

You can find the latest version of glasseye on github.

From Zero to D3

Some time ago I was asked to run a short course to take analysts from javascript newbies to being capable of putting together interactive charts using d3. That’s a big ask, from them and from me! However it did work, or at the very least it set people off in the right direction while giving them some immediate and satisfying results.

I’m sharing the materials here as they may be of use to anyone who wants to get up and running quickly.

It’s two pronged attack. In the first session we go bottom up and run through how html, javascript, css and svgs all fit together. We then use that knowledge to build a simple scatterplot. In the second session we go top down using what we learnt from our basic example to unpick some more complex code and substitute in our own data. This gets us quite a long way.

I’m sharing the videos I made, the presentation that I talk though and the exercises that we went through. This is the material from session one. Session two to follow shortly.

Details of more courses run by Coppelia can be found here. If you are interested in onsite training please get in touch (info@coppelia.io)

The videos

Please watch the following short videos before the workshop class.

The javascript revolution

 

How it all fits together

 

Read more

Interactive Mapping

Animated World Map

In the past I have used expensive GIS software, or ggmap in R, to produce maps and visualise geographic data. I wanted to see what javascript offered to make more interactive maps. After a little research I found a couple of options Leaflet and good old D3.js. I challenged myself to make an animated map that plots all the places I’ve been to over time.

Read more

A decision process for selecting statistical techniques

Screen Shot 2015-09-16 at 12.14.09

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

  • It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
  • There are of course many other available techniques but these are ones I use a lot.
  • Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

BlueprintTechniques

Machine Learning and Analytics based in London, UK