September 16, 2015 Simon Raper

A decision process for selecting statistical techniques

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Screen Shot 2015-09-16 at 12.14.09

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

  • It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
  • There are of course many other available techniques but these are ones I use a lot.
  • Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

BlueprintTechniques

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Machine Learning and Analytics based in London, UK