## A decision process for selecting statistical techniques

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

• It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
• There are of course many other available techniques but these are ones I use a lot.
• Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

## Glasseye: bringing together markdown, d3 and the Tufte layout

Glasseye is a package I’m developing to present the results of statistical analysis in an attractive and hopefully interesting way. It brings together three great things that I use a lot:

1. The markdown markup language.
2. The Tufte wide margin layout
3. Visualisation using d3.js

See a full demo of what it can do here and the you can visit the github repository here

Here is what it looks like when transformed into html.

The idea is to be able to write up work in markdown and have the results transformed into something like a Tufte layoutof which more below. For the Tufte layout I took the excellent tufte.css style sheet developed by Dave Liepmann and co and made a few changes to suit my purposes. Finally I’ve added some d3 charts (just a small selection at the moment but this will grow) that can easily invoked from within the markdown.

It’s all very very beta at the moment. I’m not claiming it’s ready to go. I would like to add lots more charts, redesign the d3 code and improve it’s overall usability (in particular replace the tags approach with something more in the spirit of markdown) however I thought I’d share it as it is. Hope you find it interesting

## Modelling for decisions

Here’s the deck I presented today at the Predictive Analytics Innovation conference in London.

The idea was to look at ways in which we might use modern statistical methods to help build models that are grounded firmly on common sense intuitions and to do it all in a very short time.

If you are interested in more details just let me know on twitter or drop me an email

## From Redshift to Hadoop and back again

If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.

Here’s a simple example that might be helpful as a template. Read more

## Beautiful Hue

Hue is a godsend for when you want to do something on hadoop (hive query, pig script, run spark etc) and you are not feeling very command line.

As with all things on AWS there seem to be many routes in but here’s my recipe for getting it up and running (I’m assuming you have an AWS account. If not get set up by following the instructions here)

I use the AWS command line interface (easy instructions for install are here) to get a cluster with hue on it up and running

At the command line I then launched a cluster

```aws emr create-cluster --name "My Cluster Name" --ami-version 3.3 --log-uri s3://my.bucket/logs/  --applications Name=Hive Name=Hue Name=Pig --ec2-attributes KeyName=mykeypair --instance-type m3.xlarge --instance-count 3
```

Next log in to the AWS console and go to EMR > Click on your cluster > view cluster details > enable web connection and follow the instructions there.

Note the instructions for configuring the proxy management tool are incomplete. Go to here for the complete instructions.

You should then, as it says, be able to put in master-public-dns-name:8888 into your browser and log in to hue. Don’t be a fool like me and forget to actually substitute in you master-public-fns-name which can be found on your cluster details!

Here’s a video tutorial I created that shows how to get Hue up and running on AWS

## Using D3 to show cost, revenue and ROI

Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.

(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)

Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).

Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.

Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.

To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.

You can also find the code here as a JSFiddle.

Thanks to Paul McAvoy for posing the problem and for all the other interesting things he’s shown me!

## Four weeks to launch!

O nly four weeks to go until our official launch date of 28th October. It feels like it’s been a long build up but we
believe it will be worth the wait! In the meantime here’s a bit more information about the kind of things we do, why we are different
and what motivates us. If you’re interested please do get in touch.

# What do we do?

T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.

Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!

Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.

The point is that good solutions tend to be bespoke solutions.

While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.

The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.

We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.

Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.

# Why are we different?

O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.

Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.

Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.

Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.

But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.

For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.

Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.

Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects

# What are we excited about?

A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.
We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.

Bayesian statistics and simulation for problem

solving always seem to be rich sources of ideas. We’re also interested in how browser technology allows greater scope for communication. We blog in a potent mixture of text, html, markdown and javascript.

# What technology are we into?

I t’s a long list but most prominent are R, python, distributed machine learning
(currently looking at Spark), and d3.js. Some current projects include a package to convert
R output into d3 and AI enhanced statistical modelling.

## A new home for pifreak

pifreak is my twitterbot. It started tweeting the digits of pi in April 2012 and has tweeted the next 140 digits at 3:14 pm GMT every day since. Not especially useful or popular (only 48 followers) but I’ve grown fond of she/he/it.

I was housing her on an AWS ec2 micro instance, however my one year of free hire ran out and it has become a little too expensive to keep that box running.

So I’ve been looking at alternatives. I’ve settled on the google app engine which I’m hoping is going to come out as pretty close to free hosting.

So here’s a few notes for anyone else who might be thinking of using the google app engine for automated posting on twitter.

It was reasonably simple to set up

1. Download the GAE python SDK. This provides a GUI for both testing your code locally and then deploying it to the cloud when you are happy with it.
2. Create a new folder for your app and within that place your python modules together with an app.yaml file and a cron.yaml file which will configure the application and schedule your task respectively. It’s all very well documented here and for the cron scheduling here.
3. Open the App Engine Launcher (which is effectively the SDK), add your folder, then either hit run to test locally or deploy to push to the cloud (you’ll be taken to some forms to register your app if you’ve not already done so)
4. Finally if you click on dashboard from the launcher you’ll get lots of useful information about your web deployed app including error logs and the schedule for your tasks.

The things that caught me out were:

1. Make sure that the application name in your app.yaml file is the same as the one you register with Google (when it takes you through to the form the first time you deploy.)
2. There wasn’t a lot in the documentation about the use of the url field in both the cron and app yaml files. I ended up just putting a forward slash in both since in my very simple app the python module is in the root.
3. Don’t forget module names are case sensitive so when you add your python module in the script section of the app file you’ll need to get this right.
4. Yaml files follow an indentation protocol that is similar to python. You’ll need to ensure it’s all lined up correctly.
5. Any third party libraries you need that are not included in this list will need to be included in your app folder. For example I had to include tweepy and some of its dependencies
6. Where the third party library that you need is included in the GAE runtime environment you need to add it to the app file using the following syntax

```libraries: - name: ssl version: "latest"```

And here finally is a link to the code.

## An A to Z of extra features for the D3 force layout

Since d3 can be a little inaccessible at times I thought I’d make things easier by starting with a basic skeleton force directed layout (Mike Bostock’s original example) and then giving you some blocks of code that can be plugged in to add various features that I have found useful.

The idea is that you can pick the features you want and slot in the code. In other words I’ve tried to make things sort of modular. The code I’ve taken from various places and adapted so thank you to everyone who has shared. I will try to provide the credits as far as I remember them!

## Basic Skeleton

Here’s the basic skeleton based on Mike Bostock’s original example but with a bit of commentary to remind me exactly what is going on

## A is for arrows

If you are dealing with a directed graph and you wish to use arrows to indicate direction your can just append this bit of code Read more

## Buster – a new R package for bagging hierarchical clustering

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation. Read more