Concept map for Spark and Hadoop

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here is a concept map I use in some of our Spark workshops. I find these diagrams very useful when a topic is particularly crowded with different tools, techniques and ideas. It gives a zoomed out view which you can refer back to when you start to get lost.

To read the diagram pick a concept, read off the description underneath and then continue the sentence using one of the arrows. So for example “EMR is a web-based service that allows you to efficiently process large data sets by … running on a cluster of computers built with … EC2”

Click into the image to get a zoomable version else you won’t be able to read the text!

MapReduceConceptMap-2

Animated Densities

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I’ve just added (pseudo) random number generation for Gamma and Dirichlet distributions to the glasseye package (a necessary new part of the toolkit as I’ll be adding some Bayesian things shortly).

To celebrate I’ve created a wholly pointless new chart in the library: the animated density plot. I admit it adds nothing (apart from showing you how the distribution can slowly be built up by the randomly generated data) but sometimes it’s good to be pointless.

Here’s the gamma(5,1) distribution.

You can find the latest version of glasseye on github.

From Zero to D3

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Some time ago I was asked to run a short course to take analysts from javascript newbies to being capable of putting together interactive charts using d3. That’s a big ask, from them and from me! However it did work, or at the very least it set people off in the right direction while giving them some immediate and satisfying results.

I’m sharing the materials here as they may be of use to anyone who wants to get up and running quickly.

It’s two pronged attack. In the first session we go bottom up and run through how html, javascript, css and svgs all fit together. We then use that knowledge to build a simple scatterplot. In the second session we go top down using what we learnt from our basic example to unpick some more complex code and substitute in our own data. This gets us quite a long way.

I’m sharing the videos I made, the presentation that I talk though and the exercises that we went through. This is the material from session one. Session two to follow shortly.

Details of more courses run by Coppelia can be found here. If you are interested in onsite training please get in touch ([email protected])

The videos

Please watch the following short videos before the workshop class.

The javascript revolution

 

How it all fits together

 

Read more

A decision process for selecting statistical techniques

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Screen Shot 2015-09-16 at 12.14.09

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

  • It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
  • There are of course many other available techniques but these are ones I use a lot.
  • Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

BlueprintTechniques

Glasseye: bringing together markdown, d3 and the Tufte layout

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Glasseye is a package I’m developing to present the results of statistical analysis in an attractive and hopefully interesting way. It brings together three great things that I use a lot:

  1. The markdown markup language.
  2. The Tufte wide margin layout
  3. Visualisation using d3.js

See a full demo of what it can do here and the you can visit the github repository here

Here is what it looks like when transformed into html.

 

Screen Shot 2015-08-07 at 13.48.35

 

The idea is to be able to write up work in markdown and have the results transformed into something like a Tufte layoutof which more below. For the Tufte layout I took the excellent tufte.css style sheet developed by Dave Liepmann and co and made a few changes to suit my purposes. Finally I’ve added some d3 charts (just a small selection at the moment but this will grow) that can easily invoked from within the markdown.

It’s all very very beta at the moment. I’m not claiming it’s ready to go. I would like to add lots more charts, redesign the d3 code and improve it’s overall usability (in particular replace the tags approach with something more in the spirit of markdown) however I thought I’d share it as it is. Hope you find it interesting

Modelling for decisions

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s the deck I presented today at the Predictive Analytics Innovation conference in London.

The idea was to look at ways in which we might use modern statistical methods to help build models that are grounded firmly on common sense intuitions and to do it all in a very short time.

If you are interested in more details just let me know on twitter or drop me an email

Or you can download the presentation from here

From Redshift to Hadoop and back again

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.

Here’s a simple example that might be helpful as a template. Read more

Beautiful Hue

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Hue is a godsend for when you want to do something on hadoop (hive query, pig script, run spark etc) and you are not feeling very command line.

As with all things on AWS there seem to be many routes in but here’s my recipe for getting it up and running (I’m assuming you have an AWS account. If not get set up by following the instructions here)

I use the AWS command line interface (easy instructions for install are here) to get a cluster with hue on it up and running

At the command line I then launched a cluster

aws emr create-cluster --name "My Cluster Name" --ami-version 3.3 --log-uri s3://my.bucket/logs/  --applications Name=Hive Name=Hue Name=Pig --ec2-attributes KeyName=mykeypair --instance-type m3.xlarge --instance-count 3

Next log in to the AWS console and go to EMR > Click on your cluster > view cluster details > enable web connection and follow the instructions there.

Note the instructions for configuring the proxy management tool are incomplete. Go to here for the complete instructions.

You should then, as it says, be able to put in master-public-dns-name:8888 into your browser and log in to hue. Don’t be a fool like me and forget to actually substitute in you master-public-fns-name which can be found on your cluster details!

Here’s a video tutorial I created that shows how to get Hue up and running on AWS

Machine Learning and Analytics based in London, UK