Quick start Hadoop and Hive for analysts

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

The problem

You have a huge data set to crunch in a small amount of time. You’ve heard about the advantages of map reduce for processing data in parallel and you’d quite like to have a go with Hadoop to see what all the fuss is about. You know SQL and you want something that runs something SQL-like on top of Hadoop.

What you’ll need

  • An Amazon Web Services account (see this tutorial)
  • A firefox browser
  • An understanding of SQL


We will keep it simple by just joining two tables. Let’s take some recommendation data from my Mahout post on drunks and lampposts. These aren’t particularly large data sets as we’re after something quick for a demo. You can download the data sets from here.

We would like to join RECOMMENDER_SET_NUM to PERSON_IDS_FULLNAME by ID (we are just looking up a person’s name).

Note HIVE (the data warehouse platform that sits on top of Hadoop allowing you to run a kind of SQL) doesn’t seem to like column headers. Fortunately our data does not have them. If you do need to remove column headers from lots of files I’ve written some very simple python to do the job and you can find it here.

Do it

This post uses screen scripts to give instructions about what to click on, input etc. Here’s an explanation of the notation.

Step 1: Creating the S3 buckets

First we create a new S3 bucket. Log onto the console and then

@ AWS Console > S3 > create Bucket
~ name: crunchdata [Note name must be lower case]

As of 21 Feb 2014

Step 2: Setting up the keys

To get your AWS access key and your secret access key

@ AWS Console > {your name} > my account > security credentials
As of 21 Feb 2014

Step 3: Open Mozilla Firefox, download S3fox and open it up

@ http://www.s3fox.net/ > download
@ Firefox > tools > S3 organiser > manage accounts
~ account name: {you choose}
~ access key {your access key from step 2}
~ secret key: {your secret key from step 2}

save > close
As of 21 Feb 2014

Step 4: Upload your data to S3

In the righthand panel

right click > create directory
~ folder name: input

Repeat to create folders for output, scripts and logs and then create two folders in the input folder: recommendersetnum and personsidsfullname.

Finally pull your data from the your local directory in the left panel over to the corresponding folder in the input folder.

As of 21 Feb 2014

Step 5: Write your HiveQL

Open up a text editor and write your hive script.

drop table if exists recommender_set_num; --In case you need to rerun the script
drop table if exists person_ids_full_names;
drop table if exists recom_names;
-- Set up a table to load the recommendations data into
create external table if not exists recommender_set_num
userID bigint,
itemID bigint
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/recommendersetnum';
-- Set up a table to load the names look up into
create external table if not exists person_ids_full_names
userID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/personsidsfullname';
-- Set up a table to add the joined data to
create external table if not exists recom_names
userID bigint,
itemID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/output/recomnames.csv';
-- Join the tables
insert overwrite table recom_names
select A.userID,
from recommender_set_num A join
person_ids_full_names B
on A.userID = B.userID;

If you are happy with SQL this should look fairly intelligible to you. The main difference is the association of each of the tables with a folder (in the case of the inputs) or a file (in the case of the output) on your S3 bucket.

Use the firefox plugin to upload your hive script to the scripts folder in the crunchdata S3 bucket

As of 21 Feb 2014

Step 6: Start up an elastic map reduce cluster and run the job.

@ AWSConsole > Services > Elastic Map Reduce

create cluster
~ cluster name: {you choose}
~ log folder S3 location: s3n://crunchdata/logs/
~ add step: hive program
configure and add
~ script s3 location: s3n://crunchdata/scripts/{your hiveql file}
~ input s3 location: s3n://crunchdata/input/
~ output s3 location: s3n://crunchdata/output/
add > create cluster

To monitor the progress of your job > cluster list. It should take about 5 minutes.

As of 21 Feb 2014

Step 7: Collect the results

When the job is complete go back to your firefox plug in and download the text file that has been placed in your output file.

If all has gone well it should be a csv file containing the joined tables.

See it

I’ve created this concept diagram to help make clear the relationship between the tools used in this task.


Explain it

Chances are you’ll be asked to explain why this couldn’t be done on a standard database without the additional rigmarole of transferring the data and learning a variant of SQL. That’s a fair objection if the dataset isn’t too big or you’ve got as long as you want to wait for the process to run through. But if it’s no to either of these then you’ll want to take advantage of Hive’s ability (via Hadoop) to parallelise the job. That is, break it into pieces that can be run simultaneously and then put the results back together again. That’s essentially what it’s doing.

Fork it

You can download the code from here

Include uncertainty in a financial model

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

On-line resources for analysts are often either too general to be of practical use or too specialised to be accessible. The aim of ragscripts.com is to remedy this by providing start to finish directions for complex analytical tasks. The site is under construction at the moment but any feedback is most welcome!

The problem

You’ve been asked to calculate some figure or other (e.g. end of year revenue, average customer lifetime value) based on numbers supplied from various parts of the business. You know how to make the calculation but what bothers you is that some of the figures going in are clearly approximations or at worst complete guesses. Therefore the final calculation will inherit this uncertainty. A decision will be made based on your final figure and more likely than not the decision makers will treat this number as a certainty. They won’t have any way of knowing how uncertain it is and may well draw the wrong conclusions. How do you get an understanding of the uncertainty around the final calculation and how do you explain it to the decision makers?

The standard approach in finance is to produce a set of best and worst case scenarios. These are tedious to produce and are often disregarded as unlikely outcomes. I think our approach is better as it gives a full picture of the uncertainty around an estimate.

What you’ll need

  • Basic R programming skills (including writing functions)
  • An understanding of probability distributions
  • About 1-3 hours (depending on the complexity of the model)


Let’s say you are asked to project the revenue of a business in one year’s time based on the following numbers:

  • Customer base at the beginning of the year: 50,000
  • Average revenue per customer: £7
  • Monthly acquisition rate: 5%
  • Monthly churn rate: 2%

You might have thought that these figures could be known with some certainty but believe me it’s actually very rare especially in large businesses where integrating systems to get an overall picture is a tough job. Add to this the fact that we are guessing at how well any current figures are going to hold out over the year and we should agree than there’s a fair bit of uncertainty in the inputs.

With this is mind you ask the suppliers of the data for a lower and upper bound for the figures, i.e. the lowest and highest numbers that they would expect to see without being very surprised. Where they are unable or unwilling to do this you use your common sense. This gives you the following table:

Statistic Estimate Lower Bound Upper Bound
ARPC £7 £4 £10
Acquisition Rate 5% 3% 7%
Churn Rate 2% 1% 3%

Do it

Step 1: Setting up the distributions

There’s lots of tools you could do this with but R seems most appropriate so start up a session. We are going to model the our uncertainty about the input values in each case as a normal distribution. We will say that the mean of the distribution is the value the business has estimated and that the 5% and 95% quantiles are the upper bounds and lower bounds respectively.


Yes this is subjective (we could have chosen other values for the bounds) but the point of this exercise is not to make incredibly accurate statistical inferences rather it is to show the consequences that different levels of uncertainty in the inputs have on the output. You can play with these parameters if you wish and say things like if we were this accurate with our inputs then we would be this accurate with our output

Let’s start by calculating the parameters (mean, standard deviation) for the three normal distributions we’ll need.

Let  q_{95} be the 95th quantile. Then

 q_{95} = \mu + 1.645 \sigma . Therefore   \sigma = \frac{q_{95}-\mu}{1.645}

#Work out the standard deviation for arpu
#Plot the resulting distribution
x<-seq(0, 15,by=0.5)
d<-dnorm(x, 7, arpu.sd)
plot(x, d, type='l')

This should give a plot that meets our expectations and it does:


We’ll complete this step by doing the same for acquisition and churn.

#Do the same for acquisition and churn
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.05, acq.sd)
plot(x, d, type='l')
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.02, ch.sd)
plot(x, d, type='l')
Step 2: A function for calculating revenue

The next piece you need is a function for calculating the output (the statistic of interest) from the inputs. This is what would usually be captured in the formulae of an excel spreadsheet.

For our example it looks like this:

revenue<-function(arpu, acq, ch){
  for (m in 1:12){
Step 3: Making random draws from our belief distributions

We will now generate 10 thousand values from our each of normally distributed input variables to create 10 thousand simulations. Each of these simulations represents a possible set of inputs to our model. For example one set might be

  • Average revenue per customer: £7.50
  • Monthly acquisition rate: 3%
  • Monthly churn rate: 1%

However the way we’ve set things up unlikely values for each input are unlikely to be drawn from our distributions. In fact since 10 thousand is a pretty large number the frequency at which each of the possible values is drawn should fairly well reflect our initial beliefs.

Here is our code for simulating the input values:

#Now let's simulate 10k values from each of our distributions
sim.arpu<-rnorm(10000, 7, arpu.sd)
sim.acq<-rnorm(10000, 0.05, acq.sd)
sim.ch<-rnorm(10000, 0.02, ch.sd)

Note a couple of things we haven’t accounted for: we’ve assumed that the our beliefs about our input variables are independent of one another but that might not be the case. We might think for some reason that if churn is higher then acquisition will be lower. If so we want to draw our values from a multivariate distribution incorporating correlation between the variables. However assuming independence is an easy and useful place to start. Also we’ve assumed that our model itself is correct – i.e. that constant rates of churn and acquisition apply and that there are no other effects (for example seasonal or economic.)

Step 4: Running them through the function

Now everything is place and it’s a simple matter to apply our function to each of the 10 thousand simulated sets of input variables. We use R’s mapply function.

sim.rev<-mapply(revenue, sim.arpu, sim.acq, sim.ch)
Step 5: Examining the results

Since our simulated input variables were distributed in a way that reflects our beliefs about their likely values, our final revenue calculations, based on these simulated inputs, will also have a distribution that reflect these uncertainties. The next few lines of code show us this distribution.


The histogram of the possible values for annual revenue makes the point of doing all of this quite clear. A straight calculation based on the original input values gives a revenue value of 499k. However a quick glance at the histogram shows that revenues as low as 400k are not at all improbable. Any planning should account for this!


See it

Here’s a quick JSfiddle to see how inputs affect outputs in the example case.

Explain it

So how do you explain what you have done and why you have done it?

Chances are you may run into the analyst’s catch 22 (see skit). You’ve done a much better job because you’ve incorporated uncertainty into your forecast allowing the decision maker to make a more informed decision. However people like certainty and the very fact you have done this may be perceived as incompetence What do you mean you don’t know what the revenue figure is going to be – go back and do it again!

So either you compromise your own integrity by glossing over the uncertainty or you’re branded an idiot or at least a boffin who overcomplicates things!

But there are some ways around this. Try to find out what decisions are being made using the data. Some of the frustration will come from being handed something that doesn’t solve the problem and the decision maker may have a point: if the range of accuracy is not enough to make a decision the model may need to be improved in some way.

Also talking about the decision and the problem from the outset is a good way of driving home the value of what you are doing. The very purpose of showing the uncertainty in the estimate is to prevent misinformed decisions being made.

The decision in our example might be whether or not to we can invest in a new product now given that we still need to be in profit by the end of the year. Our model as it is shows that if all other outgoings come to 150k then we’d be taking a risk spending over 50k on the new product, (since there’s a reasonable chance that revenue could be as low as 200k). If 50k was what were planning then we’ve done the job. If however a new product can’t be done for less than 100k then it’s worth spending some time improving our model.

Fork it

Take my code and adapt it for your own models. You can find the gist here

Machine Learning and Analytics based in London, UK