April 12, 2014 Simon Raper

Quick start Hadoop and Hive for analysts

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

The problem

You have a huge data set to crunch in a small amount of time. You’ve heard about the advantages of map reduce for processing data in parallel and you’d quite like to have a go with Hadoop to see what all the fuss is about. You know SQL and you want something that runs something SQL-like on top of Hadoop.

What you’ll need

  • An Amazon Web Services account (see this tutorial)
  • A firefox browser
  • An understanding of SQL


We will keep it simple by just joining two tables. Let’s take some recommendation data from my Mahout post on drunks and lampposts. These aren’t particularly large data sets as we’re after something quick for a demo. You can download the data sets from here.

We would like to join RECOMMENDER_SET_NUM to PERSON_IDS_FULLNAME by ID (we are just looking up a person’s name).

Note HIVE (the data warehouse platform that sits on top of Hadoop allowing you to run a kind of SQL) doesn’t seem to like column headers. Fortunately our data does not have them. If you do need to remove column headers from lots of files I’ve written some very simple python to do the job and you can find it here.

Do it

This post uses screen scripts to give instructions about what to click on, input etc. Here’s an explanation of the notation.

Step 1: Creating the S3 buckets

First we create a new S3 bucket. Log onto the console and then

@ AWS Console > S3 > create Bucket
~ name: crunchdata [Note name must be lower case]

As of 21 Feb 2014

Step 2: Setting up the keys

To get your AWS access key and your secret access key

@ AWS Console > {your name} > my account > security credentials
As of 21 Feb 2014

Step 3: Open Mozilla Firefox, download S3fox and open it up

@ http://www.s3fox.net/ > download
@ Firefox > tools > S3 organiser > manage accounts
~ account name: {you choose}
~ access key {your access key from step 2}
~ secret key: {your secret key from step 2}

save > close
As of 21 Feb 2014

Step 4: Upload your data to S3

In the righthand panel

right click > create directory
~ folder name: input

Repeat to create folders for output, scripts and logs and then create two folders in the input folder: recommendersetnum and personsidsfullname.

Finally pull your data from the your local directory in the left panel over to the corresponding folder in the input folder.

As of 21 Feb 2014

Step 5: Write your HiveQL

Open up a text editor and write your hive script.

drop table if exists recommender_set_num; --In case you need to rerun the script
drop table if exists person_ids_full_names;
drop table if exists recom_names;
-- Set up a table to load the recommendations data into
create external table if not exists recommender_set_num
userID bigint,
itemID bigint
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/recommendersetnum';
-- Set up a table to load the names look up into
create external table if not exists person_ids_full_names
userID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/personsidsfullname';
-- Set up a table to add the joined data to
create external table if not exists recom_names
userID bigint,
itemID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/output/recomnames.csv';
-- Join the tables
insert overwrite table recom_names
select A.userID,
from recommender_set_num A join
person_ids_full_names B
on A.userID = B.userID;

If you are happy with SQL this should look fairly intelligible to you. The main difference is the association of each of the tables with a folder (in the case of the inputs) or a file (in the case of the output) on your S3 bucket.

Use the firefox plugin to upload your hive script to the scripts folder in the crunchdata S3 bucket

As of 21 Feb 2014

Step 6: Start up an elastic map reduce cluster and run the job.

@ AWSConsole > Services > Elastic Map Reduce

create cluster
~ cluster name: {you choose}
~ log folder S3 location: s3n://crunchdata/logs/
~ add step: hive program
configure and add
~ script s3 location: s3n://crunchdata/scripts/{your hiveql file}
~ input s3 location: s3n://crunchdata/input/
~ output s3 location: s3n://crunchdata/output/
add > create cluster

To monitor the progress of your job > cluster list. It should take about 5 minutes.

As of 21 Feb 2014

Step 7: Collect the results

When the job is complete go back to your firefox plug in and download the text file that has been placed in your output file.

If all has gone well it should be a csv file containing the joined tables.

See it

I’ve created this concept diagram to help make clear the relationship between the tools used in this task.


Explain it

Chances are you’ll be asked to explain why this couldn’t be done on a standard database without the additional rigmarole of transferring the data and learning a variant of SQL. That’s a fair objection if the dataset isn’t too big or you’ve got as long as you want to wait for the process to run through. But if it’s no to either of these then you’ll want to take advantage of Hive’s ability (via Hadoop) to parallelise the job. That is, break it into pieces that can be run simultaneously and then put the results back together again. That’s essentially what it’s doing.

Fork it

You can download the code from here

Tagged: ,

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK