Concept map for Spark and Hadoop

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here is a concept map I use in some of our Spark workshops. I find these diagrams very useful when a topic is particularly crowded with different tools, techniques and ideas. It gives a zoomed out view which you can refer back to when you start to get lost.

To read the diagram pick a concept, read off the description underneath and then continue the sentence using one of the arrows. So for example “EMR is a web-based service that allows you to efficiently process large data sets by … running on a cluster of computers built with … EC2”

Click into the image to get a zoomable version else you won’t be able to read the text!


From Redshift to Hadoop and back again

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.

Here’s a simple example that might be helpful as a template. Read more

Beautiful Hue

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Hue is a godsend for when you want to do something on hadoop (hive query, pig script, run spark etc) and you are not feeling very command line.

As with all things on AWS there seem to be many routes in but here’s my recipe for getting it up and running (I’m assuming you have an AWS account. If not get set up by following the instructions here)

I use the AWS command line interface (easy instructions for install are here) to get a cluster with hue on it up and running

At the command line I then launched a cluster

aws emr create-cluster --name "My Cluster Name" --ami-version 3.3 --log-uri s3://my.bucket/logs/  --applications Name=Hive Name=Hue Name=Pig --ec2-attributes KeyName=mykeypair --instance-type m3.xlarge --instance-count 3

Next log in to the AWS console and go to EMR > Click on your cluster > view cluster details > enable web connection and follow the instructions there.

Note the instructions for configuring the proxy management tool are incomplete. Go to here for the complete instructions.

You should then, as it says, be able to put in master-public-dns-name:8888 into your browser and log in to hue. Don’t be a fool like me and forget to actually substitute in you master-public-fns-name which can be found on your cluster details!

Here’s a video tutorial I created that shows how to get Hue up and running on AWS

Scoring a Neural Net using R on AWS

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

nnet scoring plot

One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.

With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.

Read more

A new home for pifreak

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

pifreak is my twitterbot. It started tweeting the digits of pi in April 2012 and has tweeted the next 140 digits at 3:14 pm GMT every day since. Not especially useful or popular (only 48 followers) but I’ve grown fond of she/he/it.

Screen Shot 2014-08-19 at 14.27.11

I was housing her on an AWS ec2 micro instance, however my one year of free hire ran out and it has become a little too expensive to keep that box running.

So I’ve been looking at alternatives. I’ve settled on the google app engine which I’m hoping is going to come out as pretty close to free hosting.

So here’s a few notes for anyone else who might be thinking of using the google app engine for automated posting on twitter.

It was reasonably simple to set up

  1. Download the GAE python SDK. This provides a GUI for both testing your code locally and then deploying it to the cloud when you are happy with it.
  2. Create a new folder for your app and within that place your python modules together with an app.yaml file and a cron.yaml file which will configure the application and schedule your task respectively. It’s all very well documented here and for the cron scheduling here.
  3. Open the App Engine Launcher (which is effectively the SDK), add your folder, then either hit run to test locally or deploy to push to the cloud (you’ll be taken to some forms to register your app if you’ve not already done so)
  4. Finally if you click on dashboard from the launcher you’ll get lots of useful information about your web deployed app including error logs and the schedule for your tasks.

The things that caught me out were:

  1. Make sure that the application name in your app.yaml file is the same as the one you register with Google (when it takes you through to the form the first time you deploy.)
  2. There wasn’t a lot in the documentation about the use of the url field in both the cron and app yaml files. I ended up just putting a forward slash in both since in my very simple app the python module is in the root.
  3. Don’t forget module names are case sensitive so when you add your python module in the script section of the app file you’ll need to get this right.
  4. Yaml files follow an indentation protocol that is similar to python. You’ll need to ensure it’s all lined up correctly.
  5. Any third party libraries you need that are not included in this list will need to be included in your app folder. For example I had to include tweepy and some of its dependencies
  6. Where the third party library that you need is included in the GAE runtime environment you need to add it to the app file using the following syntax

    - name: ssl
    version: "latest"

And here finally is a link to the code.

Quick start Hadoop and Hive for analysts

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

The problem

You have a huge data set to crunch in a small amount of time. You’ve heard about the advantages of map reduce for processing data in parallel and you’d quite like to have a go with Hadoop to see what all the fuss is about. You know SQL and you want something that runs something SQL-like on top of Hadoop.

What you’ll need

  • An Amazon Web Services account (see this tutorial)
  • A firefox browser
  • An understanding of SQL


We will keep it simple by just joining two tables. Let’s take some recommendation data from my Mahout post on drunks and lampposts. These aren’t particularly large data sets as we’re after something quick for a demo. You can download the data sets from here.

We would like to join RECOMMENDER_SET_NUM to PERSON_IDS_FULLNAME by ID (we are just looking up a person’s name).

Note HIVE (the data warehouse platform that sits on top of Hadoop allowing you to run a kind of SQL) doesn’t seem to like column headers. Fortunately our data does not have them. If you do need to remove column headers from lots of files I’ve written some very simple python to do the job and you can find it here.

Do it

This post uses screen scripts to give instructions about what to click on, input etc. Here’s an explanation of the notation.

Step 1: Creating the S3 buckets

First we create a new S3 bucket. Log onto the console and then

@ AWS Console > S3 > create Bucket
~ name: crunchdata [Note name must be lower case]

As of 21 Feb 2014

Step 2: Setting up the keys

To get your AWS access key and your secret access key

@ AWS Console > {your name} > my account > security credentials
As of 21 Feb 2014

Step 3: Open Mozilla Firefox, download S3fox and open it up

@ > download
@ Firefox > tools > S3 organiser > manage accounts
~ account name: {you choose}
~ access key {your access key from step 2}
~ secret key: {your secret key from step 2}

save > close
As of 21 Feb 2014

Step 4: Upload your data to S3

In the righthand panel

right click > create directory
~ folder name: input

Repeat to create folders for output, scripts and logs and then create two folders in the input folder: recommendersetnum and personsidsfullname.

Finally pull your data from the your local directory in the left panel over to the corresponding folder in the input folder.

As of 21 Feb 2014

Step 5: Write your HiveQL

Open up a text editor and write your hive script.

drop table if exists recommender_set_num; --In case you need to rerun the script
drop table if exists person_ids_full_names;
drop table if exists recom_names;
-- Set up a table to load the recommendations data into
create external table if not exists recommender_set_num
userID bigint,
itemID bigint
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/recommendersetnum';
-- Set up a table to load the names look up into
create external table if not exists person_ids_full_names
userID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/input/personsidsfullname';
-- Set up a table to add the joined data to
create external table if not exists recom_names
userID bigint,
itemID bigint,
nameKey string,
displayName string
) row format delimited fields terminated by ','
stored as textfile
location 's3n://crunchdata/output/recomnames.csv';
-- Join the tables
insert overwrite table recom_names
select A.userID,
from recommender_set_num A join
person_ids_full_names B
on A.userID = B.userID;

If you are happy with SQL this should look fairly intelligible to you. The main difference is the association of each of the tables with a folder (in the case of the inputs) or a file (in the case of the output) on your S3 bucket.

Use the firefox plugin to upload your hive script to the scripts folder in the crunchdata S3 bucket

As of 21 Feb 2014

Step 6: Start up an elastic map reduce cluster and run the job.

@ AWSConsole > Services > Elastic Map Reduce

create cluster
~ cluster name: {you choose}
~ log folder S3 location: s3n://crunchdata/logs/
~ add step: hive program
configure and add
~ script s3 location: s3n://crunchdata/scripts/{your hiveql file}
~ input s3 location: s3n://crunchdata/input/
~ output s3 location: s3n://crunchdata/output/
add > create cluster

To monitor the progress of your job > cluster list. It should take about 5 minutes.

As of 21 Feb 2014

Step 7: Collect the results

When the job is complete go back to your firefox plug in and download the text file that has been placed in your output file.

If all has gone well it should be a csv file containing the joined tables.

See it

I’ve created this concept diagram to help make clear the relationship between the tools used in this task.


Explain it

Chances are you’ll be asked to explain why this couldn’t be done on a standard database without the additional rigmarole of transferring the data and learning a variant of SQL. That’s a fair objection if the dataset isn’t too big or you’ve got as long as you want to wait for the process to run through. But if it’s no to either of these then you’ll want to take advantage of Hive’s ability (via Hadoop) to parallelise the job. That is, break it into pieces that can be run simultaneously and then put the results back together again. That’s essentially what it’s doing.

Fork it

You can download the code from here

Deploying your Mahout application as a webapp on Openshift

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Openshift (a cloud computing platform from Redhat) does not at present support Hadoop so this is not a route to go down if you have the kind of data that requires map reduce. However it’s not a bad option if you’re just playing with Mahout (see the previous post) and would like to share what you have done in the form of a web app. It’s also free unless you require anything fancy.

Here are some notes I made on how to do it. Assumes you are developing your application in Eclipse.

  1. Set up an openshift account
  2. Follow the steps in this post. It shows you how to push your repository to Openshift through the eclipese IDE.
  3. Because I’m a novice in web app creastion I borrowed the web.xml and servlet code from this post on stack overflow and customised it to call my mahout function. You can see what I mean here Note the server code is placed in the src/main/java folder and the web.xml file is placed in in WEB-INF.
  4. When compiling the project in eclipse use run Maven install. This generates the WAR file. I found this post invaluable for understanding what is going on.
  5. I found I then had to copy the WAR file over from the target folder to the deployments folder.
  6. Switch to the git aspect in Eclipse and commit and push to openshift.

You should then be able to access your web app at the url provided by open shift and use the usual http requests to interact with it. Again see the previous post for an example.

Two Quick Recipes: Ubuntu and Hadoop

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

There are so many flavours of everything and things are changing so quickly that I find every task researched online ends up being a set of instructions stitched together from several blogs and forums. Here’s a couple of recent ones.

Ubuntu on AWS (50 mins)

Was going to buy a new laptop but it made more sense to set up a linux instance on AWS and remote in (a quarter of the price and more interesting). Here’s my recipe

    1. As in Mark’s earlier post set yourself up with an AWS account and a key pair by following this tutorial
    2. Launch an Ubuntu instance using the EC2 management console and select memory and processing power to suit.
    3. Start up the instance then connect to it by using Mindterm (very useful alternative to SSHing in with putty). To do this just select the instance in the terminal. Select Actions and then Connect. (You’ll need to provide the path to your saved key)
    4. Now you probably want to remote into your machine. Do this by setting up NoMachineNX following steps 2 to 4 in the following post
    5. However when you execute the last line of step 2 you’ll find that nxsetup is not found. To fix this switch to this post and follow steps 6-7 (life’s so complicated)

    6. Change password authentication to yes in  /etc/ssh/sshd_config
    7. Add gnome fall back

sudo apt-get install gnome-session-fallback

  1. Restart the instance and log in

Just remember to keep an eye on the charges!

Single Cluster Hadoop on Ubuntu (20 mins)

Of course you can run Hadoop directly on Amazon’s EMR platform but if you want to get more of a feel for how it works in a familiar environment you can set it up on a single instance.

  1. Follow the instructions in this post substituting in the latest hadoop stable release
  2. Install the latest JDK sudo apt-get install openjdk-7-jdk
  3. Set the JAVA_HOME path variable export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 Substituting in the path to the JDK binaries
  4. From the Hadoop quick start guide follow the instructions in the “Prepare to start the Hadoop Cluster” and “Stand Alone Operations” sections. If this all works you should be ready to go.


Machine Learning and Analytics based in London, UK