January 13, 2015 Simon Raper

From Redshift to Hadoop and back again

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.

Here’s a simple example that might be helpful as a template.

  1. First I set up a test database in Redshift. To do this I follow the aws redshift getting started guide, downloading and installing SQL Workbench J and the JDBC driver so that I can connect to the database and write some SQL.
  2. Next I follow this set of instructions to launch and connect to a cluster.
  3. The next step is to create some tables. Executing the following in SQL Workbench J will do the trick. (I’ve borrowed some data from one of the AWS examples.)

     

    create table users(
    	userid integer not null distkey sortkey,
    	username char(8),
    	firstname varchar(30),
    	lastname varchar(30),
    	city varchar(30),
    	state char(2),
    	email varchar(100),
    	phone char(14),
    	likesports boolean,
    	liketheatre boolean,
    	likeconcerts boolean,
    	likejazz boolean,
    	likeclassical boolean,
    	likeopera boolean,
    	likerock boolean,
    	likevegas boolean,
    	likebroadway boolean,
    	likemusicals boolean);
    create table sales(
    	salesid integer not null,
    	listid integer not null distkey,
    	sellerid integer not null,
    	buyerid integer not null,
    	eventid integer not null,
    	dateid smallint not null sortkey,
    	qtysold smallint not null,
    	pricepaid decimal(8,2),
    	commission decimal(8,2),
    	saletime timestamp);
    copy users from 's3://awssampledbeuwest1/tickit/allusers_pipe.txt'
    credentials 'aws_access_key_id=;aws_secret_access_key=';
    copy sales from 's3://awssampledbeuwest1/tickit/sales_tab.txt'
    credentials 'aws_access_key_id=;aws_secret_access_key='
    delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS';
    
  4. So now that we’ve set things up on Redshift we are ready to walk through an example of using it in conjunction with EMR. First dump the data into an S3 bucket

     

    unload ('select * from user_price')
    to 's3://redshiftemrdemo/unload/' credentials
    'aws_access_key_id=;aws_secret_access_key=';
    

    The nice thing about this is it defaults to using a pipe delimiter in output file which makes reading into Hive a lot easier (since csv can be a little fiddly).

  5. Then we do some kind of hive job. I’m going to do a simple group by but of course it can be much more involved. I’ll use this Hive script:

     

    create external table if not exists user_price
           (userid bigint,
    	username string,
    	firstname string,
    	lastname string,
    	city string,
    	state string,
    	email string,
    	phone string,
    	likesports string,
    	liketheatre string,
    	likeconcerts string,
    	likejazz string,
    	likeclassical string,
    	likeopera string,
    	likerock string,
    	likevegas string,
    	likebroadway string,
    	likemusicals string,
            pricepaid float
    ) row format delimited fields terminated by '|'
    stored as textfile
    location 's3://redshiftemrdemo/unload';
    create external table if not exists user_spend
    (
    userid bigint,
    spend float
    ) row format delimited fields terminated by '|'
    stored as textfile
    location 's3://redshiftemrdemo/output/user_spend.csv';
    insert overwrite table user_spend
    select userid,
    sum(pricepaid)
    from user_price
    group by userid
    ;
    

    and execute it from the AWS cli

    aws emr create-cluster --name "Redshift and EMR" --ami-version 3.3 --log-uri s3://redshiftemrdemo/logs/ --applications Name=Hive --ec2-attributes KeyName=examplejobs --instance-type m3.xlarge --instance-count 3 --steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3n://redshiftemrdemo/scripts/redshiftemr.q]
    
  6. Finally let’s load the results back into redshift
    create table user_spend(
    	userid integer not null,
    	spend decimal);
    copy user_spend from 's3://redshiftemrdemo/output/user_spend.csv'
    credentials 'aws_access_key_id=;aws_secret_access_key=';
    

And that’s it. Of course in reality you’d be doing something on a pretty big data set else it wouldn’t be worth your while but this at least shows you how it works.

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Machine Learning and Analytics based in London, UK