If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.
Here’s a simple example that might be helpful as a template.
- First I set up a test database in Redshift. To do this I follow the aws redshift getting started guide, downloading and installing SQL Workbench J and the JDBC driver so that I can connect to the database and write some SQL.
- Next I follow this set of instructions to launch and connect to a cluster.
- The next step is to create some tables. Executing the following in SQL Workbench J will do the trick. (I’ve borrowed some data from one of the AWS examples.)
create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), email varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean, likemusicals boolean); create table sales( salesid integer not null, listid integer not null distkey, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null sortkey, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp); copy users from 's3://awssampledbeuwest1/tickit/allusers_pipe.txt' credentials 'aws_access_key_id=
;aws_secret_access_key= '; copy sales from 's3://awssampledbeuwest1/tickit/sales_tab.txt' credentials 'aws_access_key_id= ;aws_secret_access_key= ' delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS';
- So now that we’ve set things up on Redshift we are ready to walk through an example of using it in conjunction with EMR. First dump the data into an S3 bucket
unload ('select * from user_price') to 's3://redshiftemrdemo/unload/' credentials 'aws_access_key_id=
The nice thing about this is it defaults to using a pipe delimiter in output file which makes reading into Hive a lot easier (since csv can be a little fiddly).
- Then we do some kind of hive job. I’m going to do a simple group by but of course it can be much more involved. I’ll use this Hive script:
create external table if not exists user_price (userid bigint, username string, firstname string, lastname string, city string, state string, email string, phone string, likesports string, liketheatre string, likeconcerts string, likejazz string, likeclassical string, likeopera string, likerock string, likevegas string, likebroadway string, likemusicals string, pricepaid float ) row format delimited fields terminated by '|' stored as textfile location 's3://redshiftemrdemo/unload'; create external table if not exists user_spend ( userid bigint, spend float ) row format delimited fields terminated by '|' stored as textfile location 's3://redshiftemrdemo/output/user_spend.csv'; insert overwrite table user_spend select userid, sum(pricepaid) from user_price group by userid ;
and execute it from the AWS cli
aws emr create-cluster --name "Redshift and EMR" --ami-version 3.3 --log-uri s3://redshiftemrdemo/logs/ --applications Name=Hive --ec2-attributes KeyName=examplejobs --instance-type m3.xlarge --instance-count 3 --steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3n://redshiftemrdemo/scripts/redshiftemr.q]
- Finally let’s load the results back into redshift
create table user_spend( userid integer not null, spend decimal); copy user_spend from 's3://redshiftemrdemo/output/user_spend.csv' credentials 'aws_access_key_id=
And that’s it. Of course in reality you’d be doing something on a pretty big data set else it wouldn’t be worth your while but this at least shows you how it works.