If you are using AWS Redshift as your data warehouse and you have a data processing or analytical job that would benefit from a bit of hadoop then it’s fairly straightforward to get your data into EMR and then back into Redshift. It’s just a matter of using the copy and unload commands to read from and write to an S3 bucket.
Hue is a godsend for when you want to do something on hadoop (hive query, pig script, run spark etc) and you are not feeling very command line.
As with all things on AWS there seem to be many routes in but here’s my recipe for getting it up and running (I’m assuming you have an AWS account. If not get set up by following the instructions here)
I use the AWS command line interface (easy instructions for install are here) to get a cluster with hue on it up and running
At the command line I then launched a cluster
aws emr create-cluster --name "My Cluster Name" --ami-version 3.3 --log-uri s3://my.bucket/logs/ --applications Name=Hive Name=Hue Name=Pig --ec2-attributes KeyName=mykeypair --instance-type m3.xlarge --instance-count 3
Next log in to the AWS console and go to EMR > Click on your cluster > view cluster details > enable web connection and follow the instructions there.
Note the instructions for configuring the proxy management tool are incomplete. Go to here for the complete instructions.
You should then, as it says, be able to put in master-public-dns-name:8888 into your browser and log in to hue. Don’t be a fool like me and forget to actually substitute in you master-public-fns-name which can be found on your cluster details!
Here’s a video tutorial I created that shows how to get Hue up and running on AWS
Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.
(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)
Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).
Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.
Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.
To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.
You can also find the code here as a JSFiddle.
Thanks to Paul McAvoy for posing the problem and for all the other interesting things he’s shown me!
What do we do?
T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.
Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!
Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.
While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.
The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.
Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.
Why are we different?
O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.
Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.
Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.
Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.
But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.
For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.
Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.
Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects
What are we excited about?
Bayesian statistics and simulation for problem
What technology are we into?
pifreak is my twitterbot. It started tweeting the digits of pi in April 2012 and has tweeted the next 140 digits at 3:14 pm GMT every day since. Not especially useful or popular (only 48 followers) but I’ve grown fond of she/he/it.
I was housing her on an AWS ec2 micro instance, however my one year of free hire ran out and it has become a little too expensive to keep that box running.
So I’ve been looking at alternatives. I’ve settled on the google app engine which I’m hoping is going to come out as pretty close to free hosting.
So here’s a few notes for anyone else who might be thinking of using the google app engine for automated posting on twitter.
It was reasonably simple to set up
- Download the GAE python SDK. This provides a GUI for both testing your code locally and then deploying it to the cloud when you are happy with it.
- Create a new folder for your app and within that place your python modules together with an app.yaml file and a cron.yaml file which will configure the application and schedule your task respectively. It’s all very well documented here and for the cron scheduling here.
- Open the App Engine Launcher (which is effectively the SDK), add your folder, then either hit run to test locally or deploy to push to the cloud (you’ll be taken to some forms to register your app if you’ve not already done so)
- Finally if you click on dashboard from the launcher you’ll get lots of useful information about your web deployed app including error logs and the schedule for your tasks.
The things that caught me out were:
- Make sure that the application name in your app.yaml file is the same as the one you register with Google (when it takes you through to the form the first time you deploy.)
- There wasn’t a lot in the documentation about the use of the url field in both the cron and app yaml files. I ended up just putting a forward slash in both since in my very simple app the python module is in the root.
- Don’t forget module names are case sensitive so when you add your python module in the script section of the app file you’ll need to get this right.
- Yaml files follow an indentation protocol that is similar to python. You’ll need to ensure it’s all lined up correctly.
- Any third party libraries you need that are not included in this list will need to be included in your app folder. For example I had to include tweepy and some of its dependencies
- Where the third party library that you need is included in the GAE runtime environment you need to add it to the app file using the following syntax
- name: ssl
And here finally is a link to the code.
Since d3 can be a little inaccessible at times I thought I’d make things easier by starting with a basic skeleton force directed layout (Mike Bostock’s original example) and then giving you some blocks of code that can be plugged in to add various features that I have found useful.
The idea is that you can pick the features you want and slot in the code. In other words I’ve tried to make things sort of modular. The code I’ve taken from various places and adapted so thank you to everyone who has shared. I will try to provide the credits as far as I remember them!
Here’s the basic skeleton based on Mike Bostock’s original example but with a bit of commentary to remind me exactly what is going on
A is for arrows
I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.
Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!
I wrote this code for a project that didn’t work out but I thought I’d share. It takes the dendrogram produced by hclust in R and converts it into json to be used in a D3 force directed graph (slicing the dendrogram near the top to create a couple of clusters). The dendrogram in R looks like this
I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.
To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.
One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.
Seeing what’s going on
Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result.