Lots of data that’s available online tends to not be the cleanest thing in the world, particularly if you’ve had to scrape it in the first place. At the same time, lots of internal data sets can be just as messy, with columns having different names in what should be identical spreadsheet templates, or with what should be identical values (e.g. “United Kingdom” vs “UK”) not being identical. There are lots more examples that I could mention, but
Google Refine is a great tool for dealing with messy data and turning it quickly and easily into a much better dataset, which then allows for the fun to begin with the analysis and visualisation.
Three handy Google Refine tricks.
My reason for using Refine was to clean up and append additional information to the Information is Beautiful Awards. I’m still learning how to use the software, but found these three things to be very handy and things that I’ve not encountered elsewhere (without writing some code).
Add together multiple different Excel worksheet. This can be an absolute pain when there are either a lot of them, or where each one has some additional or missing columns. In the past, I’ve dealt with this using a combination of Python and VBA. However, Refine makes it incredibly easy to join together multiple Excel worksheets (particularly if they have similarly named columns). NB: this functionality doesn’t seem to exist when plugging into Google spreadsheets.
Reconciliation. My new favourite word. The data set that was provided, whilst exhaustive in some respects, still lacked a lot of potentially interesting information, such as when in the year it was released, the stars who featured, and so on. My first approach to this was to use the URLs provided and see if I could scrape the data from the link, as most of them were from a particular website, which had a consistent format.
Then I discovered Reconciliation.
This screencast does it far more justice than I can, but essentially the idea is that my reconciling my database with another one (in this case Freebase), there’s a whole lot more information that I can easily add on to my dataset. Think of reconciling a bit like Primary Keys that are universally defined and being able to left join. Also, think of remakes of films (Casino Royale, Ocean’s Eleven, etc.) – which version of the film am I referring to?
I simply asked Refine to start Reconciling, and by looking at the data in my film name column, was able to identify that it was a list of films. It then did its best at fuzzy matching the names, and where it wasn’t sure, gave me a list of options from which to choose. I could then choose a confidence level and it would leave me to manually choose the remaining records which were lower than this level.
Facets. Without Data Validation built into data entry, lots of alternative spellings can crop up for what should be the same value (e.g. “United Kingdom” vs “UK”). There are normally a few ways to deal with these, and tools like Tableau make it easy to group such values together.
Refine takes a nice approach to these (and other data validation issues) using what it calls “Facets”, which are essentially a combination of a Summary of the data, combined with Data Manipulation. What this means, is that (a) I can see what mistakes there are in the data and (b) I can then easily correct them.