Here’s something I started to write a few months ago, but never got around to finish it off.
Neil Charles over at Wallpapering Fog has just written an excellent post about the growing importance of R and Tableau to the modern day analyst. Although not as old in the tooth as Neil (sorry Neil), even in the last 5 years, there has been a definite movement towards a much wider skill set for everyday analysis, at least within marketing.
The days of only using the likes of SPSS, SAS and Excel are long gone as the need to make work more repeatable, scalable and downright flexible. Today’s analyst needs to be comfortable getting hold of new datasources that don’t necessarily sit in an excel file or in tabular form, manipulating it, using a statistical technique that they didn’t necessarily learn at university and then visualising the results (maybe on a map or a small multiples plot).
Drawing on Neil’s post, I thought I would add my tuppence worth on some of the themes he mentions.
New Languages to learn
I’ve been a fan of R for the last 3 years, enjoying its power and flexibility throughout the analysis process and allowing for much greater repeatability. The sheer number of packages that exist for it have massively levelled the playing field between academia and industry now.
To put it bluntly, there are datasets that I wouldn’t be able to access and visualise if I was just using Excel, a similar sentiment to that by the guys behind Processing. SAS is a bit better from what I understand, but still very limited (this might well be why it’s dropped out of the top 50 programming languages).
James Cheshire over at SpatialAnalysis has shown what is possible with visualising data using ggplot2. Even five years ago this would have been the preserve of expensive GIS software which would have been available to a small group of people.
Tools like Tableau take this one step further, providing software that makes it very easy to produce excellent quality graphics and visualisations that provide an ability to tell stories with data that again, would have been very expensive to produce not long ago and would have required a very strong programming background.
Python is my other tool of choice, mainly due to the number of wrappers that have been written for it to access the various APIs that exist. For instance, up until recently, there was no implementation of OAuth (a way of gaining authorisation to some APIs) within R, whereas Python had this and also has wrappers written for the likes of Google Analytics, Facebook and Twitter.
I tend to be a lazy analyst – I’ll find the easiest and quickest way for the computer to do something so that I don’t have to. What that’s meant is that I’ve tended to use R, Python and Tableau as parts of my toolkit. The future is one where learning just one language is likely to leave you in some painful situations when something else can accomplish the task a lot easier.
The growing importance of databases
As datasets are getting larger and more complicated, so traditional methods for storing smaller datasets (e.g. in CSVs and Excel files) are becoming increasingly unfit for purpose. Often, datasets are now constantly being added to and the need to query these in different ways databases the way to go.
MySQL is a fantastic and freely available solution which can sit locally on your laptop, on a company server or even in the cloud on an EC2 instance (or similar). It’s easy to set up and easy to manage for small to medium size pieces.
My other thing of the moment is MongoDB, one of the growing number of NoSQL (which stands for Not only SQL) database solutions out there. Essentially, its a way of storing data in a non table format, so each record can have a different number of records. For anyone familiar with XML or JSON, it’s very similar in thinking to these. I’ve become a big fan due to this flexibility, combined with the speed and relatively easy to understand syntax to write queries.
The one thing to note with databases is that bigger ones need to be designed right. There is an important role in database architecture and administration that it’s very easy to overlook when you’ve got an analyst who roughly understands how these things work. Of course, there are also lots of times where the smaller data sizes mean that performance delays due to poor architecture don’t result in any tangible delay in getting the data.
The result of this trend is that there’s an important Interplay between analysts and those who are involved in more traditional database administration and how the needs of the analyst can be accommodated in an agile way.
I’ve been using BitBucket for a couple of years now as the default place that I store my code. Rather than having it on one machine, I store it in the cloud and can access it from anywhere. I can choose whether I want to share the code with no one, everyone or somewhere in between.
It’s a bit of a struggle in places (e.g. those damned two heads), but as a device for sharing and managing version control, it’s a no brainer.
Keeping up to speed with the latest developments
Probably the most important and hardest to define piece is how to keep up to speed with all this – how to hear about Mongo or Haskell or how to scrape Facebook data. I tend to use a few things:
- Netvibes to follow blogs.
- Twitter to follow people.
- Pinboard to tag the stuff I might need at some point in the future.
- StackOverflow and the R Mailing List for help with answering questions (which someone has normally come by before).
- Online Courses like those run by Stanford to learn new techniques.
That’s it for now, but I’ll doubtless pick up on this thread again in the future. I’d also be really keen to see what other analysts experience has been over the last few years as new software languages and technologies have become available.