D3 – another acronym to learn

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

D3 (which stands for data driven documents ) has been getting a lot of traction over the last few months, with more and more interactive and animated visualisations using this JavaScript library. The author of the library, Mike Bostock, is very active in both developing the library and also in providing a constant stream of examples using the language.

For anyone coming from a ggplot2 background, there are a lot of similarities with certain parts of D3 and the attributes that can be given to objects. There is a fair amount more code to write, but there are a lot of great tutorials out there which give very good introductions to creating good looking visualisations. There’s a bit of a learning curve to using it, but there are lots of tutorials out there which make it easy to get started, for example, here, here and here.

The general idea behind the language is to encode objects, for example, a circle, with various attributes like colour, size and  location. These could either be hard coded, based on a function (e.g. the sine function) or based on a dataset. The additional nice feature of the software is the built in ability to transition between states (e.g. moving a circle from location A to location B) which would normally be difficult to do, and probably appear quite jerky.

For anyone coming from a Tableau background, D3 provides a lot more flexibility in what can be achieved, but is a very code based solution and not a BI solution.

From my experience so far, a specialist HTML editor like Astana Studio is a good place to start for a few reasons, including colour coding for the HTML and also having a built in local web server so that you can quickly preview the results of your code writing.

As an aside, you’ll need a web server of some sort for any D3 code to work (It took me a frustrating hour to figure this out). The alternative to using the built in Astana server is something like using the python simple http server, but I find this a lot more cumbersome.

Oh, and sorry for not having any examples built into the post – WordPress.com doesn’t like having custom JavaScript in posts.

The changing face of “Analysis”

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s something I started to write a few months ago, but never got around to finish it off.

Neil Charles over at Wallpapering Fog has just written an excellent post about the growing importance of R and Tableau to the modern day analyst. Although not as old in the tooth as Neil (sorry Neil), even in the last 5 years, there has been a definite movement towards a much wider skill set for everyday analysis, at least within marketing.

The days of only using the likes of SPSS, SAS and Excel are long gone as the need to make work more repeatable, scalable and downright flexible. Today’s analyst needs to be comfortable getting hold of new datasources that don’t necessarily sit in an excel file or in tabular form, manipulating it, using a statistical technique that they didn’t necessarily learn at university and then visualising the results (maybe on a map or a small multiples plot).

Drawing on Neil’s post, I thought I would add my tuppence worth on some of the themes he mentions.

New Languages to learn

I’ve been a fan of R for the last 3 years, enjoying its power and flexibility throughout the analysis process and allowing for much greater repeatability. The sheer number of packages that exist for it have massively levelled the playing field between academia and industry now.

To put it bluntly, there are datasets that I wouldn’t be able to access and visualise if I was just using Excel, a similar sentiment to that by the guys behind Processing. SAS is a bit better from what I understand, but still very limited (this might well be why it’s dropped out of the top 50 programming languages).

James Cheshire over at SpatialAnalysis has shown what is possible with visualising data using ggplot2. Even five years ago this would have been the preserve of expensive GIS software which would have been available to a small group of people.

Tools like Tableau take this one step further, providing software that makes it very easy to produce excellent quality graphics and visualisations that provide an ability to tell stories with data that again, would have been very expensive to produce not long ago and would have required a very strong programming background.

Python is my other tool of choice, mainly due to the number of wrappers that have been written for it to access the various APIs that exist. For instance, up until recently, there was no implementation of OAuth (a way of gaining authorisation to some APIs) within R, whereas Python had this and also has wrappers written for the likes of Google Analytics, Facebook and Twitter.

I tend to be a lazy analyst – I’ll find the easiest and quickest way for the computer to do something so that I don’t have to. What that’s meant is that I’ve tended to use R, Python and Tableau as parts of my toolkit. The future is one where learning just one language is likely to leave you in some painful situations when something else can accomplish the task a lot easier.

The growing importance of databases

As datasets are getting larger and more complicated, so traditional methods for storing smaller datasets (e.g. in CSVs and Excel files) are becoming increasingly unfit for purpose. Often, datasets are now constantly being added to and the need to query these in different ways databases the way to go.

MySQL is a fantastic and freely available solution which can sit locally on your laptop, on a company server or even in the cloud on an EC2 instance (or similar). It’s easy to set up and easy to manage for small to medium size pieces.

My other thing of the moment is MongoDB, one of the growing number of NoSQL (which stands for Not only SQL) database solutions out there. Essentially, its a way of storing data in a non table format, so each record can have a different number of records. For anyone familiar with XML or JSON, it’s very similar in thinking to these. I’ve become a big fan due to this flexibility, combined with the speed and relatively easy to understand syntax to write queries.

The one thing to note with databases is that bigger ones need to be designed right. There is an important role in database architecture and administration that it’s very easy to overlook when you’ve got an analyst who roughly understands how these things work. Of course, there are also lots of times where the smaller data sizes mean that performance delays due to poor architecture don’t result in any tangible delay in getting the data.

The result of this trend is that there’s an important Interplay between analysts and those who are involved in more traditional database administration and how the needs of the analyst can be accommodated in an agile way.

Collaboration

I’ve been using BitBucket for a couple of years now as the default place that I store my code. Rather than having it on one machine, I store it in the cloud and can access it from anywhere. I can choose whether I want to share the code with no one, everyone or somewhere in between.

It’s a bit of a struggle in places (e.g. those damned two heads), but as a device for sharing and managing version control, it’s a no brainer.

Keeping up to speed with the latest developments

Probably the most important and hardest to define piece is how to keep up to speed with all this – how to hear about Mongo or Haskell or how to scrape Facebook data. I tend to use a few things:

  1. Netvibes to follow blogs.
  2. Twitter to follow people.
  3. Pinboard to tag the stuff I might need at some point in the future.
  4. StackOverflow and the R Mailing List for help with answering questions (which someone has normally come by before).
  5. Online Courses like those run by Stanford to learn new techniques.

That’s it for now, but I’ll doubtless pick up on this thread again in the future. I’d also be really keen to see what other analysts experience has been over the last few years as new software languages and technologies have become available.

Machine Learning and Analytics based in London, UK