Clegg vs Pleb: An XKCD-esque chart

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I saw an interesting “challenge” on StackOverflow last night to create an XKCD style chart in R. A couple of hours later & going in a very similar direction to a couple of the answers on SO, I got to something that looked pretty good, using the sin and cos curves for simple and reproducible replication.

Tonight, I thought I’d try and apply the theme and styling to some real word and slightly XKCD content: UK politics. Two of the biggest stories of the month in the UK have been Nick Clegg’s apology over reneging on the Liberal Democrat’s tuition fees pledge and Andrew Mitchell’s “incident” trying to cycle out of Downing Street, the so-called GateGate.

Using the newly redesigned Google Insights for Search, I looked at searches for clegg and pleb over the last 30 days. A quick manipulation into a csv and applying the XKCD theme and some geom_smoothing gives this:

Looks like Andrew Mitchell might be Nick Clegg’s new best friend in terms of deflecting some of the attention away from the sorry song…

And here’s the code (note that you need to have installed the Humor Sans font using install_fonts() ):

library(ggplot2)
library(extrafont)
### Already have read in fonts (see previous answer on how to do this)
loadfonts()
### Clegg and Pleb data
pleb.clegg <- read.csv("pleb and clegg.csv")
pleb.clegg$Date <- as.Date(pleb.clegg$Date, format="%d/%m/%Y")
pleb.clegg$xaxis <- -4
### XKCD theme
theme_xkcd <- theme(
	panel.background = element_rect(fill="white"),
	axis.ticks = element_line(colour=NA),
	panel.grid = element_line(colour="white"),
	axis.text.y = element_text(colour=NA),
	axis.text.x = element_text(colour="black"),
	text = element_text(size=16, family="Humor Sans")
	)
### Plot the chart
p <- ggplot(data=pleb.clegg, aes(x=Date, y=Pleb))+
geom_smooth(aes(y=Clegg), colour="gold", size=1, position="jitter", fill=NA)+
geom_smooth(colour="white", size=3, position="jitter", fill=NA)+
geom_smooth(colour="dark blue", size=1, position="jitter", fill=NA)+
geom_text(data=pleb.clegg[10, ], family="Humor Sans", aes(x=Date), colour="gold", y=20, label="Searches for clegg")+
geom_text(data=pleb.clegg[22, ], family="Humor Sans", aes(x=Date), colour="dark blue", y=4, label="Searches for pleb")+
geom_line(aes(y=xaxis), position = position_jitter(h = 0.1), colour="black")+
coord_cartesian(ylim=c(-5, 40))+
labs(x="", y="", title="Pleb vs Clegg: Google Keyword Volumes")+
theme_xkcd
ggsave("xkcd_cleggpleb.jpg", plot=p, width=8, height=5)

Why are pirates called pirates?

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In homage to International Talk Like a Pirate Day…

I recently stumbled across a series of blog posts from the folks at IDV that visualised the archive of recorded pirate attacks which has been collected by the US National Geospatial-Intelligence Agency. It’s a dataset of 6000+ pirate attacks which have been recorded over the last 30 or so years.

This first map shows where the attacks have been recorded, with four clear areas standing out when the data is aggregated into hexagon bins:

Map showing areas where pirate attacks have been recorded

Zooming in on the area around Yemen, there’s a clear ramp up in the number of attacks since 2008, which saw a 570% increase compared to the previous year.  As noted by the IDV analysis, most attacks take place on Wednesday and during the spring and autumn.

Number of attacks recorded in the Aden region by year

The reaction to the massive increase in attacks in 2008 seems to have been ships not travelling as close to the shore in 2009, leading to more attacks happening further out to sea. This can clearly be seen by looking only at the attacks in 2008 and 2009:

Number of Pirates attacks in the Aden area in 2008 and 2009 (Distance is in degrees)

Within the dataset, as well as information around the location of the attack, there are also descriptions of the attacks, which lends itself well to some text analysis to understand where there have been changes in the nature of the attacks above and beyond their distance.

Some analysis of the descriptions of the attacks reveals that the nature of the attacks also changed in 2010, with more featuring terms such as security and speedboats (full details of how these topic groups have been created is below). The analysis was used to identify five different types of attacks.

From the chart below, Topics 4 and 5came to prominence in 2008, with Topic 5 maintaining it’s share in 2010 before Topic 2 then increasing in number in 2011 and 2012. This is just scratching the surface with what can be done with topic analysis and given that all the documents are related to pirate attacks, there’s not the variation compared to what you would see in say, news articles about many subjects. There’s a good walk through of using the TopicModels package here.

Attacks in the Yemen region classified into one of five topics based on description

And what are these topics? The table below shows the top 10 terms for each of the 5 topics – they’re not as clear cut as you’d hope (mainly because there are a fair few verbs and numbers in there at the moment), but give an idea of some differences – skiffs versus speedboats, topic 2 featuring “security”, the numbers involved and months of the year all hit at different aspects of the attacks which have been picked out.

     Topic 1  Topic 2  Topic 3    Topic 4    Topic 5
1  attempted     were hijacked       boat      boats
2        six     fire      are      white       four
3       took security attacked       port        men
4  increased   skiffs    miles      about    persons
5      board     team  vessels      small      three
6     skiffs     when     this        sep        may
7     alarm,      jan  advised        apr speedboats
8        for    seven merchant    general       five
9   chemical      had  boarded reportedly    reports
10      guns    which exercise  speedboat       each

And the reason why pirates are called pirates? Because they Argghhhhhhhhhh.

NB. I haven’t had a chance to check on the copyright, etc. for hosting the pirate dataset, so please download it from here.

Reading the data into R.

library(maps)
library(sp)
library(maptools)
library(ggplot2)
library(spatstat)
gpclibPermit()
library(topicmodels)
pirates.data <- readShapePoints("C:\ASAM 05 SEP 12")
pirates.data.2 <- as.data.frame(pirates.data)

How far to the shore?

Next is to turn the data into a Planar Point Pattern in order to allow for calculation of nearest coastal point for each attack. The same technique is then used to create a similar file for the coastline. Use the nncross function to find the distance from each attack to the nearest point on the coast (and identify which point that is).

bb <- c(40, 56, 7, 17)
pirates.ppp <- as.ppp(pirates.data.2[,13:14], bb)
worldmap <- map_data("world")
land.ppp <- as.ppp(worldmap[, 1:2], bb)
land.df <- as.data.frame(cbind(land.ppp$x, land.ppp$y))
reg <- as.data.frame(map("world", xlim = c(40, 56), ylim = c(7, 17), plot = FALSE)1)
nearest.land <- nncross(pirates.ppp, land.ppp)
pirates.nearest.land <- as.data.frame(cbind(as.numeric(pirates.ppp$x), as.numeric(pirates.ppp$y), as.numeric(nearest.land$dist)))
pirates.data.aden <- merge(pirates.nearest.land, pirates.data.2, by.x=c("V1", "V2"), by.y = c("coords.x1", "coords.x2"))

Calculate various extra columns such as year of date and number of attacks by year

pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$month <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
year.stats <- ddply(pirates.data.aden, .(year), summarise, attacks= length(year))
year.stats$Delt <- Delt(year.stats$attacks)

Plot the world map showing attacks binned into hexagons

ggplot()+
	stat_summary_hex(fun="length", data=pirates.data.2, aes(x=coords.x1, y=coords.x2, z=coords.x2)) +
	scale_fill_gradient(low="white", high="red", "Pirate Attacks recorded") +
	geom_path(aes(x=x, y=y), data = world) +
	mb.theme +
	labs(x="", y="") +
	theme(panel.background = element_rect(fill="white"),
		axis.ticks = element_line(colour="white"),
		axis.text = element_text(colour="white"),
		axis.line = element_line(colour="white"),
		panel.grid = element_line(colour=NA)) +
		scale_x_continuous(breaks=NA)+
		scale_y_continuous(breaks=NA)

Number of attacks by year near Aden

ggplot(data=pirates.data.aden, aes(x=year))+
		geom_histogram(binwidth=1, colour="white", fill="dark blue")+
		mb.theme +
		labs(x="Year", y="Number of attacks recorded")

Attacks by distance from Shore as a histogram

ggplot(data=subset(pirates.data.aden, year%in%c(2008, 2009)), aes(x=V3))+geom_histogram(fill="dark blue", colour="white")+
		mb.theme+
		facet_wrap(~year, ncol=1) +
		labs(x="Distance from shore", y="Number of attacks")

Topic Models analysis of attack descriptions near Aden

corpus <- Corpus(VectorSource(pirates.data.aden$Desc1))
dtm <- DocumentTermMatrix(corpus)
term_tfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0))
dtm <- dtm[, term_tfidf >= 0.1]
dtm <- dtm[row_sums(dtm) > 0,]
k=5
SEED=2012
TM <- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)),
		VEM_fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
		Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000, thin = 100, iter = 1000)),
		CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))))
pirates.data.aden$Topic <- topics(TM[["Gibbs"]], 1)
ggplot(data=pirates.data.aden, aes(x=year, fill=as.factor(Topic), group=as.factor(Topic)))+
		geom_histogram(binwidth=1, colour="white")+
		scale_fill_brewer(palette="Set3", "Topic Group") +
		mb.theme +
		labs(x="", y="")

The changing face of “Analysis”

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s something I started to write a few months ago, but never got around to finish it off.

Neil Charles over at Wallpapering Fog has just written an excellent post about the growing importance of R and Tableau to the modern day analyst. Although not as old in the tooth as Neil (sorry Neil), even in the last 5 years, there has been a definite movement towards a much wider skill set for everyday analysis, at least within marketing.

The days of only using the likes of SPSS, SAS and Excel are long gone as the need to make work more repeatable, scalable and downright flexible. Today’s analyst needs to be comfortable getting hold of new datasources that don’t necessarily sit in an excel file or in tabular form, manipulating it, using a statistical technique that they didn’t necessarily learn at university and then visualising the results (maybe on a map or a small multiples plot).

Drawing on Neil’s post, I thought I would add my tuppence worth on some of the themes he mentions.

New Languages to learn

I’ve been a fan of R for the last 3 years, enjoying its power and flexibility throughout the analysis process and allowing for much greater repeatability. The sheer number of packages that exist for it have massively levelled the playing field between academia and industry now.

To put it bluntly, there are datasets that I wouldn’t be able to access and visualise if I was just using Excel, a similar sentiment to that by the guys behind Processing. SAS is a bit better from what I understand, but still very limited (this might well be why it’s dropped out of the top 50 programming languages).

James Cheshire over at SpatialAnalysis has shown what is possible with visualising data using ggplot2. Even five years ago this would have been the preserve of expensive GIS software which would have been available to a small group of people.

Tools like Tableau take this one step further, providing software that makes it very easy to produce excellent quality graphics and visualisations that provide an ability to tell stories with data that again, would have been very expensive to produce not long ago and would have required a very strong programming background.

Python is my other tool of choice, mainly due to the number of wrappers that have been written for it to access the various APIs that exist. For instance, up until recently, there was no implementation of OAuth (a way of gaining authorisation to some APIs) within R, whereas Python had this and also has wrappers written for the likes of Google Analytics, Facebook and Twitter.

I tend to be a lazy analyst – I’ll find the easiest and quickest way for the computer to do something so that I don’t have to. What that’s meant is that I’ve tended to use R, Python and Tableau as parts of my toolkit. The future is one where learning just one language is likely to leave you in some painful situations when something else can accomplish the task a lot easier.

The growing importance of databases

As datasets are getting larger and more complicated, so traditional methods for storing smaller datasets (e.g. in CSVs and Excel files) are becoming increasingly unfit for purpose. Often, datasets are now constantly being added to and the need to query these in different ways databases the way to go.

MySQL is a fantastic and freely available solution which can sit locally on your laptop, on a company server or even in the cloud on an EC2 instance (or similar). It’s easy to set up and easy to manage for small to medium size pieces.

My other thing of the moment is MongoDB, one of the growing number of NoSQL (which stands for Not only SQL) database solutions out there. Essentially, its a way of storing data in a non table format, so each record can have a different number of records. For anyone familiar with XML or JSON, it’s very similar in thinking to these. I’ve become a big fan due to this flexibility, combined with the speed and relatively easy to understand syntax to write queries.

The one thing to note with databases is that bigger ones need to be designed right. There is an important role in database architecture and administration that it’s very easy to overlook when you’ve got an analyst who roughly understands how these things work. Of course, there are also lots of times where the smaller data sizes mean that performance delays due to poor architecture don’t result in any tangible delay in getting the data.

The result of this trend is that there’s an important Interplay between analysts and those who are involved in more traditional database administration and how the needs of the analyst can be accommodated in an agile way.

Collaboration

I’ve been using BitBucket for a couple of years now as the default place that I store my code. Rather than having it on one machine, I store it in the cloud and can access it from anywhere. I can choose whether I want to share the code with no one, everyone or somewhere in between.

It’s a bit of a struggle in places (e.g. those damned two heads), but as a device for sharing and managing version control, it’s a no brainer.

Keeping up to speed with the latest developments

Probably the most important and hardest to define piece is how to keep up to speed with all this – how to hear about Mongo or Haskell or how to scrape Facebook data. I tend to use a few things:

  1. Netvibes to follow blogs.
  2. Twitter to follow people.
  3. Pinboard to tag the stuff I might need at some point in the future.
  4. StackOverflow and the R Mailing List for help with answering questions (which someone has normally come by before).
  5. Online Courses like those run by Stanford to learn new techniques.

That’s it for now, but I’ll doubtless pick up on this thread again in the future. I’d also be really keen to see what other analysts experience has been over the last few years as new software languages and technologies have become available.

Voter Relationship Management

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Customer Relationship Management (CRM) seems to be coming into the mainstream, with the New York Times recently reporting how Target has used such analytics to identify expectant mothers based on their shopping habits and was then able to target them appropriately with special offers and vouchers.

As the 2012 US Election approaches, it seems that data analysis is coming of age, being increasingly used to target voters on a scale not seen before. Credited in part for Obama‘s win in 2008, where voters were profiled and segmented, just as advertisers tend to segment and cluster their clients based on behaviours, demographics and attitudes.

The growth of Facebook, Twitter and the like since 2008 have added a new dimension to what was a fairly static dataset and which shied away from the behaviours dimension. Adding this newly available dimension brings massive new opportunities for market research and targeting. The reaction to a new ad can be evaluated in real time and A/B testing can help to pick out the messages that work.

Obama’s Data Crunchers

There have been a few interesting pieces about how Obama’s re-election campaign are using methods more commonly associated with consumer marketing to target supporters and voters. This piece in the New York Times gives an overview of some of the team behind the analysis, which includes Rayid Ghani who was previously at Accenture Technology Lab and who has written extensively on Data and Text Mining.

Social CRM seems to be one of the growing areas of buzz to come up with a “holistic view of the customer”, with several players claiming the ability to be able to join a users various online accounts together (e.g. Facebook, Twitter, LinkedIn) together in one place to give a single view, so called “Social Identity Mapping”. How well this works is still up for debate; Infochimps offers an API with this capability and the results seem to be biased towards the more socially savvy.

A recent set of donation raising emails show how the set of data and analysts are starting to put this to use with emails tailored to the individual, and doubtlessly also has a large test and learn component to it, where the emails that yield the highest response are then used more widely. At the same time, Google and Facebook are coining it in through serving ads based on what people say and do online.

Facebook and Politico

Facebook has made its first foray into publishing insight from its data collecting machine, aggregating up individual’s wall posts and status updates to report back on the Republican primaries and how the various candidates are performing, the results are then being published by Politico. This seemed to generate a lot more buzz at the time it was announced than the ongoing analysis of the campaigns. I suspect this is a dry run ahead of November.

And just to be clear, Facebook isn’t handing over users’ information to the Republicans! 

#sherlock & the power of the retweet

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Much has been made over the last few days of Sherlock writer Steven Moffat‘s views on people who tweet whilst watching TV. Whilst watching it last night, I kept an eye on the tweets during the show and there was clearly a lot of volume going through the Twitter-sphere.

Interested to find out a bit more about the volumes, I used this excellent (and well used) Python script to pull the set of tweets from the beginning of the show through to 9.30am GMT this morning.

First off, a few head line figures:

  • Between 8pm and midnight there were more than 93,000 tweets and retweets.
  • Tweets per minute peaked at 2,608 at 10.30pm (excluding “RT” retweets). That’s more than 43 tweets per second on average.
  • There were more than 10 retweets per second at 10.45pm of @steven_moffat’s “#sherlock Yes of course there’s going to be a third series – it was commissioned at the same time as the second. Gotcha!” tweet.

The data, along with a little Tableau time series visualisation are below.

Update: Tableau Public doesn’t seem to play nicely with a WordPress-hosted blog, so click here to open in a new tab.

sherlock

Machine Learning and Analytics based in London, UK