Another visualisation of 118 Years of US Weather Data

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I posted yesterday about weather data sourced from NOAA to look at how hot this March was compared to previous years and used a couple heat maps in R to look at how temperatures compared based on using the rank of each year for each state (so if, say this March in Florida was the hottest since 1895, it would achieve a score of 118). 

To make the visualisation a bit more interactive, I’ve put the data into Tableau Public, making it possible to interact with the map and look at the average temperature across all 48 contiguous states in any year going back to 1895. If you click on any individual state, this will update the line chart at the bottom so you can see how that state’s temperature has moved over the last 117 years (for annual data there are only 117 data points as we’re only half way through 2012). 

You can access it here: http://public.tableausoftware.com/views/USTemperatureDashboard/Dashboard

(Tableau Public doesn’t play nicely with embedding into WordPress :( )

Visualising the Path of a Genetic Algorithm

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

The paths of 8 runs of a genetic algorithm optimising a multivariate normal mixture function. Three of the runs break out from a local maxima (presumably through mutation) to find the global solution. The circles are the starting points.

We quite regularly use genetic algorithms to optimise over the ad-hoc functions we develop when trying to solve problems in applied mathematics. However it’s a bit disconcerting to have your algorithm roam through a high dimensional solution space while not being able to picture what it’s doing or how close one solution is to another. With this in mind, and also just out of curiosity, I’ve tried to visualise the path of a genetic algorithm by using principal components analysis.

If you are not familiar with this extremely useful technique then there’s plenty of information online. To put it very very briefly PCA rotates the axes in whatever dimensional space you are working to find a solution where the first axis points in the direction of the greatest variation in your data, the second axis in the direction that captures the greatest part of the leftover variation and so on. The upshot is that if we take the first two axes we should get the best two dimensional view of the shape of our data.

So what I’ve done is run a genetic algorithm several times on the same function, recorded the paths of each run and then applied PCA to visualise these paths. To run the genetic algorithm I’ve used the excellent rgenoud package. The code below implements a function that I hope will be flexible enough to apply this process to any function that rgenoud can handle. You just need to specify the function, the number of parameters, the number of runs you want and a few other optional parameters.

I’ve tried it on a multivariate normal mixture function (above) and Colville’s function (below). I’d like to try some others when I’ve time and would love to see the results if anyone else would like to make use of it. I would recommend using the parallel option if you can as otherwise plotting many runs can take some time.

Eight runs eventually converge on the global minima of the Colville function. The circles are the starting points of the algorithm

If you enjoyed this then please check out more visualisation in Mark’s latest post.

Now here’s the code for the general function.

… and here is the code for the examples I’ve included.

118 years of US State Weather Data

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A recent post on the Junkcharts blog looked at US weather dataand the importance of explaining scales (which in this case went up to 118). Ultimately, it turns out that 118 is the rank of the data compared to the previous 117 years of data (in ascending order, so that 118 is the highest). At the end of the post,

I always like to explore doing away with the unofficial rule that says spatial data must be plotted on maps. Conceptually I’d like to see the following heatmap, where a concentration of red cells at the top of the chart would indicate extraordinarily hot temperatures across the states. I couldn’t make this chart because the NOAA website has this insane interface where I can only grab the rank for one state for one year one at a time. But you get the gist of the concept.

In this spirit then, I wrote a little R script for scraping the data and produced a couple of charts based on it (click on them to get full size versions). I used Charles Web Proxy to figure out what needs to be sent to the website to return the data I was looking for. A Heatmap for March 2012, which shows the rank for each state in the latest month: A Heatmap for each March going back to 1895: The code to reproduce and tweak these charts is below:

### Packages needed for the work
library(RCurl)
library(ggplot2)
### Get list of US states to tie onto dataset, remove Alaska and Hawaii
us.list.of.states <- readHTMLTable("http://www.worldatlas.com/aatlas/populations/usapoptable.htm")[[1]]
us.list.of.states <- us.list.of.states[ c(-2, -11), ]
### Functions to pull monthly and annual data from the NOAA website
getNOAAdataMonth <- function(state.no, month){
	zeroes = ifelse(state.no > 9, "0", "00")
	state.string = paste(zeroes, state.no, sep="")
	data.in <- postForm("http://climvis.ncdc.noaa.gov/cgi-bin/cag3/hr-display3.pl",
			data_set = "01",
			byear = "1895",
			period = month,
			lyear = "2012",
			strgn = state.string,
			bbeg = "1901",
			bend = "2000",
			trend = "0",
			type = "3",
			rank = "0",
			send.x = "60",
			send.y = "8",
			spec = "")
	data.out <- readHTMLTable(data.in)[[2]]
	data.out$state <- us.list.of.states[state.no, 3]
	data.out}
getNOAAdataAnnual <- function(state.no){
zeroes = ifelse(state.no > 9, "0", "00")
state.string = paste(zeroes, state.no, sep="")
data.in <- postForm("http://climvis.ncdc.noaa.gov/cgi-bin/cag3/hr-display3.pl",
		data_set = "01",
		byear = "1895",
		period = "17",
		lyear = "2012",
		strgn = state.string,
		bbeg = "1901",
		bend = "2000",
		trend = "0",
		type = "3",
		rank = "0",
		send.x = "60",
		send.y = "8",
		spec = "")
data.out <- readHTMLTable(data.in)[[2]]
data.out$state <- us.list.of.states[state.no, 3]
data.out}
### Run function over 48 states
weather.data.annual <- sapply(1:48, function(x) getNOAAdataAnnual(x), simplify=FALSE)
weather.data.march <- sapply(1:48, function(x) getNOAAdataMonth(x, "3"), simplify=FALSE)
### Join lists together into dataframe
weather.data.2 <- do.call("rbind", weather.data.march)
weather.data.annual.2 <- do.call("rbind", weather.data.annual)
### rename columns for easier use
colnames(weather.data.2) <- c("year", "temp", "rank1", "rank2", "state")
### Subset 2012 data for first chart
weather.data.march2012 <- subset(weather.data.2, year==2012)
weather.data.march2012$fill <- ifelse(as.numeric(as.character(weather.data.march2012$rank1))==118, 1, 0)
ggplot(weather.data.march2012, aes(x=state, y=as.numeric(as.character(rank1)), fill=1, label = state))+
		geom_tile()+
		geom_text(size=3)+
		ylab("March 2012 Rank")+
		scale_fill_continuous("", low="white", high="red")+
		opts(title = "All the red at the top means record temperatures across many states")
### Plot all years data (year is a factor in the dataset, so need to convert to numeric)
ggplot(weather.data.2, aes(x=state, y=as.numeric(as.character(year)), fill=as.numeric(as.character(rank1))))+
		geom_tile()+
		coord_flip()+
		scale_fill_continuous("", low="white", high="dark red")+
		ylab("Year")+
		opts(title = "All the red at the right means record temperatures across many states")

Machine Learning and Analytics based in London, UK