EC2 Tutorial: NumPy and SciPy

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Another quick note for getting set up on your EC2 instance. To install SciPy, you first need to install ATLAS and lapack. The following few lines of code run as root (sudo bash) should sort you out:

yum -y install atlas-devel
yum install lapack
pip install scipy

Some new functions I’ve discovered in R

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I’ve been writing a fair amount of R recently and have been going through a good learning period, here are some functions that I’ve discovered (mainly plyr and reshape related) and thought I would share:

merge_all is a good way to merge multiple different data frames, rather than multiple merge commands. The key thing is to put the dataframes to merge within a list – e.g. merge_all(list(df1, df2, df3), by=”key”).

mutate is a good data manipulation function which is similar to transform (both make for much cleaner code when creating a number of variables within a data frame. The key difference is the iterative nature of mutate – earlier variables that are created can be used in later variables.

So, whilst transform(data.frame, variablex = 5, variabley = variablex +1) won’t work, mutate(data.frame, variablex = 5, variabley = variabley +1) will work.

colwise is a good function for data aggregation when working with wide files. For example, colwise(mean)(data.frame) will return the average of each column in data.frame (there are other ways of doing this, but this makes for quite nice syntax. This example only works if all columns in the dataframe are numeric. To get around this, there are two options – use either numcolwise or colwise(data.frame, is.numeric) – both accomplish exactly the same purpose of subsetting the dataframe before applying the function.

I’m still getting my head around Higher Order Functions in R (John Myles White has a very good intro to these here) and how to use them, but them seem to be like a nice way of writing easy to understand and elegant code:

small.even.numbers <- Filter(function (x) {x %% 2 == 0}, 1:10)
my.sum <- function (x) {Reduce(`+`, x)}

EC2 Tutorials: Scheduling tasks on EC2 using Crontab

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

One of my main reasons for wanting an EC2 instance was to be able to automatically run scripts at certain times, normally to collect data and save it to a database. As my EC2 instance is always running, I can forget about it for a month and have a month’s worth of data ready and waiting.

Want to schedule scripts so that they automatically run at set times or at set intervals (e.g. every day at 1pm or every 10 minutes)? No problem, Crontab is the easy to use Linux scheduling program that allows such execution in a no-thrills, straightforward way.

The syntax works along the lines of:

<minute> <hour> <day of month> <month> <weekday> <task>

For example:

0 2 * * * python /location/of/script.py

will run script.py at 2am every day. Simples. (* is essentially a placeholder to say “do it for all integer values”)

Lots more detail can be found here, here and here.

To create or access your crontab, type crontab -e on the EC2 command line.

To see what’s scheduled by crontab, type crontab -l on the EC2 command line.

Google Refine: One of The Best Tools You’ve Probably Never Heard About

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Lots of data that’s available online tends to not be the cleanest thing in the world, particularly if you’ve had to scrape it in the first place. At the same time, lots of internal data sets can be just as messy, with columns having different names in what should be identical spreadsheet templates, or with what should be identical values (e.g. “United Kingdom” vs “UK”) not being identical. There are lots more examples that I could mention, but

Google Refine is a great tool for dealing with messy data and turning it quickly and easily into a much better dataset, which then allows for the fun to begin with the analysis and visualisation.

Three handy Google Refine tricks.

My reason for using Refine was to clean up and append additional information to the Information is Beautiful Awards. I’m still learning how to use the software, but found these three things to be very handy and things that I’ve not encountered elsewhere (without writing some code).

Add together multiple different Excel worksheet. This can be an absolute pain when there are either a lot of them, or where each one has some additional or missing columns. In the past, I’ve dealt with this using a combination of Python and VBA. However, Refine makes it incredibly easy to join together multiple Excel worksheets (particularly if they have similarly named columns). NB: this functionality doesn’t seem to exist when plugging into Google spreadsheets.

Reconciliation. My new favourite word. The data set that was provided, whilst exhaustive in some respects, still lacked a lot of potentially interesting information, such as when in the year it was released, the stars who featured, and so on. My first approach to this was to use the URLs provided and see if I could scrape the data from the link, as most of them were from a particular website, which had a consistent format.

Then I discovered Reconciliation.

This screencast does it far more justice than I can, but essentially the idea is that my reconciling my database with another one (in this case Freebase), there’s a whole lot more information that I can easily add on to my dataset. Think of reconciling a bit like Primary Keys that are universally defined and being able to left join. Also, think of remakes of films (Casino Royale, Ocean’s Eleven, etc.) – which version of the film am I referring to?

I simply asked Refine to start Reconciling, and by looking at the data in my film name column, was able to identify that it was a list of films. It then did its best at fuzzy matching the names, and where it wasn’t sure, gave me a list of options from which to choose. I could then choose a confidence level and it would leave me to manually choose the remaining records which were lower than this level.

Facets. Without Data Validation built into data entry, lots of alternative spellings can crop up for what should be the same value (e.g. “United Kingdom” vs “UK”). There are normally a few ways to deal with these, and tools like Tableau make it easy to group such values together.

Refine takes a nice approach to these (and other data validation issues) using what it calls “Facets”, which are essentially a combination of a Summary of the data, combined with Data Manipulation. What this means, is that (a) I can see what mistakes there are in the data and (b) I can then easily correct them.

 

 

Visualising The Correlation Matrix

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |05/01/2012|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset,
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels
# |
# *--------------------------------------------------------------------
# | NOTES:   For more information about MDS please see
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------
visCorrel&lt;-function(dataset, abr, method=&quot;metric&quot;){
  #Create correlation matrix
  cor_ts&lt;-cor(dataset)
  n&lt;-dim(cor_ts)[2]
  # Create dissimilarities
  ones&lt;-matrix(rep(1,n^2), nrow=n)
  dis_ts&lt;-ones-abs(cor_ts)
  # Do MDS
  if ( method==&quot;ordinal&quot;){
    fit &lt;- isoMDS(dis_ts, k=2)$points
  } else {
    cmd.res &lt;- cmdscale(dis_ts, k=2, eig=TRUE)
    eig&lt;-cmd.res$eig
    fit&lt;-cmd.res$points
    prop&lt;-sum(abs(eig[1:2]))/sum(abs(eig))
    print(paste(&quot;Proportion of squared distances represented:&quot;, round(prop*100)))
    if(prop&lt;0.5){print(&quot;Less than 50% of squared distance is represented. Consider using ordinal scaling instead&quot;)}
  }
  x &lt;- fit[,1]
  y &lt;- fit[,2]
  labels&lt;-row.names(cor_ts)
  if (abr&gt;0){labels&lt;-substr(labels,1,abr)}
  mds_plot&lt;-data.frame(labels, x, y)
  #Plot the results
  g&lt;-ggplot(mds_plot, aes(x=x, y=y, colour=labels, main=&quot;MDS Plot of Correlations&quot;))+geom_point() + coord_fixed()+ opts(title =&quot;MDS Plot of Correlations&quot;)
  direct.label(g, first.qp)
}
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# visCorrel(midwest[,4:27],10, method=&quot;classical&quot;)
# visCorrel(midwest[,4:27],10, method=&quot;ordinal&quot;)
# visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

crmrte crimes committed per person
prbarr probability of arrest
prbarr probability of arrest
prbconv probability of conviction
prbpris probability of prison sentence
avgsen average sentence, days
polpc police per capita
density people per square mile
taxpc tax revenue per capita
pctmin percentage minority in 1980
wcon weekly wage in construction
wtuc weekly wage in trns, util, commun
wtrd weekly wage in whole sales and retail trade
wfir weekly wage in finance, insurance and real estate
wser weekly wage in service industry
wmfg weekly wage in manufacturing
wfed weekly wage of federal emplyees
wsta weekly wage of state employees
wloc weekly wage of local governments employees
mix offence mix: face-to-face/other
pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Quite a handy couple of lines of code to subset a list in R to just those elements which meet a certain condition. Here’s an example to return only those elements of a list which are a certain class.

Thanks to this StackOverflow answer.

list.condition &lt;- sapply(input.list, function(x) class(x)==&quot;desired.class&quot;)
output.list  &lt;- input.list[list.condition]

EC2 Tutorials: Installing new software; yum, pip, easy_install, sudo-apt

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

For anyone familiar with python and easy_install, Amazon Linux uses “yum” as its easy installation system, and it is possible to install “pip” and “easy_install” to install new python packages.

As I’ve tried to install new software on my box, I’ve found lots and lots of references to sudo-apt as the standard way to install software where you’re using the Ubuntu flavour of Linux. As an FYI, Amazon Linux is based on Red Hat Linux (so sudo-apt doesn’t work with an Amazon Linux EC2 box).

Here’s a list of handy stuff to install (thanks to this post for a good starting point) :

yum install unixODBC-devel
yum install gcc-gfortran
yum install gcc-c++
yum install readline-devel
yum install libpng-devel libX11-devel libXt-devel
yum install texinfo-tex
yum install tetex-dvips
yum install docbook-utils-pdf
yum install cairo-devel
yum install java-1.6.0-openjdk-devel
yum install libxml2-devel
yum install mysql 
yum install mysql-server 
yum install mysql-devel

For Python packages, there are a couple of options available: easy_install and pip. As pip notes on its PyPl page:

pip is a replacement for easy_install. It mostly uses the same techniques for finding packages, so packages that are easy_installable should be pip-installable as well. This means that you can use pip install SomePackage instead of easy_install SomePackage.

I’ve found that pip tends to be easier to use than easy_install, and more reliable (based on where it installs packages to) so would recommend pip as the best way to install packages for Python. This link shows you how to install it. I installed it from source onto EC2.

#sherlock & the power of the retweet

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Much has been made over the last few days of Sherlock writer Steven Moffat‘s views on people who tweet whilst watching TV. Whilst watching it last night, I kept an eye on the tweets during the show and there was clearly a lot of volume going through the Twitter-sphere.

Interested to find out a bit more about the volumes, I used this excellent (and well used) Python script to pull the set of tweets from the beginning of the show through to 9.30am GMT this morning.

First off, a few head line figures:

  • Between 8pm and midnight there were more than 93,000 tweets and retweets.
  • Tweets per minute peaked at 2,608 at 10.30pm (excluding “RT” retweets). That’s more than 43 tweets per second on average.
  • There were more than 10 retweets per second at 10.45pm of @steven_moffat’s “#sherlock Yes of course there’s going to be a third series – it was commissioned at the same time as the second. Gotcha!” tweet.

The data, along with a little Tableau time series visualisation are below.

Update: Tableau Public doesn’t seem to play nicely with a WordPress-hosted blog, so click here to open in a new tab.

sherlock

Multicollinearity and Ridge Regression

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In marketing mix modelling you have to be very lucky not to run into problems with multicollinearity. It’s in the nature of marketing campaigns that everything tends to happen at once: the TV is supported by radio, both are timed to coincide with the relaunch of the website. One of the techniques that is often touted as a solution is ridge regression. However there is quite a bit of disagreement over whether this works. So I thought we’d just try it out with the simulated sales data I created in the last post.

In fact we’ll need to modify that data a little as we need a case of serious multicollinearity. I’ve adjusted the Tv campaigns to ensure that they always occur in the same winter months (not uncommon in marketing mix data) and I’ve added radio campaigns alongside the TV campaigns. Here is the modified code.

  #TV now coincides with winter. Carry over is dec, theta is dim, beta is ad_p,
  tv_grps&lt;-rep(0,5*52)
  tv_grps[40:45]&lt;-c(390,250,100,80,120,60)
  tv_grps[92:97]&lt;-c(390,250,100,80,120,60)
  tv_grps[144:149]&lt;-c(390,250,100,80,120,60)
  tv_grps[196:201]&lt;-c(390,250,100,80,120,60)
  tv_grps[248:253]&lt;-c(390,250,100,80,120,60)
  if (adstock_form==2){adstock&lt;-adstock_calc_2(tv_grps, dec, dim)}
  else {adstock&lt;-adstock_calc_1(tv_grps, dec, dim)}
  TV&lt;-ad_p*adstock
  # Accompanying radio campaigns
  radio_spend&lt;-rep(0,5*52)
  radio_spend[40:44]&lt;-c(100, 100, 80, 80)
  radio_spend[92:95]&lt;-c(100, 100, 80, 80)
  radio_spend[144:147]&lt;-c(100, 100, 80)
  radio_spend[196:200]&lt;-c(100, 100, 80, 80)
  radio_spend[248:253]&lt;-c(100, 100, 80, 80, 80)
  radio&lt;-radio_p*radio_spend

The sales data now looks like this:

The correlation matrix of the explanatory variables shows that we have serious multicollinearity issues even when only two variables are taken at a time.

 &gt; cor(test[,c(2,4:6)])
                  temp radio_spend        week     adstock
temp         1.0000000 -0.41545174 -0.15593463 -0.47491671
radio_spend -0.4154517  1.00000000  0.09096521  0.90415219
week        -0.1559346  0.09096521  1.00000000  0.08048096
adstock     -0.4749167  0.90415219  0.08048096  1.00000000

What is this going to mean for the chances of recovering the parameters in our simulated data set? Well we know that even with heavy multicollinearity our estimates using linear regression are going to be unbiased; the problem is going to be their high variance.

We can show this quite nicely by generating lots of examples of our sales data (always with the same parameters but allowing a different random draw each from the normally distributed error term) and plotting the distribution of the estimates arrived at using linear regression. (See Monte Carlo Simulation for more details about this kind of technique.)

}

You can see that on average the estimates for tv and radio are close to correct but the distributions are wide. So for any one instance of the data (which in real life is all we have) chances are that our estimate is quite wide of the mark. The data and plots are created using the following code:

coefs&lt;-NA
for (i in 1:10000){
   sim&lt;-create_test_sets(base_p=1000,
                       trend_p=0.8,
                       season_p=4,
                       ad_p=30,
                       dim=100,
                       dec=0.3,
                       adstock_form=1,
                       radio_p=0.1,
                       error_std=5)
  lm_std&lt;-lm(sales~week+temp+adstock+radio_spend, data=sim)
  coefs&lt;-rbind(coefs,coef(lm_std))
}
col_means&lt;-colMeans(coefs[-1,])
for_div&lt;-matrix(rep(col_means,10000), nrow=10000, byrow=TRUE)
mean_div&lt;-coefs[-1,]/for_div
m_coefs&lt;-melt(mean_div)
ggplot(data=m_coefs, aes(x=value))+geom_density()+facet_wrap(~X2, scales=&quot;free_y&quot;) + scale_x_continuous('Scaled as % of Mean')

What does ridge regression do to fix this? Ridge regression is best explained using a concept more familiar in machine learning and data mining: the bias-variance trade off. The idea is that you will often achieve better predictions (or estimates) if you are prepared to swap a bit of unbiasedness for much less variance. In other words the average of your predictions will no longer converge on the right answer but any one prediction is likely to be much closer.

In ridge regression we have a parameter lambda that controls the bias-variance trade off. As lambda increases our estimates get more biased but their variance increases. Cross-validation (another machine learning technique) is used to estimate the best possible setting of lambda.

So let’s see if ridge regression can help us with the multicolinearity in our marketing mix data. What we hope to see is a decent reduction in variance but not at too high a price in bias. The code below simulates the distribution of the ridge regression estimates of the parameters for increasing values of lambda.

library(MASS)
for (i in 1:1000){
   sim&lt;-create_test_sets(base_p=1000,
                       trend_p=0.8,
                       season_p=4,
                       ad_p=30,
                       dim=100,
                       dec=0.3,
                       adstock_form=1,
                       radio_p=0.1,
                       error_std=5)
  lm_rg&lt;-lm.ridge(sales~week+temp+adstock+radio_spend, data=sim, lambda = seq(0,20,0.5))
  if (i==1){coefs_rg&lt;-coef(lm_rg)}
  else {coefs_rg&lt;-rbind(coefs_rg,coef(lm_rg))}
}
colnames(coefs_rg)[1]&lt;-&quot;intercept&quot;
m_coefs_rg&lt;-melt(coefs_rg)
names(m_coefs_rg)&lt;-c(&quot;lambda&quot;, &quot;variable&quot;, &quot;value&quot;)
ggplot(data=m_coefs_rg, aes(x=value, y=lambda))+geom_density2d()+facet_wrap(~variable, scales=&quot;free&quot;)

The results are not encouraging. Variance decreases slightly for tv and radio however the cost in bias is far too high.

I’m aware that this by no means proves that ridge regression is never a solution for marketing mix data but it does at least show that it is not always the solution and I’m inclined to think that if it doesn’t work in a simple situation like this then it doesn’t work very often.

However I might try varying the parameters for the simulated data set to see if there are some settings where it looks more promising.

Still, for now, I won’t be recommending it as a solution to multicollinearity in marketing mix models.

A good explanation of ridge regression can be found in this post

Making an R Package: Not as hard as you think

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I’ve been writing functions in R for a while to do various things like talking to APIs, web scraping, model testing and data visualisation (basically thing which can get a bit repetitive!), but have always been slightly intimidated about turning those functions into a package, which I could then call using library (package-name). Note that I’m writing this from a Windows perspective, but should work just as well on other platforms I think (you just won’t need to install Rtools.

I finally took the leap today and created my first package… and it turned out to not be too hard to do. This guide by Rob Hyndman is a good place to start, as it covers the key steps that you need to go through.

There are three key steps to building a package (assuming that you have already written the functions that are to be included in the package!).

  1. Create a skeleton package.
  2. Fill in the Documentation and Help Files.
  3. Use Rcmd to build and install the package.

1. Create a skeleton package

 package.skeleton("package-name") 

will get you started, by building a base package which contains the objects that are currently loaded into your R workspace (note that this will build the package into the current work directory, so set this prior to creating the skeleton).

2. Fill in the Documentation and Help Files.

After this has run successfully, you then need to provide information for DOCUMENTATION (things like author name, and also add on DEPENDS for any other packages that the package you’re creating depends on). You also need to write the help files for the functions, which are a set of .Rd files that are within the package/MAN folder.

3. Use Rcmd to build and install the package.

The final step is to actually build your package, and if you’re doing this on Windows then you need to have installed Rtools which is available from here.

Once this is installed, you need to use the command line to run the following code :

CD /to/file/path/package
Rcmd INSTALL --build packagename

This will install the package and also create a ZIP file in the same folder as where the skeleton package has been created so that you can then distribute this to others (assuming that their on the same OS as you).

It’s then relatively easy to update the package with new functions or data:
1. Add the function to the r-package/R directory (where the other functions already sit)
2. The file should be named with .R as the extension
3. Create a new manual page (ending with the extension .Rd) in r-package/man directory (use one of the existing help files as a template).
4. If there are any new dependencies, add them to the “Depends:” line in DESCRIPTION.

There’s a lot more detail that can be gone into (e.g. S3 and S4 classes), but to get through the basics, the above and the guide from Rob Hyndman should sort you out.

Machine Learning and Analytics based in London, UK