## Scoring a Neural Net using R on AWS

One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.

With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.

## Dendrograms in R2D3

Hi, I’m Andrew and this is my first post for Coppelia! If you like the look of this feel free to visit my blog dinner with data (and see what happens when a data scientist hits the kitchen!)

I was excited by James’s last post on the new package R2D3, and I thought I would try to help further develop the package. This is a great new package, built by James Thomson (and in collaboration with myself and Simon Raper at Coppelia) that utilises D3 visualisations inside R. You can quickly create very striking visualisations with a just a few lines of code. This has recently been shared with a recent post, but since then a couple of updates have been made to increase the functionality.

In particular to the function D3Dendro, which creates dendrograms based on a hclust object in R. I had been working on a number of alternatives to the usual static dendrogram found in the package so far, so I thought I would add these in and describe them below.

I have created two new distinct functionalities:

• Collapsible nodes

You can clone the package from James’s github repository or run the following in R:

 install.packages("devtools") library(devtools) install_github("jamesthomson/R2D3") library(R2D3) 

I will include the example in the original post, so you can easily compare the differences.

Original dendrogram:

 hc < - hclust(dist(USArrests), "ave") JSON<-jsonHC(hc) D3Dendro(JSON, file_out="USArrests_Dendo.html") 

## Introducing R2D3

R2D3 is a new package for R I’ve been working on. As the name suggests this package uses R to produce D3 visualisations. It builds on some work I previously blogged about here.

There are some similar packages out there on CRAN already. Notably rjson and d3Network. However I found with these packages that they covered parts of the process (creating a json or creating a D3) but not the whole process and not ensuring the json was in the right format for the D3. So that was the thinking with this package. I was the aiming to create an end to end process for converting R objects into D3 visualisations. When i mentioned it to Simon@Coppelia he was keen to contribute. So we’ve been collaborating on it over the last few weeks. Its by no means finished, but I think it contains enough that its worth sharing.

## Converting an R HClust object into a D3.js Dendrogram

Hi all I’m James. This is my first blog for Coppelia. Thanks to Simon for encouraging me to do this.

I’ve been doing a lot of hierarchical clustering in R and have started to find the the standard dendrogram plot fairly unreadable once you have over a couple of hundred records. I’ve recently been introduced to the D3.js gallery and I wondered if I could hack something better together. I found this dendrogram I liked and started to play. I soon realised in order to get my data into it I needed a nested json. Read more

## Buster – a new R package for bagging hierarchical clustering

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation. Read more

## Converting a dendrogram into a graph for a D3 force directed layout

I wrote this code for a project that didn’t work out but I thought I’d share. It takes the dendrogram produced by hclust in R and converts it into json to be used in a D3 force directed graph (slicing the dendrogram near the top to create a couple of clusters). The dendrogram in R looks like this

Dendrogram of clustering on US arrests

And the end result in D3 is this Read more

## Visualising cluster stability using Sankey diagrams

###### The problem

I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.

To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.

One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.

###### Seeing what’s going on

Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result. Read more

## gist: dfToJSON

I’ve been using a lot of javascript charting and visualisation libraries recently (e.g. D3, highcharts) and found that it is quite painful to get my data into the JSON structure required by each library. Since I’m doing most of the data manipulation in R anyway it makes sense to arrange the data as a nested list in R and then transform it to JSON using rJSON. In this function I’ve catered for the structures needed for most of the highcharts data, however many other structures could easily be added.

## Include uncertainty in a financial model

On-line resources for analysts are often either too general to be of practical use or too specialised to be accessible. The aim of ragscripts.com is to remedy this by providing start to finish directions for complex analytical tasks. The site is under construction at the moment but any feedback is most welcome!

### The problem

You’ve been asked to calculate some figure or other (e.g. end of year revenue, average customer lifetime value) based on numbers supplied from various parts of the business. You know how to make the calculation but what bothers you is that some of the figures going in are clearly approximations or at worst complete guesses. Therefore the final calculation will inherit this uncertainty. A decision will be made based on your final figure and more likely than not the decision makers will treat this number as a certainty. They won’t have any way of knowing how uncertain it is and may well draw the wrong conclusions. How do you get an understanding of the uncertainty around the final calculation and how do you explain it to the decision makers?

The standard approach in finance is to produce a set of best and worst case scenarios. These are tedious to produce and are often disregarded as unlikely outcomes. I think our approach is better as it gives a full picture of the uncertainty around an estimate.

### What you’ll need

• Basic R programming skills (including writing functions)
• An understanding of probability distributions
• About 1-3 hours (depending on the complexity of the model)

### Example

Let’s say you are asked to project the revenue of a business in one year’s time based on the following numbers:

• Customer base at the beginning of the year: 50,000
• Average revenue per customer: £7
• Monthly acquisition rate: 5%
• Monthly churn rate: 2%

You might have thought that these figures could be known with some certainty but believe me it’s actually very rare especially in large businesses where integrating systems to get an overall picture is a tough job. Add to this the fact that we are guessing at how well any current figures are going to hold out over the year and we should agree than there’s a fair bit of uncertainty in the inputs.

With this is mind you ask the suppliers of the data for a lower and upper bound for the figures, i.e. the lowest and highest numbers that they would expect to see without being very surprised. Where they are unable or unwilling to do this you use your common sense. This gives you the following table:

 Statistic Estimate Lower Bound Upper Bound ARPC £7 £4 £10 Acquisition Rate 5% 3% 7% Churn Rate 2% 1% 3%

### Do it

##### Step 1: Setting up the distributions

There’s lots of tools you could do this with but R seems most appropriate so start up a session. We are going to model the our uncertainty about the input values in each case as a normal distribution. We will say that the mean of the distribution is the value the business has estimated and that the 5% and 95% quantiles are the upper bounds and lower bounds respectively.

Yes this is subjective (we could have chosen other values for the bounds) but the point of this exercise is not to make incredibly accurate statistical inferences rather it is to show the consequences that different levels of uncertainty in the inputs have on the output. You can play with these parameters if you wish and say things like if we were this accurate with our inputs then we would be this accurate with our output

Let’s start by calculating the parameters (mean, standard deviation) for the three normal distributions we’ll need.

Let $q_{95}$ be the 95th quantile. Then

$q_{95} = \mu + 1.645 \sigma$. Therefore $\sigma = \frac{q_{95}-\mu}{1.645}$

#Work out the standard deviation for arpu
arpu.sd<-3/1.645
#Plot the resulting distribution
x<-seq(0, 15,by=0.5)
d<-dnorm(x, 7, arpu.sd)
plot(x, d, type='l')


This should give a plot that meets our expectations and it does:

We’ll complete this step by doing the same for acquisition and churn.

#Do the same for acquisition and churn
acq.sd<-0.02/1.645
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.05, acq.sd)
plot(x, d, type='l')
ch.sd<-0.01/1.645
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.02, ch.sd)
plot(x, d, type='l')

##### Step 2: A function for calculating revenue

The next piece you need is a function for calculating the output (the statistic of interest) from the inputs. This is what would usually be captured in the formulae of an excel spreadsheet.

For our example it looks like this:

revenue<-function(arpu, acq, ch){
num.cust<-50000
for (m in 1:12){
num.cust<-num.cust+acq*num.cust-ch*num.cust
}
return(num.cust*arpu)
}

##### Step 3: Making random draws from our belief distributions

We will now generate 10 thousand values from our each of normally distributed input variables to create 10 thousand simulations. Each of these simulations represents a possible set of inputs to our model. For example one set might be

• Average revenue per customer: £7.50
• Monthly acquisition rate: 3%
• Monthly churn rate: 1%

However the way we’ve set things up unlikely values for each input are unlikely to be drawn from our distributions. In fact since 10 thousand is a pretty large number the frequency at which each of the possible values is drawn should fairly well reflect our initial beliefs.

Here is our code for simulating the input values:

#Now let's simulate 10k values from each of our distributions
sim.arpu<-rnorm(10000, 7, arpu.sd)
sim.acq<-rnorm(10000, 0.05, acq.sd)
sim.ch<-rnorm(10000, 0.02, ch.sd)


Note a couple of things we haven’t accounted for: we’ve assumed that the our beliefs about our input variables are independent of one another but that might not be the case. We might think for some reason that if churn is higher then acquisition will be lower. If so we want to draw our values from a multivariate distribution incorporating correlation between the variables. However assuming independence is an easy and useful place to start. Also we’ve assumed that our model itself is correct – i.e. that constant rates of churn and acquisition apply and that there are no other effects (for example seasonal or economic.)

##### Step 4: Running them through the function

Now everything is place and it’s a simple matter to apply our function to each of the 10 thousand simulated sets of input variables. We use R’s mapply function.

sim.rev<-mapply(revenue, sim.arpu, sim.acq, sim.ch)

##### Step 5: Examining the results

Since our simulated input variables were distributed in a way that reflects our beliefs about their likely values, our final revenue calculations, based on these simulated inputs, will also have a distribution that reflect these uncertainties. The next few lines of code show us this distribution.

summary(sim.rev)
hist(sim.rev)
plot(density(sim.rev))


The histogram of the possible values for annual revenue makes the point of doing all of this quite clear. A straight calculation based on the original input values gives a revenue value of 499k. However a quick glance at the histogram shows that revenues as low as 400k are not at all improbable. Any planning should account for this!

### See it

Here’s a quick JSfiddle to see how inputs affect outputs in the example case.

### Explain it

So how do you explain what you have done and why you have done it?

Chances are you may run into the analyst’s catch 22 (see skit). You’ve done a much better job because you’ve incorporated uncertainty into your forecast allowing the decision maker to make a more informed decision. However people like certainty and the very fact you have done this may be perceived as incompetence What do you mean you don’t know what the revenue figure is going to be – go back and do it again!

So either you compromise your own integrity by glossing over the uncertainty or you’re branded an idiot or at least a boffin who overcomplicates things!

But there are some ways around this. Try to find out what decisions are being made using the data. Some of the frustration will come from being handed something that doesn’t solve the problem and the decision maker may have a point: if the range of accuracy is not enough to make a decision the model may need to be improved in some way.

Also talking about the decision and the problem from the outset is a good way of driving home the value of what you are doing. The very purpose of showing the uncertainty in the estimate is to prevent misinformed decisions being made.

The decision in our example might be whether or not to we can invest in a new product now given that we still need to be in profit by the end of the year. Our model as it is shows that if all other outgoings come to 150k then we’d be taking a risk spending over 50k on the new product, (since there’s a reasonable chance that revenue could be as low as 200k). If 50k was what were planning then we’ve done the job. If however a new product can’t be done for less than 100k then it’s worth spending some time improving our model.

### Fork it

Take my code and adapt it for your own models. You can find the gist here

## Box Me

Here’s a short R function I wrote to turn a long data set into a wide one for viewing. It’s not the most exciting function ever but I find it quite useful when my screen is wide and short. It simply cuts the data set horizontally into equal size pieces and puts them side by side. Lazy I know!

#'boxMe
#'
#'Turns an overly long data frame into something easier to look at
#'
#' @param d A dataframe or matrix
#' @param nrow The number of rows you would like to see in the new dataframe
#' @examples
#' test.set&lt;-data.frame(x=rnorm(100), y=rnorm(100))
#' boxMe(test.set, 18)
#'
#' library(ggplot2)
#' boxMe(diamonds, 10)
boxMe&lt;-function(d, nrow){
# Number of rows and columns
r&lt;-dim(d)[1]
c&lt;-dim(d)[2]
rem&lt;-r %% nrow # Number of blank rows
reps&lt;-floor(r/nrow) # Number of folds
s&lt;-seq(1, reps*nrow, by=nrow) # Breaks
box&lt;-d[1:nrow,] # First col
for (i in s[-1]){
ap&lt;-d[i:(i+nrow-1),]
box&lt;-cbind(box, ap)
}
#Append remainder
if (rem&gt;0){
n.null.rows&lt;-nrow-rem
rem.rows&lt;-d[(reps*nrow+1):r,]
null.block&lt;-as.data.frame(matrix(rep(NA, (n.null.rows*c)), nrow=n.null.rows))
names(null.block)&lt;-names(rem.rows)
last.block&lt;-rbind(rem.rows, null.block)
box&lt;-cbind(box, last.block)
}
return(box)
}