## A decision process for selecting statistical techniques

In this chart (detail above, full version below) I’ve tried to capture the decision process I go through to select the most promising statistical or machine learning technique given the problem and the data.

It’s a heuristic in the sense given in Wikipedia:

A heuristic technique often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. (Wikipedia)

It certainly isn’t perfect but it is practical! In particular it’s worth bearing in mind that

• It does not cover the tests you’d need to go through to establish whether a technique is being applied correctly. Also where a technique is sophisticated I’d probably start with something simpler and then work towards the more complex technique.
• There are of course many other available techniques but these are ones I use a lot.
• Some personal preferences are also built in. For example I tend to go for a Bayesian model whenever the problem does not call for a model using a linear combination of explanatory variables as I find it easier to think about the more unusual cases in this way.

This diagram was made with fantastic draw.io. Click into it for the full version.

## Modelling for decisions

Here’s the deck I presented today at the Predictive Analytics Innovation conference in London.

The idea was to look at ways in which we might use modern statistical methods to help build models that are grounded firmly on common sense intuitions and to do it all in a very short time.

If you are interested in more details just let me know on twitter or drop me an email

## Scoring a Neural Net using R on AWS

One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.

With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.

## Buster – a new R package for bagging hierarchical clustering

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation. Read more

## Distribution for the difference between two binomially distributed random variables

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

$X \sim Bin(n_1, p_1)$

$Y \sim Bin(n_2, p_2)$

We are looking for the probability mass function of $Z=X-Y$

First note that the min and max of the support of Z must be $(-n_2, n_1)$ since that covers the most extreme cases ($X=0$ and $Y=n_2$) and ($X=n_1$ and $Y=0$).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

$m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k}$ when $k \leq n$ and 0 otherwise.

Then we need to define two cases

1. $Z \geq 0$
2. $latex Z < 0$ In the first case $latex p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)$ since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of $latex n_1$ and $latex n_2$. For example if $latex n_2 = 4$ then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero. For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc. $latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex] Put them together and that's your pmf. Here’s the function in R and a simulation to check it’s right (and it does work.) ## Visualising Shrinkage A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy. To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students are probably too small a sample to get useful results. In a hierarchical model the data are pooled so that all classes in a school are modelled together as a hierarchy and even all schools in a district. At each level in this hierarchy an estimate for the efficiency of the teaching method is obtained. You will get an estimate for the school as a whole and for the district. You will even get estimates for the individual classes. These estimates will be weighted averages of the estimates for the class and the estimate for the school (which in turn is a weighted average of the estimate for the school and the district.) The clever part is that this weighting is itself determined by the data. Where a class is an outlier, and therefore the overall school average is less relevant, the estimate will be weighted towards the class. Where it is typical it will be weighted towards the school. This property is known as shrinkage. I’m often interested in how much shrinkage is affecting my estimates and I want to see it. I’ve created this plot which I find useful. It’s done in R using ggplot and is very simple to code. The idea is that the non shrunk estimates bi (i.e. the estimates that would be obtained by modelling classes individually) are plotted on along the line x=y at the points (bi, bi). The estimates they are being shrunk towards ai are plotted at the points (bi, ai). Finally we plot the shrunk estimates si at (bi, si) and connect the points with an arrow to illustrate the direction of the shrinkage. Here is an example. You can see the extent of the shrinkage by the the distance covered by the arrow towards the higher level estimate. Note the arrows do sometimes point away from the higher level estimate. This is because this data is for a single coefficient in a hierarchical regression model with multiple coefficients. Where other coefficients have been stabilized by shrinkage this causes this particular coefficient to be revised. The R code is as follows: # *-------------------------------------------------------------------- # | FUNCTION: shrinkPlot # | Function for visualising shrinkage in hierarchical models # *-------------------------------------------------------------------- # | Version |Date |Programmer |Details of Change # | 01 |31/08/2013|Simon Raper |first version. # *-------------------------------------------------------------------- # | INPUTS: orig Estimates obtained from individual level # | modelling # | shrink Estimates obtained from hierarchical modelling # | prior Priors in Bayesian model or fixed effects in # | mixed effects model (i.e. what it is shrinking # | towards. # | window Limits for the plot (as a vector) # | # *-------------------------------------------------------------------- # | OUTPUTS: A ggplot object # *-------------------------------------------------------------------- # | DEPENDS: grid, ggplot2 # | # *-------------------------------------------------------------------- library(ggplot) library(grid) shrinkPlot&lt;-function(orig, shrink, prior, window=NULL){ group&lt;-factor(signif(prior,3)) data&lt;-data.frame(orig, shrink, prior, group) g&lt;-ggplot(data=data, aes(x=orig, xend=orig, y=orig, yend=shrink, col=group)) g2&lt;-g+geom_segment(arrow = arrow(length = unit(0.3, &quot;cm&quot;))) +geom_point(data=comp.in, aes(x=coef, y=mean)) g3&lt;-g2+xlab(&quot;Estimate&quot;)+ylab(&quot;Shrinkage&quot;)+ ggtitle(&quot;Shrinkage Plot&quot;) if (is.null(window)==FALSE){ g3&lt;-g3+ylim(window)+xlim(window) } print(g3) } ## What are degrees of freedom? I remember getting frustrated as an undergraduate trying to find straight answer to this question. The standard text book answer is something like this: "In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary" That’s from Wikipedia but it’s fairly typical. I could just about make sense of this for something like a chi-squared statistic but why, could someone explain to me, are the degrees of freedom for linear regression n-k-1? I realise this doesn’t keep many people awake but it did me so I was pleased to find the following quote: "The person who is unfamiliar with N-dimensional geometry or who knows the contributions to modern sampling theory only from secondhand sources such as textbooks, this concept often seems almost mystical, with no practical meaning." Walker, 1940 Many years later I’m nasty enough to use it as an interview question. In a kinder frame of mind I thought I’d post my slightly XKCD inspired notes for explaining, as simply as I can, the concept in terms of N-dimensional geometry. I hope you can read my writing and apologies to mathematicians if the language is a bit loose! ## Expected switching for the Dirichlet distribution A valuable tool in choice modelling is the Dirichlet-multinomial distribution. It’s a compound of the multinomial and Dirichlet distributions and it works like this: • A choice between N options is modelled as a multinomial distribution with parameters θ1, θ2, θ3 … θN, where the thetas also represent the probabilities of each option being chosen. For example we might model votes cast in an election as draws from a multinomial distribution with parameters θ1=0.7, θ2=0.2, θ3=0.1. • However the multinomial distribution by itself is likely to be a poor model of the choices made within a population as it assumes all individuals select options with the same probabilities. It would be more realistic to say that the thetas themselves vary over the population. This gives us what is known as an over-dispersed distribution: the parameters for one distribution are modelled by another distribution. In this case we use a Dirichlet distribution, which is the multivariate version of a Beta distribution, to model the distribution of the thetas. As we’ll be using it a lot here’s the probability density function for the Dirichlet distribution. $f(\theta_1,\dots, \theta_N; \alpha_1,\dots, \alpha_N) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^N \theta_i^{\alpha_i - 1}$ Where the normalising constant is: $\mathrm{B}(\alpha) = \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{\Gamma\bigl(\sum_{i=1}^N \alpha_i\bigr)},\qquad\alpha=(\alpha_1,\dots,\alpha_N)$ One of the powerful things about the Dirichlet distribution as a modelling tool is that it allows us to capture not only the proportions of the population that opt for each choice but also the amount of switching between the choices from one draw to the next. To take our election example again, an election result of 70%, 20%, 10% for three parties could be modelled by Dirichlet distribution with alphas of 0.1, 0.029 and 0.014 or with alphas of 20, 5.71 and 2.86. In fact there are infinitely many possible settings of the alpha parameters that will produce this result. The difference between them is stability. If two successive elections produce the same result then this could be because the same people are voting for the same parties or, less likely but equally possible, people are switching their votes but in such a way that the net result is the same. Different settings of the alpha parameters produce different levels of switching. A natural question is then: given a particular parametrisation of the Dirchlet distribution, what is the expected percentage of individuals that will switch category from one draw of the multinomial distribution to another? I’m sure this has been worked out before somewhere but after a quick and fruitless trawl through the online literature I decided to do it myself, helped a lot by a great post from Leo Alekseyev who demonstrates a clever way of integrating over Dirichlet distributions. All I’ve done is adapt his technique. (By the way to convert the latex in this post into a form easily used in wordpress I used an excellent python package from Luca Trevisan) So let’s say we have N choices. For individual i the probability of picking choice j is θij. What then is the probability that a randomly selected individual will make the same choice in two successive draws from a multinomial distribution? The individual could either select the first option twice or the second option twice or the third option twice etc. In other words the probability we are interested in is $L=\theta_{i1}^2 + \theta_{i2}^2 +\theta_{i3}^2 ...\theta_{iN}^2$ We will call L the loyalty and work out expected switching as 1-E[L]. Leo Alekseyev has created an excellent video on you tube talking through his technique. I would recommend watching it if you would like to follow the arguments below. If you’re just interested in the end result then scroll to the end of the post. The quantity we are interested in is ${L=\theta_1^2+\theta_2^2+ \cdots \theta_N^2}$ where the ${\theta_i}$ come from a Dirichlet distribution with parameters ${ \alpha_1, \ \alpha_2, \ \alpha_3, \ldots \alpha_N }$ To get the expected value of L we can use a generalised version of the Law of the Unconscious Statistician to adapt the proof given by Leo Alekseyev. As a reminder, the Law of the Unconscious Statistician is: $\displaystyle E[g(X)] = \int_{-\infty}^\infty g(x) f(x) \ dx \ \ \ \ \ (1)$ where ${f(x)}$ is the probability distribution of the random variable X. This will give us the following integral $\displaystyle E[L]= \frac{1}{B(\alpha)}\int \cdots \int_\mathbf{D}\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (2)$ So how do we evaluate this integral? It’s domain D is not straightforward as it is constrained by ${ \left\vert \bf{\theta} \right\vert = 1 }$ (i.e. the probabilities must sum to zero). Leo Alekseyev shows us a trick using the Dirac delta function: $\displaystyle \delta(x)=\frac{1}{2\pi} \int_{-\infty}^\infty e^{ikx} dk \ \ \ \ \ (3)$ This (generalised) function, the limit of increasingly concentrated distributions, has an area of one beneath the curve (if you can call it that) at x=0 and an area of zero everywhere else – a slightly odd concept sometimes thought of as infinitely tall spike above the origin. The helpful thing for us is if we set ${x=1-\theta_1-\theta_2- \cdots \theta_N}$ and multiply the contents of the integral by the delta function then this is equivalent to evaluating the integral over D. $\displaystyle \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \delta (1-\theta_1-\theta_2 \cdots \theta_N) \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (4)$ Since… $\displaystyle \delta (1-\theta_1-\theta_2 \cdots \theta_N) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-ik(1-\theta_1-\theta_2 \cdots \theta_N)} dk$ $\displaystyle = \frac{1}{2\pi}\int_{-\infty}^\infty e^{ik} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N}dk \ \ \ \ \ (5)$ … the integral can be rewritten as: $\displaystyle \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N} \ d\theta_1 \!\cdots d\theta_N\ dk\ \ \ \ \ (6)$ If we group together like terms we reach: $\displaystyle \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \Big[$ $\displaystyle \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N$ $\displaystyle + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N$ $\displaystyle + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N$ $\displaystyle \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-ik\theta_N} d\theta_N$ $\displaystyle \Big] dk \ \ \ \ \ (7)$ Note the pattern in the exponents! Continuing along the lines described by Leo Alekseyev we use the substitutions ${ik=\kappa}$ and ${k=i\kappa}$ to set us up for the Laplace transform and evalute the integral at ${t=1}$ to set us up for the inverse Laplace transform. $\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[$ $\displaystyle \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N$ $\displaystyle + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N$ $\displaystyle + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N$ $\displaystyle \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-\kappa\theta_N} d\theta_N$ $\displaystyle \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (8)$ As a reminder the Laplace transform is: $\displaystyle \displaystyle\mathcal{L} \left\{f(t)\right\}=\int_0^\infty f(t)e^{-st}ds \ \ \ \ \ (9)$ So we can substitute it in giving us: $\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1+1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}$ $\displaystyle +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2+1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}$ $\displaystyle +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3+1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}$ $\displaystyle + \cdots \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N+1}\right\}$ $\displaystyle \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (10)$ The Laplace transformation evaluates as: $\displaystyle \int_0^\infty \theta^\alpha e^{-s\theta} d\theta =\frac{\Gamma(\alpha+1)}{s^{\alpha+1}} \ \ \ \ \ (11)$ Which we can substitute back into our integral: $\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\Gamma(\alpha_1+2)}{\kappa^{\alpha_1+2}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}$ $\displaystyle +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2+2)}{\kappa^{\alpha_2+2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}$ $\displaystyle +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3+2)}{\kappa^{\alpha_3+2}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}$ $\displaystyle + \cdots$ $\displaystyle +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N+2)}{\kappa^{\alpha_N+2}}$ $\displaystyle \Big]$ $\displaystyle \left.d\kappa\right|_{t=1} \ \ \ \ \ (12)$ Since ${\Gamma(x+2)=x(x+1)\Gamma(x)}$ we get: $\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_1(\alpha_1+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}$ $\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_2(\alpha_2+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}$ $\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_3(\alpha_3+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}$ $\displaystyle + \cdots$ $\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_N(\alpha_N+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}$ $\displaystyle \Big]$ $\displaystyle \left.d\kappa\right|_{t=1} \ \ \ \ \ (13)$ Some rearranging gives us: $\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{2\pi i} \sum_{i=1}^N(\alpha_i(\alpha_i+1)) \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (14)$ Now we use the inverse Laplace transform to evaluate ${\frac{1}{2\pi} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1}}$ as ${\frac{1}{\Gamma(2+\sum_{i=1}^N \alpha_i)}}$ giving us: $\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(2+\sum_{i=1}^N \alpha_i)}$ $\displaystyle =\frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(\sum_{i=1}^N \alpha_i)\sum_{i=1}^N \alpha_i (\sum_{i=1}^N \alpha_i+1)} \ \ \ \ \ (15)$ Bringing the normalisation constant back in and cancelling out we get $\displaystyle \frac{\sum_{i=1}^N\alpha_i(\alpha_i+1)}{\sum_{i=1}^N \alpha_i (1+\sum_{i=1}^N \alpha_i)} \ \ \ \ \ (16)$ which is our expected loyalty ${E[L]}$. All we need to do now is check it’s right by comparing with a simulated result in R comp&lt;-NULL for (i in 1:100){ alphas&lt;-runif(3, 0.01, 10) #Simulate sample.d1&lt;-rdirichlet(10000, alphas) purchases&lt;-t(mapply(function(x, y, z) rmultinom(1, size = 2, prob=c(x,y,z)), sample.d1[,1], sample.d1[,2], sample.d1[,3])) loyal&lt;-sum(purchases[,1]==2)+sum(purchases[,2]==2)+sum(purchases[,3]==2) switching.sim&lt;-(10000-loyal)/10000 #Derived switching.dir&lt;-1-(sum(alphas*(alphas+1))/(sum(alphas)*(1+sum(alphas)))) comp&lt;-rbind(comp, c(switching.sim, switching.dir)) } colnames(comp)&lt;-c(&quot;Simulated&quot;, &quot;Derived&quot;) plot(comp) x&lt;-seq(0,1,0.1) lines(x,x) A plot of the simulated as against the derived results shows that, as we would hope, they are approximately equal (the line is x=y) ## Visualising The Correlation Matrix Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set. # *-------------------------------------------------------------------- # | FUNCTION: visCorrel # | Creates an MDS plot where the distance between variables represents # | correlation between the variables (closer=more correlated) # *-------------------------------------------------------------------- # | Version |Date |Programmer |Details of Change # | 01 |05/01/2012|Simon Raper |first version. # *-------------------------------------------------------------------- # | INPUTS: dataset A dataframe containing only the explanatory # | variables. It should not contain missing # | values or variables where there is no # | variation # | abr The number of characters used when # | abbreviating the variables. If set to zero # | there is no abbreviation # | method The options are 'metric' or 'ordinal'. The # | default is metric # *-------------------------------------------------------------------- # | OUTPUTS: graph An MDS plot where the similarity measure is # | correlation between the variables. # | # *-------------------------------------------------------------------- # | USAGE: vis_correl(dataset, # | abr) # | # *-------------------------------------------------------------------- # | DEPENDS: ggplot2, directlabels, MASS # | # *-------------------------------------------------------------------- # | NOTES: For more information about MDS please see # | http://en.wikipedia.org/wiki/Multidimensional_scaling # | # *-------------------------------------------------------------------- visCorrel<-function(dataset, abr, method="metric"){ #Create correlation matrix cor_ts<-cor(dataset) n<-dim(cor_ts)[2] # Create dissimilarities ones<-matrix(rep(1,n^2), nrow=n) dis_ts<-ones-abs(cor_ts) # Do MDS if ( method=="ordinal"){ fit <- isoMDS(dis_ts, k=2)$points
} else {
cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
eig<-cmd.res$eig fit<-cmd.res$points
prop<-sum(abs(eig[1:2]))/sum(abs(eig))
print(paste("Proportion of squared distances represented:", round(prop*100)))
if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
}
x <- fit[,1]
y <- fit[,2]
labels<-row.names(cor_ts)
if (abr>0){labels<-substr(labels,1,abr)}
mds_plot<-data.frame(labels, x, y)
#Plot the results
ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations", label=labels))+geom_text() + coord_fixed()+ ggtitle("MDS Plot of Correlations") + theme(legend.position="none")
}
# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# library(MASS)
# library(ggplot2)
# library(plm)
# data(midwest)
# data(Crime)
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

 crmrte crimes committed per person prbarr probability of arrest prbarr probability of arrest prbconv probability of conviction prbpris probability of prison sentence avgsen average sentence, days polpc police per capita density people per square mile taxpc tax revenue per capita pctmin percentage minority in 1980 wcon weekly wage in construction wtuc weekly wage in trns, util, commun wtrd weekly wage in whole sales and retail trade wfir weekly wage in finance, insurance and real estate wser weekly wage in service industry wmfg weekly wage in manufacturing wfed weekly wage of federal emplyees wsta weekly wage of state employees wloc weekly wage of local governments employees mix offence mix: face-to-face/other pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

## Multicollinearity and Ridge Regression

In marketing mix modelling you have to be very lucky not to run into problems with multicollinearity. It’s in the nature of marketing campaigns that everything tends to happen at once: the TV is supported by radio, both are timed to coincide with the relaunch of the website. One of the techniques that is often touted as a solution is ridge regression. However there is quite a bit of disagreement over whether this works. So I thought we’d just try it out with the simulated sales data I created in the last post.

In fact we’ll need to modify that data a little as we need a case of serious multicollinearity. I’ve adjusted the Tv campaigns to ensure that they always occur in the same winter months (not uncommon in marketing mix data) and I’ve added radio campaigns alongside the TV campaigns. Here is the modified code.

#TV now coincides with winter. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[40:45]&lt;-c(390,250,100,80,120,60)
tv_grps[92:97]&lt;-c(390,250,100,80,120,60)
tv_grps[144:149]&lt;-c(390,250,100,80,120,60)
tv_grps[196:201]&lt;-c(390,250,100,80,120,60)
tv_grps[248:253]&lt;-c(390,250,100,80,120,60)

The sales data now looks like this:

The correlation matrix of the explanatory variables shows that we have serious multicollinearity issues even when only two variables are taken at a time.

&gt; cor(test[,c(2,4:6)])
temp         1.0000000 -0.41545174 -0.15593463 -0.47491671
week        -0.1559346  0.09096521  1.00000000  0.08048096

What is this going to mean for the chances of recovering the parameters in our simulated data set? Well we know that even with heavy multicollinearity our estimates using linear regression are going to be unbiased; the problem is going to be their high variance.

We can show this quite nicely by generating lots of examples of our sales data (always with the same parameters but allowing a different random draw each from the normally distributed error term) and plotting the distribution of the estimates arrived at using linear regression. (See Monte Carlo Simulation for more details about this kind of technique.)

}

You can see that on average the estimates for tv and radio are close to correct but the distributions are wide. So for any one instance of the data (which in real life is all we have) chances are that our estimate is quite wide of the mark. The data and plots are created using the following code:

coefs&lt;-NA
for (i in 1:10000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
coefs&lt;-rbind(coefs,coef(lm_std))
}
col_means&lt;-colMeans(coefs[-1,])
for_div&lt;-matrix(rep(col_means,10000), nrow=10000, byrow=TRUE)
mean_div&lt;-coefs[-1,]/for_div
m_coefs&lt;-melt(mean_div)
ggplot(data=m_coefs, aes(x=value))+geom_density()+facet_wrap(~X2, scales=&quot;free_y&quot;) + scale_x_continuous('Scaled as % of Mean')

What does ridge regression do to fix this? Ridge regression is best explained using a concept more familiar in machine learning and data mining: the bias-variance trade off. The idea is that you will often achieve better predictions (or estimates) if you are prepared to swap a bit of unbiasedness for much less variance. In other words the average of your predictions will no longer converge on the right answer but any one prediction is likely to be much closer.

In ridge regression we have a parameter lambda that controls the bias-variance trade off. As lambda increases our estimates get more biased but their variance increases. Cross-validation (another machine learning technique) is used to estimate the best possible setting of lambda.

So let’s see if ridge regression can help us with the multicolinearity in our marketing mix data. What we hope to see is a decent reduction in variance but not at too high a price in bias. The code below simulates the distribution of the ridge regression estimates of the parameters for increasing values of lambda.

library(MASS)
for (i in 1:1000){
sim&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
if (i==1){coefs_rg&lt;-coef(lm_rg)}
else {coefs_rg&lt;-rbind(coefs_rg,coef(lm_rg))}
}
colnames(coefs_rg)[1]&lt;-&quot;intercept&quot;
m_coefs_rg&lt;-melt(coefs_rg)
names(m_coefs_rg)&lt;-c(&quot;lambda&quot;, &quot;variable&quot;, &quot;value&quot;)
ggplot(data=m_coefs_rg, aes(x=value, y=lambda))+geom_density2d()+facet_wrap(~variable, scales=&quot;free&quot;)

The results are not encouraging. Variance decreases slightly for tv and radio however the cost in bias is far too high.

I’m aware that this by no means proves that ridge regression is never a solution for marketing mix data but it does at least show that it is not always the solution and I’m inclined to think that if it doesn’t work in a simple situation like this then it doesn’t work very often.

However I might try varying the parameters for the simulated data set to see if there are some settings where it looks more promising.

Still, for now, I won’t be recommending it as a solution to multicollinearity in marketing mix models.

A good explanation of ridge regression can be found in this post