Expected switching for the Dirichlet distribution

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A valuable tool in choice modelling is the Dirichlet-multinomial distribution. It’s a compound of the multinomial and Dirichlet distributions and it works like this:

  • A choice between N options is modelled as a multinomial distribution with parameters θ1, θ2, θ3 … θN, where the thetas also represent the probabilities of each option being chosen. For example we might model votes cast in an election as draws from a multinomial distribution with parameters θ1=0.7, θ2=0.2, θ3=0.1.
  • However the multinomial distribution by itself is likely to be a poor model of the choices made within a population as it assumes all individuals select options with the same probabilities. It would be more realistic to say that the thetas themselves vary over the population. This gives us what is known as an over-dispersed distribution: the parameters for one distribution are modelled by another distribution. In this case we use a Dirichlet distribution, which is the multivariate version of a Beta distribution, to model the distribution of the thetas.

As we’ll be using it a lot here’s the probability density function for the Dirichlet distribution.

f(\theta_1,\dots, \theta_N; \alpha_1,\dots, \alpha_N) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^N \theta_i^{\alpha_i - 1}

Where the normalising constant is:

\mathrm{B}(\alpha) = \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{\Gamma\bigl(\sum_{i=1}^N \alpha_i\bigr)},\qquad\alpha=(\alpha_1,\dots,\alpha_N)

One of the powerful things about the Dirichlet distribution as a modelling tool is that it allows us to capture not only the proportions of the population that opt for each choice but also the amount of switching between the choices from one draw to the next. To take our election example again, an election result of 70%, 20%, 10% for three parties could be modelled by Dirichlet distribution with alphas of 0.1, 0.029 and 0.014 or with alphas of 20, 5.71 and 2.86. In fact there are infinitely many possible settings of the alpha parameters that will produce this result. The difference between them is stability. If two successive elections produce the same result then this could be because the same people are voting for the same parties or, less likely but equally possible, people are switching their votes but in such a way that the net result is the same. Different settings of the alpha parameters produce different levels of switching.

A natural question is then: given a particular parametrisation of the Dirchlet distribution, what is the expected percentage of individuals that will switch category from one draw of the multinomial distribution to another?

I’m sure this has been worked out before somewhere but after a quick and fruitless trawl through the online literature I decided to do it myself, helped a lot by a great post from Leo Alekseyev who demonstrates a clever way of integrating over Dirichlet distributions. All I’ve done is adapt his technique.

(By the way to convert the latex in this post into a form easily used in wordpress I used an excellent python package from Luca Trevisan)

So let’s say we have N choices. For individual i the probability of picking choice j is θij. What then is the probability that a randomly selected individual will make the same choice in two successive draws from a multinomial distribution? The individual could either select the first option twice or the second option twice or the third option twice etc. In other words the probability we are interested in is

L=\theta_{i1}^2 + \theta_{i2}^2 +\theta_{i3}^2 ...\theta_{iN}^2

We will call L the loyalty and work out expected switching as 1-E[L].

Leo Alekseyev has created an excellent video on you tube talking through his technique. I would recommend watching it if you would like to follow the arguments below. If you’re just interested in the end result then scroll to the end of the post.

The quantity we are interested in is {L=\theta_1^2+\theta_2^2+ \cdots \theta_N^2} where the {\theta_i} come from a Dirichlet distribution with parameters { \alpha_1, \ \alpha_2, \ \alpha_3, \ldots \alpha_N } To get the expected value of L we can use a generalised version of the Law of the Unconscious Statistician to adapt the proof given by Leo Alekseyev. As a reminder, the Law of the Unconscious Statistician is:

\displaystyle  E[g(X)] = \int_{-\infty}^\infty g(x) f(x) \ dx \ \ \ \ \ (1)

where {f(x)} is the probability distribution of the random variable X.

This will give us the following integral

\displaystyle  E[L]= \frac{1}{B(\alpha)}\int \cdots \int_\mathbf{D}\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (2)

So how do we evaluate this integral? It’s domain D is not straightforward as it is constrained by { \left\vert \bf{\theta} \right\vert = 1 } (i.e. the probabilities must sum to zero).

Leo Alekseyev shows us a trick using the Dirac delta function:

\displaystyle  \delta(x)=\frac{1}{2\pi} \int_{-\infty}^\infty e^{ikx} dk \ \ \ \ \ (3)

This (generalised) function, the limit of increasingly concentrated distributions, has an area of one beneath the curve (if you can call it that) at x=0 and an area of zero everywhere else – a slightly odd concept sometimes thought of as infinitely tall spike above the origin. The helpful thing for us is if we set {x=1-\theta_1-\theta_2- \cdots \theta_N} and multiply the contents of the integral by the delta function then this is equivalent to evaluating the integral over D.

\displaystyle  \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \delta (1-\theta_1-\theta_2 \cdots \theta_N) \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (4)

Since…

\displaystyle \delta (1-\theta_1-\theta_2 \cdots \theta_N) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-ik(1-\theta_1-\theta_2 \cdots \theta_N)} dk

\displaystyle   = \frac{1}{2\pi}\int_{-\infty}^\infty e^{ik} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N}dk    \ \ \ \ \ (5)

… the integral can be rewritten as:

\displaystyle  \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N} \ d\theta_1 \!\cdots d\theta_N\ dk\ \ \ \ \ (6)

If we group together like terms we reach:

\displaystyle   \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-ik\theta_N} d\theta_N

\displaystyle   \Big] dk \ \ \ \ \ (7)

Note the pattern in the exponents!

Continuing along the lines described by Leo Alekseyev we use the substitutions {ik=\kappa} and {k=i\kappa} to set us up for the Laplace transform and evalute the integral at {t=1} to set us up for the inverse Laplace transform.

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-\kappa\theta_N} d\theta_N

\displaystyle \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (8)

As a reminder the Laplace transform is:

\displaystyle  \displaystyle\mathcal{L} \left\{f(t)\right\}=\int_0^\infty f(t)e^{-st}ds \ \ \ \ \ (9)

So we can substitute it in giving us:

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1+1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2+1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3+1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    + \cdots  \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N+1}\right\}

\displaystyle \Big] \left.d\kappa\right|_{t=1}    \ \ \ \ \ (10)

The Laplace transformation evaluates as:

\displaystyle  \int_0^\infty \theta^\alpha e^{-s\theta} d\theta =\frac{\Gamma(\alpha+1)}{s^{\alpha+1}} \ \ \ \ \ (11)

Which we can substitute back into our integral:

\displaystyle    \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\Gamma(\alpha_1+2)}{\kappa^{\alpha_1+2}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2+2)}{\kappa^{\alpha_2+2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3+2)}{\kappa^{\alpha_3+2}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  + \cdots

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N+2)}{\kappa^{\alpha_N+2}}

\displaystyle \Big]

\displaystyle  \left.d\kappa\right|_{t=1} \ \ \ \ \ (12)

Since {\Gamma(x+2)=x(x+1)\Gamma(x)} we get:

\displaystyle  \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_1(\alpha_1+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_2(\alpha_2+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_3(\alpha_3+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle + \cdots

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_N(\alpha_N+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle \Big]

\displaystyle \left.d\kappa\right|_{t=1}   \ \ \ \ \ (13)

Some rearranging gives us:

\displaystyle  \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{2\pi i} \sum_{i=1}^N(\alpha_i(\alpha_i+1)) \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (14)

Now we use the inverse Laplace transform to evaluate {\frac{1}{2\pi} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1}} as {\frac{1}{\Gamma(2+\sum_{i=1}^N \alpha_i)}} giving us:

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(2+\sum_{i=1}^N \alpha_i)}

\displaystyle =\frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(\sum_{i=1}^N \alpha_i)\sum_{i=1}^N \alpha_i (\sum_{i=1}^N \alpha_i+1)} \ \ \ \ \ (15)

Bringing the normalisation constant back in and cancelling out we get

\displaystyle  \frac{\sum_{i=1}^N\alpha_i(\alpha_i+1)}{\sum_{i=1}^N \alpha_i (1+\sum_{i=1}^N \alpha_i)} \ \ \ \ \ (16)

which is our expected loyalty {E[L]}.

All we need to do now is check it’s right by comparing with a simulated result in R

comp<-NULL
for (i in 1:100){
alphas<-runif(3, 0.01, 10)
#Simulate
sample.d1<-rdirichlet(10000, alphas)
purchases<-t(mapply(function(x, y, z) rmultinom(1, size = 2, prob=c(x,y,z)), sample.d1[,1], sample.d1[,2], sample.d1[,3]))
loyal<-sum(purchases[,1]==2)+sum(purchases[,2]==2)+sum(purchases[,3]==2)
switching.sim<-(10000-loyal)/10000
#Derived
switching.dir<-1-(sum(alphas*(alphas+1))/(sum(alphas)*(1+sum(alphas))))
comp<-rbind(comp, c(switching.sim, switching.dir))
}
colnames(comp)<-c("Simulated", "Derived")
plot(comp)
x<-seq(0,1,0.1)
lines(x,x)

A plot of the simulated as against the derived results shows that, as we would hope, they are approximately equal (the line is x=y)

SimDir

Thorstein Veblen and Hard Coding

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Veblen3a
It is still quite common to hear the career progression of an analyst described as one upwards from the hard graft of coding and “getting your hands dirty” towards the enviable heights of people management and strategic thinking. Whenever I hear this it reminds me of the book Conspicuous Consumption by the American economist Thorstein Veblen. It examines, in a very hypothetical way, the roots of economic behaviour in some of our basic social needs: to impress others, to dominate and to demonstrate status. It’s not a happy book.

His principal concept is the distinction between exploit and drudgery.

 

The institution of a leisure class is the outgrowth of an early discrimination between employments, according to which some employments are worthy and others unworthy. Under this ancient distinction the worthy employments are those which may be classed as exploit; unworthy are those necessary employments into which no appreciable element of exploit enters.

He sees this division arising early in history as honour and status are assigned to those are successful in making others do what they want and “at the same time, employment in industry becomes correspondingly odious, and, in the common-sense apprehension, the handling of the tools and implements of industry falls beneath the dignity of able-bodied men. Labour becomes irksome.”

We’re no longer hairy barbarians using the vanquished for foot-stools but, as Veblen points out, the distinction persists no matter how illogical and unprofitable. Coding is still somehow viewed as necessarily inferior to the boardroom meeting even if it is the genius piece of code that makes or breaks a business.

Attitudes are changing, but still so slowly that there is a desperate shortage of people with both the experience and the hands-on skills to get things done.

So if you are a good analyst, and you love what you do, then this is my advice to you: when they come to lure you from your lovely pristine scripts, resist. Stay where you are. The 21st century is going to need you!

Two Quick Recipes: Ubuntu and Hadoop

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

There are so many flavours of everything and things are changing so quickly that I find every task researched online ends up being a set of instructions stitched together from several blogs and forums. Here’s a couple of recent ones.

Ubuntu on AWS (50 mins)

Was going to buy a new laptop but it made more sense to set up a linux instance on AWS and remote in (a quarter of the price and more interesting). Here’s my recipe

    1. As in Mark’s earlier post set yourself up with an AWS account and a key pair by following this tutorial
    2. Launch an Ubuntu instance using the EC2 management console and select memory and processing power to suit.
    3. Start up the instance then connect to it by using Mindterm (very useful alternative to SSHing in with putty). To do this just select the instance in the terminal. Select Actions and then Connect. (You’ll need to provide the path to your saved key)
    4. Now you probably want to remote into your machine. Do this by setting up NoMachineNX following steps 2 to 4 in the following post
    5. However when you execute the last line of step 2 you’ll find that nxsetup is not found. To fix this switch to this post and follow steps 6-7 (life’s so complicated)

    6. Change password authentication to yes in  /etc/ssh/sshd_config
    7. Add gnome fall back

sudo apt-get install gnome-session-fallback

  1. Restart the instance and log in

Just remember to keep an eye on the charges!

Single Cluster Hadoop on Ubuntu (20 mins)

Of course you can run Hadoop directly on Amazon’s EMR platform but if you want to get more of a feel for how it works in a familiar environment you can set it up on a single instance.

  1. Follow the instructions in this post substituting in the latest hadoop stable release
  2. Install the latest JDK sudo apt-get install openjdk-7-jdk
  3. Set the JAVA_HOME path variable export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 Substituting in the path to the JDK binaries
  4. From the Hadoop quick start guide follow the instructions in the “Prepare to start the Hadoop Cluster” and “Stand Alone Operations” sections. If this all works you should be ready to go.

 

Machine Learning and Analytics based in London, UK