February 26, 2013 Simon Raper

Expected switching for the Dirichlet distribution

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A valuable tool in choice modelling is the Dirichlet-multinomial distribution. It’s a compound of the multinomial and Dirichlet distributions and it works like this:

  • A choice between N options is modelled as a multinomial distribution with parameters θ1, θ2, θ3 … θN, where the thetas also represent the probabilities of each option being chosen. For example we might model votes cast in an election as draws from a multinomial distribution with parameters θ1=0.7, θ2=0.2, θ3=0.1.
  • However the multinomial distribution by itself is likely to be a poor model of the choices made within a population as it assumes all individuals select options with the same probabilities. It would be more realistic to say that the thetas themselves vary over the population. This gives us what is known as an over-dispersed distribution: the parameters for one distribution are modelled by another distribution. In this case we use a Dirichlet distribution, which is the multivariate version of a Beta distribution, to model the distribution of the thetas.

As we’ll be using it a lot here’s the probability density function for the Dirichlet distribution.

f(\theta_1,\dots, \theta_N; \alpha_1,\dots, \alpha_N) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^N \theta_i^{\alpha_i - 1}

Where the normalising constant is:

\mathrm{B}(\alpha) = \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{\Gamma\bigl(\sum_{i=1}^N \alpha_i\bigr)},\qquad\alpha=(\alpha_1,\dots,\alpha_N)

One of the powerful things about the Dirichlet distribution as a modelling tool is that it allows us to capture not only the proportions of the population that opt for each choice but also the amount of switching between the choices from one draw to the next. To take our election example again, an election result of 70%, 20%, 10% for three parties could be modelled by Dirichlet distribution with alphas of 0.1, 0.029 and 0.014 or with alphas of 20, 5.71 and 2.86. In fact there are infinitely many possible settings of the alpha parameters that will produce this result. The difference between them is stability. If two successive elections produce the same result then this could be because the same people are voting for the same parties or, less likely but equally possible, people are switching their votes but in such a way that the net result is the same. Different settings of the alpha parameters produce different levels of switching.

A natural question is then: given a particular parametrisation of the Dirchlet distribution, what is the expected percentage of individuals that will switch category from one draw of the multinomial distribution to another?

I’m sure this has been worked out before somewhere but after a quick and fruitless trawl through the online literature I decided to do it myself, helped a lot by a great post from Leo Alekseyev who demonstrates a clever way of integrating over Dirichlet distributions. All I’ve done is adapt his technique.

(By the way to convert the latex in this post into a form easily used in wordpress I used an excellent python package from Luca Trevisan)

So let’s say we have N choices. For individual i the probability of picking choice j is θij. What then is the probability that a randomly selected individual will make the same choice in two successive draws from a multinomial distribution? The individual could either select the first option twice or the second option twice or the third option twice etc. In other words the probability we are interested in is

L=\theta_{i1}^2 + \theta_{i2}^2 +\theta_{i3}^2 ...\theta_{iN}^2

We will call L the loyalty and work out expected switching as 1-E[L].

Leo Alekseyev has created an excellent video on you tube talking through his technique. I would recommend watching it if you would like to follow the arguments below. If you’re just interested in the end result then scroll to the end of the post.

The quantity we are interested in is {L=\theta_1^2+\theta_2^2+ \cdots \theta_N^2} where the {\theta_i} come from a Dirichlet distribution with parameters { \alpha_1, \ \alpha_2, \ \alpha_3, \ldots \alpha_N } To get the expected value of L we can use a generalised version of the Law of the Unconscious Statistician to adapt the proof given by Leo Alekseyev. As a reminder, the Law of the Unconscious Statistician is:

\displaystyle  E[g(X)] = \int_{-\infty}^\infty g(x) f(x) \ dx \ \ \ \ \ (1)

where {f(x)} is the probability distribution of the random variable X.

This will give us the following integral

\displaystyle  E[L]= \frac{1}{B(\alpha)}\int \cdots \int_\mathbf{D}\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (2)

So how do we evaluate this integral? It’s domain D is not straightforward as it is constrained by { \left\vert \bf{\theta} \right\vert = 1 } (i.e. the probabilities must sum to zero).

Leo Alekseyev shows us a trick using the Dirac delta function:

\displaystyle  \delta(x)=\frac{1}{2\pi} \int_{-\infty}^\infty e^{ikx} dk \ \ \ \ \ (3)

This (generalised) function, the limit of increasingly concentrated distributions, has an area of one beneath the curve (if you can call it that) at x=0 and an area of zero everywhere else – a slightly odd concept sometimes thought of as infinitely tall spike above the origin. The helpful thing for us is if we set {x=1-\theta_1-\theta_2- \cdots \theta_N} and multiply the contents of the integral by the delta function then this is equivalent to evaluating the integral over D.

\displaystyle  \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \delta (1-\theta_1-\theta_2 \cdots \theta_N) \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (4)

Since…

\displaystyle \delta (1-\theta_1-\theta_2 \cdots \theta_N) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-ik(1-\theta_1-\theta_2 \cdots \theta_N)} dk

\displaystyle   = \frac{1}{2\pi}\int_{-\infty}^\infty e^{ik} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N}dk    \ \ \ \ \ (5)

… the integral can be rewritten as:

\displaystyle  \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N} \ d\theta_1 \!\cdots d\theta_N\ dk\ \ \ \ \ (6)

If we group together like terms we reach:

\displaystyle   \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-ik\theta_N} d\theta_N

\displaystyle   \Big] dk \ \ \ \ \ (7)

Note the pattern in the exponents!

Continuing along the lines described by Leo Alekseyev we use the substitutions {ik=\kappa} and {k=i\kappa} to set us up for the Laplace transform and evalute the integral at {t=1} to set us up for the inverse Laplace transform.

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-\kappa\theta_N} d\theta_N

\displaystyle \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (8)

As a reminder the Laplace transform is:

\displaystyle  \displaystyle\mathcal{L} \left\{f(t)\right\}=\int_0^\infty f(t)e^{-st}ds \ \ \ \ \ (9)

So we can substitute it in giving us:

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1+1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2+1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3+1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    + \cdots  \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N+1}\right\}

\displaystyle \Big] \left.d\kappa\right|_{t=1}    \ \ \ \ \ (10)

The Laplace transformation evaluates as:

\displaystyle  \int_0^\infty \theta^\alpha e^{-s\theta} d\theta =\frac{\Gamma(\alpha+1)}{s^{\alpha+1}} \ \ \ \ \ (11)

Which we can substitute back into our integral:

\displaystyle    \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\Gamma(\alpha_1+2)}{\kappa^{\alpha_1+2}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2+2)}{\kappa^{\alpha_2+2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3+2)}{\kappa^{\alpha_3+2}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  + \cdots

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N+2)}{\kappa^{\alpha_N+2}}

\displaystyle \Big]

\displaystyle  \left.d\kappa\right|_{t=1} \ \ \ \ \ (12)

Since {\Gamma(x+2)=x(x+1)\Gamma(x)} we get:

\displaystyle  \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_1(\alpha_1+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_2(\alpha_2+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_3(\alpha_3+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle + \cdots

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_N(\alpha_N+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle \Big]

\displaystyle \left.d\kappa\right|_{t=1}   \ \ \ \ \ (13)

Some rearranging gives us:

\displaystyle  \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{2\pi i} \sum_{i=1}^N(\alpha_i(\alpha_i+1)) \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (14)

Now we use the inverse Laplace transform to evaluate {\frac{1}{2\pi} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1}} as {\frac{1}{\Gamma(2+\sum_{i=1}^N \alpha_i)}} giving us:

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(2+\sum_{i=1}^N \alpha_i)}

\displaystyle =\frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(\sum_{i=1}^N \alpha_i)\sum_{i=1}^N \alpha_i (\sum_{i=1}^N \alpha_i+1)} \ \ \ \ \ (15)

Bringing the normalisation constant back in and cancelling out we get

\displaystyle  \frac{\sum_{i=1}^N\alpha_i(\alpha_i+1)}{\sum_{i=1}^N \alpha_i (1+\sum_{i=1}^N \alpha_i)} \ \ \ \ \ (16)

which is our expected loyalty {E[L]}.

All we need to do now is check it’s right by comparing with a simulated result in R

comp<-NULL
for (i in 1:100){
alphas<-runif(3, 0.01, 10)
#Simulate
sample.d1<-rdirichlet(10000, alphas)
purchases<-t(mapply(function(x, y, z) rmultinom(1, size = 2, prob=c(x,y,z)), sample.d1[,1], sample.d1[,2], sample.d1[,3]))
loyal<-sum(purchases[,1]==2)+sum(purchases[,2]==2)+sum(purchases[,3]==2)
switching.sim<-(10000-loyal)/10000
#Derived
switching.dir<-1-(sum(alphas*(alphas+1))/(sum(alphas)*(1+sum(alphas))))
comp<-rbind(comp, c(switching.sim, switching.dir))
}
colnames(comp)<-c("Simulated", "Derived")
plot(comp)
x<-seq(0,1,0.1)
lines(x,x)

A plot of the simulated as against the derived results shows that, as we would hope, they are approximately equal (the line is x=y)

SimDir

Tagged: ,

About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK