Include uncertainty in a financial model

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

On-line resources for analysts are often either too general to be of practical use or too specialised to be accessible. The aim of ragscripts.com is to remedy this by providing start to finish directions for complex analytical tasks. The site is under construction at the moment but any feedback is most welcome!

The problem

You’ve been asked to calculate some figure or other (e.g. end of year revenue, average customer lifetime value) based on numbers supplied from various parts of the business. You know how to make the calculation but what bothers you is that some of the figures going in are clearly approximations or at worst complete guesses. Therefore the final calculation will inherit this uncertainty. A decision will be made based on your final figure and more likely than not the decision makers will treat this number as a certainty. They won’t have any way of knowing how uncertain it is and may well draw the wrong conclusions. How do you get an understanding of the uncertainty around the final calculation and how do you explain it to the decision makers?

The standard approach in finance is to produce a set of best and worst case scenarios. These are tedious to produce and are often disregarded as unlikely outcomes. I think our approach is better as it gives a full picture of the uncertainty around an estimate.

What you’ll need

  • Basic R programming skills (including writing functions)
  • An understanding of probability distributions
  • About 1-3 hours (depending on the complexity of the model)

Example

Let’s say you are asked to project the revenue of a business in one year’s time based on the following numbers:

  • Customer base at the beginning of the year: 50,000
  • Average revenue per customer: £7
  • Monthly acquisition rate: 5%
  • Monthly churn rate: 2%

You might have thought that these figures could be known with some certainty but believe me it’s actually very rare especially in large businesses where integrating systems to get an overall picture is a tough job. Add to this the fact that we are guessing at how well any current figures are going to hold out over the year and we should agree than there’s a fair bit of uncertainty in the inputs.

With this is mind you ask the suppliers of the data for a lower and upper bound for the figures, i.e. the lowest and highest numbers that they would expect to see without being very surprised. Where they are unable or unwilling to do this you use your common sense. This gives you the following table:

Statistic Estimate Lower Bound Upper Bound
ARPC £7 £4 £10
Acquisition Rate 5% 3% 7%
Churn Rate 2% 1% 3%

Do it

Step 1: Setting up the distributions

There’s lots of tools you could do this with but R seems most appropriate so start up a session. We are going to model the our uncertainty about the input values in each case as a normal distribution. We will say that the mean of the distribution is the value the business has estimated and that the 5% and 95% quantiles are the upper bounds and lower bounds respectively.

RS4

Yes this is subjective (we could have chosen other values for the bounds) but the point of this exercise is not to make incredibly accurate statistical inferences rather it is to show the consequences that different levels of uncertainty in the inputs have on the output. You can play with these parameters if you wish and say things like if we were this accurate with our inputs then we would be this accurate with our output

Let’s start by calculating the parameters (mean, standard deviation) for the three normal distributions we’ll need.

Let  q_{95} be the 95th quantile. Then

 q_{95} = \mu + 1.645 \sigma . Therefore   \sigma = \frac{q_{95}-\mu}{1.645}

#Work out the standard deviation for arpu
arpu.sd<-3/1.645
#Plot the resulting distribution
x<-seq(0, 15,by=0.5)
d<-dnorm(x, 7, arpu.sd)
plot(x, d, type='l')

This should give a plot that meets our expectations and it does:

RS2

We’ll complete this step by doing the same for acquisition and churn.

#Do the same for acquisition and churn
acq.sd<-0.02/1.645
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.05, acq.sd)
plot(x, d, type='l')
ch.sd<-0.01/1.645
x<-seq(0, 0.2,by=0.001)
d<-dnorm(x, 0.02, ch.sd)
plot(x, d, type='l')
Step 2: A function for calculating revenue

The next piece you need is a function for calculating the output (the statistic of interest) from the inputs. This is what would usually be captured in the formulae of an excel spreadsheet.

For our example it looks like this:

revenue<-function(arpu, acq, ch){
  num.cust<-50000
  for (m in 1:12){
    num.cust<-num.cust+acq*num.cust-ch*num.cust
  }
  return(num.cust*arpu)
}
Step 3: Making random draws from our belief distributions

We will now generate 10 thousand values from our each of normally distributed input variables to create 10 thousand simulations. Each of these simulations represents a possible set of inputs to our model. For example one set might be

  • Average revenue per customer: £7.50
  • Monthly acquisition rate: 3%
  • Monthly churn rate: 1%

However the way we’ve set things up unlikely values for each input are unlikely to be drawn from our distributions. In fact since 10 thousand is a pretty large number the frequency at which each of the possible values is drawn should fairly well reflect our initial beliefs.

Here is our code for simulating the input values:

#Now let's simulate 10k values from each of our distributions
sim.arpu<-rnorm(10000, 7, arpu.sd)
sim.acq<-rnorm(10000, 0.05, acq.sd)
sim.ch<-rnorm(10000, 0.02, ch.sd)

Note a couple of things we haven’t accounted for: we’ve assumed that the our beliefs about our input variables are independent of one another but that might not be the case. We might think for some reason that if churn is higher then acquisition will be lower. If so we want to draw our values from a multivariate distribution incorporating correlation between the variables. However assuming independence is an easy and useful place to start. Also we’ve assumed that our model itself is correct – i.e. that constant rates of churn and acquisition apply and that there are no other effects (for example seasonal or economic.)

Step 4: Running them through the function

Now everything is place and it’s a simple matter to apply our function to each of the 10 thousand simulated sets of input variables. We use R’s mapply function.

sim.rev<-mapply(revenue, sim.arpu, sim.acq, sim.ch)
Step 5: Examining the results

Since our simulated input variables were distributed in a way that reflects our beliefs about their likely values, our final revenue calculations, based on these simulated inputs, will also have a distribution that reflect these uncertainties. The next few lines of code show us this distribution.

summary(sim.rev)
hist(sim.rev)
plot(density(sim.rev))

The histogram of the possible values for annual revenue makes the point of doing all of this quite clear. A straight calculation based on the original input values gives a revenue value of 499k. However a quick glance at the histogram shows that revenues as low as 400k are not at all improbable. Any planning should account for this!

RS3

See it

Here’s a quick JSfiddle to see how inputs affect outputs in the example case.

Explain it

So how do you explain what you have done and why you have done it?

Chances are you may run into the analyst’s catch 22 (see skit). You’ve done a much better job because you’ve incorporated uncertainty into your forecast allowing the decision maker to make a more informed decision. However people like certainty and the very fact you have done this may be perceived as incompetence What do you mean you don’t know what the revenue figure is going to be – go back and do it again!

So either you compromise your own integrity by glossing over the uncertainty or you’re branded an idiot or at least a boffin who overcomplicates things!

But there are some ways around this. Try to find out what decisions are being made using the data. Some of the frustration will come from being handed something that doesn’t solve the problem and the decision maker may have a point: if the range of accuracy is not enough to make a decision the model may need to be improved in some way.

Also talking about the decision and the problem from the outset is a good way of driving home the value of what you are doing. The very purpose of showing the uncertainty in the estimate is to prevent misinformed decisions being made.

The decision in our example might be whether or not to we can invest in a new product now given that we still need to be in profit by the end of the year. Our model as it is shows that if all other outgoings come to 150k then we’d be taking a risk spending over 50k on the new product, (since there’s a reasonable chance that revenue could be as low as 200k). If 50k was what were planning then we’ve done the job. If however a new product can’t be done for less than 100k then it’s worth spending some time improving our model.

Fork it

Take my code and adapt it for your own models. You can find the gist here

Freehand Diagrams with Adobe Ideas

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Freehand diagrams have two big virtues: they are quick and they are unconstrained.

I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast.

A diagram for teaching machine learning

A diagram for teaching machine learning

It takes a bit of perseverance to get into a new habit but it pays off. Here are some examples and some tips on how to make it work as an alternative to paper.

First don’t bother using your fingers. You won’t have nearly enough control. Get a stylus. I use this one which is fairly cheap but works fine. I’ve noticed that it moves more freely when the ipad is in landscape position so I always draw with it that way round. I don’t know why this is but I guess it’s something to do with the glass.

The confusion matrix

The confusion matrix

Neighbourhood of C major

Neighbourhood of C major

When you first open Adobe Ideas it’s tempting to see the page in front of view as something that needs to be filled as you would a piece of paper. That’s the wrong way to approach it as you’ll get overly distracted by the task of orientating your work on the page. Instead treat the page as an enormous workboard. Start by zooming in somewhere in the middle and work outwards.

A diagram for teaching machine learning

A diagram for teaching machine learning

It’s vector graphic based so in principle infinitely deep. You’re only limited by how thin the pens get. Draw your diagram without any thought for the borders then right at the end scale it up to fit the page.

The next thing that might strike you is that there’s no copy and paste. Ideas is part of Adobe’s creative suite along with Photoshop and Illustrator and as such inherits their layer based approach to image editing. If you want to be able to adjust the positioning of parts of your diagram (say the key with respect to the rest of it) then you would be wise to put these on separate layers. You’d think this would be deeply frustrating however it’s one of those constraints that somehow makes you work better as you think more about your layout.

You are allowed 10 layers in total with one photo layer. I sometimes place an image of a grid on this layer if I need grid lines to work to.

If you are working quickly you don’t want to spending too much fretting about the colour palette. Again Ideas forces you into simplicity by limiting your colours to five at a time. Even better you can connect up to Kuler provided its app is also on your ipad. This gives you access to a huge range of palletes. The Kuler tool, which sucks the pallete out of an image, is also there by default in Ideas if you click on themes. This allows you to pull the colour scheme out any image on the photo layer.

Some Diagram Elements

Some Diagram Elements

When it comes to using the pen tool I tend to stick to the marker and constantly zoom in and out as I draw. I zoom right in for writing text and out again for drawing shapes. You soon get used to flipping between using your fingers to zoom and pan and your stylus to draw. It’s worth noting down the width of the pens you are using as it’s easy to get lose track of where you are and a diagram made up of many different line widths looks bad. I tend to use two widths: 6.5 and 3.

The toughest thing to pull off is anything that requires large scales swipes across the surface of the ipad, for example long straight lines or wide circles. It is easier to do this on a small scale with a thin pen and then enlarge the image.

One thing I didn’t realise for a while was that holding the pen tool for a few seconds in any completely circumscribed areas floods that areas with colour. This is very useful for filling shapes and creating colorful backgrounds.

I try to vary styles and colours although it takes an effort to keep this up if you are taking notes or in rush. I’ve listed in the panel on the left some of the diagram elements I’ve experimented with.

Adobe Ideas looks at first sight too limited to be particularly powerful but take a look at How to Use Adobe Ideas by Michael Startzman to see what is achievable by a professional!

Book Recommendations from Beyond the Grave: A Mahout Example

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!

The Case of Charles Dexter Ward

It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.

Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.

So all we need to supply the algorithm with is a list of users and items.

Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.

Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.

It works (sort of) and was very interesting to put together. Here’s a demo:

Say you liked English Romantic poets. Then you might submit the names:

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge

Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.

Or you might try some painters you like:

http://simpleweb-deadrecommender.rhcloud.com/?names=Pablo Picasso, Henri Matisse, Paul Gauguin

Or even just …

http://simpleweb-deadrecommender.rhcloud.com/?names=Bob Dylan

Or you could upset things a bit by adding in someone incongruous!

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge, arthur c clarke

Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.

What are we getting back?

The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)

The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)

The code is available on github

Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.

I plan to do several more posts on how this was done. At the moment I’m thinking

  • A walk though of the code
  • An explanation of how the recommender was tuned
  • How it was deployed as a web app on Open Shift

But if anyone out there is interested in other things or has questions just let me know.

Simon

What are degrees of freedom?

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I remember getting frustrated as an undergraduate trying to find straight answer to this question.

The standard text book answer is something like this:

"In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary"

That’s from Wikipedia but it’s fairly typical.

I could just about make sense of this for something like a chi-squared statistic but why, could someone explain to me, are the degrees of freedom for linear regression n-k-1?

I realise this doesn’t keep many people awake but it did me so I was pleased to find the following quote:

"The person who is unfamiliar with N-dimensional geometry or who knows the contributions to modern sampling theory
only from secondhand sources such as textbooks, this concept often seems almost mystical, with no practical meaning."
Walker, 1940

Many years later I’m nasty enough to use it as an interview question. In a kinder frame of mind I thought I’d post my slightly XKCD inspired notes for explaining, as simply as I can, the concept in terms of N-dimensional geometry.

I hope you can read my writing and apologies to mathematicians if the language is a bit loose!

001

002

003

004

005

Expected switching for the Dirichlet distribution

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A valuable tool in choice modelling is the Dirichlet-multinomial distribution. It’s a compound of the multinomial and Dirichlet distributions and it works like this:

  • A choice between N options is modelled as a multinomial distribution with parameters θ1, θ2, θ3 … θN, where the thetas also represent the probabilities of each option being chosen. For example we might model votes cast in an election as draws from a multinomial distribution with parameters θ1=0.7, θ2=0.2, θ3=0.1.
  • However the multinomial distribution by itself is likely to be a poor model of the choices made within a population as it assumes all individuals select options with the same probabilities. It would be more realistic to say that the thetas themselves vary over the population. This gives us what is known as an over-dispersed distribution: the parameters for one distribution are modelled by another distribution. In this case we use a Dirichlet distribution, which is the multivariate version of a Beta distribution, to model the distribution of the thetas.

As we’ll be using it a lot here’s the probability density function for the Dirichlet distribution.

f(\theta_1,\dots, \theta_N; \alpha_1,\dots, \alpha_N) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^N \theta_i^{\alpha_i - 1}

Where the normalising constant is:

\mathrm{B}(\alpha) = \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{\Gamma\bigl(\sum_{i=1}^N \alpha_i\bigr)},\qquad\alpha=(\alpha_1,\dots,\alpha_N)

One of the powerful things about the Dirichlet distribution as a modelling tool is that it allows us to capture not only the proportions of the population that opt for each choice but also the amount of switching between the choices from one draw to the next. To take our election example again, an election result of 70%, 20%, 10% for three parties could be modelled by Dirichlet distribution with alphas of 0.1, 0.029 and 0.014 or with alphas of 20, 5.71 and 2.86. In fact there are infinitely many possible settings of the alpha parameters that will produce this result. The difference between them is stability. If two successive elections produce the same result then this could be because the same people are voting for the same parties or, less likely but equally possible, people are switching their votes but in such a way that the net result is the same. Different settings of the alpha parameters produce different levels of switching.

A natural question is then: given a particular parametrisation of the Dirchlet distribution, what is the expected percentage of individuals that will switch category from one draw of the multinomial distribution to another?

I’m sure this has been worked out before somewhere but after a quick and fruitless trawl through the online literature I decided to do it myself, helped a lot by a great post from Leo Alekseyev who demonstrates a clever way of integrating over Dirichlet distributions. All I’ve done is adapt his technique.

(By the way to convert the latex in this post into a form easily used in wordpress I used an excellent python package from Luca Trevisan)

So let’s say we have N choices. For individual i the probability of picking choice j is θij. What then is the probability that a randomly selected individual will make the same choice in two successive draws from a multinomial distribution? The individual could either select the first option twice or the second option twice or the third option twice etc. In other words the probability we are interested in is

L=\theta_{i1}^2 + \theta_{i2}^2 +\theta_{i3}^2 ...\theta_{iN}^2

We will call L the loyalty and work out expected switching as 1-E[L].

Leo Alekseyev has created an excellent video on you tube talking through his technique. I would recommend watching it if you would like to follow the arguments below. If you’re just interested in the end result then scroll to the end of the post.

The quantity we are interested in is {L=\theta_1^2+\theta_2^2+ \cdots \theta_N^2} where the {\theta_i} come from a Dirichlet distribution with parameters { \alpha_1, \ \alpha_2, \ \alpha_3, \ldots \alpha_N } To get the expected value of L we can use a generalised version of the Law of the Unconscious Statistician to adapt the proof given by Leo Alekseyev. As a reminder, the Law of the Unconscious Statistician is:

\displaystyle  E[g(X)] = \int_{-\infty}^\infty g(x) f(x) \ dx \ \ \ \ \ (1)

where {f(x)} is the probability distribution of the random variable X.

This will give us the following integral

\displaystyle  E[L]= \frac{1}{B(\alpha)}\int \cdots \int_\mathbf{D}\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (2)

So how do we evaluate this integral? It’s domain D is not straightforward as it is constrained by { \left\vert \bf{\theta} \right\vert = 1 } (i.e. the probabilities must sum to zero).

Leo Alekseyev shows us a trick using the Dirac delta function:

\displaystyle  \delta(x)=\frac{1}{2\pi} \int_{-\infty}^\infty e^{ikx} dk \ \ \ \ \ (3)

This (generalised) function, the limit of increasingly concentrated distributions, has an area of one beneath the curve (if you can call it that) at x=0 and an area of zero everywhere else – a slightly odd concept sometimes thought of as infinitely tall spike above the origin. The helpful thing for us is if we set {x=1-\theta_1-\theta_2- \cdots \theta_N} and multiply the contents of the integral by the delta function then this is equivalent to evaluating the integral over D.

\displaystyle  \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} \delta (1-\theta_1-\theta_2 \cdots \theta_N) \ d\theta_1 \!\cdots d\theta_N \ \ \ \ \ (4)

Since…

\displaystyle \delta (1-\theta_1-\theta_2 \cdots \theta_N) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-ik(1-\theta_1-\theta_2 \cdots \theta_N)} dk

\displaystyle   = \frac{1}{2\pi}\int_{-\infty}^\infty e^{ik} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N}dk    \ \ \ \ \ (5)

… the integral can be rewritten as:

\displaystyle  \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \int_{\theta_1 =0}^\infty \int_{\theta_2 =0}^\infty \cdots \int_{\theta_N =0}^\infty\ \sum_{j=1}^N \theta_{j}^2 \ \prod_{j=1}^N \theta_j^{\alpha_j-1} e^{ik\theta_1} e^{ik\theta_2} \cdots e^{ik\theta_N} \ d\theta_1 \!\cdots d\theta_N\ dk\ \ \ \ \ (6)

If we group together like terms we reach:

\displaystyle   \frac{1}{2\pi} \int_{-\infty}^\infty e^{-ik} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-ik\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-ik\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-ik\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-ik\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-ik\theta_N} d\theta_N

\displaystyle   \Big] dk \ \ \ \ \ (7)

Note the pattern in the exponents!

Continuing along the lines described by Leo Alekseyev we use the substitutions {ik=\kappa} and {k=i\kappa} to set us up for the Laplace transform and evalute the integral at {t=1} to set us up for the inverse Laplace transform.

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[

\displaystyle   \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1+1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2+1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle   + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3+1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N-1} e^{-\kappa\theta_N} d\theta_N

\displaystyle  \cdots + \int_{\theta_1 =0}^\infty \theta_1^{\alpha_1-1} e^{-\kappa\theta_1} d\theta_1 \int_{\theta_2 =0}^\infty \theta_2^{\alpha_2-1} e^{-\kappa\theta_2} d\theta_2 \int_{\theta_3 =0}^\infty \theta_3^{\alpha_3-1} e^{-\kappa\theta_3} d\theta_3 \cdots \int_{\theta_N =0}^\infty \theta_N^{\alpha_N+1} e^{-\kappa\theta_N} d\theta_N

\displaystyle \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (8)

As a reminder the Laplace transform is:

\displaystyle  \displaystyle\mathcal{L} \left\{f(t)\right\}=\int_0^\infty f(t)e^{-st}ds \ \ \ \ \ (9)

So we can substitute it in giving us:

\displaystyle \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1+1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2+1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    +\displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3+1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N-1}\right\}

\displaystyle    + \cdots  \displaystyle\mathcal{L} \left\{\theta_1^{\alpha_1-1}\right\} \displaystyle\mathcal{L} \left\{\theta_2^{\alpha_2-1}\right\} \displaystyle\mathcal{L} \left\{\theta_3^{\alpha_3-1}\right\} \cdots \displaystyle\mathcal{L} \left\{\theta_N^{\alpha_N+1}\right\}

\displaystyle \Big] \left.d\kappa\right|_{t=1}    \ \ \ \ \ (10)

The Laplace transformation evaluates as:

\displaystyle  \int_0^\infty \theta^\alpha e^{-s\theta} d\theta =\frac{\Gamma(\alpha+1)}{s^{\alpha+1}} \ \ \ \ \ (11)

Which we can substitute back into our integral:

\displaystyle    \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\Gamma(\alpha_1+2)}{\kappa^{\alpha_1+2}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2+2)}{\kappa^{\alpha_2+2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3+2)}{\kappa^{\alpha_3+2}} \cdots  \frac{\Gamma(\alpha_N)}{\kappa^{\alpha_N}}

\displaystyle  + \cdots

\displaystyle  +\frac{\Gamma(\alpha_1)}{\kappa^{\alpha_1}} \frac{\Gamma(\alpha_2)}{\kappa^{\alpha_2}} \frac{\Gamma(\alpha_3)}{\kappa^{\alpha_3}} \cdots  \frac{\Gamma(\alpha_N+2)}{\kappa^{\alpha_N+2}}

\displaystyle \Big]

\displaystyle  \left.d\kappa\right|_{t=1} \ \ \ \ \ (12)

Since {\Gamma(x+2)=x(x+1)\Gamma(x)} we get:

\displaystyle  \frac{1}{2\pi i} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_1(\alpha_1+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_2(\alpha_2+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle +\frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_3(\alpha_3+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle + \cdots

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\alpha_N(\alpha_N+1)}{\kappa^{\alpha_1+\alpha_2+ \cdots \alpha_N + 2}}

\displaystyle \Big]

\displaystyle \left.d\kappa\right|_{t=1}   \ \ \ \ \ (13)

Some rearranging gives us:

\displaystyle  \frac{\prod_{i=1}^N \Gamma(\alpha_i)}{2\pi i} \sum_{i=1}^N(\alpha_i(\alpha_i+1)) \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1} \ \ \ \ \ (14)

Now we use the inverse Laplace transform to evaluate {\frac{1}{2\pi} \int_{- i \infty}^{i \infty} e^{\kappa t} \Big[ \frac{1}{\kappa^{\alpha_1+\alpha_2 \cdots \alpha_N + 2}} \Big] \left.d\kappa\right|_{t=1}} as {\frac{1}{\Gamma(2+\sum_{i=1}^N \alpha_i)}} giving us:

\displaystyle \frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(2+\sum_{i=1}^N \alpha_i)}

\displaystyle =\frac{\prod_{i=1}^N \Gamma(\alpha_i)\sum_{i=1}^N(\alpha_i(\alpha_i+1))}{\Gamma(\sum_{i=1}^N \alpha_i)\sum_{i=1}^N \alpha_i (\sum_{i=1}^N \alpha_i+1)} \ \ \ \ \ (15)

Bringing the normalisation constant back in and cancelling out we get

\displaystyle  \frac{\sum_{i=1}^N\alpha_i(\alpha_i+1)}{\sum_{i=1}^N \alpha_i (1+\sum_{i=1}^N \alpha_i)} \ \ \ \ \ (16)

which is our expected loyalty {E[L]}.

All we need to do now is check it’s right by comparing with a simulated result in R

comp&lt;-NULL
for (i in 1:100){
alphas&lt;-runif(3, 0.01, 10)
#Simulate
sample.d1&lt;-rdirichlet(10000, alphas)
purchases&lt;-t(mapply(function(x, y, z) rmultinom(1, size = 2, prob=c(x,y,z)), sample.d1[,1], sample.d1[,2], sample.d1[,3]))
loyal&lt;-sum(purchases[,1]==2)+sum(purchases[,2]==2)+sum(purchases[,3]==2)
switching.sim&lt;-(10000-loyal)/10000
#Derived
switching.dir&lt;-1-(sum(alphas*(alphas+1))/(sum(alphas)*(1+sum(alphas))))
comp&lt;-rbind(comp, c(switching.sim, switching.dir))
}
colnames(comp)&lt;-c(&quot;Simulated&quot;, &quot;Derived&quot;)
plot(comp)
x&lt;-seq(0,1,0.1)
lines(x,x)

A plot of the simulated as against the derived results shows that, as we would hope, they are approximately equal (the line is x=y)

SimDir

Machine Learning and Analytics based in London, UK