June 16, 2014

# Distribution for the difference between two binomially distributed random variables

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

$X \sim Bin(n_1, p_1)$

$Y \sim Bin(n_2, p_2)$

We are looking for the probability mass function of $Z=X-Y$

First note that the min and max of the support of Z must be $(-n_2, n_1)$ since that covers the most extreme cases ($X=0$ and $Y=n_2$) and ($X=n_1$ and $Y=0$).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

$m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k}$ when $k \leq n$ and 0 otherwise.

Then we need to define two cases

1. $Z \geq 0$
2. $latex Z < 0$ In the first case $latex p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)$ since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of $latex n_1$ and $latex n_2$. For example if $latex n_2 = 4$ then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero. For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc. \$latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex] Put them together and that's your pmf.

Here’s the function in R and a simulation to check it’s right (and it does work.)

### About the Author

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.