In this post, I’ll be describing how I implemented a zero-truncated poisson distribution in PyMC3, as well as why I did so.
What is truncation?
Truncated distributions arise when some parts of a distribution are impossible to observe. For example, consider how often someone visits a store in a week. It’s natural to count people from one, and it’s hard to imagine the people who didn’t visit. For example, do we assume that every single person in the world didn’t visit? All 7.6 billion of them?
It’s natural to use a Poisson distribution for counts like these, but unfortunately the poisson distribution can produce zero counts. Because there are no zero-counts, the poisson’s estimate of the average visiting rate is far higher than expected. This also causes something called overdispersion, where the estimated variance of the lambda value is higher than it needs to be, making it hard to compare different rates.
See the accompanying notebook for the code.
In this problem, we’ll be working with a really simple dataset. In it, we have two groups of people, one with an average of 0.79 visits per week, the other with 0.95 visits per week:
Just to start off, we make sure that we an fit the non-truncated version:
This works perfectly (that’s good), and here is the distribution of the two lambda values, as well as the true values (in blue and orange), and histograms for post-predictive checks:
Generally, the 95% confidence intervals will not include each other (depending on the particulars of the random dataset). And also, the 95% CI contains the two lambda values we expect!
Here, we just use the poisson distribution and ignore any issues of truncation:
This distribution is less promising:
The lambdas recovered are far from where we’d expect them, and the HPD intervals sit over each other, suggesting that the two groups do not have different visit rates (even though we know they do). Lastly, the PPC checks are just way off - the two distributions are wildly different.
I also had some warnings while fitting this - PyMC3 is trying to tell me I messed up 😊.
Solution using a truncated distribution
I’ll go into the implementation of the zero-truncated poisson in the next section, but here’s an example of it in action;
Notice here that we’ve successfully used the truncated data to retrieve the correct lambda values! Additionally, the two distribution’s HPD do not overlap, meaning that we’ve (correctly) inferred that there are different values for the two group’s visitation rates.
Things we’ve done:
With our zero-truncated poisson we’ve:
- ✔ Accounted for the truncated zero data
- ✔ correctly recovered underlying lambda values even when large parts of the dataset are unobserved
- ✔ managed to prove that there’s a significant difference between the two groups - impossible without accounting for the truncation.
How’s it done? Two steps:
- Use the truncated distribution formula to work out the log-pdf of the distribution.
- Use the same formulas to work out the inverse-CDF of the new distribution
- Implement the distribution in pymc3
The poisson distribution’s PDF is .
The new PDF is given by where is the poisson CDF and is the modified PDF which is zero at .
What we’re doing here is cutting out part of the PDF (to make ), and scaling the PDF to account for the missing probability. The poisson CDF is 1 at inifinity (remember that the CDF here represents the probability of seeing a count less than infinity - so its probability is one), so . is simply found by substituting it into the formula on wikipedia for the CDF of the poisson distribution. I used the summation approach - because it’s very simple:
Which of course simplifies to . The final formula is (click here for a derivation):
So now we simply take the log of this, and we get:
PyMC3 has a few useful functions here to help us out:
- logpow: for taking the log of powers, with special cases for zeros
- factln: a differential log-factorial
And the final expression is:
p = logpow(mu, value) - (factln(value) + pm.math.log(pm.math.exp(mu)-1))
For this, we need a few ingredients:
- a way to map random numbers in interval [0, 1) to the truncated interval
- The ppf function, which is what scipy calls the inverse CDF
Then it’s simply a matter of taking the mapped numbers and then passing them through the ppf function of the existing poisson distribution.
I implemented this as a “scipy-like” function that I could pass in to PyMC3’s
Hopefully at this point you’ve seen the importance of accoutning for truncated data - if you want to the the right results, you should be checking for truncation effects. And hopefully I’ve given you enough of a starting point to write your own truncated distributions, should the need arise.