I spend so much free time on reddit. So much, that I decided what I really needed was a way to spend even more time browsing it.
So, naturally, I decided to download some reddit data and try to find some new subs. In doing so, I encountered a common problem with these sorts of recommendation situations. Often, recommendations are for what everybody does, rather than being personalized for me and my situation. Relevancy scoring is a simple way to avoid this problem.
Demo: type a subreddit
This takes a few seconds to load, it needs to pull in the subreddit data.
Let’s start with a really simple approach to the problem of finding new subreddits:
What other subreddits do people who comment on /r/programming most often comment on?
The answer is disappointing: they comment on the subs that everyone comments on. The top 5 subreddits most commented on are:
Scoring with probability
How about something a bit more refined?
For those who comment on /r/programming, which subreddits do they comment on unusually often?
This is called the relevancy score, and is simply
P(comments on sub|comments on programming) / P(comments on sub).
This doesn’t quite work either. The top 5 are:
Now, I’ve never heard of any of these subs. The problem here is that we’re looking at cases with one or two comments, which means their scores are massive. The commonplace solution would be to add some “smoothing” to the denominator, making less-likely subs appear less often, but it’s not quite right.
Thinking about probability
I find that in my work, many, many questions in machine learning or data analysis can be solved by thinking of the problem probabalistically. In this case, we asked for some measure of how more relevant a subreddit is to a set of users, compared to the general baseline.
The mistake here is that we don’t have accurate probabilities. For these smaller subreddits, we estimate the probability of a visit to them by looking at the number of visits to them, but even small amounts of noise in visitation will change the results massively.
Instead, we need to ask:
What is the probaility of observing this level of commenting, by /r/programmer commenters, compared to baseline.
By re-phrasing as a probability, and including the possiblity of randomness in our measurements (comments), we’re able to solve the problem nicely!
I calculate how many standard deviations from the expected commenting level is the observed commenting level. In code:
This works perfectly. The top 5 results are:
- politics: technology, worldnews, news, science, atheism
- funny: pics, AdviceAnimals, gifs, todayilearned, WTF
- books: television, explainlikeimfive, booksuggestions, news, literature
This sort of relevancy scoring can be great for when you want simple recommendations, based on single items. They help work around the problem of popularity, where a recommendation tends to simply be what’s popular with everyone, and zoom in on the things that are particular to a group. At the same time, they aren’t overwhelmed by noise.