Detecting Reddit Voting Rings Using This Weird Little Data Trick
A Wolf in Sheep’s Clothing
Wow, so apparently the Reddit community has outed (and as of 15 minutes ago, ousted) colleague Doug Turnbull as the vile spammer that he is! Apparently he has been shamelessly promoting his own blog posts! And among his tactics, the use of voting rings. Yes, and even some of my own colleagues succumbed to his pressure to upvote his sales pitches – including this one masked as a post about test driven search relevancy. (Though I must admit… it is an interesting read… and Doug is an eloquent writer)
But not I. No, I’m as white as the wind driven snow, as pure as a mountain stream, as innocent as a newborn babe. And as a matter of fact, this has all got me thinking: How can I help prevent things like this from happening in the future? In particular, how can I help Reddit prevent voting rings from forming in the future? And then it occurred to me: the HyperLogLog counter.
What is the HyperLogLog Counter?
The HyperLogLog counter is a really neat probabilistic data structure that estimates the count of unique elements in a list. For instance, let’s say you have a really popular web page and you want to perfectly count the number of unique visitors. To keep a count of uniques, you have to store every IP address that you ever see. And upon receiving a new IP address, you have to first check that the new IP address has not been run across before, and only then do you increment the site counter. Under the best of situations, the storage and the computation probably scale as O(log(n)). (What would you use? Tries? Skip-lists?)
But How Can the HyperLogLog Counter Help Beat Voting Rings?
Let’s turn back to the sad example of disgraced colleague Doug Turnbull. Usually when Doug coerces others to upvote his posts, he uses forms of extortion or even threats of physical harm. Naturally I’ve never succumbed to this, but those that do go to Doug’s most recent post (many of which can be found here) and upvote.
But consider what would happen if we created a HyperLogLog counter
for every user on Reddit, and any time that a user receives an upvote,
we update the corresponding HyperLogLog counter with the id of the user
who did the upvoting. Given this setup, here’s how we detect the voting
ring. First, for a given user, retrieve the estimated count of unique
upvotes from that user’s HyperLogLog counter, and let’s label this
estimated_unique_upvotes. Next, tally up the total number of actual upvotes …
across all of that user’s posts. Let’s call this quantity as
global_total_upvotes. The probability that a given user is involved in a voting ring will correlate highly with the ratio of these numbers. In other words, we can define:
voting_honesty = estimated_unique_upvotes / global_total_upvotes
Think about it. If a user is not in a voting ring, then
their upvotes are going to be organically generated from anyone one the
world-wide-web that thinks their links are interesting. Because the
internet is a big place, this user’s
global_total_upvotes will tend toward the same number and their
“probability” will tend toward 1. On the other hand, a user wrapped up
in the seedy underworld of Reddit voting rings will have many more
global_total_upvotes across all posts than they have
estimated_unique_upvotes. For this user, the
voting_honesty will tend toward 0.
Now we’re talking 1′s and 0′s. Aren’t computers supposed to be good
with those? Yes! So, the method I’ve outlined above should be a scalable
way to automatically detect and expel those voting charlatans! Are there
issues with this strategy? Sure. But I do think that the
global_total_upvotes ratio is interesting. Help me think; where could this be used outside of Reddit voting rings?
So, what now? Well, it’s obvious, right? Upvote this article!
Update Idea: Throwing Away 90% of Reddit’s Infrastructure and Using HLL Approximations Instead.
Hey, I got another idea! I wonder what reddit’s infrastructure looks like? It’s gotta be hell to scale everything up. The most obviously complex thing is tracking all those up and down votes. Furthermore, the most important function of the infrastructure is in making sure that no one votes for a post more than once – otherwise colleague, Doug Turnbull, would cramp to death setting at the computer and click-upvoting his own posts! Since it’s so hard to scale, why not just rip all that stuff out and replace it with HLL approximations instead?
Here’s how: Each post gets 2 HLL counters, one positive votes and one negative. When a user upvotes or downvotes a post, their id is submitted to the corresponding HLL counter. There is never a chance that a vote can be counted twice, and there’s no need for any of the infrastructure that ensures that no one votes twice.
But there are a couple of obvious drawbacks to this approach. One – the vote is now just an estimate – a guess, and while on average this would work out fine, there would certainly be crappy posts getting promoted and great posts getting penalized. Secondly, a side effect of all this click tracking infrastructure is that you lose the ability to trace your own history! And if you again built infrastructure so that you could see your history, you might as well return back to the way Reddit currently does it. But at the very least this idea is worth remembering when building and scaling a website that has, but doesn’t feature, votes.