Over a million developers have joined DZone.

Detecting Reddit Voting Rings Using This Weird Little Data Trick

DZone's Guide to

Detecting Reddit Voting Rings Using This Weird Little Data Trick

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

A Wolf in Sheep’s Clothing

Wow, so apparently the Reddit community has outed (and as of 15 minutes ago, ousted) colleague Doug Turnbull as the vile spammer that he is! Apparently he has been shamelessly promoting his own blog posts! And among his tactics, the use of voting rings. Yes, and even some of my own colleagues succumbed to his pressure to upvote his sales pitches – including this one masked as a post about test driven search relevancy. (Though I must admit… it is an interesting read… and Doug is an eloquent writer)

But not I. No, I’m as white as the wind driven snow, as pure as a mountain stream, as innocent as a newborn babe. And as a matter of fact, this has all got me thinking: How can I help prevent things like this from happening in the future? In particular, how can I help Reddit prevent voting rings from forming in the future? And then it occurred to me: the HyperLogLog counter.

What is the HyperLogLog Counter?

The HyperLogLog counter is a really neat probabilistic data structure that estimates the count of unique elements in a list. For instance, let’s say you have a really popular web page and you want to perfectly count the number of unique visitors. To keep a count of uniques, you have to store every IP address that you ever see. And upon receiving a new IP address, you have to first check that the new IP address has not been run across before, and only then do you increment the site counter. Under the best of situations, the storage and the computation probably scale as O(log(n)). (What would you use? Tries? Skip-lists?)

With the HyperLogLog counter it’s all different. The size of the data structure is determined a priori (thus O(1)) and is a miniscule fraction of whatever data structure you would use from in the example of the previous paragraph. Data insertion and count tallying are super fast and also scales as O(1). But here’s the catch: You can’t actually know the exact count of unique views. Rather, you may only have an estimate of the current count. I encourage you to read more. This article has been the best resource I’ve found. And make sure you try their cool JavaScript demo, which really helps to understand how it all works.

But How Can the HyperLogLog Counter Help Beat Voting Rings?

Let’s turn back to the sad example of disgraced colleague Doug Turnbull. Usually when Doug coerces others to upvote his posts, he uses forms of extortion or even threats of physical harm. Naturally I’ve never succumbed to this, but those that do go to Doug’s most recent post (many of which can be found here) and upvote.

But consider what would happen if we created a HyperLogLog counter for every user on Reddit, and any time that a user receives an upvote, we update the corresponding HyperLogLog counter with the id of the user who did the upvoting. Given this setup, here’s how we detect the voting ring. First, for a given user, retrieve the estimated count of unique upvotes from that user’s HyperLogLog counter, and let’s label this quantity as estimated_unique_upvotes. Next, tally up the total number of actual upvotes …

these things… Screen Shot 2013-10-15 at 12.23.05 AM

across all of that user’s posts. Let’s call this quantity as global_total_upvotes. The probability that a given user is involved in a voting ring will correlate highly with the ratio of these numbers. In other words, we can define:

voting_honesty  =  estimated_unique_upvotes / global_total_upvotes

Think about it. If a user is not in a voting ring, then their upvotes are going to be organically generated from anyone one the world-wide-web that thinks their links are interesting. Because the internet is a big place, this user’s estimated_unique_upvotes and global_total_upvotes will tend toward the same number and their voting_honesty “probability” will tend toward 1. On the other hand, a user wrapped up in the seedy underworld of Reddit voting rings will have many more global_total_upvotes across all posts than they have estimated_unique_upvotes. For this user, the voting_honesty will tend toward 0.


Now we’re talking 1′s and 0′s. Aren’t computers supposed to be good with those? Yes! So, the method I’ve outlined above should be a scalable way to automatically detect and expel those voting charlatans! Are there issues with this strategy? Sure. But I do think that the estimated_unique_upvotes to global_total_upvotes ratio is interesting. Help me think; where could this be used outside of Reddit voting rings?

So, what now? Well, it’s obvious, right? Upvote this article!

Update Idea: Throwing Away 90% of Reddit’s Infrastructure and Using HLL Approximations Instead.

Hey, I got another idea! I wonder what reddit’s infrastructure looks like? It’s gotta be hell to scale everything up. The most obviously complex thing is tracking all those up and down votes. Furthermore, the most important function of the infrastructure is in making sure that no one votes for a post more than once – otherwise colleague, Doug Turnbull, would cramp to death setting at the computer and click-upvoting his own posts! Since it’s so hard to scale, why not just rip all that stuff out and replace it with HLL approximations instead?

Here’s how: Each post gets 2 HLL counters, one positive votes and one negative. When a user upvotes or downvotes a post, their id is submitted to the corresponding HLL counter. There is never a chance that a vote can be counted twice, and there’s no need for any of the infrastructure that ensures that no one votes twice.

But there are a couple of obvious drawbacks to this approach. One – the vote is now just an estimate – a guess, and while on average this would work out fine, there would certainly be crappy posts getting promoted and great posts getting penalized. Secondly, a side effect of all this click tracking infrastructure is that you lose the ability to trace your own history! And if you again built infrastructure so that you could see your history, you might as well return back to the way Reddit currently does it. But at the very least this idea is worth remembering when building and scaling a website that has, but doesn’t feature, votes.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.


Published at DZone with permission of John Berryman, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}