Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Detecting Reddit Voting Rings Using This Weird Little Data Trick

DZone 's Guide to

Detecting Reddit Voting Rings Using This Weird Little Data Trick

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

A Wolf in Sheep’s Clothing

Wow, so apparently the Reddit community has outed (and as of 15 minutes ago, ousted) colleague Doug Turnbull as the vile spammer that he is! Apparently he has been shamelessly promoting his own blog posts! And among his tactics, the use of voting rings. Yes, and even some of my own colleagues succumbed to his pressure to upvote his sales pitches – including this one masked as a post about test driven search relevancy. (Though I must admit… it is an interesting read… and Doug is an eloquent writer)

But not I. No, I’m as white as the wind driven snow, as pure as a mountain stream, as innocent as a newborn babe. And as a matter of fact, this has all got me thinking: How can I help prevent things like this from happening in the future? In particular, how can I help Reddit prevent voting rings from forming in the future? And then it occurred to me: the HyperLogLog counter.

What is the HyperLogLog Counter?

The HyperLogLog counter is a really neat probabilistic data structure that estimates the count of unique elements in a list. For instance, let’s say you have a really popular web page and you want to perfectly count the number of unique visitors. To keep a count of uniques, you have to store every IP address that you ever see. And upon receiving a new IP address, you have to first check that the new IP address has not been run across before, and only then do you increment the site counter. Under the best of situations, the storage and the computation probably scale as O(log(n)). (What would you use? Tries? Skip-lists?)

With the HyperLogLog counter it’s all different. The size of the data structure is determined a priori (thus O(1)) and is a miniscule fraction of whatever data structure you would use from in the example of the previous paragraph. Data insertion and count tallying are super fast and also scales as O(1). But here’s the catch: You can’t actually know the exact count of unique views. Rather, you may only have an estimate of the current count. I encourage you to read more. This article has been the best resource I’ve found. And make sure you try their cool JavaScript demo, which really helps to understand how it all works.

But How Can the HyperLogLog Counter Help Beat Voting Rings?

Let’s turn back to the sad example of disgraced colleague Doug Turnbull. Usually when Doug coerces others to upvote his posts, he uses forms of extortion or even threats of physical harm. Naturally I’ve never succumbed to this, but those that do go to Doug’s most recent post (many of which can be found here) and upvote.

But consider what would happen if we created a HyperLogLog counter for every user on Reddit, and any time that a user receives an upvote, we update the corresponding HyperLogLog counter with the id of the user who did the upvoting. Given this setup, here’s how we detect the voting ring. First, for a given user, retrieve the estimated count of unique upvotes from that user’s HyperLogLog counter, and let’s label this quantity as `estimated_unique_upvotes`. Next, tally up the total number of actual upvotes …

these things…

across all of that user’s posts. Let’s call this quantity as ` global_total_upvotes`. The probability that a given user is involved in a voting ring will correlate highly with the ratio of these numbers. In other words, we can define:

`voting_honesty  =  estimated_unique_upvotes / global_total_upvotes`

Think about it. If a user is not in a voting ring, then their upvotes are going to be organically generated from anyone one the world-wide-web that thinks their links are interesting. Because the internet is a big place, this user’s `estimated_unique_upvotes` and `global_total_upvotes` will tend toward the same number and their `voting_honesty` “probability” will tend toward 1. On the other hand, a user wrapped up in the seedy underworld of Reddit voting rings will have many more `global_total_upvotes` across all posts than they have `estimated_unique_upvotes`. For this user, the `voting_honesty` will tend toward 0.

Conclusion

Now we’re talking 1′s and 0′s. Aren’t computers supposed to be good with those? Yes! So, the method I’ve outlined above should be a scalable way to automatically detect and expel those voting charlatans! Are there issues with this strategy? Sure. But I do think that the `estimated_unique_upvotes` to `global_total_upvotes` ratio is interesting. Help me think; where could this be used outside of Reddit voting rings?

Update Idea: Throwing Away 90% of Reddit’s Infrastructure and Using HLL Approximations Instead.

Hey, I got another idea! I wonder what reddit’s infrastructure looks like? It’s gotta be hell to scale everything up. The most obviously complex thing is tracking all those up and down votes. Furthermore, the most important function of the infrastructure is in making sure that no one votes for a post more than once – otherwise colleague, Doug Turnbull, would cramp to death setting at the computer and click-upvoting his own posts! Since it’s so hard to scale, why not just rip all that stuff out and replace it with HLL approximations instead?

Here’s how: Each post gets 2 HLL counters, one positive votes and one negative. When a user upvotes or downvotes a post, their id is submitted to the corresponding HLL counter. There is never a chance that a vote can be counted twice, and there’s no need for any of the infrastructure that ensures that no one votes twice.

But there are a couple of obvious drawbacks to this approach. One – the vote is now just an estimate – a guess, and while on average this would work out fine, there would certainly be crappy posts getting promoted and great posts getting penalized. Secondly, a side effect of all this click tracking infrastructure is that you lose the ability to trace your own history! And if you again built infrastructure so that you could see your history, you might as well return back to the way Reddit currently does it. But at the very least this idea is worth remembering when building and scaling a website that has, but doesn’t feature, votes.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.