Wikipedia Uncovered: What Analytics Told Us About One of the World's Most Popular Sites
Wikipedia Uncovered: What Analytics Told Us About One of the World's Most Popular Sites
Did you know that bots are responsible for most activity on Wikipedia? And what do you think is the most popular type of Wiki article? Peek into the data to learn more.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In order to make it more useful to our prospective users, we stocked the Interana demo with more than a year's worth of data from one of the Internet's most popular websites: Wikipedia.
And that got us thinking. What can we learn about Wikipedia from this data? How many insights can we uncover about one of the world's most popular websites? And what might be most useful to know from Wikipedia's perspective?
So we took our own demo for a spin to see what we could learn about Wikipedia from its publicly available recent changes data. We found some pretty interesting things from spending just a little time with the Interana demo. So without further ado, let's get into our findings.
Bots Are Responsible for Most Activity on Wikipedia
One of the first things we did was jump straight into user activity. We used "Explorer" to look at who the most active users on Wikipedia were in the last two weeks. To do this, we counted the total number of events in that time frame and grouped them by users.
Our comparison showed that the most common editors on Wikipedia are bots, with Emijrpbot blazing through the online encyclopedia with almost two million events in the first two weeks of June — far too many for any human to complete.
While user activity for an app or a website that isn't being scrupulously edited by bots might tell you valuable information about how your most active users engage, for Wikipedia, it reveals information that might be surprising to laypeople — that the website has outsourced tedious tasks to bots.
Take the second-highest user on this list, Panoramio upload bot. This bot auto uploads freely licensed images to Wikimedia Commons from Panoramio — a photosharing site similar to Flickr.
This was a task that used to be done by human editors but was simply too large and too tedious a job for human volunteers to make any headway.
Further exploration reveals that Wikipedia relies on a network of bots to do everything from checking to see that all code has appropriate closing brackets to bots that detect and reverse vandalism faster than humans ever could. Although the split is about 50/50 between human and bot editors, it's clear from our data that there are few bots doing bulk edits and, presumably, many human editors doing small changes.
So, from this basic exploration, we learned something interesting about the way that Wikipedia is operated. And that value goes beyond just a couple of fun facts — if we were to explore more user data from actual human editors, we would know to exclude bots from our considerations, possibly by putting a limit on the number of edits made by a user within a given period.
Abused Articles Take Time to Be Deleted
As everyone on the internet knows, trolls and misinformation are everywhere. And Wikipedia is certainly no exception. We went into the "Funnels" tool to look at how Wikipedia was dealing with deletion of articles. We split up the events in the life cycle of an article as follows:
- New article (article created)
- New edit #1
- New edit #2
- New edit #3
- Article hits abuse filter
- Article is deleted
This gave us the following chart, which shows the drop off between each step (bars in grey) as well as the time it takes between each step (bars in blue).
We found that it takes articles a median time of 8 hours and 4 minutes to go from their last edit to hitting the abuse filter. It then takes a median of 2 hours and 37 minutes to enter deletion.
A quick poke around Wikipedia's deletion process shows that there are processes in place for tagging and discussing articles that may need to be deleted, which can be speedy or can take some time. But there are also systems in place that automatically tag article abuse, like flagging possible link spam.
In these cases, offensive articles may be flagged and deleted much more swiftly. Overall, we found that Wikipedia had an extremely robust documentation on what constitutes as article abuse and which categories of an article (i.e. nonsense) can be deleted without much discussion. This probably also contributes to the swiftness of finding and deleting articles.
New Accounts Create; Older Accounts Maintain
Wikipedia is famously created and maintained by volunteer users from all over the world. So it would make sense that those users would create a lot of interesting data to go through. But one unexpected difference cropped up when we looked at new and old users: newer accounts pushed creation, while older accounts spent much more time editing existing articles.
The two pie graphs above compare the behaviors of two cohorts of Wikipedia users:
- User accounts older than 28 days
- User accounts younger than 28 days
The difference between the two cohorts is striking. New users (those on the left) are disproportionately likely to create new Wikipedia articles and to hit the abuse filter during their first 28 days. Users whose accounts are older than 28 days, on the other hand, are far more likely to upload materials, overwrite content, move, and delete content.
If we were to talk to Wikipedia users, we might want to ask why this is. Are older users editing the articles they once wrote? Are they given more responsibility by the community, like admin access, that requires them to do more maintenance? Do new users feel intimidated by uploading content and tweaking existing material and instead prefer to write their own content? How many of these new user accounts are used to create spam articles or stubs?
These are the types of questions that we don't have the answers to as Wikipedia outsiders but would be a great jumping off place for this data.
Politics and Religion Are the Most Talked-About Articles
Over the course of the last month, the articles with the most activity on Talk pages have been those relating to politics or religion — a conclusion consistent with common sense. After all, the most contentious topics in real life are only magnified online.
Unsurprisingly, newsworthy topics made their way to the top of the most event-heavy articles within the last two weeks.
The most talked about articles at the time of writing, in order, are:
- Russian interference in the 2016 United States elections
- Breitbart News
- Jewish diaspora
- Jacob Barnett
Most of these make sense — Russian interference with the election, for example, is something you couldn't seem to go a day without hearing about a little while ago.
But the fifth result is a curious outlier. Jacob Barnett, who has had his Wikipedia page deleted (in a 21-7 vote of Wikipedia editors), is aligned with neither politics nor religion. He's an 18-year old student studying physics at the Perimeter Institute in Waterloo, Canada.
Anomaly aside, is it safe to assume that politics and religion are always the most active articles?
We can use the time bar beneath our results and scrub back to encompass more than a year of data, producing similar results:
So yes, it seems that people are just as drawn to politics and religion on Wikipedia as they are in real life. Which follows sense if for no other reason than that current events like fake news and Russian collusion would have articles created and maintained in real time.
Wikipedia Is Only the Start
This is just the tip of the iceberg of what you can discover about Wikipedia with the data in our demo. A little time to fool around with the data and a quick Google search or two can uncover a whole bunch of new information. It's a peek into the way that data can help you shape and ask questions about your products and your users.
If you want to play around with our demo, it's free and our Wikipedia data is up-to-date. You can mimic what we've done here, or branch out on your own to find new theories about users, articles, bots — you name it.
If you've already played around with the demo, let us know what you were surprised to find. And if there are any hardcore Wikipedia editors out there - what do you think? Did our findings represent your experience? We can't wait to find out.
Published at DZone with permission of Archana Madhavan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.