Analyzing Stack Overflow Data Directly With PowerBI
Analyzing Stack Overflow Data Directly With PowerBI
It's easy (and interesting) to explore Stack Overflow data. They make their databases public. You don't need to download them; you can query the databases directly.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Last week, Stack Overflow acknowledged their culture issues with this post. I'm glad to see them talking about these issues publicly, and they are actively looking to make things better. Admitting you have a problem is a good first step.
That post reminded me that I've been meaning to explore some Stack Overflow data. They make their databases public. You don't need to download them; you can query the databases directly. Just head over to here and get started. In the upper right, you will see an icon that is also a drop-down menu:
I'm going to switch sites to the DBA Stack Exchange because that's the one I've used the most and I thought it might be interesting to find myself in the results.
Stack Overflow Users
First, let's look at the users. I want to know just how many people are using the forum. A simple count will tell me how many users are in the system:
And the result:
That's a good number of users. The site is over seven years old, so now I'm wondering about the rate at which new users have been signing up. We can get that info — here's one way:
And the result:
It seems that 2017 had a dip in new users to the site, but 2017 was still the 2nd highest year. This dip in new users may have been something that Stack Overflow has noticed across all their websites. If fewer people are signing up or being engaged, then it's time to talk about the reasons why people are staying away. A quick query against the main Stack Overflow database shows me that the number of users is increasing. So, the decline isn't being seen by every site. Or maybe they are attracting more of the brogrammers that are contributing to the negative culture. I don't have enough data at this point to say one way or another.
OK, enough about the users — let's look at the posts.
Stack Overflow Posts
Next, we will look at the Posts tables. This is the table that contains the actual questions and answers.
A simple count tells us that there are 154,806 posts. With only 122k users, that means a majority of users must be posting only once. Let's look at the posts over time, same as we did for the users:
Wait, we have posts from 2008, but users only go back to 2011? Yep, because questions can be migrated from one Stack Overflow site to another, it's possible for us to have questions with dates from before the DBA Stack Exchange site even existed.
(Welcome to the wonderful world of data analytics, where you spend 95% of your time as a Data Janitor, helping to clean up what needs cleaning and explaining weird stains on the carpet to visitors.)
So, if questions have been migrated, we don't really know how many of the posts are from DBA SE users or not. Of course, it's possible that a user of DBA SE had a question migrated from another site. But, hey, it's close enough for me. Because Data Janitor.
Stack Overflow Posts From Unique Users
Next, we look at the number of posts grouped by the OwnerUserId:
I did the ORDER BY here so we could see the number of posts without an OwnerUserId. These are the posts migrated from other sites:
So, 2692 questions have been migrated. That's a small fraction of the 154k total number. But what I found more interesting here was the total number of rows returned:
We have 122k users, but only 40k of them have written a post. What are the other 82k people doing?
Well, it's possible that the other 82k have written a post and deleted it. Or, they haven't written a post and have only left a comment. I'll leave that analysis as an exercise for the reader. I have other questions I want to answer before I worry about those.
Here is a question I've thought about: How many users of DBA SE sign up, ask one question, and leave? We can filter the previous query:
And the resulting rowcount:
We have 26,801 users that have one post (or fewer). This means that we have just under 14k users asking two questions or more.
Suddenly, this website seems a whole lot smaller. I know that the Stack Overflow websites get a lot of views, there's no question. But there are millions of database professionals. The answers at DBA SE are coming from a very small subset of database professionals.
Stack Overflow Posts With Quick Answers
If you are like me, you've noticed questions posted at DBA SE by someone who is a new user, and there is a detailed answer in a matter of minutes. I've often wondered about this. I'm certain two users could collaborate the timing of the question and the answer. There are no rules against doing so.
I get a lot of questions emailed to me weekly from people all over the world. It would not be difficult to search DBA SE to see if the question is listed. If not, I could ask the person to ask the question and email me when it is posted, and I could then quickly post my answer for them to mark as correct.
Let's see what the data has to say about this. First, how many users have been created and asked questions within the first 15 minutes:
There's nothing odd about signing up and asking a question — that's the whole point of the site. The number of users returned is 25,264. How many of these quick-questions users do we have over time:
It seems we hit a peak in 2015 and have had a decline in the number of users being created and asking a question within 15 minutes.
Now, I want to know how many of these posts had a reply within 15 minutes. I'm going to create a CTE to get this done and join back to the Posts table. I want to find all posts with a ParentId that is a post asked within 15 minutes of being created:
That query will return 25,602 rows, but the result set contains all replies, not just answers. I'm curious to know how many questions are marked answered within 15 minutes. So, let's make an adjustment. We will focus on the posts marked as an answer, and group by user ID:
And the results:
That's a lot of questions to have marked as an answer within 15 minutes of being posted. Of course, this is over seven years. We could break it down a bit further if desired, and group into years to get a result that looks like this:
Looks to me like this user was simply active (very active) for a handful of years and has since tapered off.
Stack Overflow Length of Answers
Asking a question and getting a quick reply seems to be common enough for the majority of users. However, there is one place where we might find an anomaly. If a person was to produce an answer of considerable length in a short amount of time, that might indicate that they were trying to game the system for points. The length of the posts is stored in characters with NVARCHAR(MAX). And the bulk of those characters are the result of code snippets, which are often cut and pasted. So, length by itself isn't an indication of something wrong. But the number of words might give a hint. This would be true if we were to find a pattern for one user providing many long-form answers to question posted just a few minutes prior.
So, let's look for questions that have an accepted answer within five minutes and examine the length of the replies to see if there is anything unusual. I can't view the whole text in the results window, but there is an option to download to CSV. So, I will download the results and use Excel to view the longest answer given within five minutes:
I could have written some T-SQL to parse our words by looking for spaces, and skipping over <code> blocks. But I used Excel here because (1) it's easier, and (2) I wanted you to be aware that the website allows for you to export to CSV.
The second result is an answer with some code and more than 2,300 words. That's a lot of typing in less than five minutes. I believe this answer was prepared in advance.
Not that there's anything wrong with that. I was just curious to know if such things were happening. To me, this shows that the user(s) involved are more interested in points, and strutting their knowledge, than in helping others.
That's their right. I'm not complaining, just observing what I see in the data.
Importing Stack Overflow Data in PowerBI
I've spent this whole post showing you some queries that you can run against the Stack Overflow websites for yourself. There's no need for any additional tools. If you want to do additional analysis, you will need to export the data. The issue you will find is that the database query interface provided by Stack Overflow limits you to only 50,000 rows of results.
I started by downloading this file. When I open it up it contains XML files, one for each table:
I could try to import those into a database, but I won't bother. My end goal is to get insights into the data. So, I'm going to load these files into PowerBI, because PowerBI desktop lets me import XML files.
Just go to Get Data > Other, and you'll find the XML option:
The downside to this is that I must load the files one at a time. The upside is that PowerBI will detect relationships for me.
Once the data is loaded, I'm only a few clicks away from generating some graphs that provide some visualization to the results we've discovered above. For example, let's look at user creation over time. Here, I can see the number of new users each year along with a running total of the overall number of users:
I can use PowerBI to analyze the entire dataset, looking for relationships that I didn't know existed. Here's another example — I can quickly see if there is a relationship between the length of the text and the score of a post:
PowerBI makes exploring the Stack Overflow data easy. Use what works best for you. That's what I did for this post. I started it all by using PowerBI to explore the data. I recognize that not everyone uses PowerBI and I wanted to make sure you could get started on doing analysis with whatever tool you want. So I spent the first part here showing you how to query the database directly. I'm hoping to do some additional analysis over time on this data using PowerBI and share with you what interesting things I may find.
Stack Overflow has publicly discussed that their websites may not have been the most welcoming. They make public their data on users and activity. I wanted to analyze the data to see if I could find examples of behavior that may be more focused on earning points than helping others. I think this gamification across the Stack Overflow sites is part of the issue they have currently. With gamification comes competition. And with competition, we get a culture that is not as welcoming as what Stack Overflow is stating they want to achieve.
If the gamification in Stack Overflow exists, there will always be people that want to earn as many points as possible. There's nothing wrong with that. It's what they need. It's how they measure their self-worth. And for some, it is an opportunity to get some free marketing by posting links to products and blogs. But if the culture created by these few users are keeping others away, then we are missing out on better answers to our questions.
Future analysis should include reviewing the comments, perhaps some sentiment analysis. I will try to think of ways the Stack Overflow data may provide some insight into the user activities that are not welcoming to others.
Queries Used in This Post
--Total number of users SELECT COUNT(*) FROM Users --Total number of users, grouped by year SELECT COUNT(Id) as [Total], DATEPART(yy,CreationDate) as [Year] FROM Users GROUP BY DATEPART(yy,CreationDate) --Total number of posts SELECT COUNT(*) FROM Posts --Total number of posts, grouped by year SELECT COUNT(Id) as [Total], DATEPART(yy,CreationDate) as [Year] FROM Posts GROUP BY DATEPART(yy,CreationDate) --Total number of posts, grouped by userid SELECT COUNT(p.Id), p.OwnerUserId FROM Posts p GROUP BY p.OwnerUserId ORDER BY p.OwnerUserId --Total number of users with more than one post SELECT COUNT(p.Id), p.OwnerUserId FROM Posts p GROUP BY p.OwnerUserId HAVING COUNT(p.Id) <= 1 --user created and asks question in 15 minutes SELECT COUNT(*) FROM Users u INNER JOIN Posts p ON u.Id = p.OwnerUserId WHERE DATEDIFF(mi, u.CreationDate, p.CreationDate) < 15 -- Total number of users created and posting within 15 minutes SELECT COUNT(p.id), DATEPART(yy, p.CreationDate) FROM Users u INNER JOIN Posts p ON u.Id = p.OwnerUserId WHERE DATEDIFF(mi, u.CreationDate, p.CreationDate) < 15 GROUP BY DATEPART(yy, p.CreationDate) --Posts with replies within 15 minutes WITH Posts_CTE (UserCreate, PostId, PostCreate, ParentId, AcceptedAnswerId, OwnerUserId, DisplayName) AS ( SELECT u.CreationDate AS [UserCreate] , p.id, p.CreationDate AS [PostCreate] , p.ParentId, p.AcceptedAnswerId , p.OwnerUserId, u.DisplayName FROM Users u INNER JOIN Posts p ON u.Id = p.OwnerUserId ) SELECT * FROM Posts_CTE pcte INNER JOIN Posts p ON pcte.PostId = p.ParentId WHERE DATEDIFF(mi, pcte.UserCreate, pcte.PostCreate) < 15
Published at DZone with permission of Thomas LaRock , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.