Over a million developers have joined DZone.

Big Dataset: Analyzing All Reddit Comments With ClickHouse

DZone's Guide to

Big Dataset: Analyzing All Reddit Comments With ClickHouse

It is hard to come across interesting datasets, especially a big one. However, I recently struck gold when I found Reddit's comments and submissions dataset!

· Big Data Zone ·
Free Resource

Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

In this blog, I'll use ClickHouse and Tabix to look at a new very large dataset for research.

It is hard to come across interesting datasets, especially a big one (and by big, I mean one billion rows or more). Before, I've used on-time airline performance available from Bureau of Transporation Statistics. Another recent example is NYC taxi and Uber trips data, with over one billion records.

However, today, I wanted to mention an interesting dataset I found recently that has been available since 2015. This is Reddit's comments and submissions dataset, made possible thanks to Reddit's generous API. The dataset was first mentioned at "I have every publicly available Reddit comment for research" and currently, you can find it at pushshift.io. However, there is no guarantee that pushshift.io will provide this dataset in the future. I think it would be valuable if Amazon or another cloud provider made this dataset available for researchers, just as Amazon provides public datasets.

The dataset contains 2.86 billion records to the end of 2016 and is 709GB in size, uncompressed. This dataset is valuable for a variety of research scenarios, from simple stats to natural language processing and machine learning.

Now let's see what simple info we can collect from this dataset using ClickHouse and https://tabix.io/, a GUI tool for ClickHouse. In this first round, we'll figure some basic stats, like number of comments per month, number of authors per month, and number of subreddits. I also added how many comments in average are left for a post.

Queries to achieve this:

SELECT toYYYYMM(created_date) dt,count(*) comments FROM commententry1 GROUP BY dt ORDER BY dt
SELECT toYYYYMM(created_date) dt,count(DISTINCT author) authors FROM commententry1 GROUP BY dt ORDER BY dt
SELECT toYYYYMM(created_date) dt,count(DISTINCT subreddit) subreddits FROM commententry1 GROUP BY dt ORDER BY dt
SELECT toYYYYMM(created_date) dt,count(*)/count(distinct link_id) comments_per_post FROM commententry1 GROUP BY dt ORDER BY dt

And the graphical result:

It impressive to see the constant growth in comments (to 70mln per month by the end of 2016) and authors (to 3.5mln for the same time period). There is something interesting happening with subreddits, which jump up and down. It's interesting to see that the average count of comments per post stays stable, with a slight decline to 13 comments/post by the end of 2016.

Now let's check most popular subreddits:

SELECT subreddit,count(*) cnt FROM commententry1 GROUP BY subreddit ORDER BY cnt DESC limit 100

And using a treemap (available in Tabix.io):

We can measure subreddits that get the biggest increase in comments in 2016 compared to 2015:

SELECT subreddit,cntnew-cntold diff FROM (SELECT subreddit,count(*) cntnew FROM commententry1 WHERE toYear(created_date)=2016 GROUP BY subreddit) ALL INNER JOIN (SELECT subreddit,count(*) cntold FROM commententry1 WHERE toYear(created_date)=2015 GROUP BY subreddit) USING (subreddit) ORDER BY diff DESC LIMIT 50

Obviously, Reddit was affected by the United States Presidential Election 2016, but not just that. The gaming community saw an increase in Overwatch, PokemonGO, and Dark Souls 3.

Now we can try to run our own DB-Ranking, but only based on Reddit comments. This is how I can do this for MySQL, PostgreSQL, and MongoDB:

SELECT toStartOfQuarter(created_date) Quarter,
sum(if(positionCaseInsensitive(body,'mysql')>0,1,0)) mysql,
sum(if(positionCaseInsensitive(body,'postgres')>0,1,0)) postgres,
sum(if(positionCaseInsensitive(body,'mongodb')>0,1,0)) mongodb
FROM commententry1
GROUP BY Quarter ORDER BY Quarter; 

Image title

I would say the result is aligned with this, where MySQL is the most popular among the three, followed by PostgreSQL and then MongoDB. There is an interesting spike for PostgreSQL in the second quarter in 2015, caused by a bot in "leagueoflegend" tournaments. The bot was actively announcing that it is powered by PostgreSQL in the comments, like this.

To highlight more ClickHouse features: along with standard SQL functions, it provides a variety of statistical functions (for example, Quantile calculations). We can try to see the distribution of the number of comments left by authors:

        count(*) AS cnt
    FROM commententry1
    WHERE author != '[deleted]'
    GROUP BY author

The result is:

quantileExact(0.1)(cnt) - 1
quantileExact(0.2)(cnt) - 1
quantileExact(0.3)(cnt) - 1
quantileExact(0.4)(cnt) - 2
quantileExact(0.5)(cnt) - 4
quantileExact(0.6)(cnt) - 7
quantileExact(0.7)(cnt) - 16
quantileExact(0.8)(cnt) - 42
quantileExact(0.9)(cnt) - 160
quantileExact(0.99)(cnt) - 2271

Which means that 30% of authors left only one comment, and 50% of authors left four comments or less.

In general, ClickHouse was a pleasure to use when running analytical queries. However, I should note the missing support of WINDOW functions is a huge limitation. Even MySQL 8.0, which recently was released as RC, provides support for WINDOW functions. I hope ClickHouse will implement this as well.

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.

big data ,tutorial ,data analytics ,data visualization ,reddit ,clickhouse

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}