Reddit comments dataset

This dataset contains publicly-available comments on Reddit that go back to December, 2005, to March, 2023, and contains over 14B rows of data. The raw data is in JSON format in compressed files and the rows look like the following:

A shoutout to Percona for the motivation behind ingesting this dataset, which we have downloaded and stored in an S3 bucket.

Note

The following commands were executed on a Production instance of ClickHouse Cloud with the minimum memory set to 720GB. To run this on your own cluster, replace default in the s3Cluster function call with the name of your cluster. If you do not have a cluster, then replace the s3Cluster function with the s3 function.

Let's create a table for the Reddit data:

Note

The names of the files in S3 start with RC_YYYY-MM where YYYY-MM goes from 2005-12 to 2023-02. The compression changes a couple of times though, so the file extensions are not consistent. For example:

the file names are initially RC_2005-12.bz2 to RC_2017-11.bz2
then they look like RC_2017-12.xz to RC_2018-09.xz
and finally RC_2018-10.zst to RC_2023-02.zst

We are going to start with one month of data, but if you want to simply insert every row - skip ahead to step 8 below. The following file has 86M records from December, 2017:

It will take a while depending on your resources, but when it's done verify it worked:

Let's see how many unique subreddits were in December of 2017:

This query returns the top 10 subreddits (in terms of number of comments):

Here are the top 10 authors in December of 2017, in terms of number of comments posted:

We already inserted some data, but we will start over:

This is a fun dataset and it looks like we can find some great information, so let's go ahead and insert the entire dataset from 2005 to 2023. For practical reasons, it works well to insert the data by years starting with...

...and ending with:

If you do not have a cluster, use s3 instead of s3Cluster:

To verify it worked, here are the number of rows per year (as of February, 2023):

Let's see how many rows were inserted and how much disk space the table is using:

Notice the compression of disk storage is about 1/3 of the uncompressed size:

The following query shows how many comments, authors and subreddits we have for each month:

This is a substantial query that has to process all 14.69 billion rows, but we still get an impressive response time (about 48 seconds):

Here are the top 10 subreddits of 2022:

Let's see which subreddits had the biggest increase in comments from 2018 to 2019:

It looks like memes and teenagers were busy on Reddit in 2019:

One more query: let's compare ClickHouse mentions to other technologies like Snowflake and Postgres. This query is a big one because it has to search all 14.69 billion comments three times for a substring, but the performance is actually quite impressive. (Unfortunately ClickHouse users are not very active on Reddit yet):