Amazon customer reviews dataset

This dataset contains over 150M customer reviews of Amazon products. The data is in snappy-compressed Parquet files in AWS S3 that total 49GB in size (compressed). Let's walk through the steps to insert it into ClickHouse.

Note

The queries below were executed on a Production instance of ClickHouse Cloud.

Without inserting the data into ClickHouse, we can query it in place. Let's grab some rows, so we can see what they look like:

The rows look like:

Let's define a new MergeTree table named amazon_reviews to store this data in ClickHouse:

The following INSERT command uses the s3Cluster table function, which allows the processing of multiple S3 files in parallel using all the nodes of your cluster. We also use a wildcard to insert any file that starts with the name https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet:

Tip

In ClickHouse Cloud, the name of the cluster is default. Change default to the name of your cluster...or use the s3 table function (instead of s3Cluster) if you do not have a cluster.

That query doesn't take long - averaging about 300,000 rows per second. within 5 minutes or so you should see all the rows inserted:

Let's see how much space our data is using:

The original data was about 70G, but compressed in ClickHouse it takes up about 30G:

Let's run some queries...here are the top 10 most-helpful reviews in the dataset:

Notice the query has to process all 151M rows, but takes less than one second!

Here are the top 10 products in Amazon with the most reviews:

Here are the average review ratings per month for each product (an actual Amazon job interview question!):

It calculates all the monthly averages for each product, but we only returned 20 rows:

Here are the total number of votes per product category. This query is fast because product_category is in the primary key:

Let's find the products with the word "awful" occurring most frequently in the review. This is a big task - over 151M strings have to be parsed looking for a single word:

The query only takes 4 seconds - which is impressive - and the results are a fun read:

We can run the same query again, except this time we search for awesome in the reviews: