Amazon customer reviews dataset
This dataset contains over 150M customer reviews of Amazon products. The data is in snappy-compressed Parquet files in AWS S3 that total 49GB in size (compressed). Let's walk through the steps to insert it into ClickHouse.
The queries below were executed on a Production instance of ClickHouse Cloud.
- Without inserting the data into ClickHouse, we can query it in place. Let's grab some rows, so we can see what they look like:
The rows look like:
- Let's define a new
MergeTree
table namedamazon_reviews
to store this data in ClickHouse:
- The following
INSERT
command uses thes3Cluster
table function, which allows the processing of multiple S3 files in parallel using all the nodes of your cluster. We also use a wildcard to insert any file that starts with the namehttps://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet
:
In ClickHouse Cloud, the name of the cluster is default
. Change default
to the name of your cluster...or use the s3
table function (instead of s3Cluster
) if you do not have a cluster.
- That query doesn't take long - averaging about 300,000 rows per second. within 5 minutes or so you should see all the rows inserted:
- Let's see how much space our data is using:
The original data was about 70G, but compressed in ClickHouse it takes up about 30G:
- Let's run some queries...here are the top 10 most-helpful reviews in the dataset:
Notice the query has to process all 151M rows, but takes less than one second!
- Here are the top 10 products in Amazon with the most reviews:
- Here are the average review ratings per month for each product (an actual Amazon job interview question!):
It calculates all the monthly averages for each product, but we only returned 20 rows:
- Here are the total number of votes per product category. This query is fast because
product_category
is in the primary key:
- Let's find the products with the word "awful" occurring most frequently in the review. This is a big task - over 151M strings have to be parsed looking for a single word:
The query only takes 4 seconds - which is impressive - and the results are a fun read:
- We can run the same query again, except this time we search for awesome in the reviews: