TPC-H (1999)
A popular benchmark which models the internal data warehouse of a wholesale supplier. The data is stored into a 3rd normal form representation, requiring lots of joins at query runtime. Despite its age and its unrealistic assumption that the data is uniformly and independently distributed, TPC-H remains the most popular OLAP benchmark to date.
References
- TPC-H
- New TPC Benchmarks for Decision Support and Web Commerce (Poess et. al., 2000)
- TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark (Boncz et. al.), 2013
- Quantifying TPC-H Choke Points and Their Optimizations (Dresseler et. al.), 2020
Data Generation and Import
First, checkout the TPC-H repository and compile the data generator:
Then, generate the data. Parameter -s
specifies the scale factor. For example, with -s 100
, 600 million rows are generated for table 'lineitem'.
Detailed table sizes with scale factor 100:
Table | size (in rows) | size (compressed in ClickHouse) |
---|---|---|
nation | 25 | 2 kB |
region | 5 | 1 kB |
part | 20.000.000 | 895 MB |
supplier | 1.000.000 | 75 MB |
partsupp | 80.000.000 | 4.37 GB |
customer | 15.000.000 | 1.19 GB |
orders | 150.000.000 | 6.15 GB |
lineitem | 600.00.00 | 26.69 GB |
(Compressed sizes in ClickHouse are taken from system.tables.total_bytes
and based on below table definitions.)
Now create tables in ClickHouse.
We stick as closely as possible to the rules of the TPC-H specification:
- Primary keys are created only for the columns mentioned in section 1.4.2.2 of the specification.
- Substitution parameters were replaced by the values for query validation in sections 2.1.x.4 of the specification.
- As per section 1.4.2.1, the table definitions do not use the optional
NOT NULL
constraints, even ifdbgen
generates them by default. The performance ofSELECT
queries in ClickHouse is not affected by the presence or absence ofNOT NULL
constraints. - As per section 1.3.1, we use ClickHouse's native datatypes (e.g.
Int32
,String
) to implement the abstract datatypes mentioned in the specification (e.g.Identifier
,Variable text, size N
). The only effect of this is better readability, the SQL-92 datatypes generated bydbgen
(e.g.INTEGER
,VARCHAR(40)
) would also work in ClickHouse.
The data can be imported as follows:
Instead of using tpch-kit and generating the tables by yourself, you can alternatively import the data from a public S3 bucket. Make sure
to create empty tables first using above CREATE
statements.
Queries
Setting join_use_nulls
should be enabled to produce correct results according to SQL standard.
The queries are generated by ./qgen -s <scaling_factor>
. Example queries for s = 100
:
Correctness
The result of the queries agrees with the official results unless mentioned otherwise. To verify, generate a TPC-H database with scale
factor = 1 (dbgen
, see above) and compare with the expected results in tpch-kit.
Q1
Q2
As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697
This alternative formulation works and was verified to return the reference results.
Q3
Q4
As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697
This alternative formulation works and was verified to return the reference results.
Q5
Q6
As of October 2024, the query does not work out-of-the box due to a bug with Decimal addition. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/70136
This alternative formulation works and was verified to return the reference results.
Q7
Q8
Q9
Q10
Q11
Q12
Q13
This alternative formulation works and was verified to return the reference results.
Q14
Q15
Q16
Q17
As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697
This alternative formulation works and was verified to return the reference results.
Q18
Q19
Q20
As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697
Q21
As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697
Q22