Run performance tests on ApsaraDB for ClickHouse using the OnTime and Star Schema datasets. This guide walks you through importing each dataset and executing benchmark queries.
Prerequisites
An ApsaraDB for ClickHouse cluster is created. For more information, see Create a cluster.
A database account is created. For more information, see Create a cluster.
A Linux machine is prepared and its IP address is added to the whitelist of the ApsaraDB for ClickHouse cluster. For more information, see Configure a whitelist.
The version of clickhouse-client that matches your ApsaraDB for ClickHouse cluster is installed. For more information, see clickhouse-client.
Usage notes
All clickhouse-client commands in this guide are run from the directory where clickhouse-client is installed.
The OnTime and Star Schema datasets are sourced from the open source ClickHouse official website. For more information, see Example Datasets.
Use the OnTime dataset to test performance
1. Connect to the cluster
Connect to your ApsaraDB for ClickHouse cluster from clickhouse-client in a Linux environment:
./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>
2. Create a table
Download and run the table creation script for your cluster edition:
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Single-replica Edition
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Double-replica Edition
3. Download the OnTime dataset
The OnTime dataset spans a large time range and takes a long time to download. You may also see data loss errors during the download, which is expected.
echo https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip | xargs -P10 wget --no-check-certificate --continue
4. Import the data to ApsaraDB for ClickHouse
Set the connection environment variables and import the data:
export ck_url=<Cluster endpoint>
export ck_user=<Database account username>
export ck_pass=<Database account password>
export ck_port=<TCP port number>
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | ./clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query="INSERT INTO ontime FORMAT CSVWithNames"; done
5. Run the benchmark queries
The test queries are sourced from the open source ClickHouse official website. For more information, see OnTime.
Q0 — Average monthly flight count per year
SELECT avg(c1)
FROM
(
SELECT Year, Month, count(*) AS c1
FROM ontime
GROUP BY Year, Month
);
Q1 — Flight count by day of week (2000–2008)
SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;
Q2 — Delayed flights by day of week (2000–2008, delay > 10 min)
SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;
Q3 — Top 10 origins with most delayed departures (2000–2008)
SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;
Q4 — Delayed flights by carrier (2007)
SELECT Carrier, count(*)
FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC;
Q5 — Delay percentage by carrier (2007)
set joined_subquery_requires_alias=0;
set any_join_distinct_right_table_keys=1;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year=2007
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year=2007
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
Q6 — Delay percentage by carrier (2000–2008)
set joined_subquery_requires_alias=0;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year>=2000 AND Year<=2008
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
Q7 — Annual delay rate trend (all years)
set joined_subquery_requires_alias=0;
SELECT Year, c1/c2
FROM
(
select
Year,
count(*)*100 as c1
from ontime
WHERE DepDelay>10
GROUP BY Year
)
ANY INNER JOIN
(
select
Year,
count(*) as c2
from ontime
GROUP BY Year
) USING (Year)
ORDER BY Year;
Q8 — Top 10 destinations with the most unique origin cities served (2000–2010)
SELECT DestCityName, uniqExact(OriginCityName) AS u
FROM ontime
WHERE Year>=2000 and Year<=2010
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10;
Q9 — Total flights per year
SELECT Year, count(*) AS c1
FROM ontime
GROUP BY Year;
Q10 — Carriers ranked by arrival delay rate, excluding remote states (before 2010)
SELECT
min(Year), max(Year), Carrier, count(*) AS cnt,
sum(ArrDelayMinutes>30) AS flights_delayed,
round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
FROM ontime
WHERE
DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt>100000 and max(Year)>1990
ORDER by rate DESC
LIMIT 1000;
Use the Star Schema dataset to test performance
1. Connect to the cluster
Connect to your ApsaraDB for ClickHouse cluster from clickhouse-client in a Linux environment:
./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>
2. Create a table
Download and run the table creation script for your cluster edition:
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Single-replica Edition
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Double-replica Edition
3. Generate data files
Clone the ssb-dbgen project, compile it, and generate the data files. The following example generates four files: customer.tbl, lineorder.tbl, part.tbl, and supplier.tbl.
The -s 100 parameter generates approximately 600 million rows of data (67 GB). Increase or decrease this value based on your test goals: use a lower value for quick validation, or a higher value for realistic load testing.
git clone https://github.com/vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
./dbgen -s 100 -T c
./dbgen -s 100 -T l
./dbgen -s 100 -T p
./dbgen -s 100 -T s
4. Import the data to ApsaraDB for ClickHouse
Set the connection environment variables and import the data:
export ck_url=<Cluster endpoint>
export ck_user=<Database account username>
export ck_pass=<Database account password>
export ck_port=<TCP port number>
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO customer FORMAT CSV" < ssb-dbgen/customer.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO part FORMAT CSV" < ssb-dbgen/part.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO supplier FORMAT CSV" < ssb-dbgen/supplier.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO lineorder FORMAT CSV" < ssb-dbgen/lineorder.tbl
5. Run the benchmark queries
The test queries are sourced from the open source ClickHouse official website. For more information, see Star Schema Benchmark.
Q0 — Flatten the normalized tables into a single denormalized table for faster aggregation queries
INSERT INTO lineorder_flat
SELECT
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;
Q1 — Total revenue from discounted items in 1993 (quantity < 25, discount 1–3%)
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;
Q2 — Total revenue from discounted items in January 1994 (quantity 26–35, discount 4–6%)
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;
Q3 — Total revenue from discounted items in week 6 of 1994 (quantity 26–35, discount 5–7%)
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994
AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;
Q4 — Annual revenue by brand for category MFGR#12 in the Americas
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q5 — Annual revenue by brand MFGR#2221–MFGR#2228 in Asia
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q6 — Annual revenue for brand MFGR#2239 in Europe
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q7 — Revenue by customer and supplier nation within Asia (1992–1997)
SELECT
C_NATION,
S_NATION,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
GROUP BY
C_NATION,
S_NATION,
year
ORDER BY
year ASC,
revenue DESC;
Q8 — Revenue by city for United States customers and suppliers (1992–1997)
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q9 — Revenue by city for specific UK cities (1992–1997)
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q10 — Revenue by city for specific UK cities in December 1997
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q11 — Annual profit by customer nation in the Americas (manufacturers MFGR#1 and MFGR#2)
SELECT
toYear(LO_ORDERDATE) AS year,
C_NATION,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
C_NATION
ORDER BY
year ASC,
C_NATION ASC;
Q12 — Profit by supplier nation and product category in the Americas (1997–1998, manufacturers MFGR#1 and MFGR#2)
SELECT
toYear(LO_ORDERDATE) AS year,
S_NATION,
P_CATEGORY,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
S_NATION,
P_CATEGORY
ORDER BY
year ASC,
S_NATION ASC,
P_CATEGORY ASC;
Q13 — Profit by city and brand in the United States (1997–1998, category MFGR#14)
SELECT
toYear(LO_ORDERDATE) AS year,
S_CITY,
P_BRAND,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
GROUP BY
year,
S_CITY,
P_BRAND
ORDER BY
year ASC,
S_CITY ASC,
P_BRAND ASC;