Performance testing guide

更新时间:
复制 MD 格式

Run performance tests on ApsaraDB for ClickHouse using the OnTime and Star Schema datasets. This guide walks you through importing each dataset and executing benchmark queries.

Prerequisites

  • An ApsaraDB for ClickHouse cluster is created. For more information, see Create a cluster.

  • A database account is created. For more information, see Create a cluster.

  • A Linux machine is prepared and its IP address is added to the whitelist of the ApsaraDB for ClickHouse cluster. For more information, see Configure a whitelist.

  • The version of clickhouse-client that matches your ApsaraDB for ClickHouse cluster is installed. For more information, see clickhouse-client.

Usage notes

  • All clickhouse-client commands in this guide are run from the directory where clickhouse-client is installed.

  • The OnTime and Star Schema datasets are sourced from the open source ClickHouse official website. For more information, see Example Datasets.

Use the OnTime dataset to test performance

1. Connect to the cluster

Connect to your ApsaraDB for ClickHouse cluster from clickhouse-client in a Linux environment:

./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>

2. Create a table

Download and run the table creation script for your cluster edition:

3. Download the OnTime dataset

Note

The OnTime dataset spans a large time range and takes a long time to download. You may also see data loss errors during the download, which is expected.

echo https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip | xargs -P10 wget --no-check-certificate --continue

4. Import the data to ApsaraDB for ClickHouse

Set the connection environment variables and import the data:

export ck_url=<Cluster endpoint>
export ck_user=<Database account username> 
export ck_pass=<Database account password>
export ck_port=<TCP port number> 
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | ./clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query="INSERT INTO ontime FORMAT CSVWithNames"; done

5. Run the benchmark queries

Note

The test queries are sourced from the open source ClickHouse official website. For more information, see OnTime.

Q0 — Average monthly flight count per year

SELECT avg(c1)
FROM
(
    SELECT Year, Month, count(*) AS c1
    FROM ontime
    GROUP BY Year, Month
);

Q1 — Flight count by day of week (2000–2008)

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

Q2 — Delayed flights by day of week (2000–2008, delay > 10 min)

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

Q3 — Top 10 origins with most delayed departures (2000–2008)

SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;

Q4 — Delayed flights by carrier (2007)

SELECT Carrier, count(*)
FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC;

Q5 — Delay percentage by carrier (2007)

set joined_subquery_requires_alias=0;
set any_join_distinct_right_table_keys=1;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
   SELECT
       Carrier,
       count(*) AS c
   FROM ontime
   WHERE DepDelay>10
       AND Year=2007
   GROUP BY Carrier
)
ANY INNER JOIN
(
   SELECT
       Carrier,
       count(*) AS c2
   FROM ontime
   WHERE Year=2007
   GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;

Q6 — Delay percentage by carrier (2000–2008)

set joined_subquery_requires_alias=0;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
   SELECT
       Carrier,
       count(*) AS c
   FROM ontime
   WHERE DepDelay>10
       AND Year>=2000 AND Year<=2008
   GROUP BY Carrier
)
ANY INNER JOIN
(
   SELECT
       Carrier,
       count(*) AS c2
   FROM ontime
   WHERE Year>=2000 AND Year<=2008
   GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;

Q7 — Annual delay rate trend (all years)

set joined_subquery_requires_alias=0;
SELECT Year, c1/c2
FROM
(
   select
       Year,
       count(*)*100 as c1
   from ontime
   WHERE DepDelay>10
   GROUP BY Year
)
ANY INNER JOIN
(
   select
       Year,
       count(*) as c2
   from ontime
   GROUP BY Year
) USING (Year)
ORDER BY Year;

Q8 — Top 10 destinations with the most unique origin cities served (2000–2010)

SELECT DestCityName, uniqExact(OriginCityName) AS u
FROM ontime
WHERE Year>=2000 and Year<=2010
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10;

Q9 — Total flights per year

SELECT Year, count(*) AS c1
FROM ontime
GROUP BY Year;

Q10 — Carriers ranked by arrival delay rate, excluding remote states (before 2010)

SELECT
  min(Year), max(Year), Carrier, count(*) AS cnt,
  sum(ArrDelayMinutes>30) AS flights_delayed,
  round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
FROM ontime
WHERE
  DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
  AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
  AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt>100000 and max(Year)>1990
ORDER by rate DESC
LIMIT 1000;

Use the Star Schema dataset to test performance

1. Connect to the cluster

Connect to your ApsaraDB for ClickHouse cluster from clickhouse-client in a Linux environment:

./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>

2. Create a table

Download and run the table creation script for your cluster edition:

3. Generate data files

Clone the ssb-dbgen project, compile it, and generate the data files. The following example generates four files: customer.tbl, lineorder.tbl, part.tbl, and supplier.tbl.

Note

The -s 100 parameter generates approximately 600 million rows of data (67 GB). Increase or decrease this value based on your test goals: use a lower value for quick validation, or a higher value for realistic load testing.

git clone https://github.com/vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
./dbgen -s 100 -T c
./dbgen -s 100 -T l
./dbgen -s 100 -T p
./dbgen -s 100 -T s

4. Import the data to ApsaraDB for ClickHouse

Set the connection environment variables and import the data:

export ck_url=<Cluster endpoint>
export ck_user=<Database account username>
export ck_pass=<Database account password>
export ck_port=<TCP port number> 
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO customer FORMAT CSV" < ssb-dbgen/customer.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO part FORMAT CSV" < ssb-dbgen/part.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO supplier FORMAT CSV" < ssb-dbgen/supplier.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO lineorder FORMAT CSV" < ssb-dbgen/lineorder.tbl 

5. Run the benchmark queries

Note

The test queries are sourced from the open source ClickHouse official website. For more information, see Star Schema Benchmark.

Q0 — Flatten the normalized tables into a single denormalized table for faster aggregation queries

INSERT INTO lineorder_flat
SELECT
    l.LO_ORDERKEY AS LO_ORDERKEY,
    l.LO_LINENUMBER AS LO_LINENUMBER,
    l.LO_CUSTKEY AS LO_CUSTKEY,
    l.LO_PARTKEY AS LO_PARTKEY,
    l.LO_SUPPKEY AS LO_SUPPKEY,
    l.LO_ORDERDATE AS LO_ORDERDATE,
    l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
    l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
    l.LO_QUANTITY AS LO_QUANTITY,
    l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
    l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
    l.LO_DISCOUNT AS LO_DISCOUNT,
    l.LO_REVENUE AS LO_REVENUE,
    l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
    l.LO_TAX AS LO_TAX,
    l.LO_COMMITDATE AS LO_COMMITDATE,
    l.LO_SHIPMODE AS LO_SHIPMODE,
    c.C_NAME AS C_NAME,
    c.C_ADDRESS AS C_ADDRESS,
    c.C_CITY AS C_CITY,
    c.C_NATION AS C_NATION,
    c.C_REGION AS C_REGION,
    c.C_PHONE AS C_PHONE,
    c.C_MKTSEGMENT AS C_MKTSEGMENT,
    s.S_NAME AS S_NAME,
    s.S_ADDRESS AS S_ADDRESS,
    s.S_CITY AS S_CITY,
    s.S_NATION AS S_NATION,
    s.S_REGION AS S_REGION,
    s.S_PHONE AS S_PHONE,
    p.P_NAME AS P_NAME,
    p.P_MFGR AS P_MFGR,
    p.P_CATEGORY AS P_CATEGORY,
    p.P_BRAND AS P_BRAND,
    p.P_COLOR AS P_COLOR,
    p.P_TYPE AS P_TYPE,
    p.P_SIZE AS P_SIZE,
    p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;

Q1 — Total revenue from discounted items in 1993 (quantity < 25, discount 1–3%)

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;

Q2 — Total revenue from discounted items in January 1994 (quantity 26–35, discount 4–6%)

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;

Q3 — Total revenue from discounted items in week 6 of 1994 (quantity 26–35, discount 5–7%)

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994 
  AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;

Q4 — Annual revenue by brand for category MFGR#12 in the Americas

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

Q5 — Annual revenue by brand MFGR#2221–MFGR#2228 in Asia

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

Q6 — Annual revenue for brand MFGR#2239 in Europe

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

Q7 — Revenue by customer and supplier nation within Asia (1992–1997)

SELECT
    C_NATION,
    S_NATION,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
GROUP BY
    C_NATION,
    S_NATION,
    year
ORDER BY
    year ASC,
    revenue DESC;

Q8 — Revenue by city for United States customers and suppliers (1992–1997)

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

Q9 — Revenue by city for specific UK cities (1992–1997)

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

Q10 — Revenue by city for specific UK cities in December 1997

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

Q11 — Annual profit by customer nation in the Americas (manufacturers MFGR#1 and MFGR#2)

SELECT
    toYear(LO_ORDERDATE) AS year,
    C_NATION,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
    year,
    C_NATION
ORDER BY
    year ASC,
    C_NATION ASC;

Q12 — Profit by supplier nation and product category in the Americas (1997–1998, manufacturers MFGR#1 and MFGR#2)

SELECT
    toYear(LO_ORDERDATE) AS year,
    S_NATION,
    P_CATEGORY,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
    year,
    S_NATION,
    P_CATEGORY
ORDER BY
    year ASC,
    S_NATION ASC,
    P_CATEGORY ASC;

Q13 — Profit by city and brand in the United States (1997–1998, category MFGR#14)

SELECT
    toYear(LO_ORDERDATE) AS year,
    S_CITY,
    P_BRAND,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
GROUP BY
    year,
    S_CITY,
    P_BRAND
ORDER BY
    year ASC,
    S_CITY ASC,
    P_BRAND ASC;