导入数据
本文介绍如何构建TPC-H测试数据,并将TPC-H 1000GB测试数据分别导入AnalyticDB for MySQL中。
本文的TPC-H的实现基于TPC-H的基准测试,并不能与已发布的TPC-H基准测试结果相比较,本文中的测试并不符合TPC-H基准测试的所有要求。
下表列出了TPC-H测试数据集中的表数据条数。
表名 | 数据条数 |
customer | 1.5亿 |
lineitem | 60亿 |
nation | 25 |
orders | 15亿 |
part | 2亿 |
partsupp | 8亿 |
region | 5 |
supplier | 1000万 |
OSS外表导入(推荐)
以下流程仅适用于企业版、基础版及湖仓版集群。
创建外部数据库。
CREATE EXTERNAL DATABASE IF NOT EXISTS external_tpch;
创建八张外表。
说明AnalyticDB for MySQL提供了TPC-H测试数据存放的OSS路径。您需要根据集群所在地域,替换
LOCATION
参数指定的OSS路径。CREATE EXTERNAl TABLE external_tpch.nation ( n_nationkey int NOT NULL COMMENT '', n_name varchar NOT NULL COMMENT '', n_regionkey int NOT NULL COMMENT '', n_comment varchar COMMENT '', `dummy` varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/nation/'; CREATE EXTERNAl TABLE external_tpch.customer ( c_custkey int NOT NULL COMMENT '', c_name varchar NOT NULL COMMENT '', c_address varchar NOT NULL COMMENT '', c_nationkey int NOT NULL COMMENT '', c_phone varchar(15) NOT NULL COMMENT '', c_acctbal double NOT NULL COMMENT '', c_mktsegment varchar(10) NOT NULL COMMENT '', c_comment varchar NOT NULL COMMENT '', `dummy` varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/customer/'; CREATE EXTERNAl TABLE external_tpch.lineitem ( l_orderkey bigint NOT NULL COMMENT '', l_partkey int NOT NULL COMMENT '', l_suppkey int NOT NULL COMMENT '', l_linenumber int NOT NULL COMMENT '', l_quantity double NOT NULL COMMENT '', l_extendedprice double NOT NULL COMMENT '', l_discount double NOT NULL COMMENT '', l_tax double NOT NULL COMMENT '', l_returnflag varchar(1) NOT NULL COMMENT '', l_linestatus varchar(1) NOT NULL COMMENT '', l_shipdate date NOT NULL COMMENT '', l_commitdate date NOT NULL COMMENT '', l_receiptdate date NOT NULL COMMENT '', l_shipinstruct varchar(25) NOT NULL COMMENT '', l_shipmode varchar(10) NOT NULL COMMENT '', l_comment varchar NOT NULL COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/lineitem/'; CREATE EXTERNAl TABLE external_tpch.orders ( o_orderkey bigint NOT NULL COMMENT '', o_custkey int NOT NULL COMMENT '', o_orderstatus varchar(1) NOT NULL COMMENT '', o_totalprice double NOT NULL COMMENT '', o_orderdate date NOT NULL COMMENT '', o_orderpriority varchar(15) NOT NULL COMMENT '', o_clerk varchar(15) NOT NULL COMMENT '', o_shippriority int NOT NULL COMMENT '', o_comment varchar NOT NULL COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/orders/'; CREATE EXTERNAl TABLE external_tpch.part ( p_partkey int NOT NULL COMMENT '', p_name varchar NOT NULL COMMENT '', p_mfgr varchar NOT NULL COMMENT '', p_brand varchar(10) NOT NULL COMMENT '', p_type varchar NOT NULL COMMENT '', p_size int NOT NULL COMMENT '', p_container varchar(10) NOT NULL COMMENT '', p_retailprice double NOT NULL COMMENT '', p_comment varchar NOT NULL COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/part/'; CREATE EXTERNAl TABLE external_tpch.partsupp ( ps_partkey int NOT NULL COMMENT '', ps_suppkey int NOT NULL COMMENT '', ps_availqty int NOT NULL COMMENT '', ps_supplycost double NOT NULL COMMENT '', ps_comment varchar NOT NULL COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/partsupp/'; CREATE EXTERNAl TABLE external_tpch.region ( r_regionkey int NOT NULL COMMENT '', r_name varchar NOT NULL COMMENT '', r_comment varchar COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/region/'; CREATE EXTERNAl TABLE external_tpch.supplier ( s_suppkey int NOT NULL COMMENT '', s_name varchar NOT NULL COMMENT '', s_address varchar NOT NULL COMMENT '', s_nationkey int NOT NULL COMMENT '', s_phone varchar(15) NOT NULL COMMENT '', s_acctbal double NOT NULL COMMENT '', s_comment varchar NOT NULL COMMENT '', dummy varchar ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dataset-cn-beijing-external/TPC-H/1TB/supplier/';
将外表中的数据导入构建测试表文档所创建的内表中。
INSERT OVERWRITE INTO customer SELECT * FROM external_tpch.customer; INSERT OVERWRITE INTO lineitem SELECT * FROM external_tpch.lineitem; INSERT INTO nation SELECT * FROM external_tpch.nation; INSERT OVERWRITE INTO orders SELECT * FROM external_tpch.orders; INSERT OVERWRITE INTO part SELECT * FROM external_tpch.part; INSERT OVERWRITE INTO partsupp SELECT * FROM external_tpch.partsupp; INSERT INTO region SELECT * FROM external_tpch.region; INSERT OVERWRITE INTO supplier SELECT * FROM external_tpch.supplier;
收集统计信息。
ANALYZE TABLE customer UPDATE HISTOGRAM; ANALYZE TABLE lineitem UPDATE HISTOGRAM; ANALYZE TABLE nation UPDATE HISTOGRAM; ANALYZE TABLE orders UPDATE HISTOGRAM; ANALYZE TABLE part UPDATE HISTOGRAM; ANALYZE TABLE partsupp UPDATE HISTOGRAM; ANALYZE TABLE region UPDATE HISTOGRAM; ANALYZE TABLE supplier UPDATE HISTOGRAM;
说明查询优化器(Query Optimizer)将查询转换为执行计划,并交给执行引擎执行。执行计划的质量会影响查询的性能。统计信息作为查询优化器的输入,可以帮助查询优化器生成高质量的执行计划。因此,在导入数据后,您需收集所有表的直方图信息,以获得最佳性能。若您想了解统计信息的更多内容,请参见统计信息。
LOAD DATA导入
从TPC官网下载TPC-H标准的数据生成工具DBGEN,编译后生成二进制可执行文件dbgen。
./dbgen -s $scale -C $chunks -S $i -f
-s
:指定scale值,例如100GB时scale值为100,1TB时scale值为1000。-C
:一共分成几个chunk。-S
::当前命令生成第几个 chunk。说明一条语句只能生成一个 chunk。
更多dbgen使用方法请参见tpch-dbgen。
使用LOAD DATA将dbgen生成的文件导入至AnalyticDB for MySQL中。
本示例以Windows操作系统为例,换行符为
\r\n
。LOAD DATA LOCAL INFILE 'customer.tbl' INTO TABLE CUSTOMER FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'orders.tbl' INTO TABLE ORDERS FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'lineitem.tbl' INTO TABLE LINEITEM FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'nation.tbl' INTO TABLE NATION FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'partsupp.tbl' INTO TABLE PARTSUPP FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'part.tbl' INTO TABLE PART FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'region.tbl' INTO TABLE REGION FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'; LOAD DATA LOCAL INFILE 'supplier.tbl' INTO TABLE SUPPLIER FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n';
收集统计信息。
ANALYZE TABLE customer UPDATE HISTOGRAM; ANALYZE TABLE lineitem UPDATE HISTOGRAM; ANALYZE TABLE nation UPDATE HISTOGRAM; ANALYZE TABLE orders UPDATE HISTOGRAM; ANALYZE TABLE part UPDATE HISTOGRAM; ANALYZE TABLE partsupp UPDATE HISTOGRAM; ANALYZE TABLE region UPDATE HISTOGRAM; ANALYZE TABLE supplier UPDATE HISTOGRAM;
说明查询优化器(Query Optimizer)将查询转换为执行计划,并交给执行引擎执行。执行计划的质量会影响查询的性能。统计信息作为查询优化器的输入,可以帮助查询优化器生成高质量的执行计划。因此,在导入数据后,您需收集所有表的直方图信息,以获得最佳性能。若您想了解统计信息的更多内容,请参见统计信息。