为了在您使用MySQL处理庞大数据集时优化写入查询性能,本文详细介绍了DataWorks数据集成将数据写入MySQL分区表的最佳实践。
前提条件
已具备MySQL数据库,并在DataWorks中添加为数据源。具体操作,请参见MySQL数据源。
准备工作
创建源表,用于DataWorks读取源数据。
CREATE TABLE `user_tb1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '名字', `update_time` datetime DEFAULT NULL, PRIMARY KEY (`id`) ); INSERT INTO user_tb1 (name, update_time) VALUES ('u1', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u2', '2023-12-25 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-26 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-27 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u4', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u5', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u6', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u7', '2023-12-28 09:52:05');
创建目标表,用于将数据写入分区表。
本文以RANGE分区表和KEY分区表为例,需要分别创建如下表:
RANGE分区表
CREATE TABLE `pt_write_test_tb1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COMMENT '名字', `update_time` datetime COMMENT '更新时间', PRIMARY KEY (`id`,`update_time`), KEY `idx_update_time` (`update_time`) ) PARTITION BY RANGE COLUMNS(update_time) ( PARTITION pt20231216 VALUES LESS THAN ('20231216'), PARTITION pt20231217 VALUES LESS THAN ('20231217'), PARTITION pt20231218 VALUES LESS THAN ('20231218'), PARTITION pt20231219 VALUES LESS THAN ('20231219'), PARTITION pt20231220 VALUES LESS THAN ('20231220'), PARTITION pt20231221 VALUES LESS THAN ('20231221'), PARTITION pt20231222 VALUES LESS THAN ('20231222'), PARTITION pt20231223 VALUES LESS THAN ('20231223'), PARTITION pt20231224 VALUES LESS THAN ('20231224'), PARTITION pt20231225 VALUES LESS THAN ('20231225'), PARTITION pt20231226 VALUES LESS THAN ('20231226'), PARTITION pt20231227 VALUES LESS THAN ('20231227'), PARTITION pt20231228 VALUES LESS THAN ('20231228'), PARTITION pt20231229 VALUES LESS THAN ('20231229') );
KEY分区表
CREATE TABLE `pt_write_test_tb2` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COMMENT '名字', `update_time` datetime COMMENT '更新时间', PRIMARY KEY (`id`) ) PARTITION BY KEY(`id`) PARTITIONS 16;
分区表写入
RANGE分区表写入
本示例通过日期update_time
字段作为RANGE分区条件,演示如何把数据写入MySQL分区表。
配置离线同步任务。
配置要点:
数据来源的表配置为已创建的源表
user_tb1
。数据去向的表配置为已创建的目标RANGE分区表
pt_write_test_tb1
。
说明您也可以通过数据去向的导入前准备语句配置项,让同步任务在写入数据之前,先创建当天的分区:
准备一个调度参数,例如
pt_date=$[yyyymmdd]
。在导入前准备语句中配置如下SQL:
ALTER TABLE `pt_write_test_tb1` ADD PARTITION (PARTITION pt${pt_date} VALUES LESS THAN ('${pt_date}'));
查询目标表中被写入分区表的数据。
查询整表数据:
SELECT * FROM pt_write_test_tb1;
返回示例:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 2 | u2 | 2023-12-25 09:52:05 | | 3 | u3 | 2023-12-26 09:52:05 | | 4 | u3 | 2023-12-27 09:52:05 | | 1 | u1 | 2023-12-28 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 8 rows in set (0.01 sec)
查询单分区:
SELECT * FROM pt_write_test_tb1 partition (pt20231229);
返回示例:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 1 | u1 | 2023-12-28 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 5 rows in set (0.00 sec)
KEY分区表写入
本示例通过主键id
字段作为KEY分区条件,演示如何把数据写入MySQL分区表。
配置离线同步任务。
配置要点:
数据来源的表配置为已创建的源表
user_tb1
。数据去向的表配置为已创建的目标KEY分区表
pt_write_test_tb2
。
查询目标表中被写入分区表的数据。
查询整表数据:
select * from pt_write_test_tb2 order by id;
返回示例:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 1 | u1 | 2023-12-28 09:52:05 | | 2 | u2 | 2023-12-25 09:52:05 | | 3 | u3 | 2023-12-26 09:52:05 | | 4 | u3 | 2023-12-27 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 8 rows in set (0.00 sec)
查看各个分区的数据:
SELECT TABLE_NAME , PARTITION_NAME , TABLE_ROWS, UPDATE_TIME FROM information_schema.partitions WHERE TABLE_NAME = 'pt_write_test_tb2';
返回示例:
+-------------------+----------------+------------+---------------------+ | TABLE_NAME | PARTITION_NAME | TABLE_ROWS | UPDATE_TIME | +-------------------+----------------+------------+---------------------+ | pt_write_test_tb2 | p0 | 0 | NULL | | pt_write_test_tb2 | p1 | 0 | NULL | | pt_write_test_tb2 | p10 | 0 | NULL | | pt_write_test_tb2 | p11 | 1 | 2024-02-29 14:49:14 | | pt_write_test_tb2 | p12 | 0 | NULL | | pt_write_test_tb2 | p13 | 0 | NULL | | pt_write_test_tb2 | p14 | 1 | 2024-02-29 14:49:15 | | pt_write_test_tb2 | p15 | 1 | 2024-02-29 14:49:16 | | pt_write_test_tb2 | p2 | 1 | 2024-02-29 14:49:17 | | pt_write_test_tb2 | p3 | 0 | NULL | | pt_write_test_tb2 | p4 | 1 | 2024-02-29 14:49:14 | | pt_write_test_tb2 | p5 | 1 | 2024-02-29 14:49:15 | | pt_write_test_tb2 | p6 | 0 | NULL | | pt_write_test_tb2 | p7 | 0 | NULL | | pt_write_test_tb2 | p8 | 1 | 2024-02-29 14:49:16 | | pt_write_test_tb2 | p9 | 1 | 2024-02-29 14:49:17 | +-------------------+----------------+------------+---------------------+ 16 rows in set (0.01 sec)
文档内容是否对您有帮助?