文档

MySQL分区表写入最佳实践

更新时间:

为了在您使用MySQL处理庞大数据集时优化写入查询性能,本文详细介绍了DataWorks数据集成将数据写入MySQL分区表的最佳实践。

前提条件

已具备MySQL数据库,并在DataWorks中添加为数据源。具体操作,请参见MySQL数据源

准备工作

  1. 创建源表,用于DataWorks读取源数据。

    CREATE TABLE `user_tb1` (
      `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id',
      `name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '名字',
      `update_time` datetime DEFAULT NULL,
      PRIMARY KEY (`id`)
    );
    INSERT INTO user_tb1 (name, update_time) VALUES ('u1', '2023-12-28 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u2', '2023-12-25 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-26 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-27 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u4', '2023-12-28 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u5', '2023-12-28 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u6', '2023-12-28 09:52:05');
    INSERT INTO user_tb1 (name, update_time) VALUES ('u7', '2023-12-28 09:52:05');
  2. 创建目标表,用于将数据写入分区表。

    本文以RANGE分区表和KEY分区表为例,需要分别创建如下表:

    • RANGE分区表

      CREATE TABLE `pt_write_test_tb1` (
        `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id',
        `name` varchar(100)  COMMENT '名字',
        `update_time` datetime  COMMENT '更新时间',
        PRIMARY KEY (`id`,`update_time`),
        KEY `idx_update_time` (`update_time`)
      )
      PARTITION BY RANGE COLUMNS(update_time) (
       PARTITION pt20231216 VALUES LESS THAN ('20231216'),
       PARTITION pt20231217 VALUES LESS THAN ('20231217'),
       PARTITION pt20231218 VALUES LESS THAN ('20231218'),
       PARTITION pt20231219 VALUES LESS THAN ('20231219'),
       PARTITION pt20231220 VALUES LESS THAN ('20231220'),
       PARTITION pt20231221 VALUES LESS THAN ('20231221'),
       PARTITION pt20231222 VALUES LESS THAN ('20231222'),
       PARTITION pt20231223 VALUES LESS THAN ('20231223'),
       PARTITION pt20231224 VALUES LESS THAN ('20231224'),
       PARTITION pt20231225 VALUES LESS THAN ('20231225'),
       PARTITION pt20231226 VALUES LESS THAN ('20231226'),
       PARTITION pt20231227 VALUES LESS THAN ('20231227'),
       PARTITION pt20231228 VALUES LESS THAN ('20231228'),
       PARTITION pt20231229 VALUES LESS THAN ('20231229')
      );
    • KEY分区表

      CREATE TABLE `pt_write_test_tb2` (
        `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id',
        `name` varchar(100)  COMMENT '名字',
        `update_time` datetime  COMMENT '更新时间',
        PRIMARY KEY (`id`)
      )
      PARTITION BY KEY(`id`)
      PARTITIONS 16;

分区表写入

RANGE分区表写入

本示例通过日期update_time字段作为RANGE分区条件,演示如何把数据写入MySQL分区表。

  1. 配置离线同步任务。

    配置要点:

    • 数据来源配置为已创建的源表user_tb1

    • 数据去向配置为已创建的目标RANGE分区表pt_write_test_tb1

    image

    说明

    您也可以通过数据去向导入前准备语句配置项,让同步任务在写入数据之前,先创建当天的分区:

    1. 准备一个调度参数,例如pt_date=$[yyyymmdd]

    2. 导入前准备语句中配置如下SQL:

      ALTER TABLE `pt_write_test_tb1` 
      ADD PARTITION (PARTITION pt${pt_date} VALUES LESS THAN ('${pt_date}'));
  2. 查询目标表中被写入分区表的数据。

    • 查询整表数据:

      SELECT * FROM pt_write_test_tb1;

      返回示例:

      +----+------+---------------------+
      | id | name | update_time         |
      +----+------+---------------------+
      |  2 | u2   | 2023-12-25 09:52:05 |
      |  3 | u3   | 2023-12-26 09:52:05 |
      |  4 | u3   | 2023-12-27 09:52:05 |
      |  1 | u1   | 2023-12-28 09:52:05 |
      |  5 | u4   | 2023-12-28 09:52:05 |
      |  6 | u5   | 2023-12-28 09:52:05 |
      |  7 | u6   | 2023-12-28 09:52:05 |
      |  8 | u7   | 2023-12-28 09:52:05 |
      +----+------+---------------------+
      8 rows in set (0.01 sec)
    • 查询单分区:

      SELECT * FROM pt_write_test_tb1 partition (pt20231229);

      返回示例:

      +----+------+---------------------+
      | id | name | update_time         |
      +----+------+---------------------+
      |  1 | u1   | 2023-12-28 09:52:05 |
      |  5 | u4   | 2023-12-28 09:52:05 |
      |  6 | u5   | 2023-12-28 09:52:05 |
      |  7 | u6   | 2023-12-28 09:52:05 |
      |  8 | u7   | 2023-12-28 09:52:05 |
      +----+------+---------------------+
      5 rows in set (0.00 sec)

KEY分区表写入

本示例通过主键id字段作为KEY分区条件,演示如何把数据写入MySQL分区表。

  1. 配置离线同步任务。

    配置要点:

    • 数据来源配置为已创建的源表user_tb1

    • 数据去向配置为已创建的目标KEY分区表pt_write_test_tb2

    image

  2. 查询目标表中被写入分区表的数据。

    • 查询整表数据:

      select * from pt_write_test_tb2 order by id;

      返回示例:

      +----+------+---------------------+
      | id | name | update_time         |
      +----+------+---------------------+
      |  1 | u1   | 2023-12-28 09:52:05 |
      |  2 | u2   | 2023-12-25 09:52:05 |
      |  3 | u3   | 2023-12-26 09:52:05 |
      |  4 | u3   | 2023-12-27 09:52:05 |
      |  5 | u4   | 2023-12-28 09:52:05 |
      |  6 | u5   | 2023-12-28 09:52:05 |
      |  7 | u6   | 2023-12-28 09:52:05 |
      |  8 | u7   | 2023-12-28 09:52:05 |
      +----+------+---------------------+
      8 rows in set (0.00 sec)
    • 查看各个分区的数据:

      SELECT
      	TABLE_NAME ,
      	PARTITION_NAME ,
      	TABLE_ROWS,
      	UPDATE_TIME
      FROM
      	information_schema.partitions
      WHERE
      	TABLE_NAME = 'pt_write_test_tb2';

      返回示例:

      +-------------------+----------------+------------+---------------------+
      | TABLE_NAME        | PARTITION_NAME | TABLE_ROWS | UPDATE_TIME         |
      +-------------------+----------------+------------+---------------------+
      | pt_write_test_tb2 | p0             |          0 | NULL                |
      | pt_write_test_tb2 | p1             |          0 | NULL                |
      | pt_write_test_tb2 | p10            |          0 | NULL                |
      | pt_write_test_tb2 | p11            |          1 | 2024-02-29 14:49:14 |
      | pt_write_test_tb2 | p12            |          0 | NULL                |
      | pt_write_test_tb2 | p13            |          0 | NULL                |
      | pt_write_test_tb2 | p14            |          1 | 2024-02-29 14:49:15 |
      | pt_write_test_tb2 | p15            |          1 | 2024-02-29 14:49:16 |
      | pt_write_test_tb2 | p2             |          1 | 2024-02-29 14:49:17 |
      | pt_write_test_tb2 | p3             |          0 | NULL                |
      | pt_write_test_tb2 | p4             |          1 | 2024-02-29 14:49:14 |
      | pt_write_test_tb2 | p5             |          1 | 2024-02-29 14:49:15 |
      | pt_write_test_tb2 | p6             |          0 | NULL                |
      | pt_write_test_tb2 | p7             |          0 | NULL                |
      | pt_write_test_tb2 | p8             |          1 | 2024-02-29 14:49:16 |
      | pt_write_test_tb2 | p9             |          1 | 2024-02-29 14:49:17 |
      +-------------------+----------------+------------+---------------------+
      16 rows in set (0.01 sec)