Synchronize MySQL data to Tablestore with DataX

更新时间:
复制 MD 格式

This topic describes how to use DataX to synchronize data from a MySQL database to Tablestore.

Background

DataX is an offline data synchronization tool from Alibaba Cloud. It connects to a MySQL database over JDBC to retrieve data by using SQL statements. DataX then caches the data in the local JVM and uses writer threads to write it to a Tablestore data table. For more information about DataX, see DataX.

Prerequisites

  • Prepare the required MySQL information, including the username, password, and JDBC connection information.

  • Activate Tablestore, and create an instance and a data table to store the synchronized data. For details, see Activate Tablestore and create an instance and Create a data table.

    Important

    Use the original primary key or a unique index from MySQL as the primary key for the Tablestore data table. For the sample MySQL table and Tablestore data table used in this topic, see Appendix: Sample data.

  • Obtain the endpoint and name of your Tablestore instance.

    1. Log on to the Tablestore console.

    2. In the top navigation bar, select a resource group and a region.

    3. On the Overview page, click the instance alias or click Instance Management in the Actions column.

    4. On the Instance Details tab, view the instance name and endpoint.

  • Obtain an AccessKey pair. You can use the AccessKey pair of an Alibaba Cloud account or a RAM user for configuration. For details on how to obtain an AccessKey pair, see Obtain an AccessKey pair.

    Important

    For security reasons, use a RAM user to access Tablestore. You can create a RAM user, grant the AliyunOTSFullAccess permission to the user, and create an AccessKey pair for the RAM user. For details, see Use the AccessKey pair of a RAM user to access Tablestore.

Procedure

The procedures in this topic are performed on an ECS instance that runs Alibaba Cloud Linux 3.2104 LTS 64-bit or Ubuntu 22.04 64-bit.

Step 1: Install dependencies

  1. Install Python (Python 2 or Python 3 is supported).

    Alibaba Cloud Linux and Ubuntu on ECS come with Python 3 pre-installed. If you use a different server, you must install Python yourself.

  2. Install JDK (1.8 or later). JDK 1.8 is recommended.

    The following commands show how to install JDK 1.8 on the specified systems. If you use a different server, you must install the JDK yourself.

    Alibaba Cloud Linux

    yum -y install java-1.8.0-openjdk-devel.x86_64

    Ubuntu

    apt update && apt upgrade
    apt install openjdk-8-jdk

Step 2: Install DataX

  1. Download the DataX package.

    wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz
  2. Decompress the package.

    tar -zxvf datax.tar.gz

If you need to build DataX from source, see the DataX Installation Guide.

Step 3: Create a configuration file

  1. Go to the bin directory of DataX.

    cd datax/bin
  2. Create a configuration file named mysql_to_ots.json.

    vi mysql_to_ots.json

    Refer to the following configuration sample. The MySQL reader supports two data reading modes. Select a mode based on your business requirements.

    • QuerySQL mode: Retrieves data by running an SQL query. This mode supports table joins.

    • Table mode: Defines the data to be exported by specifying a table name, column names, and a WHERE clause. DataX automatically assembles an SQL statement to extract the data. This mode also supports parallel synchronization through data sharding.

    QuerySQL mode

    Replace the placeholder values in the configuration file with your specific information.

    {
      "job": {
        "setting": {
          "speed": {
            "channel": 1
          },
          "errorLimit": {
            "record": 0,
            "percentage": 0
          }
        },
        "content": [
          {
            "reader": {
              "name": "mysqlreader",
              "parameter": {
                "username": "mysql_username",
                "password": "mysql_password",
                "connection": [
                  {
                    "querySql": [
                      "select * from table_name"
                    ],
                    "jdbcUrl": [
                      "jdbc:mysql://server_ip:3306/database_name?useSSL=false"
                    ]
                  }
                ]
              }
            },
            "writer": {
              "name": "otswriter",
              "parameter": {
                "endpoint":"endpoint",
                "accessId":"accesskey_id",
                "accessKey":"accesskey_secret",
                "instanceName":"instance_name",
                "table":"table_name",
                "primaryKey":[
                  {"name":"order_id", "type":"string"}
                ],
                "column":[
                  {"name":"user_id","type":"string"},
                  {"name":"sku_id","type":"string"},
                  {"name":"price","type":"double"},
                  {"name":"num","type":"int"},
                  {"name":"total_price","type":"double"},
                  {"name":"order_status","type":"string"},
                  {"name":"create_time","type":"string"},
                  {"name":"modified_time","type":"string"}
                ],
                "writeMode":"UpdateRow"
              }
            }
          }
        ]
      }
    }

    The following table describes the MySQL Reader parameters that you must replace in the configuration.

    Parameter

    Description

    username

    The username for the MySQL database connection.

    password

    The password for the MySQL user.

    querySql

    The SQL query that defines the scope of data to synchronize.

    jdbcUrl

    The JDBC connection information.

    For more information about MySQL Reader, see MySQL Reader.

    The following table describes the Tablestore Writer parameters that you must replace in the configuration.

    Parameter

    Description

    endpoint

    The endpoint of the instance.

    accessId

    The AccessKey ID of your Alibaba Cloud account or RAM user.

    accessKey

    The AccessKey secret of your Alibaba Cloud account or RAM user.

    instanceName

    The name of the instance.

    table

    The name of the target table.

    primaryKey

    The list of primary key columns for the target table.

    column

    The attribute columns to be written.

    For more information about Tablestore Writer, see Tablestore Writer.

    Table mode

    Replace the placeholder values in the configuration file with your specific information.

    {
      "job": {
        "setting": {
          "speed": {
            "channel": 1
          },
          "errorLimit": {
            "record": 0,
            "percentage": 0.02
          }
        },
        "content": [
          {
            "reader": {
              "name": "mysqlreader",
              "parameter": {
                "username": "mysql_username",
                "password": "mysql_password",
                "column": [
                  "order_id",
                  "user_id" ,
                  "sku_id" ,
                  "price",
                  "num",
                  "total_price",
                  "order_status",
                  "create_time",
                  "modified_time"
                ],
                "splitPk": "num",
                "connection": [
                  {
                    "table": [
                      "table_name"
                    ],
                    "jdbcUrl": [
                      "jdbc:mysql://server_ip:3306/database_name?useSSL=false"
                    ]
                  }
                ]
              }
            },
            "writer": {
              "name": "otswriter",
              "parameter": {
                "endpoint":"endpoint",
                "accessId":"accesskey_id",
                "accessKey":"accesskey_secret",
                "instanceName":"instance_name",
                "table":"table_name",
                "primaryKey":[
                  {"name":"order_id", "type":"string"}
                ],
                "column":[
                  {"name":"user_id","type":"string"},
                  {"name":"sku_id","type":"string"},
                  {"name":"price","type":"double"},
                  {"name":"num","type":"int"},
                  {"name":"total_price","type":"double"},
                  {"name":"order_status","type":"string"},
                  {"name":"create_time","type":"string"},
                  {"name":"modified_time","type":"string"}
                ],
                "writeMode":"UpdateRow"
              }
            }
          }
        ]
      }
    }

    The following table describes the MySQL Reader parameters that you must replace in the configuration.

    Parameter

    Description

    username

    The username for the MySQL database connection.

    password

    The password for the MySQL user.

    column

    The columns of the source table to synchronize.

    splitPk

    The column used for data sharding. DataX uses this column to split the dataset for parallel processing only when the channel value is greater than 1.

    Note

    For best performance, shard data on an integer-based column.

    table

    The name of the MySQL source table.

    jdbcUrl

    The JDBC connection information.

    For more information about MySQL Reader, see MySQL Reader.

    The following table describes the Tablestore Writer parameters that you must replace in the configuration.

    Parameter

    Description

    endpoint

    The endpoint of the instance.

    accessId

    The AccessKey ID of your Alibaba Cloud account or RAM user.

    accessKey

    The AccessKey secret of your Alibaba Cloud account or RAM user.

    instanceName

    The name of the instance.

    table

    The name of the target table.

    primaryKey

    The list of primary key columns for the target table.

    column

    The attribute columns to be written.

    For more information about Tablestore Writer, see Tablestore Writer.

Step 4: Run the synchronization command

  1. Run the following command to start data synchronization.

    python3 datax.py mysql_to_ots.json

    After the synchronization job is complete, DataX prints a summary of the job status.

    2025-02-10 18:02:17.355 [job-0] INFO  JobContainer -
    Job start time                : 2025-02-10 18:02:06
    Job end time                  : 2025-02-10 18:02:17
    Total time elapsed            :                 11s
    Average throughput            :          871.56KB/s
    Record write speed            :          10000rec/s
    Total records read            :              100000
    Total read/write failures     :                   0
  2. Verify the result.

    View the imported data in the Tablestore console.

Related documents

  • For the data types supported by Tablestore attribute columns, see Data types.

Appendix: Sample data

MySQL source table

The following CREATE TABLE statement is for the sample MySQL source table:

create table orders (
  order_id varchar(50) primary key comment 'Order ID',
  user_id varchar(10) not null comment 'User ID',
  sku_id varchar(10) not null comment 'SKU ID',
  price decimal(12, 2) not null comment 'Unit price',
  num int not null comment 'Quantity',
  total_price decimal(12, 2) not null comment 'Total order price',
  order_status varchar(2) not null comment 'Order status',
  create_time timestamp not null default current_timestamp comment 'Order creation time',
  modified_time timestamp not null default current_timestamp on update current_timestamp comment 'Last modified time'
);

Tablestore target table

Because Tablestore is schema-free, you only need to specify the primary key order_id when you create the target table.

For this example, the order_id primary key is a STRING that also serves as the partition key. The table is configured with Max Versions set to 1, a data lifecycle of Never Expire, no predefined attribute columns, and a disabled search index.