Tapdata Cloud: Import MySQL to Table Store
Tapdata Cloud's visual interface lets you synchronize data from self-hosted MySQL, Alibaba Cloud RDS MySQL, and third-party cloud databases to Tablestore in real time.
Prerequisites
Before you use Tapdata Cloud to sync data from MySQL to Table Store, ensure you have the following:
-
A Table Store instance and a data table. For more information, see Create an instance and Create a data table.
-
A RAM user with the
AliyunOTSFullAccesspermission to manage Table Store. For more information, see Use an access key of a RAM user to access Table Store.
Background
Tapdata Cloud is a real-time data service that integrates data replication and data development. It provides millisecond-level real-time data synchronization and data fusion across clouds, regions, and data sources. For more information, see What is Tapdata.
Notes
Data synchronization supports MySQL versions 5.0, 5.1, 5.5, 5.6, 5.7, and 8.x.
Procedure
This section outlines how to synchronize data from MySQL to Table Store using Tapdata Cloud.
-
Configure a MySQL database and an authorized account. For more information, see Step 1: Configure a MySQL database and an authorized account.
-
Deploy Tapdata Agent on the machine that hosts the MySQL database or on a machine in the same network. For more information, see Step 2: Deploy Tapdata Agent.
The Tapdata Agent (Agent) is a core component for data synchronization, data development, and processing heterogeneous data. It uses streaming technology to retrieve data from the source, process and transform it, and send it to the target.
ImportantYour data does not pass through Tapdata Cloud, and Tapdata Cloud does not upload or store your data.
-
Use Tapdata Cloud to connect the source and target databases. For more information, see Step 3: Connect the source and target databases.
-
Create a data replication task. For more information, see Step 4: Configure a data replication task.
Step 1: Configure MySQL and an authorized account
Before you synchronize data from your MySQL database, create a database account for the data replication task. To enable incremental data synchronization, you must also enable Binlog on the database.
-
Log on to your MySQL database.
-
Run the appropriate command for your database version to create a database account.
-
For MySQL versions 5.0, 5.1, 5.5, 5.6, and 5.7, run the following command:
CREATE USER 'username'@'host' IDENTIFIED BY 'password'; -
For MySQL 8.x, run the following command:
CREATE USER 'username'@'host' IDENTIFIED WITH mysql_native_password BY 'password';
The following table describes these parameters.
Parameter
Example
Description
username
tapdatamysql
The database username.
password
Tap**********
The database password.
host
%
The host from which the user can log on. The percent sign (
%) indicates any host. -
-
Grant privileges to the database account you just created.
For production environments, grant more granular privileges based on your business requirements. The following examples show how to grant the
SELECTprivilege on a specific database or grant global privileges.-
Grant the
SELECTprivilege on a specific databaseGRANT SELECT, SHOW VIEW, CREATE ROUTINE, LOCK TABLES ON database_name.table_name TO 'username' IDENTIFIED BY 'password'; -
Grant global privileges
GRANT RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'username' IDENTIFIED BY 'password';
The following table describes these parameters.
Parameter
Example
Description
database_name.table_name
datademo.userinfo
The database and table to grant privileges on.
username
tapdatamysql
The database username.
password
Tap**********
The database password.
-
-
To read incremental data, enable Binlog.
-
Run the vim command to modify the configuration in
$MYSQL_HOME/mysql.cnf. An example configuration is as follows:server_id = 112233 log_bin = mysql-binlog expire_logs_days = 1 binlog_format = row binlog_row_image = fullThe following table describes these parameters.
Parameter
Example
Description
server_id
112233
A unique ID that identifies a server or replication client in MySQL. The
server_idmust be unique for each server and replication client. The value must be an integer greater than 0.log_bin
mysql-binlog
The base name for the sequence of binary log files.
expire_logs_days
1
The number of days to retain binary log files before they are automatically deleted.
binlog_format
row
The format of the binary log. Set this parameter to
row.binlog_row_image
full
Specifies how row images are written to the binary log for row-based replication. Set this parameter to
fullto log all columns for both the before-image and after-image of changed rows. -
After modifying the configuration, run the following command to restart the MySQL process.
/etc/init.d/mysqld restart -
After you log on to the MySQL database, execute the following command to confirm that the Binlog configuration has taken effect, where the value of the format parameter is ROW.
SHOW VARIABLES LIKE 'binlog_format';The following is a sample output:
+---------------+-------+ | Variable_name | Value | +---------------+-------+ | binlog_format | ROW | +---------------+-------+ 1 row in set (0.00 sec)
-
Step 2: Deploy the Tapdata Agent
The machine where you deploy the Agent must have internet access.
Data flow is often time-sensitive. To minimize the impact of network latency, we recommend deploying the Tapdata Agent on the machine that hosts the MySQL database or on a machine within the same network.
This section uses Linux (64-bit) as an example. To deploy on other platforms (such as Windows or Alibaba Cloud Compute Nest), see Install an Agent.
-
Log in to the Tapdata Cloud platform.
-
In the left-side navigation pane, click Agent Management.
-
On the Agent Management page, click Create Agent.
-
In the Agent Download and Installation dialog box, select Linux (64 bit), and then follow the on-screen instructions to install the Agent.
ImportantBefore you begin, ensure that Java 1.8 is installed in your deployment environment and that the required environment variables are configured correctly.
-
Log in to the machine where you plan to deploy the Agent.
-
Run the following command to create a tapdata directory for the Tapdata Agent installation and deployment.
mkdir tapdata -
Click Copy for the command in step 3 of the dialog box, and then run it in the tapdata directory.
NoteRunning this command automatically downloads, deploys, and starts the Tapdata Agent.
Wait for the command to complete. The Tapdata Agent is then installed and started.
The following output indicates a successful Tapdata Agent startup.
WORK DIR:/root/tapdata Init tapdata... Download components... Downloading 100.00% 225.48 MB, Total size: 225.48 MB, avg speed: 12.3 MB/s, remaining time: 0m 0s Downloading 100.00% 0.00 KB .Total size: 1.07 KB Downloading 100.00% 0.00 KB .Total size: 0.14 KB organize files... WORK DIR:/root/tapdata Testing JDK... java version:1.8 Java environment OK. The max memory will be used by Tapdata is 1119M Waiting for the flow engine to start \ FlowEngine is startup at : 2022-12-22 14:54 -
Step 3: Connect to source and destination databases
Connect to the source and destination databases in Tapdata Cloud.
-
Log in to the Tapdata Cloud platform.
-
In the left-side navigation pane, click Connection Management.
-
Connect to the MySQL database as the source.
-
On the Connection Management page, click Create Connection.
-
In the Create Connection dialog box, on the Verified Data Sources tab, click MySQL.
-
On the Create Connection page, configure the data source parameters as follows.
Parameter
Description
Connection Name
The name of the data connection. This name must be unique.
Connection Type
MySQL can be used as a source or destination. Valid values are Source and Target and Source.
Address
The database connection address.
Port
The database service port.
Database
The database name. Each connection corresponds to one database.
NoteTo connect to multiple databases, create a separate connection for each.
Account
The database account.
Password
The database password.
Advanced Settings
Connection parameters
Additional connection parameters. This parameter is empty by default.
Time Zone
Defaults to the database's time zone. You can also specify a time zone based on your business requirements.
For example, if the source database uses the default time zone (+8:00) and the destination database uses a specified time zone (+0:00), a time value of
2020-01-01 16:00:00in the source is stored as2020-01-01 08:00:00in the destination.Include Tables
The tables to include. Valid values:
-
All (default): Contains all tables in the database.
-
Custom: Specify the tables to include. Use a comma (,) to separate multiple tables or expressions. In expressions, you can use an asterisk (*) to represent any string of characters.
Exclude Tables
Excludes specified tables from those matched by the Include Tables rule.
Turn on the Exclude Tables switch and specify the tables to exclude. Separate multiple tables or expressions with a comma (,). The asterisk (*) can be used as a wildcard.
Agent Settings
The agent to use for the connection. Valid values:
-
Platform automatic allocation (default): The platform allocates a node for connection access.
-
User-specified: After setting this parameter to User-specified, select the Agent to use.
Model Loading Time
If a data source contains more than 10,000 models, Tapdata Cloud periodically refreshes them at the specified time.
-
-
Click Connection Test. If the test is successful, click Save.
NoteIf the connection test fails, follow the on-screen instructions to resolve the issue.
-
-
Connect to the destination Tablestore instance.
-
On the Connection Management page, click Create Connection.
-
In the Create Connection dialog box, on the Beta Data Sources tab, click Tablestore.
-
On the Create Connection page, configure the data source parameters as follows.
Parameter
Description
Connection Name
The name of the data connection. This name must be unique.
Connection Type
The value is fixed to Target. This data connection can only be used as a destination in Tapdata Cloud, not as a source.
Endpoint
The endpoint of the Tablestore instance. For more information, see Endpoints.
Instance Name
The name of the Tablestore instance. For more information, see Instances.
AccessKey ID
The AccessKey ID and AccessKey Secret of the RAM user.
NoteWhen using a temporary access credential from STS, use the AccessKey ID and AccessKey Secret from that credential.
AccessKey Secret
AccessKey Token
The token for the temporary STS credential. This parameter is required only when using an STS credential.
Agent Settings
The agent to use for the connection. Valid values:
-
Automatic allocation by the platform (Default): The platform allocates nodes to provide access.
-
Manual specification: After you set this parameter to Manual specification, select the Agent to use.
Model Loading Frequency
If a data source contains more than 10,000 models, Tapdata Cloud periodically refreshes them at the specified frequency.
-
-
Click Connection Test. If the test is successful, click Save.
NoteIf the connection test fails, follow the on-screen instructions to resolve the issue.
-
Step 4: Configure a data replication task
Configure a data replication task to synchronize data in real time between homogeneous or heterogeneous data sources. This feature is ideal for various business scenarios, including data migration, data synchronization, disaster recovery, and read performance scaling.
-
Log in to the Tapdata Cloud platform.
-
In the left-side navigation pane, click Data Replication.
-
On the Data Replication page, click Create.
-
From the left side of the page, drag the MySQL and Tablestore databases to the canvas on the right, and then connect them.
-
Click the MySQL database. In the right-side panel, configure the parameters as described in the following table.
Parameter
Description
Node name
By default, this is the connection name. You can change it as needed.
DDL event collection
Since Tablestore does not support DDL statement synchronization, you can skip this parameter.
Select tables
Select the tables to replicate. The following options are available:
-
By table name: Select the tables to replicate.
In the Tables to Replicate area, select the desired tables, and then click the
icon.You can also click Bulk Select to enter multiple table names for bulk selection. Separate the table names with commas (,).
-
By regular expression: Use a regular expression to select tables for replication.
When you use this method, any new tables created in the source database that match the expression are automatically synchronized to the target database.
Batch read size
The number of records read per batch during full synchronization. The default is 100.
-
-
Click the Tablestore database to preview the data schema and configure advanced settings.
Parameter
Description
Node name
By default, this is the connection name. You can change it as needed.
Inferred schema
View the table schema (such as field names and data types) and modify the field types as needed.
Note-
To modify a field type, click the
icon next to the target field type and complete the configuration in the dialog box that appears. -
A single Tablestore table can have a maximum of 32 columns. If the source MySQL table exceeds this limit, drag a field processor node from the left-side panel and connect it between the MySQL and Tablestore nodes. Then, in the field processor node, you can filter out unnecessary columns to meet this limit.
Advanced settings
Duplicate handling policy
Specifies what to do if the target table already exists. The default is Keep existing schema and data in the target. If the target table has no data and its schema differs from the source table, you can select Clear existing schema and data from the target.
Data write mode
Specifies how data is written to Tablestore. The following options are available:
-
Process by event type: Sets separate data write policies for insert, update, and delete events.
-
Append-only: Processes only insert events and discards update and delete events.
Data write policy
The policy for writing data based on event type. This parameter applies only when Data write mode is set to Process by event type.
The policy for insert events is Update if target exists, and the policy for update and delete events is Discard if target does not exist.
Full multi-threaded write
The number of concurrent threads for full data writes. The default is 8. You can adjust this value based on the target database's write performance.
Incremental write threads
The number of concurrent threads for incremental data writes. This is disabled by default. Turn on the Incremental write threads switch and adjust the value based on the target's write performance.
-
-
Optional: Click the
icon at the top of the canvas and configure the task properties as described in the following table.Parameter
Description
Task name
A custom name for the synchronization task.
Synchronization type
Specifies the data synchronization method. The following options are available:
-
Full + incremental: After completing a full data synchronization, the task synchronizes incremental data.
-
Full: Copies all existing data from the source to the target.
-
Incremental: Copies new data and data changes from the source to the target in real time.
Task description
An optional description for the task.
Advanced settings
Scheduled start time
Specifies when the task should start.
To specify a start time, turn on the Scheduled start time switch and select a date and time.
Data validation
If you turn on the Data validation switch, the task automatically performs full and incremental validation on the synchronized data to ensure consistency. The number of inconsistent rows is displayed on the monitoring page.
ImportantIf a task meets any of the following conditions, the task will not be validated, even if you enable the Data Validation switch.
-
The task includes an intermediate processing node.
-
The source connection does not support validation.
-
The target connection does not support validation.
Incremental concurrent write
If you turn on the Incremental concurrent write switch, the task writes source data to the target concurrently during incremental synchronization.
Incremental lag threshold
Helps you quickly determine if the replication task is lagging.
Turn on the Incremental lag threshold switch and set a threshold. During incremental synchronization, the task is considered lagging if its latency exceeds this threshold.
The default value is 1 second.
Processor threads
Specifies the number of processor threads.
Incremental data processing mode
The processing mode for incremental data during synchronization. Valid values are Batch and Row by row.
Agent settings
Specifies which agent to use. The following options are available:
-
Platform-assigned (default): The platform automatically assigns an agent for the connection.
-
Manually specify: If you set this parameter to Manually specify, select the agent to use.
-
-
After verifying the settings, click Start.
After the task starts, you can monitor its status, including metrics such as QPS, latency, and task duration.