Synchronize data from an OceanBase database to DataHub

更新时间:
复制 MD 格式

This topic describes how to synchronize data from an OceanBase database (MySQL and Oracle tenants) to DataHub.

Prerequisites

  • Data Transmission Service (DTS) requires the necessary cloud resource access permissions. For more information, see Grant permissions to a DTS role.

  • A database user for the data synchronization task has been created in the source OceanBase database and granted the required permissions. For more information, see Create a database user.

Limits

  • DTS supports full synchronization only for tables that have unique keys.

  • DDL synchronization is supported only for topics of the BLOB type.

  • During data synchronization, DTS lets you run DROP TABLE followed by CREATE TABLE on a synchronized table. However, DTS does not support renaming tables, such as with the RENAME TABLE a TO a_tmp command.

  • DTS supports data synchronization for the UTF-8 and GBK character sets.

  • The names of the tables and columns to be synchronized cannot contain Chinese characters.

  • DTS only supports objects whose database, table, and column names consist of ASCII characters. These names cannot contain special characters, such as line breaks, spaces, or any of the following: .|"'`()=;/&\.

  • DTS does not support using a secondary OceanBase database as the source.

DataHub has the following limits:

  • The maximum size of a single message in DataHub is typically 1 MB.

  • Because messages are sent in batches, DataHub limits the size of each batch to 4 MB.

  • For more information about the limits and naming conventions of DataHub, see Limits.

Notes

  • If you perform a RENAME operation on a source table in a data synchronization task with DDL synchronization enabled, we recommend restarting the task to prevent data loss during incremental synchronization.

  • If you select incremental synchronization for an OceanBase database version from V4.0.0 to V4.3.x (excluding V4.2.5 BP1), you must configure the STORED property for generated columns. Otherwise, information about generated columns will not be saved in incremental logs, which can cause data exceptions during incremental synchronization.

  • If an updated row contains LOB columns:

    • If a LOB column is updated, its value before the UPDATE or DELETE operation is unreliable.

      Data types that use LOB columns for storage include JSON, GIS, XML, User Defined Type (UDT), and various TEXT types, such as LONGTEXT and MEDIUMTEXT.

    • If a LOB column is not updated, its value is NULL both before and after the UPDATE or DELETE operation.

  • If you configure only Incremental Synchronization, the incremental logs in the source database must be retained for more than 48 hours.

    If you configure Full Synchronization and Incremental Synchronization, the incremental logs in the source database must be retained for at least 7 days. Otherwise, DTS may be unable to retrieve the incremental logs, which can cause the sync task to fail and result in data inconsistency between the source and destination.

  • When you synchronize incremental data from an OceanBase database to DataHub, the initial table schema is synchronized to the DataHub schema. The following table describes the supported DataHub data types. This applies only to topics of the Tuple type.

    Type

    Description

    Range

    BIGINT

    8-byte signed integer

    -9223372036854775807 to 9223372036854775807

    DOUBLE

    8-byte double-precision floating-point number

    -1.0_10^308 to 1.0_10^308

    BOOLEAN

    Boolean type

    • True/False

    • true/false

    • 0/1

    TIMESTAMP

    Timestamp type

    A timestamp that is accurate to the microsecond

    STRING

    String that supports only UTF-8 encoding

    The maximum size of a single STRING column is 2 MB

    INTEGER

    4-byte integer

    -2147483648 to 2147483647

    FLOAT

    4-byte single-precision floating-point number

    -3.40292347_10^38 to 3.40292347_10^38

    DECIMAL

    Numeric type

    -10^38 +1 to 10^38 - 1

Supported source and destination instance types

In the following table, an OceanBase database with a MySQL tenant is referred to as OB_MySQL, and an OceanBase database with an Oracle tenant is referred to as OB_Oracle.

Source

Destination

OB_MySQL (OceanBase cluster instance)

DataHub (Alibaba Cloud DataHub instance)

OB_MySQL (OceanBase cluster instance)

DataHub (Public DataHub instance)

OB_MySQL (Serverless instance)

DataHub (Alibaba Cloud DataHub instance)

OB_MySQL (Serverless instance)

DataHub (Public DataHub instance)

OB_Oracle (OceanBase cluster instance)

DataHub (Alibaba Cloud DataHub instance)

OB_Oracle (OceanBase cluster instance)

DataHub (Public DataHub instance)

Supported DDL operations

Important
  • If the source is an OceanBase database with a MySQL tenant and you select a Tuple topic, DDL synchronization only supports ALTER ADD COLUMN and ignores the NOT NULL constraint.

  • If the source is an OceanBase database with an Oracle tenant, DDL synchronization is supported only for BLOB topics.

  • Modify table ALTER TABLE

    • Add column ADD COLUMN

    • Modify column length MODIFY COLUMN

    • Drop column DROP COLUMN

  • Create index CREATE INDEX

  • Drop index DROP INDEX

  • Truncate table TRUNCATE

    Note

    In a delayed deletion scenario, a transaction may contain two identical TRUNCATE TABLE DDL statements. In this case, the downstream consumer must process the statements in an idempotent manner.

Data type mapping

DataHub supports only the INTEGER, BIGINT, TIMESTAMP, FLOAT, DOUBLE, DECIMAL, STRING, and BOOLEAN data types for synchronization.

  • If you create a new topic of an unsupported type, the data synchronization task will fail.

  • The default mapping rules in the tables are recommended. Errors may occur if you change the mapping.

Data type mapping between an OceanBase database with a MySQL tenant and DataHub

When you synchronize data from a MySQL tenant of an OceanBase database to DataHub, you can select BLOB or Tuple as the topic type. The following tables describe the data type mapping when the topic type is Tuple and the serialization format is Default or DTSCompatible. For more information about the data type mapping when the topic type is BLOB, see Data formats.

  • When you synchronize data from a MySQL tenant of an OceanBase database to DataHub (Tuple type) and set the serialization format to Default, the data types are mapped as follows.

    OceanBase database with a MySQL tenant

    Default DataHub type

    BIT

    STRING (Base64 encoded)

    CHAR

    STRING

    BINARY

    STRING (Base64 encoded)

    VARBINARY

    STRING (Base64 encoded)

    INT

    BIGINT

    TINYTEXT

    STRING

    SMALLINT

    BIGINT

    MEDIUMINT

    BIGINT

    BIGINT

    DECIMAL (Because the unsigned value exceeds the range of the LONG type in Java)

    FLOAT

    DECIMAL

    DOUBLE

    DECIMAL

    DECIMAL

    DECIMAL

    DATE

    STRING

    TIME

    STRING

    YEAR

    BIGINT

    DATETIME

    STRING

    TIMESTAMP

    TIMESTAMP (Displayed as a millisecond-level timestamp)

    VARCHAR

    STRING

    TINYBLOB

    STRING (Base64 encoded)

    TINYTEXT

    STRING

    BLOB

    STRING (Base64 encoded)

    TEXT

    STRING

    MEDIUMBLOB

    STRING (Base64 encoded)

    MEDIUMTEXT

    STRING

    LONGBLOB

    STRING (Base64 encoded)

    LONGTEXT

    STRING

  • When you synchronize data from a MySQL tenant in an OceanBase database to DataHub (Tuple type) and set the serialization format to DTSCompatible, the data types are mapped as follows.

    OceanBase database with a MySQL tenant

    Default DataHub type

    CHAR

    STRING

    VARCHAR

    STRING

    BINARY

    STRING

    VARBINARY

    STRING

    BIT(1)

    The value of BIT can be 0 or 1.

    BOOLEAN

    • If the value of the BIT column is 0, the BOOLEAN value in DataHub is false.

    • If the value of the BIT column is 1, the BOOLEAN value in DataHub is true.

    BIT(n)

    STRING (HEX encoded)

    The number of bits displayed for the `BIT` data type may not match the number of bits defined in the database. For example, a `bit(10)` value of `0` may be displayed as `000` by DTS, but as `00` in the database.

    TINYINT

    BIGINT

    SMALLINT

    BIGINT

    MEDIUMINT

    BIGINT

    INT

    BIGINT

    BIGINT

    BIGINT

    The BIGINT type in DataHub is signed and has a range of -9223372036854775807 to 9223372036854775807. If the source data is an unsigned BIGINT and its value exceeds 9223372036854775807, a data overflow occurs in DataHub, which leads to data inconsistency.

    FLOAT

    DOUBLE

    The precision is 7 digits. If the precision exceeds 7 digits, inconsistencies may occur.

    DOUBLE

    DOUBLE

    The precision is 16 digits. If the precision exceeds 16 digits, inconsistencies may occur.

    DECIMAL

    DECIMAL

    DATE

    TIMESTAMP (converted using the +08:00 time zone, accurate to the microsecond)

    TIME

    STRING

    YEAR

    STRING

    DATETIME

    TIMESTAMP (converted using the +08:00 time zone, accurate to the microsecond)

    TIMESTAMP

    TIMESTAMP (accurate to the microsecond)

    TINYTEXT

    STRING

    MEDIUMTEXT

    STRING

    TEXT

    STRING

    LONGTEXT

    STRING

    TINYBLOB

    STRING (HEX encoded)

    MEDIUMBLOB

    STRING (HEX encoded)

    BLOB

    STRING (HEX encoded)

    LONGBLOB

    STRING (HEX encoded)

Data type mapping between an OceanBase database with an Oracle tenant and DataHub

OceanBase database with an Oracle tenant

Default DataHub type

CHAR

STRING

NCHAR

STRING

VARCHAR2

STRING

NVARCHAR2

STRING

CLOB

STRING

BLOB

STRING (Base64 encoded)

NUMBER

DECIMAL

BINARY_FLOAT

DECIMAL

BINARY_DOUBLE

DECIMAL

DATE

STRING

TIMESTAMP

STRING

TIMESTAMP WITH TIME ZONE

STRING

TIMESTAMP WITH LOCAL TIME ZONE

STRING

INTERVAL YEAR TO MONTH

STRING

INTERVAL DAY TO SECOND

STRING

RAW

STRING (Base64 encoded)

Additional properties

If you create a topic manually, you must add the following properties to the DataHub schema before you start the sync task. If you allow DTS to automatically create the topic and synchronize the schema, DTS adds these properties for you.

Important

This section applies to Tuple-type topics when the Serialization Method is set to Default on the Sync Options > Advanced Options page.

Name

Type

Description

oms_timestamp

STRING

The time when the change occurred.

oms_table_name

STRING

The name of the changed table if the source is a table.

oms_database_name

STRING

If the source is a database, change its name.

oms_sequence

STRING

The timestamp when the data enters the memory of the sync process. It consists of a time value and a 5-digit incremental number. If a clock backoff occurs, data inconsistency may result.

oms_record_type

STRING

The type of change, which can be UPDATE, INSERT, or DELETE.

oms_is_before

STRING

If the change type is UPDATE, Y indicates that this record is the data before the change.

oms_is_after

STRING

If the change type is UPDATE, Y indicates that this record is the data after the change.

If the source is an OceanBase database with a MySQL tenant, setting the topic type to Tuple on the Choose Synchronization Objects page and the Serialization Format to DTSCompatible on the Advanced Options tab of the Sync Options page adds the dts_ suffix to all column names. For example, if the columnName in the source OceanBase database with a MySQL tenant is c1, it becomes dts_c1 in the destination DataHub instance. The field conversions are described as follows.

Field name

Data type

Description

dts_record_id

STRING

The record ID in the incremental log. This ID is the unique identifier of the log.

The record ID usually increments. However, during disaster recovery and migration, the clocks on different machines may not be synchronized, which can cause inconsistencies.

dts_operation_flag

STRING

The operation type. Valid values:

  • I: INSERT operation

  • D: DELETE operation

  • U: UPDATE operation

  • F: Full data synchronization

dts_db_name

STRING

The name of the database. For an OceanBase database, the format is Tenant name.Database name.

dts_table_name

STRING

The name of the table.

dts_utc_timestamp

STRING

The timestamp of the operation, which is the timestamp of the binary log (UTC time).

dts_before_flag

STRING

Indicates whether the values of all columns are the values before the update. Valid values are Y and N.

dts_after_flag

STRING

Indicates whether the values of all columns are the values after the update. Valid values are Y and N.

Procedure

  1. Log on to the OceanBase Management Console and purchase a data sync task.

    For more information, see Purchase a data sync task.

  2. In the Data Transmission Service console, navigate to the Data Synchronization page, find the new sync task, and click Configure.

    image.png

    If you want to reuse the configurations of an existing task, click Reuse Configuration. For more information, see Reuse and clear the configurations of a data sync task.

  3. On the Choose Source and Destination page, you can configure the parameters.

    Parameter

    Description

    Sync Task Name

    We recommend that you use a combination of Chinese characters, digits, and letters. The name cannot contain spaces and must be no more than 64 characters in length.

    Source

    If you have created an OceanBase data source, select it from the drop-down list. If not, click Create Data Source in the drop-down list to create one in the dialog box that appears. For more information about the parameters, see Create an OceanBase data source.

    Destination

    If you have created a DataHub data source, select it from the drop-down list. If not, click Create Data Source in the drop-down list to create one in the dialog box that appears. For more information about the parameters, see Create a DataHub data source.

    Tags (Optional)

    Click the text box and select a tag from the drop-down list. You can also click Manage Tags to create, modify, and delete tags. For more information, see Manage data sync tasks using tags.

  4. Click Next. On the Choose Synchronization Types page, select the required synchronization types.

    image

    The synchronization types are Schema Synchronization, Full Synchronization, and Incremental Synchronization. Schema Synchronization creates a topic, and Incremental Synchronization includes DML Synchronization (INSERT, DELETE, and UPDATE) and DDL Synchronization. You can configure these types as needed. For more information, see Customize DDL and DML synchronization.

    Note

    If you select DDL Synchronization when the source is an OceanBase database with an Oracle tenant, you can only select BLOB as the topic type on the Choose Synchronization Objects page.

    image.png

  5. Click Next. On the Choose Synchronization Objects page, select the topic type and the objects that you want to synchronize.

    The topic types include Tuple and BLOB. If the source is an OceanBase database with an Oracle tenant, topics of the Tuple type do not support DDL synchronization but do support data similar to database records, where each record contains multiple columns. Topics of the BLOB type support writing a single block of binary data as a record, which is transmitted using Base64 encoding. For more information, see the official DataHub documentation.

    After you select a topic type, you can select synchronization objects using Specify Objects or Matching Rules. This topic describes how to use Specify Objects. For more information about how to configure matching rules, see Configure and modify matching rules.

    Note

    If you select the DDL Synchronization check box in the Choose Synchronization Types step, we recommend that you select synchronization objects using matching rules to ensure that all new objects that meet the rules are synchronized. If you select specific synchronization objects, new or renamed objects are not synchronized.

    1. In the Choose Synchronization Objects section, select Specify Objects.

    2. In the selection area on the left, select the objects to synchronize.

    3. Click >.

    4. Select a mapping method as needed.

      • To synchronize a single Tuple table or a BLOB table, select a mapping method in the Map Object to Topic dialog box and click OK.

        image.png

        When the Schema Synchronization check box is cleared, you can only select Existing Topic. When the Schema Synchronization check box is selected, you can only use one mapping method to create or select a topic.

        For example, if you select schema synchronization and then use both the Create Topic and Select Existing Topic mapping methods, or if you rename the topic, a precheck error will occur because of conflicting options.

        Parameter

        Description

        Create Topic

        Enter a name for the new topic in the text box. The name can contain up to 128 letters, digits, or underscores (_), and must start with a letter.

        Select Topic

        DTS lets you query DataHub topics. You can click Select Topic, and then search for and select the topic to synchronize from the Existing Topic drop-down list.

        Batch Create Topics

        The rule for batch creating topics is Topic_${Database Name}_${Table Name}.

        If you select Create Topic or Batch Create Topics, you can query the new topic in DataHub after the schema synchronization is complete. By default, the new topic has 2 shards and a time-to-live (TTL) of 7 days. These values cannot be changed.

      • To synchronize multiple tables of the Tuple type, click OK in the dialog box.

        image.png

        If you select multiple tables and a topic of the Tuple type without selecting the Schema Synchronization check box, you must select an existing topic in the Map Object to Topic dialog box and click OK.

        image.png

        In this case, multiple tables are displayed for one topic on the right, but only one table can be synchronized. If you click Next, a message is displayed indicating that Tuple-type topics support only a one-to-one mapping with tables.

        image.png

    DTS lets you import objects from a text file and supports operations such as renaming destination objects, setting row filters, and removing single or all objects. The structure of a destination object is Topic > Database > Table.

    Note

    When you select synchronization objects using Matching Rules, the renaming function is part of the matching rule syntax. You can only set filter conditions and select the shard column and the columns to synchronize. For more information, see Configure and modify matching rules.

    image.png

    Operation

    Description

    Import Objects

    1. In the list on the right of the selection area, click Import Objects in the upper-right corner.

    2. In the dialog box, click OK.

      Important

      The import operation overwrites previous selections. Proceed with caution.

    3. In the Import Synchronization Objects dialog box, import the objects to synchronize. You can import a CSV file to rename database tables, set row filtering conditions, and perform other operations. For more information, see Download and import synchronization object configurations.

    4. Click Validate.

    5. After the validation is successful, click OK.

    Change Topic

    When you select BLOB as the topic type, you can change the topic for the destination object. For more information, see Change a topic.

    Settings

    DTS supports row filtering using a WHERE condition, along with selecting a shard column and the columns to synchronize.

    1. In the list on the right of the selection area, move the mouse pointer over the destination table object.

    2. Click the Settings icon that appears.

    3. In the Settings dialog box, you can perform the following operations.

      • In the text box in the Row Filtering Condition section, enter a standard WHERE clause of an SQL statement to configure row filtering. For more information, see Filter data using SQL conditions.

      • From the Shard Column drop-down list, select the destination shard column. You can select multiple fields as shard columns. This parameter is optional.

        When you select a shard column, you can select the primary key by default. If a load imbalance exists on the primary key, select a field that has a unique identifier and a relatively balanced load as the shard column. This avoids potential performance issues. The main functions of a shard column are as follows:

        • Load balancing: If concurrent writes are allowed at the destination, the shard column is used to determine the specific thread for sending messages.

        • Ordering: To address the disorder that may be caused by concurrent writes, DTS ensures that messages are received in order when the values of the shard column are the same. Ordering here refers to the order of changes (the execution order of DML operations on a column).

      • In the Select Columns section, select the columns to synchronize. For more information, see Filter columns.

    4. Click OK.

    Remove/Remove All

    During data mapping, DTS lets you remove one or more objects that are temporarily selected for the destination.

    • Remove a single synchronization object

      In the list on the right of the selection area, move the mouse pointer over the destination object and click the Remove icon that appears to remove the synchronization object.

    • Remove all synchronization objects

      In the list on the right of the selection area, click Remove All in the upper-right corner. In the dialog box that appears, click OK to remove all synchronization objects.

  6. Click Next and configure the parameters on the Sync Options page.

    • Full synchronization

      The following parameters are displayed only if you select Full Synchronization on the Choose Synchronization Types page.

      image

      Parameter

      Description

      Read Concurrency

      This parameter specifies the number of concurrent threads for reading data from the source during full synchronization. The maximum value is 512. A high concurrency may put excessive pressure on the source and affect your business.

      Write Concurrency

      This parameter specifies the number of concurrent threads for writing data to the destination during full synchronization. The maximum value is 512. A high concurrency may put excessive pressure on the destination and affect your business.

      Full Synchronization Rate Limit

      You can decide whether to enable the full synchronization rate limit as needed. If you enable it, set RPS (the maximum number of data rows that can be synchronized to the destination per second during full synchronization) and BPS (the maximum data volume that can be synchronized to the destination per second during full synchronization).

      Note

      The RPS and BPS values set here are only for rate limiting. The actual performance of full synchronization is affected by factors such as the source, destination, and instance specifications.

    • Incremental synchronization

      The following parameters are displayed only if you select Incremental Synchronization on the Choose Synchronization Types page.

      image

      Parameter

      Description

      Write Concurrency

      This parameter specifies the number of concurrent threads for writing data to the destination during incremental synchronization. The maximum value is 512. A high concurrency may put excessive pressure on the destination and affect your business.

      Incremental Synchronization Rate Limit

      You can decide whether to enable the incremental synchronization rate limit as needed. If you enable it, set RPS (the maximum number of data rows that can be synchronized to the destination per second during incremental synchronization) and BPS (the maximum data volume that can be synchronized to the destination per second during incremental synchronization).

      Note

      The RPS and BPS values set here are only for rate limiting. The actual performance of incremental synchronization is affected by factors such as the source, destination, and instance specifications.

      Incremental Synchronization Start Offset

      • If you select Full Synchronization when you select the synchronization type, you cannot change this parameter.

      • If you do not select Full Synchronization but select Incremental Synchronization when you select the synchronization type, specify a point in time after which you want to synchronize data. The default value is the current system time. For more information, see Set the start offset for incremental synchronization.

    • Advanced options

      image

      Parameter

      Description

      Serialization Format

      Controls the message format for data synchronization to DataHub. The supported formats are Default, Canal, DataWorks (V2.0 is supported), SharePlex, DefaultExtendColumnType, Debezium, DebeziumFlatten, DebeziumSmt, and DTSCompatible. For more information, see Serialization format descriptions.

      Important
      • This parameter is displayed only when you select BLOB as the topic type on the Choose Synchronization Objects page, or when the source is an OceanBase database with a MySQL tenant and you select Tuple as the topic type.

      • Currently, only OceanBase databases with a MySQL tenant support Debezium, DebeziumFlatten, DebeziumSmt, and DTSCompatible.

      • When the source is an OceanBase database with a MySQL tenant and you select Tuple as the topic type on the Choose Synchronization Objects page, the supported serialization formats are Default and DTSCompatible.

      Partitioning Rule

      The rule for synchronizing source data to a DataHub topic. The supported rules are Hash and Table. We recommend that you select Table to ensure that DDL and DML consumption is consistent for the downstream consumer.

      • Hash indicates that DTS uses a hash algorithm to select a DataHub shard based on the primary key value or shard column value.

      • Table indicates that DTS ships all data from one table to the same partition, using the table name as the hash key.

      Note

      If you select DDL Synchronization on the Choose Synchronization Types page, only the Table partitioning rule is supported.

      Business System ID (Optional)

      This parameter is displayed only when you set Serialization Format to DataWorks. It is used to identify the source business system of the data for custom processing. The business system ID can be 1 to 20 characters in length.

  7. Click Precheck.

    In the Precheck step, DTS checks only the column names, column types, and whether the columns are empty. It does not check the length or default values. If a precheck error occurs:

    • You can troubleshoot the issue and run the precheck again until all items pass.

    • You can also click Skip in the Actions column for a failed precheck item. A dialog box appears that explains the impact of skipping this operation. If you want to proceed, click OK in the dialog box.

  8. After the precheck passes, click Start Task.

    If you do not need to start the task right away, click Save. You can then start the task manually on the Sync Task List page or include it in a batch operation. For more information about batch operations, see Perform batch operations on data sync tasks.

    DTS lets you modify the synchronization objects while a data sync task is running. For more information, see View and modify synchronization objects and their filter conditions. After the data sync task starts, it runs based on the selected synchronization types. For more information, see View synchronization details.

If a data sync task fails, typically due to network connection issues or a slow process startup, you can click Resume on the task list or the product page.

References