Customize an RDBMS database and synchronize data

更新时间:
复制 MD 格式

Dataphin lets you create custom components for relational database management system (RDBMS) types to meet your data integration requirements. You can then use these components to synchronize data. To create a custom component, you only need the database driver. This topic uses MySQL as an example to demonstrate how to customize an RDBMS database and synchronize data.

Prerequisites

  • You have created an ApsaraDB RDS for MySQL instance. The network type of the instance must be virtual private cloud (VPC). For more information, see Quickly create an ApsaraDB RDS for MySQL instance.

  • You have created a database and an account for the RDS for MySQL instance. Take note of the database name, username, and password. For more information, see Create a database and an account.

  • The driver for the MySQL database has been downloaded.

Background information

An RDBMS is a relational database management system. Examples include MySQL, Oracle, SQL Server, PostgreSQL, Vertica, DRDS, DB2, OceanBase, PolarDB, SAP HANA, and TeraData. This topic uses MySQL as an example to demonstrate how to customize an RDBMS database and synchronize data.

Procedure

The procedure to customize and use a MySQL database component is as follows:

Step

Description

Step 1: Configure the network and create data tables

Before you customize and use the RDBMS component, you must configure the network between the RDS for MySQL instance and Dataphin. Then, create the source and destination tables for data synchronization.

Step 2: Create a custom data source

Create a custom data source of the test_rdbms_mysql type. After it is created, the custom data source component appears in the development module of the component library.

Step 3: Create a data source instance

Create a data source instance of the custom component type (test_rdbms_mysql) to import business data from the RDS for MySQL instance into Dataphin.

Step 4: Create an offline pipeline task

Create an offline pipeline task based on the test_rdbms_mysql component type and the test_rdbms_mysql data source instance. After the task is created, run it to integrate (synchronize) the data.

Step 1: Configure the network and create data tables

  • Connect the RDS for MySQL instance and the Dataphin instance.

    • Add the public endpoint and port of the RDS for MySQL instance to the sandbox whitelist of the Dataphin project:

      1. Obtain the public endpoint and port of the RDS for MySQL instance.

        Go to the Database Connection page to obtain the public endpoint and port of the RDS for MySQL instance.faga

      2. Add the public endpoint and port of the RDS for MySQL instance to the sandbox whitelist of the Dataphin project. For more information, see Sandbox whitelist.

    • Add the IP address of Dataphin to the whitelist of the RDS for MySQL instance. For more information, see Connect to an ApsaraDB RDS for MySQL instance using a client or the CLI.

      Region

      IP address whitelist

      China (Shanghai)

      100.104.228.128/26, 100.104.115.192/26

      China (Shenzhen)

      100.104.48.128/26

      China (Beijing)

      100.104.238.64/26

      China (Shanghai), China (Shenzhen), China (Beijing)

      100.104.0.0/16

  • Create source and destination data tables for data synchronization.

    Connect to the MySQL instance from the command line. Then, create the source and destination data tables. For more information, see Connect to an ApsaraDB RDS instance using a client or the CLI.

    • The following code is an example of how to create a source data table.

      create table xin_test_scr2
      (
      id string,
      name string
      );
      
      insert into xin_test_scr2 values('1001','huayu1'),('1002','huayuyu2'),
    • The following code is an example of how to create a destination data table.

      create table xin_test_dst_1
      (    
      id string,    
      name string
      );

Step 2: Create a custom data source

  1. Log on to the Dataphin console.

  2. On the Dataphin console page, select a region for your workspace and click Enter Dataphin>>.

  3. Go to the custom component page.

    1. On the Dataphin homepage, click Management Center.

    2. On the Management Center page, click Datasource Management in the top menu bar.

    3. On the Custom Source Type page, click +New Offline Custom Source Type.

  4. On the New Offline Custom Source Type page, configure the parameters.

    Parameter

    Description

    Basic Configuration

    Type

    Set Type to RDBMS database.

    Name

    Set Name to test_rdbms_mysql.

    Important

    The name defines the type of the custom component. The system does not allow you to create custom components with the same name.

    Type Code

    A unique code that identifies the data source type for backend use. Enter rdbms_mysql.

    Resource Configuration

    Driver Name

    Set Driver Name to com.mysql.jdbc.Driver.

    Driver Upload

    Upload the downloaded driver file (mysql-connector-java-5.1.47). For more information, see Prerequisites.

    Description Information

    Description

    Enter mysql-5.1.47 database.

  5. Click Create.

Step 3: Create a data source instance

  1. On the Data Source Management page, click +New Data Source in the upper-right corner.

  2. In the New Data Source dialog box, select the test_rdbms_mysql data source under Custom Datasource.

  3. In the New test_rdbms_mysql Data Source dialog box, configure the parameters.

    Parameter

    Description

    Datasource Name

    Set Datasource Name to test_rdbms_mysql.

    Datasource Description

    Enter a brief description of the data source.

    Datasource Config

    Configure the data source:

    • If the developer mode is Basic, select Production Datasource.

    • If the developer mode is Dev-Prod, you can configure the data source in one of the following ways:

      • Click Production + Development Datasource to configure data sources for the production and development environments.

      • Click Production Datasource to configure the production data source. After you create the production data source, click Development Datasource to configure the development data source.

    Note

    You can use the same data source for both the Production Datasource and Development Datasource, or you can use different data sources.

    Tag

    Optional. Configure a tag for data source categorization.

    Production Datasource or Production + Development Datasource

    Link

    Enter the connection string for the data source.

    The format is: jdbc:mysql://{Public Endpoint}:3306/{DatabaseName}.

    • {Public Endpoint}: The public endpoint.

    • {DatabaseName}: The database name.

    Username

    The username to log on to the database.

    Password

    The password to log on to the database.

  4. Click Test Connection to test the connectivity of the data source.

    Important

    Connection tests are not supported for non-RDBMS data sources. Make sure that the data source connection information is accurate.

  5. Click OK.

Step 4: Create and develop an offline pipeline task

  1. On the data source page, move the mouse over the image.png icon in the upper-left corner and click Integration.

  2. On the Data Integration page, open the Create Pipeline Development Script dialog box.

    1. In the navigation pane on the left, click Offline Management to go to the offline pipeline development page.

    2. On the offline pipeline development page, click the image.png icon, select Offline Single Pipeline, and open the Create Pipeline Development Script dialog box.

  3. In the Create Pipeline Development Script dialog box, configure the parameters.

    Parameter

    Description

    Pipeline Name

    Enter test.

    Schedule Type

    Select Manual Node.

    Description

    Enter a brief description. For example, Test custom component.

    Select Directory

    The default value is Offline Pipeline.

  4. Click OK.

  5. On the test offline pipeline page, click the image.png icon in the upper-right corner and then click Custom.

  6. Drag the custom_test_rdbms_mysql_Input Configuration and custom_test_rdbms_mysql_Output Configuration components to the pipeline canvas.

  7. Connect the input and output components and click the image.png icon to configure the components. The following figure shows an example:

    image.png

    • custom_test_rdbms_mysql_Input component configuration.

      Parameter

      Description

      Step Name

      Keep the default value for this tutorial.

      You can also change the name. The naming convention is as follows: The name can contain only letters, digits, and hyphens (-). The name can be up to 64 characters long.

      Data Source

      Select test_rdbms_mysql.

      Table

      Set the source table to xin_test_src2.

      Input Filter

      No configuration is required for this tutorial.

      The input filter specifies the filter conditions for the input fields. For example, ds=${bizdate}.

      The input filter is applicable to the following scenarios:

      • A fixed part of the data.

      • Parameter-based filtering.

      Output Fields

      The output fields are the fields whose data you want to synchronize. In this tutorial, add the id and name fields from the source table xin_test_scr2 as the output fields of the input component:

      1. In the Output Fields section, click New Output Field.

      2. Set Output Field to id, and set Type to String.

      3. Click New Output Field.

      4. Set Output Field to name, and set Type to String.

    • custom_test_rdbms_mysql_Output component configuration.

      Parameter

      Description

      Step Name

      Keep the default value for this tutorial.

      You can also change the name. The naming convention is as follows: The name can contain only letters, digits, and hyphens (-). The name can be up to 64 characters long.

      Data Source

      Select test_rdbms_mysql.

      Table

      Enter xin_test_dst_1 as the destination table.

      Parsing Scheme

      You do not need to make a selection for this tutorial.

      The parsing scheme is optional. It specifies special processing methods before and after data output. The parsing scheme includes a preparation statement and a completion statement: Preparation statement: The SQL script to execute before import. Completion statement: The SQL script to execute after import.

      Input Fields

      The output fields configured in the input component are displayed by default.

      Output Fields

      Output fields are the fields whose data you want to sync. In this tutorial, add the id and name fields from the source table xin_test_scr2 as the output fields for the output component:

      1. In the Output Fields section, click New Output Field.

      2. Set Output Field to id, and set Type to String.

      3. Click New Output Field.

      4. Set Output Field to name, and set Type to String.

      Mapping

      Click Quick Mapping and select Map by Name.

      The mapping specifies the relationship between the output fields of the input component and the output fields of the output component. Mappings include Map by Name and Map by Row:

      • Map by Name: Maps fields that have the same name.

      • Map by Row: Maps fields in the same row. After you map by row, the input fields become the final output fields.

  8. Click OK.

  9. In the canvas toolbar, click Run to run the offline pipeline task and check whether the task runs as expected.

    image.png