Data integration

更新时间:
复制 MD 格式

Data Integration is a simple and efficient data synchronization platform built on Dataphin. It provides powerful data pre-processing capabilities and high-speed, stable data synchronization between various disparate data sources.

Get started in 5 minutes

Background information

As big data applications expand across industries, data integration faces increasing demands. These demands include the ability to efficiently configure sync tasks for numerous data tables, integrate multiple disparate data sources, perform light pre-processing on data, and optimize data sync tasks with features such as fault tolerance, speed limits, and concurrency.

Function overview

Note

If you purchased Dataphin after April 2020, the data synchronization feature has been upgraded to Data Integration.

Dataphin has upgraded its data integration capabilities to help you build a simple, efficient, secure, and reliable data synchronization platform:

  • You can improve data integration efficiency using full database migration to quickly generate batch sync tasks and create destination tables with one click. When you sync data to MaxCompute, you do not need to manually create tables. For more information, see Configure an integration task by migrating an entire database.

  • You can use the Flow and Transform components to pre-process data from a data source. Pre-processing includes traffic scrubbing, transformation, field masking, calculation, merging, distribution, and filtering. For more information, see Create an integration task from a single pipeline.

  • You can use Dev-Prod and Basic developer patterns based on your needs.

  • You can quickly sync logical tables created in Dataphin to a destination database.

  • You can create custom components that are not supported by the system to meet data synchronization needs for different scenarios. Relational Database Management System (RDBMS) components connect through Java Database Connectivity (JDBC). For non-RDBMS components, you must upload the JAR package.

Data Integration supports various component types. You can drag, configure, and assemble these components to generate an offline single pipeline. Data Integration also lets you quickly generate batch sync tasks. For full database migration, the source can be MySQL, SQL Server, or Oracle, and the destination must be MaxCompute. Data Integration also lets you create custom component types that are not supported by the system to meet your data synchronization needs.

Access Data Integration

Quick access (recommended)

On the Dataphin home page, click Data Import in the product path to quickly access Data Integration.

image

Standard access

On the Dataphin home page, choose Develop > Data Integration from the top menu bar to go to the Data Integration page.

image

Connect the data source to the Dataphin network

To synchronize data, you must establish a network connection between your data source and your Dataphin project. For more information, see Network connectivity solutions.

Scenarios

Scenario

Description

Instructions

Build a sync task using a pipeline script

Develop a pipeline node based on an existing pipeline script to synchronize data.

  1. Download the developed pipeline script. For more information, see Create an integration task from a single pipeline.

  2. Upload the script to create a pipeline development script. For more information, see Create an integration task from a single pipeline.

  3. Develop a pipeline node based on the pipeline script and configure its scheduling properties. For more information, see Configure offline pipeline node properties.

  4. Submit or publish the pipeline node to the production environment. For more information, see Manage published nodes.

    Note

    If the data development pattern is Basic, you do not need to publish the pipeline node.

  5. Perform O&M and scheduling. For more information, see Operation Center.

Build a sync task using an offline single pipeline

An offline data pipeline defines the source and destination data sources and datasets. It provides an abstract set of data entry, Outputs, Flow, and Transform components. This frame uses a simplified intermediate data transmission format to enable data transmission between data sources.

  1. Configure the data source. For more information, see Data sources supported by Dataphin.

  2. Assemble and configure the offline single pipeline script. For more information, see Create an integration task from a single pipeline. To configure a batch sync task, see Configure an integration task by migrating an entire database.

  3. Submit or publish the pipeline node to the production environment. For more information, see Manage published nodes.

    Note

    If the data development pattern is Basic, you do not need to publish the pipeline node.

  4. Perform O&M and scheduling. For more information, see Operation Center.

Build a sync task using offline full database migration

Full database migration is a tool that improves user efficiency and reduces costs. It lets you quickly upload all tables from a MySQL, Oracle, or SQL Server database to MaxCompute. This greatly reduces the configuration and migration costs of the initial cloud setup.

  1. Configure the data source. For more information, see Data sources supported by Dataphin.

  2. Assemble and configure the offline single pipeline script. For more information, see Create an integration task from a single pipeline. To configure a batch sync task, see Configure an integration task by migrating an entire database.

  3. Submit or publish the pipeline node to the production environment. For more information, see Manage published nodes.

    Note

    If the data development pattern is Basic, you do not need to publish the pipeline node.

  4. For more information about O&M scheduling, see Operation Center.

Build a sync task using a custom component

Data Integration lets you create custom components that are not supported by the system to meet data synchronization needs for various business scenarios.

  1. Create a custom component. For more information, see Create a custom offline source type.

  2. Create a data source based on the custom component. For more information, see Example of developing a custom component.

  3. Create an offline single pipeline. For more information, see Create an integration task from a single pipeline.

  4. Submit or publish the pipeline node to the production environment. For more information, see Manage published nodes.

    Note

    If the data development pattern is Basic, you do not need to publish the pipeline node.

  5. Perform operations management on the pipeline node. For more information, see Operation Center.