feature support for StarRocks data synchronization-DataWorks(DataWorks)-阿里云帮助中心

Supported versions

Supports all versions of EMR Serverless StarRocks.
Supports EMR on ECS: StarRocks version 2.1.
Supports StarRocks Community Edition.
Note
- DataWorks connects to StarRocks only over a VPC. Therefore, you must deploy StarRocks Community Edition on an EMR on ECS cluster.
- StarRocks Community Edition is an open platform. If you encounter compatibility issues, submit a ticket for support.

Limitations

For full database real-time synchronization from MySQL to StarRocks, the destination StarRocks table must use a primary key model.
Full database real-time synchronization from MySQL to StarRocks does not support Data Definition Language (DDL) operations other than TRUNCATE. For other DDL operations, you can choose to either ignore them or configure the task to report an error.

Supported data types

Only fields of numeric, string, and date types are supported.

Network connectivity

EMR Serverless StarRocks

To ensure network connectivity, add the IP addresses of your DataWorks resource group to the internal IP address whitelist of the EMR Serverless StarRocks instance.

To obtain the IP address of the DataWorks resource group, see Common configurations: Add a whitelist.
To configure whitelists for the EMR Serverless StarRocks instance:

On the instance details page, in the Basic Information section, click the internal IP address whitelist link next to security group ID to configure the internal IP address whitelist. In the FE Details section, click the public whitelist link next to the Public Endpoint to configure the public whitelist.

Self-managed StarRocks

Ensure that the DataWorks resource group can access the query port, FE port, and BE port of your StarRocks instance. These are typically ports 9030, 8030, and 8040.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Select a connection mode for StarRocks based on your network environment:

Scenario 1: VPC connection

A VPC connection is low-latency and secure, and does not require public network access.

Use case: Your StarRocks instance and the serverless resource group are in the same VPC.
Supported modes: Alibaba Cloud instance mode and connection string mode:
- Select ApsaraDB for RDS: Directly choose the StarRocks instance in the same VPC. The system automatically retrieves connection information, so no manual configuration is required.
- Select User-created Data Store with Public IP Addresses: Manually enter the intranet endpoint, IP address, port, and Load URL of the instance.

Scenario 2: Internet connection

Data transmission over the internet poses security risks. Use security controls such as whitelists and IP-based access control.

Use case: You need to access a StarRocks instance over the internet, for example, across regions or from a local environment.
Supported mode: connection string mode (ensure that public network access is enabled for the StarRocks instance):
- Select User-created Data Store with Public IP Addresses: Manually enter the public endpoint, IP address, port, and Load URL of the instance.

Note

By default, serverless resource groups cannot access the internet. To connect to a StarRocks instance over a public endpoint, configure a NAT gateway and an EIP for the associated VPC. You must also ensure that the resource group can access the query port, FE port, and BE port of the StarRocks instance, which are typically ports 9030, 8030, and 8040.

If you are using EMR Serverless StarRocks, set Host Address/IP Address to the Internal Endpoint or Public network address, and use the query port for the port.

FE: You can find this information on the instance details page.

In the FE Details section, find the public endpoint and query port (the default is 9030).
Database: After you connect to the instance by using the EMR StarRocks Manager, you can find the database in the SQL Editor or Metadata Management view.

Note
To create a database, you can run SQL statements directly in the SQL editor.

Data synchronization tasks

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Single-table batch synchronization

Supported sources: All data sources supported by Data Integration.
For more information, see Codeless UI and Code Editor.
For a complete list of parameters and script samples for the Code Editor, see Appendix: Script samples and parameter descriptions.

Single-table real-time synchronization

Supported source: Kafka
Configuration guide: Configure a real-time data synchronization task

Full database batch synchronization

Supported source: MySQL
Configuration guide: Configure a batch data synchronization task

Full database real-time synchronization

Supported sources: MySQL, Oracle, and PolarDB
Configuration guide: Configure a real-time data synchronization task

Script samples and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Script mode configuration. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script sample

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Reader script parameters

Parameter	Description	Required	Default
datasource	The name of the StarRocks data source.	Yes	None
selectedDatabase	The name of the StarRocks database.	No	The database name that is configured for the StarRocks data source.
column	The source table columns to synchronize. To add a SET_VAR hint when reading data from StarRocks, add the hint before the first column name in the column array. For example, to synchronize the `id` column with the `SET_VAR(enable_spill = true)` hint, set the column parameter to `[ "/+ SET_VAR(enable_spill = true)/ id"]`.	Yes	None
where	The filter condition. For example, to synchronize only data generated on the current day, set the where condition to `gmt_create>${bizdate}`. The where condition enables incremental data synchronization. If you do not provide a where clause or leave the where parameter empty, a full data synchronization is performed.	No	None
table	The source table from which you want to synchronize data.	Yes	None
splitPk	The column used for parallel data sharding during synchronization. Specifying this parameter improves performance. For best results, use the table's primary key, as an evenly distributed key helps prevent data hot spots.	No	None

Writer script sample

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "\\x02",
            "column_separator": "\\x01"
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880,
        "strategyOnError": "exit"
    },
    "name": "Writer",
    "category": "writer"
}

Writer script parameters

Parameter	Description	Required	Default
datasource	The name of the StarRocks data source.	Yes	None
selectedDatabase	The name of the StarRocks database.	No	The database name that is configured for the StarRocks data source.
loadProps	Note When writing to StarRocks with Stream Load, the write policy (Upsert or Append) is determined by the destination table's model and requires no configuration. Tables with a primary key model use an Upsert policy, while other models default to an Append policy. The request parameters for StarRocks Stream Load. When importing CSV data, you can configure the following parameters. If no special configuration is required, use {}: column_separator: the column delimiter for CSV import. Default: \t. row_delimiter: the row delimiter for CSV import. Default: \n. If your data contains \t or \n, you must specify other characters as delimiters. Example: `{"column_separator":"\\x01","row_delimiter":"\\x02"}` Stream Load also supports importing JSON data. To do so, configure the following parameter: `{ "format": "json" }` The following parameters are available for the JSON format: strip_outer_array: specifies whether to trim the outermost array structure. Valid values: `true` and `false`. Default value: `false`. JSON data to import may be enclosed in outer brackets (`[]`) that represent an array. In this case, set this parameter to `true`. StarRocks then trims the outer brackets (`[]`) and imports each inner array as a separate row. If set to `false`, StarRocks parses the entire JSON data file as a single array and imports it as one row. `[{"category":1,"author":2},{"category":3,"author":4}]` If set to `true`, StarRocks parses `{"category":1,"author":2}` and `{"category":3,"author":4}` into two rows and imports them into the destination StarRocks table. If set to `false`, StarRocks parses the entire JSON array as a single row and imports it into the destination StarRocks table. ignore_json_size: specifies whether to check the size of the JSON body in an HTTP request. Note By default, the size of the JSON body in an HTTP request cannot exceed `100 MB`. If the JSON body size exceeds `100 MB`, the following error message is returned: `The size of this batch exceed the max size [104857600] of json type data data [8617627793].Set ignore_json_size to skip check,although it may lead huge memory consuming.` To prevent this error, you can add `ignore_json_size: true` to the `HTTP` request header to skip the size check for the JSON body. compression: specifies the compression algorithm to use during Stream Load data transmission. Supported algorithms: `GZIP`, `BZIP2`, `LZ4_FRAME`, and `ZSTD`. strict_mode: specifies whether to enable strict mode. Valid values: `true`: Enables strict mode. StarRocks filters out invalid rows, imports only valid rows, and returns details about the invalid data. `false`: Disables strict mode. StarRocks converts fields that fail the conversion to `NULL` values and imports the rows that contain these `NULL` values along with valid data rows. Default value: `false`.	Yes	None
column	The destination columns to which you want to synchronize data.	Yes	None
loadUrl	The IP address and HTTP port of the StarRocks frontend (FE) node. The port defaults to `8030`. For multiple FE nodes, specify all of them separated by commas (,).	Yes	None
table	The destination table to which you want to synchronize data.	Yes	None
preSql	SQL statements to run before the synchronization task starts. For example, run `TRUNCATE TABLE tablename` to clear existing data from the destination table.	No	None
postSql	SQL statements to run after the synchronization task is complete.	No	None
maxBatchRows	The maximum number of rows to write per batch.	No	500000
maxBatchSize	The maximum size of data to write per batch, in bytes.	No	5242880
strategyOnError	The policy for handling errors that occur during batch writes. Valid values: `exit`: If a write error occurs, the task fails and exits. `batchDirtyData`: If a write error occurs, the task records the current data batch as dirty data. Default value: `exit`.	No	exit