StarRocks data source

更新时间:
复制 MD 格式

The StarRocks data source allows you to read from and write to StarRocks. This topic describes how to synchronize data with StarRocks by using DataWorks.

Supported versions

  • Supports all versions of EMR Serverless StarRocks.

  • Supports EMR on ECS: StarRocks version 2.1.

  • Supports StarRocks Community Edition.

    Note
    • DataWorks connects to StarRocks only over a VPC. Therefore, you must deploy StarRocks Community Edition on an EMR on ECS cluster.

    • The StarRocks Community Edition is an open platform. If you encounter compatibility issues when using this data source, submit a ticket for support.

Limitations

  • For full database real-time synchronization from MySQL to StarRocks, the destination StarRocks table must use a primary key model.

  • Full database real-time synchronization from MySQL to StarRocks does not support Data Definition Language (DDL) operations other than TRUNCATE. For other DDL operations, you can choose to either ignore them or configure the task to report an error.

Supported data types

Only fields of numeric, string, and date types are supported.

Network connectivity

EMR Serverless StarRocks

To ensure network connectivity, add the IP addresses of the DataWorks resource group you will use to the internal IP address whitelist of your EMR Serverless StarRocks instance.

  • To obtain the IP address of the DataWorks resource group, see Common configurations: Add a whitelist.

  • To configure whitelists for your EMR Serverless StarRocks instance, follow these steps:

    On the instance details page, in the Basic Information section, click the internal IP address whitelist link next to security group ID to configure the internal IP address whitelist. In the FE Details section, click the public whitelist link next to the Public Endpoint to configure the public whitelist.

Self-managed StarRocks

Ensure that the DataWorks resource group can access the query port, FE port, and BE port of your StarRocks instance. These are typically ports 9030, 8030, and 8040.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Select a connection mode for StarRocks based on your network environment:

Scenario 1: VPC connection

A VPC connection is low-latency and secure, and does not require public network access.

  • Use case: Your StarRocks instance and the serverless resource group are in the same VPC.

  • Supported modes: Alibaba Cloud instance mode and connection string mode:

    • Select ApsaraDB for RDS: Directly choose the StarRocks instance in the same VPC. The system automatically retrieves connection information, so no manual configuration is required.

    • Select User-created Data Store with Public IP Addresses: Manually enter the intranet endpoint, IP address, port, and Load URL of the instance.

Scenario 2: Internet connection

Data transmission over the internet poses security risks. Use security controls such as whitelists and IP-based access control.

  • Use case: You need to access a StarRocks instance over the internet, for example, across regions or from a local environment.

  • Supported mode: connection string mode (ensure that public network access is enabled for the StarRocks instance):

    • Select User-created Data Store with Public IP Addresses: Manually enter the public endpoint, IP address, port, and Load URL of the instance.

Note

By default, serverless resource groups cannot access the internet. To connect to a StarRocks instance by using a public endpoint, you must configure a NAT gateway and an EIP for the associated VPC to enable internet access. You must also ensure that the resource group can access the query port, FE port, and BE port of the StarRocks instance, which are typically ports 9030, 8030, and 8040.

If you are using EMR Serverless StarRocks, set Host Address/IP Address to the Internal Endpoint or Public network address, and use the query port for the port.

  • FE: You can find this information on the instance details page.

    In the FE Details section, find the public endpoint and query port (the default is 9030).

  • Database: After you connect to the instance by using the EMR StarRocks Manager, you can find the database in the SQL Editor or Metadata Management view.

    Note

    To create a database, you can run SQL statements directly in the SQL editor.

Data synchronization tasks

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Single-table batch synchronization

Single-table real-time synchronization

Full database batch synchronization

Full database real-time synchronization

Script samples and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script sample

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Reader script parameters

Parameter

Description

Required

Default

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The database name that is configured for the StarRocks data source.

column

The source table columns to synchronize. To add a SET_VAR hint when you read data from StarRocks, add the hint before the first column name in the column array. For example, if you want to synchronize the id column and add the SET_VAR(enable_spill = true) hint, set the column parameter to [ "/*+ SET_VAR(enable_spill = true)*/ id"].

Yes

None

where

The filter condition. In a typical business scenario where you synchronize data generated on the current day, you can set the where condition to gmt_create>${bizdate}.

  • The where condition enables efficient incremental data synchronization.

  • If you do not provide a where clause, or if you do not provide a key or value for the where parameter, the operation is treated as a full data synchronization.

No

None

table

The source table from which you want to synchronize data.

Yes

None

splitPk

Specifying the splitPk parameter enables parallel data synchronization by performing data sharding based on the specified column. This improves performance. For best results, use the table's primary key for the splitPk parameter, as an evenly distributed key helps prevent data hot spots.

No

None

Writer script sample

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "\\x02",
            "column_separator": "\\x01"
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880,
        "strategyOnError": "exit"
    },
    "name": "Writer",
    "category": "writer"
}

Writer script parameters

Parameter

Description

Required

Default

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The database name that is configured for the StarRocks data source.

loadProps

Note

When writing to StarRocks with Stream Load, the write policy (Upsert or Append) is determined by the destination table's model and requires no configuration. Tables with a primary key model use an Upsert policy, while other models default to an Append policy.

The request parameters for StarRocks Stream Load. When you use Stream Load to import CSV data, you can optionally configure import parameters. If no special configuration is required, use {}. The configurable parameters include:

  • column_separator: the column delimiter for CSV import. Default: \t.

  • row_delimiter: the row delimiter for CSV import. Default: \n.

If your data contains \t or \n, you must specify other characters as delimiters. Example:

{"column_separator":"\\x01","row_delimiter":"\\x02"}

Stream Load also supports importing JSON data. To do so, configure the following parameter:

{
  "format": "json"
}

The following parameters are available for the JSON format:

  • strip_outer_array: specifies whether to trim the outermost array structure. Valid values: true and false. Default value: false.

    In real-world business scenarios, the JSON data to be imported may be enclosed in an outermost pair of brackets ([]) that represent an array. In this case, we recommend that you set this parameter to true. StarRocks then trims the outer brackets ([]) and imports each inner array as a separate row. If you set this parameter to false, StarRocks parses the entire JSON data file as a single array and imports it as one row.

    [{"category":1,"author":2},{"category":3,"author":4}]
    
    • If you set this parameter to true, StarRocks parses {"category":1,"author":2} and {"category":3,"author":4} into two rows and imports them into the corresponding rows in the destination StarRocks table.

    • If you set this parameter to false, StarRocks parses the entire JSON array as a single row and imports it into the destination StarRocks table.

  • ignore_json_size: specifies whether to check the size of the JSON body in an HTTP request.

    Note

    By default, the size of the JSON body in an HTTP request cannot exceed 100 MB. If the JSON body size exceeds 100 MB, the following error message is returned: The size of this batch exceed the max size [104857600] of json type data data [8617627793].Set ignore_json_size to skip check,although it may lead huge memory consuming. To prevent this error, you can add ignore_json_size: true to the HTTP request header to skip the size check for the JSON body.

  • compression: specifies the compression algorithm to use during Stream Load data transmission. Supported algorithms: GZIP, BZIP2, LZ4_FRAME, and ZSTD.

  • strict_mode: specifies whether to enable strict mode.

    Valid values:

    • true: Enables strict mode. StarRocks filters out invalid rows, imports only valid rows, and returns details about the invalid data.

    • false: Disables strict mode. StarRocks converts fields that fail the conversion to NULL values and imports the rows that contain these NULL values along with valid data rows.

    Default value: false.

Yes

None

column

The destination columns to which you want to synchronize data.

Yes

None

loadUrl

The IP address and HTTP port of the StarRocks frontend (FE) node. The port typically defaults to 8030. If you have multiple FE nodes, you can specify all of them, separated by commas (,).

Yes

None

table

The destination table to which you want to synchronize data.

Yes

None

preSql

The SQL statements to run before the data synchronization task starts. For example, you can run TRUNCATE TABLE tablename to clear existing data from the destination table.

No

None

postSql

The SQL statements to run after the data synchronization task is complete.

No

None

maxBatchRows

The maximum number of rows to write per batch.

No

500000

maxBatchSize

The maximum size of data to write per batch, in bytes.

No

5242880

strategyOnError

The policy for handling errors that occur during batch writes.

Valid values:

  • exit: If a write error occurs, the task fails and exits.

  • batchDirtyData: If a write error occurs, the task records the current data batch as dirty data.

Default value: exit.

No

exit