The StarRocks data source allows you to read from and write to StarRocks. This topic describes how to synchronize data with StarRocks by using DataWorks.
Supported versions
-
Supports all versions of EMR Serverless StarRocks.
-
Supports EMR on ECS: StarRocks version 2.1.
-
Supports StarRocks Community Edition.
Note-
DataWorks connects to StarRocks only over a VPC. Therefore, you must deploy StarRocks Community Edition on an EMR on ECS cluster.
-
The StarRocks Community Edition is an open platform. If you encounter compatibility issues when using this data source, submit a ticket for support.
-
Limitations
-
For full database real-time synchronization from MySQL to StarRocks, the destination StarRocks table must use a primary key model.
-
Full database real-time synchronization from MySQL to StarRocks does not support Data Definition Language (DDL) operations other than TRUNCATE. For other DDL operations, you can choose to either ignore them or configure the task to report an error.
Supported data types
Only fields of numeric, string, and date types are supported.
Network connectivity
EMR Serverless StarRocks
To ensure network connectivity, add the IP addresses of the DataWorks resource group you will use to the internal IP address whitelist of your EMR Serverless StarRocks instance.
-
To obtain the IP address of the DataWorks resource group, see Common configurations: Add a whitelist.
-
To configure whitelists for your EMR Serverless StarRocks instance, follow these steps:
On the instance details page, in the Basic Information section, click the internal IP address whitelist link next to security group ID to configure the internal IP address whitelist. In the FE Details section, click the public whitelist link next to the Public Endpoint to configure the public whitelist.
Self-managed StarRocks
Ensure that the DataWorks resource group can access the query port, FE port, and BE port of your StarRocks instance. These are typically ports 9030, 8030, and 8040.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Select a connection mode for StarRocks based on your network environment:
Scenario 1: VPC connection
A VPC connection is low-latency and secure, and does not require public network access.
-
Use case: Your StarRocks instance and the serverless resource group are in the same VPC.
-
Supported modes: Alibaba Cloud instance mode and connection string mode:
-
Select ApsaraDB for RDS: Directly choose the StarRocks instance in the same VPC. The system automatically retrieves connection information, so no manual configuration is required.
-
Select User-created Data Store with Public IP Addresses: Manually enter the intranet endpoint, IP address, port, and Load URL of the instance.
-
Scenario 2: Internet connection
Data transmission over the internet poses security risks. Use security controls such as whitelists and IP-based access control.
-
Use case: You need to access a StarRocks instance over the internet, for example, across regions or from a local environment.
-
Supported mode: connection string mode (ensure that public network access is enabled for the StarRocks instance):
-
Select User-created Data Store with Public IP Addresses: Manually enter the public endpoint, IP address, port, and Load URL of the instance.
-
By default, serverless resource groups cannot access the internet. To connect to a StarRocks instance by using a public endpoint, you must configure a NAT gateway and an EIP for the associated VPC to enable internet access. You must also ensure that the resource group can access the query port, FE port, and BE port of the StarRocks instance, which are typically ports 9030, 8030, and 8040.
If you are using EMR Serverless StarRocks, set Host Address/IP Address to the Internal Endpoint or Public network address, and use the query port for the port.
-
FE: You can find this information on the instance details page.
In the FE Details section, find the public endpoint and query port (the default is
9030). -
Database: After you connect to the instance by using the EMR StarRocks Manager, you can find the database in the SQL Editor or Metadata Management view.
NoteTo create a database, you can run SQL statements directly in the SQL editor.
Data synchronization tasks
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Single-table batch synchronization
-
Supported sources: All data sources supported by Data Integration.
-
For more information, see Codeless UI and Code Editor.
-
For a complete list of parameters and script samples for the Code Editor, see Appendix: Script samples and parameter descriptions.
Single-table real-time synchronization
-
Supported source: Kafka
-
Configuration guide: Configure a real-time data synchronization task
Full database batch synchronization
-
Supported source: MySQL
-
Configuration guide: Configure a batch data synchronization task
Full database real-time synchronization
-
Supported sources: MySQL, Oracle, and PolarDB
-
Configuration guide: Configure a real-time data synchronization task
Script samples and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script sample
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}
Reader script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the StarRocks data source. |
Yes |
None |
|
selectedDatabase |
The name of the StarRocks database. |
No |
The database name that is configured for the StarRocks data source. |
|
column |
The source table columns to synchronize. To add a SET_VAR hint when you read data from StarRocks, add the hint before the first column name in the column array. For example, if you want to synchronize the |
Yes |
None |
|
where |
The filter condition. In a typical business scenario where you synchronize data generated on the current day, you can set the where condition to
|
No |
None |
|
table |
The source table from which you want to synchronize data. |
Yes |
None |
|
splitPk |
Specifying the splitPk parameter enables parallel data synchronization by performing data sharding based on the specified column. This improves performance. For best results, use the table's primary key for the splitPk parameter, as an evenly distributed key helps prevent data hot spots. |
No |
None |
Writer script sample
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.X.X:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
],
"maxBatchRows": 500000,
"maxBatchSize": 5242880,
"strategyOnError": "exit"
},
"name": "Writer",
"category": "writer"
}
Writer script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the StarRocks data source. |
Yes |
None |
|
selectedDatabase |
The name of the StarRocks database. |
No |
The database name that is configured for the StarRocks data source. |
|
loadProps |
Note
When writing to StarRocks with Stream Load, the write policy (Upsert or Append) is determined by the destination table's model and requires no configuration. Tables with a primary key model use an Upsert policy, while other models default to an Append policy. The request parameters for StarRocks Stream Load. When you use Stream Load to import CSV data, you can optionally configure import parameters. If no special configuration is required, use {}. The configurable parameters include:
If your data contains \t or \n, you must specify other characters as delimiters. Example:
Stream Load also supports importing JSON data. To do so, configure the following parameter:
The following parameters are available for the JSON format:
|
Yes |
None |
|
column |
The destination columns to which you want to synchronize data. |
Yes |
None |
|
loadUrl |
The IP address and HTTP port of the StarRocks frontend (FE) node. The port typically defaults to |
Yes |
None |
|
table |
The destination table to which you want to synchronize data. |
Yes |
None |
|
preSql |
The SQL statements to run before the data synchronization task starts. For example, you can run |
No |
None |
|
postSql |
The SQL statements to run after the data synchronization task is complete. |
No |
None |
|
maxBatchRows |
The maximum number of rows to write per batch. |
No |
500000 |
|
maxBatchSize |
The maximum size of data to write per batch, in bytes. |
No |
5242880 |
|
strategyOnError |
The policy for handling errors that occur during batch writes. Valid values:
Default value: |
No |
exit |