This tutorial uses a user profiling project as an example to demonstrate a full workflow for data synchronization, data processing, and quality monitoring using DataWorks in the China (Shanghai) region. This tutorial requires an EMR Serverless StarRocks cluster, a DataWorks workspace, and the necessary environment configurations.
Prepare the OSS environment
This tutorial uses custom functions, which require you to upload resources to OSS. Before you begin, make sure that you have activated OSS and created an OSS bucket.
Prepare the EMR Serverless StarRocks environment
This tutorial uses EMR Serverless StarRocks for data processing. Ensure you have a StarRocks instance. If you do not have one, you can check your eligibility on the Alibaba Cloud Free Tier page or purchase an instance directly. For purchasing information, see the E-MapReduce Serverless StarRocks product page.
-
Instance Type: Compute-storage integrated.
-
Region: China (Shanghai).
-
Instance Edition: Basic Edition.
ImportantThe Basic Edition is intended for trial use and feature testing only and is not covered by a Service Level Agreement (SLA). You can select the Standard Edition based on your business needs.
-
Version: 3.1.
The operations in this tutorial are performed in the user_behavior_analysis database. Therefore, after you create an EMR Serverless StarRocks instance, you must create the user_behavior_analysis database. To do so, you can log on to the SQL Editor of the EMR Serverless StarRocks instance and run the following SQL statement.
CREATE DATABASE user_behavior_analysis;
Prepare the DataWorks environment
Before developing in DataWorks, you must activate the DataWorks service. For more information, see Preparations.
1. Create a workspace
-
Log on to the DataWorks console. In the top navigation bar, switch to a region where the DataWorks service is available.
-
Click Workspaces in the left navigation bar. On the workspace list page, click Create Workspace. For more information, see Create Workspace.
-
If you already have a workspace, you can skip this step and use your existing one.
-
Because the MySQL and HttpFile data sources for this tutorial are in the China (Shanghai) region, you must also use a workspace in that region.
2. Create a resource group
-
Purchase a resource group: DataWorks requires a resource group to run StarRocks tasks. For information about purchasing a resource group, see Use serverless resource groups.
-
Establish network connectivity: Ensure network connectivity between the resource group and the StarRocks instance. For more information about network solutions, see Network connectivity solutions.
-
Confirm the network environment of your StarRocks instance: Go to the StarRocks instance details page to find its network configuration information, such as the VPC and vSwitch.
-
Bind DataWorks to the same VPC. On the VPC Binding tab of your DataWorks serverless resource group, select the same VPC and vSwitch as your StarRocks instance to complete the network binding.
-
Add the IP address of the DataWorks serverless resource group to the StarRocks whitelist to allow access.
-
Obtain the outbound IP address of the DataWorks serverless resource group. On the VPC Binding tab of the resource group, find the table in the Data Scheduling & Data Integration section and record the corresponding vSwitch CIDR block value.
-
Click the StarRocks instance name. On the Basic information page of the Instance Details page, click Internal Whitelist to add the vSwitch CIDR block of the DataWorks Serverless resource group. In the left navigation bar of the ECS console, choose Network & Security > Security Groups. Click the target security group to go to its details page. In the list of Inbound rules, click Manually Add and add the egress IP address of the DataWorks Serverless resource group to the security group rule.
-
-
Configure public network access for the resource group.
-
Log on to the NAT Gateway console, and in the top menu bar, switch to the China (Shanghai) region.
-
Click Create Internet NAT Gateway. Configure the parameters.
Parameter
Value
Region
China (Shanghai).
VPC
Select the VPC and vSwitch associated with the resource group.
You can go to the DataWorks console and switch the region. In the left navigation bar, click Resource Groups. Find the resource group that you created and then click Operation in the Network Settings column. In the Data Scheduling & Data Integration area, you can view the bound VPC and Switch. For more information about VPCs and vSwitches, see What is a VPC?.
Associate vSwitch
Access mode
VPC Full-NAT Mode (SNAT).
EIP
Create and associate an EIP.
Service-Linked Role Creation
When you create a NAT Gateway for the first time, you must create a service-linked role. Click Create Service-Linked Role.
NoteFor parameters not specified in the table, you can retain the default values.
-
Click Buy Now, select the service agreement, and click Confirm to complete the purchase.
-
-
3. Create a StarRocks data source
In the DataWorks console, click Management Center in the left navigation bar, and then select a target workspace from the drop-down list to go to the Management Center. In the Management Center, click , click Add Connection, and select StarRocks. Create a StarRocks data source by adding the StarRocks instance to the current DataWorks workspace by using the ApsaraDB for RDS.
You must configure both the Development Environment and Production Environment. For Alibaba Cloud Account, select Current Alibaba Cloud Main Account. After selecting your instance, click Get Latest Address to retrieve the internal endpoint of the Frontend node. Set FrontEnd Http Port to 8030 and FrontEnd Query Port to 9030.
-
Configure the basic information for the StarRocks data source.
You need to go to the EMR console and configure the Basic information of the StarRocks data source in DataWorks based on the information in your Instance Details. This step is required for subsequent task synchronization and processing. The following provides detailed information about the configuration items.
Parameter
Description
Data Source Name
Enter a name for the data source. For this tutorial, use
Doc_StarRocks_Storage_Compute_Tightly_01.Description
Enter an optional description for the data source.
Configuration Mode
ApsaraDB for RDS.Alibaba Cloud Instance Mode
Region
China (Shanghai)
Examples
Select your Serverless instance.
Database Name
Enter the StarRocks database name. For this tutorial, use
user_behavior_analysis. All data operations will be performed in this database.Username
Your StarRocks database username.
Password
Your StarRocks database password.
-
Test resource connectivity: After the connectivity test passes, click Complete Modification to successfully create the StarRocks data source.
4. Create a MySQL data source
-
On the Management Center page, go to the Data Source page and click Add Connection.
-
In the Add Data Source dialog box, search for and select MySQL.
-
On the Create MySQL data source page, configure the parameters using the example values for both the development and production environments.
Parameter
Description
Data Source Name
Enter a name for the data source. For this tutorial, enter user_behavior_analysis_mysql.
Description
A dedicated, read-only data source for this tutorial. It provides test data from the platform for batch synchronization tasks and cannot be used by other modules.
Configuration Mode
Select User-created Data Store with Public IP Addresses.
Connection Address
-
Hostname:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com -
Port Number:
3306
Database Name
For this tutorial, enter
workshop.Username
For this tutorial, enter workshop.
Password
For this tutorial, enter workshop#2017.
Authentication Method
None.
-
-
For the specified resource group, click Test Connectivity under both Connection Status (Development Environment) and Connection Status (Production Environment) until the status changes to Connectable.
-
Click Complete Creation.
5. Create an HttpFile data source
Go to the page, click Add Connection, select HttpFile, and then click Create HttpFile data source to add the HttpFile data source to the current DataWorks workspace.
During configuration, enable both the Development Environment and Production Environment and enter the same information for both. Keep the default value {} for Default Header.
-
Configure the basic information for the HttpFile data source.
When you create an HttpFile data source, the parameters for Basic information are described as follows.
Parameter
Description
Data Source Name
Enter a display name for the public HttpFile data source in your workspace. For this tutorial, use user_behavior_analysis_httpfile.
Description
Enter an optional description for the data source.
This data source is provided for the DataWorks tutorial. It provides access to test data for batch synchronization tasks and is intended for read operations in data integration scenarios only. Other modules are not supported.
URL Domain Name
Enter
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com. -
Test resource connectivity: After the connectivity test is successful, click Complete Modification to create the HttpFile data source.
Next steps
Now that you have prepared the environment, you can proceed to the next tutorial, where you will learn to synchronize user basic information and website access log data to StarRocks. For more information, see Synchronize data.