This tutorial uses a user profiling use case to demonstrate how to use DataWorks to complete an end-to-end workflow, including data synchronization, data processing, and quality monitoring, in the China (Shanghai) region. To successfully complete this tutorial, you must prepare the required EMR Serverless Spark workspace and DataWorks workspace, and configure the environment.
DataWorks preparation
Make sure DataWorks is activated for your account. If DataWorks is not activated, go to the DataWorks page to activate it. For more information, see Purchase.
EMR Serverless Spark workspace preparation
This tutorial uses EMR Serverless Spark as the compute resource. Make sure you have a Spark workspace. If not, go to the E-MapReduce console, select Spark, and create one.
-
Region: China (Shanghai).
-
Payment Type: pay-as-you-go.
-
Workspace Name: Enter a custom name.
-
DLF for Metadata Storage: Select a DLF data catalog to bind to the workspace. To isolate metadata between different EMR clusters, use a different catalog for each.
-
Workspace Directory: Select an OSS bucket path to store task logs.
-
Workspace Type: For this tutorial, select Professional Edition.
Note-
Professional Edition: Extends the Basic Edition with advanced features and performance improvements, making it ideal for large-scale ETL jobs.
-
Basic Edition: Includes core features and a powerful compute engine.
-
Private OSS bucket preparation
Later, you will synchronize user information and website access logs to this bucket for data modeling and analysis.
-
Log on to the OSS console.
-
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.
-
In the Create Bucket dialog box, configure the parameters and click Complete Creation.
-
Bucket Name: Enter a custom name.
-
Region: Select China (Shanghai).
-
HDFS service: Enable the HDFS service as prompted on the page.
For more information about the parameters, see Create a bucket.
-
-
On the Buckets page, click the Bucket Name to go to the File Management page.
DataWorks environment preparation
After you prepare DataWorks, EMR Serverless Spark, and OSS, you can create a workspace in DataWorks, register a Spark cluster, and create data sources. This prepares the environment for data synchronization and processing.
DataWorks workspace creation
-
Log on to the DataWorks console.
-
In the left-side navigation pane, click Workspaces to go to the workspace list page.
-
Click Create Workspace. In the panel that appears on the left, create a Standard Mode workspace and select the Isolate Development and Production Environments option.
We recommend creating the workspace in China (Shanghai), where the tutorial's data resources are located. This prevents network connectivity issues when you add data sources from other regions. For a simpler setup, you can select No for the For Development and Production Environments parameter.
Resource group creation
Before you use DataWorks, you must create a resource group. The resource group provides the runtime resources for data synchronization and scheduling. Ensure that the resource group can connect to Serverless Spark.
-
Purchase a serverless resource group.
-
Log on to the DataWorks console, switch to the target region, and then click Resource Group in the left-side navigation pane to go to the resource group list page.
-
Click Create Resource Group. On the resource group purchase page, set Region to China (Shanghai) and specify a Resource Group Name. Configure other parameters as prompted and complete the payment. For information about the billing of serverless resource groups, see Billing of serverless resource groups.
Note
This tutorial uses a serverless resource group in the China (Shanghai) region as an example. Serverless resource groups do not support cross-region operations.
-
-
Configure the serverless resource group.
-
Log on to the DataWorks console, switch to the target region, and then click Resource Group in the left-side navigation pane to go to the resource group list page.
-
Find the serverless resource group you purchased, and in the Operation column, click Associate Workspace to bind it to the DataWorks workspace you created.
-
Configure internet access for the resource group.
-
Log on to the NAT Gateway console. In the top navigation bar, switch to the China (Shanghai) region.
-
Click Create Internet NAT Gateway and configure the parameters.
Parameter
Value
Region
China (Shanghai).
Network and Zone
Select the VPC and vSwitch that are bound to the resource group.
You can go to the DataWorks console, switch to the target region, and click Resource Groups in the left-side navigation pane. Find the resource group that you created and click Network Settings in the Operation column. In the Data Scheduling & Data Integration section, view the bound VPC and Switch. For more information about VPCs and vSwitches, see What is VPC?.
Network Type
Public.
EIP
Select Create EIP.
Service-linked role
If you are creating a NAT gateway for the first time, you must create a service-linked role by clicking Create Service-Linked Role.
Note
You can use the default values for any parameters not listed in the table.
-
Click Buy Now, select the terms of service, and then click Confirm to complete the purchase.
-
-
EMR Serverless Spark cluster registration
Data storage and processing for the user profiling workflow run on an EMR Serverless Spark cluster, which you must register in advance.
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.
-
In the left-side navigation pane, click Clusters to go to the cluster management page. Then, click Register Cluster. In the dialog box that appears, select E-MapReduce to configure the EMR Serverless Spark cluster.
-
Register an E-MapReduce cluster.
-
Display Name of Cluster: Enter a custom name.
-
Alibaba Cloud Account to Which Cluster Belongs: Select the current Alibaba Cloud Account.
-
Cluster Type: EMR Serverless Spark.
-
Workspace Created in EMR Serverless Spark: Select the workspace that you prepared in Prepare an EMR Serverless Spark workspace.
-
Default Engine Version: When you create an EMR Spark node in DataStudio, this engine version is used by default. To set different engine versions for different nodes, configure the versions on the Advanced Settings tab of the node editor.
-
Default Resource Queue: When you create an EMR Spark node in DataStudio, this resource queue is used by default. To set different resource queues for different nodes, configure the queues on the Advanced Settings tab of the node editor.
-
Default SQL Compute: When you create an EMR Spark SQL node in DataStudio, this SQL Compute instance is used by default. To set different SQL Compute instances for different nodes, configure the instances on the Advanced Settings tab of the node editor.
-
Default Access Identity: In the development environment, the default identity is Executor. In the production environment, you can select Alibaba Cloud primary account, Alibaba Cloud RAM sub-account, or Node Owner.
NoteThe configurations in this tutorial are for demonstration purposes. If your scenario is different, see DataStudio (legacy): Bind an EMR compute engine.
-
Data source creation
This tutorial provides a MySQL database that stores user information and an OSS bucket that stores user log data. You must add them as data sources in DataWorks to enable data synchronization.
-
To access the test data provided by the platform for this tutorial, you must add the corresponding data sources to your workspace.
-
The data provided in this tutorial is for hands-on practice with DataWorks only. All data is mock data and is read-only through the Data Integration feature.
-
The bucket that you created in the Prepare a private OSS bucket step is used to store user information from the MySQL data source and log data from the HttpFile data source.
MySQL data source
In this tutorial, the database for the MySQL data source is provided by the platform. It serves as the data source for Data Integration tasks and provides user information.
-
On the Management Center page, click Data Source. Then, click Add Connection.
-
In the Add Data Source dialog box, search for and select MySQL.
-
On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the sample values for both the development and production environments.
Parameter
Description
Data Source Name
Enter a name for the data source. For this tutorial, enter user_behavior_analysis_mysql.
Description
A dedicated data source for the DataWorks tutorial. Read data from this source in a synchronization task to access the provided test data. This data source supports reading only in data integration scenarios and is unavailable for other modules.
Configuration Mode
Select User-created Data Store with Public IP Addresses.
Connection Address
-
Host IP:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com -
Port:
3306
Database Name
Enter the database name. For this tutorial, enter
workshop.Username
Enter the username. For this tutorial, enter workshop.
Password
Enter the password. For this tutorial, enter workshop#2017.
Authentication Method
None.
-
-
For the specified resource group, click Test Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Wait until the status changes to Connectable.
-
Click Complete Creation.
HttpFile data source
In this tutorial, the HttpFile data source is an OSS bucket provided by the platform. It serves as the data source for Data Integration tasks and provides log data.
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.
On the Workspace Management page, click Data Sources in the left-side navigation pane to open the data source page.
-
Click Add Connection. In the Add Connection dialog box, search for and select HttpFile.
-
On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the sample values for both the development and production environments.
Parameter
Description
Data Source Name
Enter a name for the data source. For this tutorial, enter user_behavior_analysis_httpfile.
Description
A dedicated data source for the DataWorks tutorial. Read data from this source in a synchronization task to access the provided test data. This data source supports reading only in data integration scenarios and is unavailable for other modules.
URL domain
The URL domain for both the development and production environments is
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com. -
For the specified resource group, click Test Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Wait until the status changes to Connectable.
ImportantAt least one resource group must have a status of Connectable. Otherwise, you cannot use the wizard mode to create a synchronization task for this data source.
-
Click Complete Creation.
Private OSS data source
In this tutorial, you must prepare an OSS bucket to serve as a private OSS data source. This bucket will be the destination in Data Integration for user information and log data.
The private OSS data source is an OSS data source that uses your own OSS instance. It stores user information from the tutorial's MySQL data source and log data from the HttpFile data source.
-
In Management Center, go to page and click Add Connection.
-
In the Add Connection dialog box, search for and select OSS.
-
In the Add OSS data source dialog box, configure the parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. Example: test_g.
Description
Enter a brief description of the data source.
Endpoint
Enter
http://oss-cn-shanghai-internal.aliyuncs.com.oss-bucket
The name of the OSS bucket that you created during environment preparation. Example: dw-emr-demo.
Access mode
RAM role authorization mode
This method uses Security Token Service (STS) to authorize the service account of a cloud service to assume a role and access the data source. This provides enhanced security. For more information, see Configure a data source by using a RAM role.
Access Key mode
AccessKey ID
The AccessKey ID of the current account. You can copy the AccessKey ID from the Security Settings page.
AccessKey Secret
Enter the AccessKey secret of the current account.
ImportantThe AccessKey secret is displayed only when you create it and cannot be viewed later. Keep it confidential. If your AccessKey is leaked or lost, delete it and create a new one.
NoteFor the authentication method, select either RAM role authorization mode or Access Key mode.
-
In the Connected state column for the specified resource group, click Test Connectivity. Wait for the test to complete and the status to change to Connectable.
ImportantMake sure that at least one resource group has a status of Connectable. Otherwise, you cannot use the wizard to create synchronization tasks for this data source.
-
Click Complete.
Next steps
You have completed the environment setup. You can now proceed to the next tutorial, where you will learn how to synchronize user profile data and website access logs to OSS, and then use Spark SQL to create an external table for the data in your private OSS. For more information, see Synchronize data.