EMR Spark Streaming nodes provide high-throughput, fault-tolerant processing for real-time data streams, which helps you quickly recover from data stream errors. This topic describes how to create an EMR Spark Streaming node and develop a streaming task.
Prerequisites
An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see DataStudio (legacy): Register an EMR compute resource.
(Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.
A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.
A workflow is created in DataStudio.
Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.
Limits
-
This type of task can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.
-
You cannot use EMR Spark Streaming nodes to develop tasks for Spark clusters that run on EMR on ACK.
Step 1: Create an EMR Spark Streaming node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
-
Create an EMR Spark Streaming node.
-
In the Business Flow pane, right-click the target workflow and choose .
NoteAlternatively, you can hover over the Create icon and choose .
-
In the Create Node dialog box, enter a Name and select the Engine Instance, Node Type, and Path. Click Confirm to open the editor tab for the node.
NoteNode names can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
-
Step 2: Develop the EMR Spark Streaming task
Double-click the node you created to open its editor tab.
Create and reference an EMR JAR resource
If you are using a DataLake cluster, follow these steps to reference an EMR JAR resource.
If your EMR Spark Streaming node depends on a resource that is too large to upload from the DataWorks page, you must store the resource in HDFS and reference it in your code. Example:
spark-submit --master yarn
--deploy-mode cluster
--name SparkPi
--driver-memory 4G
--driver-cores 1
--num-executors 5
--executor-memory 4G
--executor-cores 1
--class org.apache.spark.examples.JavaSparkPi
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
-
Create an EMR JAR resource. For more information, see Create and use EMR resources. The first time you perform this operation, you must click Authorize.
-
Reference the EMR JAR resource.
-
Open the EMR Spark Streaming node that you created.
-
In the navigation tree, expand , find the resource you want to reference, then right-click it, and select Insert Resource Path.
-
A reference statement like
##@resource_reference{""}appears in the editor. The command is shown in the example below. It includes placeholder values that you must replace with your actual information.##@resource_reference{"examples-1.2.0-shaded.jar"} --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
-
Develop the task code
In the editor for the EMR Spark Streaming node, enter the code for the job that you want to run. The following code provides an example.
spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
-
In this example, the resource uploaded to DataWorks is
examples-1.2.0-shaded.jar. -
Replace
access-key-idandaccess-key-secretwith the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain them, log on to the DataWorks console, hover over your profile picture in the upper-right corner, and go to the AccessKey Management page. -
Comments are not supported in the code editor for EMR Spark Streaming nodes.
-
If your workspace is bound to multiple EMR compute resources, select the appropriate one for your task.
(Optional) Configure advanced parameters
You can configure specific properties on the Advanced Settings tab of the node. For more information about available properties, see Spark Configuration. The following table describes the advanced parameters that you can configure for the cluster.
Datalake: EMR on ECS
|
Parameter |
Description |
|
queue |
The YARN queue for the job. The default value is default. For more information about YARN in EMR, see Basic queue configurations. |
|
priority |
The priority of the job. The default value is 1. |
|
Others |
You can add custom SparkConf parameters in the advanced settings. DataWorks automatically adds the new parameters to the command when the code is submitted. Example: Note
To enable Ranger permission control, add the For more information about parameter configurations, see Set global Spark parameters. |
Run the task
-
In the toolbar, click the
icon. In the Parameter dialog box, select the scheduling resource group that you created and click Running.Note-
To access compute resources in a public network or VPC, you must use a scheduling resource group that has passed a connectivity test with the compute resource. For more information, see Network connectivity solutions.
-
If you need to change the resource group for subsequent runs, click the Run with Parameters
icon and select the resource group that you want to use.
-
-
Click the
icon to save the code. -
(Optional) Perform smoke testing.
To perform smoke testing in the development environment, run the test before or after you commit the node. For more information, see Perform smoke testing.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
-
Click the
icon in the top toolbar to save the task. -
Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note-
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
-
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
-
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploying tasks.
More operations
After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage auto triggered tasks.