EMR Spark Streaming node-DataWorks(DataWorks)-阿里云帮助中心

Prerequisites

Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.
(Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

If you are using a root account, skip this step.

Limitations

This type of task can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.
You cannot use EMR Spark Streaming nodes for task development on Spark clusters that run on EMR on ACK.
This node cannot be used in a workflow. You can develop and run it only as a standalone node.

Procedure

On the EMR Spark Streaming node editor page, perform the following steps.

Create and reference an EMR JAR resource

If you use a DataLake cluster, follow these steps to reference an EMR JAR resource.

Note

If a required resource is too large to upload from the DataWorks page, store the resource in HDFS and reference it in your code. For example:

spark-submit --master yarn
--deploy-mode cluster
--name SparkPi
--driver-memory 4G
--driver-cores 1
--num-executors 5
--executor-memory 4G
--executor-cores 1
--class org.apache.spark.examples.JavaSparkPi
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100

Create an EMR JAR resource.
1. For more information, see Resource Management. Store the generated JAR package in the emr/jars directory. Click Click Upload to upload the JAR resource.
2. Select a Storage Path, Data Sources, and Resource Group.
3. Click Save.
Reference the EMR JAR resource.
1. Open the created EMR Spark Streaming node and stay on the code editor page.
2. In the left-side navigation pane, find the resource that you want to reference under Resource Management. Right-click the resource and select Insert Resource Path.
3. After the reference is added, a success message appears on the code editor page for the EMR Spark Streaming node. You can then run the following command. The resource package, bucket name, and path information in the following command are examples. Replace them with your values.
```
##@resource_reference{"examples-1.2.0-shaded.jar"}
--master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
```

Develop code

Enter your job code in the EMR Spark Streaming node code editor. Example:

spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>

Note

In this example, the resource uploaded to DataWorks is examples-1.2.0-shaded.jar.
Replace access-key-id and access-key-secret with the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. To obtain an AccessKey ID and AccessKey Secret, log on to the DataWorks console, hover over your profile picture in the upper-right corner of the top navigation bar, and select AccessKey Management.
Comments are not supported in the code editor for EMR Spark Streaming nodes.

(Optional) Configure advanced parameters

In the Scheduling Settings pane on the right side of the node page, configure the following parameters under EMR Node Parameters > DataWorks parameters.

Note

The available advanced parameters vary by EMR cluster type, as shown in the following table.
You can configure more open-source Spark properties in the Scheduling Settings pane under EMR Node Parameters > Spark parameter.

DataLake: EMR on ECS

Parameter	Description
FLOW_SKIP_SQL_ANALYZE	The execution mode for SQL statements. Valid values: `true`: Executes multiple SQL statements at a time. `false` (default): Executes one SQL statement at a time. Note This parameter is supported only for test runs in the data development environment.
queue	The job submission queue. The default queue is default. For more information about EMR YARN, see Basic queue configuration.
priority	The job priority. The default is 1.
Other	Add custom SparkConf parameters in the advanced configuration. DataWorks automatically appends them to the command on submission. Example: `"spark.driver.memory" : "2g"`. Note To enable Ranger access control, add the `spark.hadoop.fs.oss.authorization.method=ranger` configuration in Set global Spark parameters. For more information about parameter configurations, see Set global Spark parameters.

Run the task

In Run Configuration, under Compute Resource, select the Compute Resource and DataWorks Resource Group.
Note
- You can also configure the CUs for Scheduling based on the resource requirements of the task. The default value is 0.25.
- To access a data source over the public internet or in a VPC, you must use a scheduling resource group that can connect to the data source. For more information, see Network connectivity solutions.
In the parameter dialog box on the toolbar, select the data source that you created and click Run to run the task.

To run the node as a scheduled task, configure its scheduling properties as needed. For more information, see Configure node scheduling.
After configuring the node task, you must publish it. For more information, see Publish a node or workflow.
After the task is published, you can view the status of scheduled tasks in Operation Center. For more information, see Get started with Operation Center.