EMR Spark Streaming node

更新时间:
复制 MD 格式

EMR Spark Streaming nodes process high-throughput, real-time streaming data and provide a fault-tolerant mechanism to recover failed data streams. This topic explains how to create an EMR Spark Streaming node for data development.

Prerequisites

  • You have created an Alibaba Cloud EMR cluster and registered it with DataWorks. For more information, see Data Studio: Bind an EMR computing resource.

  • (Optional, for RAM users) The RAM user for task development has been added to the corresponding workspace and granted the Development or Workspace Administrator (this role has extensive permissions, grant with caution) role. For more information about how to add members, see Add members to a workspace.

    If you use an Alibaba Cloud account, you can skip this step.

Limitations

  • This type of task can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.

  • You cannot use EMR Spark Streaming nodes for task development on Spark clusters that run on EMR on ACK.

  • This node cannot be used in a workflow. You can develop and run it only as a standalone node.

Procedure

  1. On the EMR Spark Streaming node editor page, perform the following steps.

    Create and reference an EMR JAR resource

    If you use a DataLake cluster, follow these steps to reference an EMR JAR resource.

    Note

    If a required resource is too large to upload from the DataWorks page, you must store the resource in HDFS and reference it in your code. The following code is an example.

    spark-submit --master yarn
    --deploy-mode cluster
    --name SparkPi
    --driver-memory 4G
    --driver-cores 1
    --num-executors 5
    --executor-memory 4G
    --executor-cores 1
    --class org.apache.spark.examples.JavaSparkPi
    hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
    1. Create an EMR JAR resource.

      1. For more information, see Resource Management. Store the generated JAR package in the emr/jars directory. Click Click Upload to upload the JAR resource.

      2. Select a Storage Path, Data Sources, and Resource Group.

      3. Click Save.

    2. Reference the EMR JAR resource.

      1. Open the created EMR Spark Streaming node and stay on the code editor page.

      2. In the left-side navigation pane, find the resource that you want to reference under Resource Management. Right-click the resource and select Insert Resource Path.

      3. After the reference is added, a success message appears on the code editor page for the EMR Spark Streaming node. You can then run the following command. The resource package, bucket name, and path information in the following command are examples. Replace them with your values.

        ##@resource_reference{"examples-1.2.0-shaded.jar"}
        --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>

    Develop code

    In the code editor for the EMR Spark Streaming node, enter the job code that you want to run. Example:

    spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
    Note
    • In this example, the resource uploaded to DataWorks is examples-1.2.0-shaded.jar.

    • Replace access-key-id and access-key-secret with the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. To obtain an AccessKey ID and AccessKey Secret, log on to the DataWorks console, hover over your profile picture in the upper-right corner of the top navigation bar, and select AccessKey Management.

    • Comments are not supported in the code editor for EMR Spark Streaming nodes.

    (Optional) Configure advanced parameters

    In the Scheduling Settings pane on the right side of the node page, you can configure the specific parameters in the following table under EMR Node Parameters > DataWorks parameters.

    Note
    • The available advanced parameters vary by EMR cluster type, as shown in the following table.

    • You can configure more open-source Spark properties in the Scheduling Settings pane under EMR Node Parameters > Spark parameter.

    DataLake: EMR on ECS

    Parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    The execution mode for SQL statements. Valid values:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is supported only for test runs in the data development environment.

    queue

    The job submission queue. The default queue is default. For more information about EMR YARN, see Basic queue configuration.

    priority

    The job priority. The default is 1.

    Other

    You can add custom SparkConf parameters in the advanced configuration. When you submit the code, DataWorks automatically adds the new parameters to the command. Example: "spark.driver.memory" : "2g".

    Note

    To enable Ranger access control, add the spark.hadoop.fs.oss.authorization.method=ranger configuration in Set global Spark parameters.

    For more information about parameter configurations, see Set global Spark parameters.

    Run the task

    1. In Run Configuration, under Compute Resource, select the Compute Resource and DataWorks Resource Group.

      Note
      • You can also configure the CUs for Scheduling based on the resource requirements of the task. The default value is 0.25.

      • To access a data source over the public internet or in a VPC, you must use a scheduling resource group that can connect to the data source. For more information, see Network connectivity solutions.

    2. In the parameter dialog box on the toolbar, select the data source that you created and click Run to run the task.

  2. To run the node as a scheduled task, configure its scheduling properties as needed. For more information, see Configure node scheduling.

  3. After configuring the node task, you must publish it. For more information, see Publish a node or workflow.

  4. After the task is published, you can view the status of scheduled tasks in Operation Center. For more information, see Get started with Operation Center.