Associate EMR Serverless Spark computing resource

更新时间:
复制 MD 格式

To develop and manage EMR Serverless Spark tasks in DataWorks, you must associate your EMR Serverless Spark workspace with DataWorks as a Serverless Spark computing resource. You can then use this computing resource for data development in DataWorks.

Prerequisites

Limits

  • Region limits : China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).

  • Permission limits:

    Operator

    Required permissions

    Alibaba Cloud account

    No additional permissions are required.

    RAM user/RAM role

    • DataWorks management permissions: Only workspace members with the O&M and Workspace Administrator roles, or with the AliyunDataWorksFullAccess policy, can create computing resources. For more information, see Grant a user the permissions of a Workspace Administrator.

    • EMR Serverless Spark service permissions:

      • The AliyunEMRServerlessSparkFullAccess policy.

      • The Owner permission for the EMR Serverless Spark workspace. For more information, see Manage users and roles.

Go to the computing resource list page

  1. Log on to the DataWorks console. In the navigation pane on the left, switch to the target region and click More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.

  2. In the navigation pane on the left, click Computing Resources to open the computing resource list page.

Associate Serverless Spark computing resource

On the computing resource list page, configure and bind a Serverless Spark computing resource.

  1. Select the type of computing resource to associate.

    1. Click Associate Computing Resources to go to the Associate Computing Resources page.

    2. On the Associate Computing Resources page, select Serverless Spark as the computing resource type to open the Associate EMR Serverless Spark Computing Resource configuration page.

  2. Configure the Serverless Spark computing resource.

    On the Associate EMR Serverless Spark Computing Resource configuration page, configure the parameters that are described in the following table.

    Parameter

    Description

    EMR Serverless Spark Workspace

    Select the Spark workspace that you want to associate. You can also click Create in the drop-down list to create a Spark workspace.

    Default Engine Version

    Select the engine version to use.

    • When you create an EMR Spark task in Data Studio, this engine version is used by default.

    • To set different engine versions for different tasks, specify the versions in the advanced settings of the Spark task editing window.

    Default Resource Queue

    Select the resource queue to use. You can also click Create in the drop-down list to add a queue.

    • When you create an EMR Spark task in Data Studio, this resource queue is used by default.

    • To set different resource queues for different tasks, specify the queues in the advanced settings of the Spark task editing window.

    Default Kyuubi Gateway

    Optional. The Kyuubi Gateway configuration determines how the following tasks run:

    • If a Kyuubi Gateway is configured:

      • All related tasks (EMR Spark SQL/Kyuubi and Serverless Spark SQL/Kyuubi) run via the Kyuubi Gateway.

    • If no Kyuubi Gateway is configured:

      • EMR Spark SQL and Serverless Spark SQL tasks run by using spark-submit.

      • EMR Kyuubi and Serverless Kyuubi tasks fail to execute.

    To configure a gateway, create a Kyuubi Gateway and a token in the EMR Serverless Spark console > Operation Center > Gateway > Kyuubi Gateway .

    • If Kerberos is not enabled: Click the name of the Kyuubi Gateway to obtain the JDBC URL and token, and then combine them to form the complete connection string.

    • If Kerberos is enabled: Obtain the Beeline connection string based on your Kerberos configuration. For more information, see Use Kerberos with Kyuubi Gateway.

      # Example of a standard connection string
      jdbc:hive2://kyuubi-cn-hangzhou-internal.spark.emr.aliyuncs.com:80/;transportMode=http;httpPath=cliservice/token/<token>
      # Example of a connection string for a Kerberos-enabled cluster. Make sure to include the principal of the Kyuubi service.
      jdbc:hive2://ep-xxxxxxxxxxx.epsrv-xxxxxxxxxxx.cn-hangzhou.privatelink.aliyuncs.com:10009/;principal=kyuubi/_HOST@EMR.C-DFD43*****7C204.COM

    Default Access Identity

    The identity used to access this Spark workspace from DataWorks.

    • Development environment: Only the Executor identity is supported.

    • Production environment: The Alibaba Cloud Account, Alibaba Cloud RAM Sub-account, and Task Owner identities are supported.

    Computing Resource Instance Name

    The identifier for the computing resource. At runtime, tasks use this name to select the resource.

  3. Click Confirm to complete the configuration of the Serverless Spark computing resource.

Configure global Spark parameters

In DataWorks, you can specify Spark parameters for each module at the workspace level and set whether these global parameters take precedence over local parameters within specific modules such as Data Studio. After the configuration is complete, tasks use the specified Spark parameters by default. The following table describes the configuration methods.

Scope

Configuration method

Global configuration

At the workspace level, you can configure global Spark parameters for modules that run EMR tasks and set their priority over module-specific parameters. For more information, see Configure global SPARK parameters.

Single node

In Data Studio, you can set specific Spark properties for an individual node task on the node editing page. Other product modules do not support setting Spark properties within the modules.

Permission management

Only users with the following roles can configure global Spark parameters:

  • Alibaba Cloud account

  • A RAM user or RAM role with the AliyunDataWorksFullAccess policy

  • A RAM user with the Workspace Administrator role

Configure global Spark parameters

You can perform the following steps to configure global Spark parameters. For more information about how to configure Spark parameters for a Serverless Spark computing resource, see Job configuration overview.

  1. Go to the computing resource list page and find the Serverless Spark computing resource that you have associated.

  2. Click Spark Parameters to open the configuration pane and view the global Spark parameter settings.

  3. Set global Spark parameters.

    In the upper-right corner of the Spark Parameters page, click Edit Spark Parameters to configure global Spark parameters and priorities for each module.

    Note

    This is a workspace-level configuration. Before you proceed, make sure that you have selected the correct workspace.

    Parameter

    Steps

    Spark property

    Configure the Spark properties that are used to run Serverless Spark tasks.

    Global Settings Take Precedence

    If you select this option, the global configuration overrides the configurations in product modules. In this case, tasks are run based on the global Spark properties.

    • Global configuration: refers to the Spark properties configured on the Spark Parameters page of the Serverless Spark computing resource in Management Center > Computing Resources.

      Currently, you can set global Spark parameters only for Data Studio, Operation Center, and Data Analysis.

    • Configurations in product modules:

      • Data Studio: For EMR Spark, EMR Kyuubi, EMR Spark SQL, EMR Spark Streaming, Serverless Spark Batch, Serverless Spark SQL, and Serverless Kyuubi nodes, you can set Spark properties for an individual node task in the Spark Parameters section on the Run Configuration or Scheduling Settings tab of the node editing page.

      • Other product modules: You cannot set Spark properties within the modules.

  4. Click Confirm to save the configured global Spark parameters.

Configure cluster account mappings

Configure mappings between the Alibaba Cloud accounts of DataWorks members and the identity accounts in the EMR cluster. This allows members to run EMR Serverless Spark tasks by using the mapped cluster identities.

Important

This feature is available only in a serverless resource group. If you purchased a serverless resource group before August 15, 2025 and want to use this feature, you must submit a ticket to upgrade the resource group.

  1. Go to the computing resource list page and find the Serverless Spark computing resource that you have associated.

  2. Click Account Mappings to go to the Account Mappings configuration pane.

  3. Click Edit Account Mapping to configure cluster account mapping information. You can configure the related parameters based on the selected Mapping Type.

    Account Mapping Type

    Task execution

    Configuration

    System account mapping

    Runs EMR Spark, EMR Spark SQL, EMR Kyuubi, and notebook tasks developed in a personal development environment by using a cluster account that has the same name as the Default Access Identity specified in the basic information of the computing resource.

    By default, same-name mapping is used. To use other account mappings, you can manually configure different accounts.

    OpenLDAP account mapping

    The Default Access Identity specified in the basic information of the computing resource is used to run EMR Spark and EMR Spark SQL tasks.

    The OpenLDAP account that is mapped to the default access identity in the basic information of the computing resource is used to run EMR Kyuubi and notebook tasks developed in a personal development environment.

    If you have configured and enabled LDAP authentication for Kyuubi Gateway, you must configure mappings between Account and OpenLDAP accounts (LDAP Account and LDAP Password) to run related tasks.

    Important

    If a required Alibaba Cloud account is not in the mapping list, the task may fail.

    Kerberos account mapping

    The Default Access Identity specified in the basic information of the computing resource is used to run EMR Spark and EMR Spark SQL tasks.

    The Kerberos account that is mapped to the default access identity in the basic information of the computing resource is used to run EMR Kyuubi node tasks.

    1. You must upload the krb5.conf file of the Kerberos service that is configured for the EMR Serverless Spark cluster.

    2. For the Alibaba Cloud account specified as the default access identity, configure the principal and keytab that are required for Kerberos authentication.

  4. Click Confirm to complete the configuration of cluster account mappings.

Next steps

After configuring the Serverless Spark computing resource, you can use it to develop node tasks in Data Studio. For more information, see EMR Spark node, EMR Spark SQL node, EMR Spark Streaming node, EMR Kyuubi node, Serverless Spark Batch node, Serverless Spark SQL node, and Serverless Kyuubi node.