Register an EMR cluster in DataWorks-DataWorks(DataWorks)-阿里云帮助中心

DataWorks lets you create nodes such as Hive, MapReduce (MR), Presto, and Spark SQL on an E-MapReduce (EMR) cluster to configure task workflows, schedule jobs, and manage metadata. You can register an EMR cluster that belongs to the same or a different Alibaba Cloud account.

Background

E-MapReduce (EMR) is an open source big data processing solution that runs on Alibaba Cloud.

Built on Apache Hadoop and Apache Spark, EMR lets you analyze and process data with their ecosystems. EMR can also exchange data with other Alibaba Cloud services, such as Object Storage Service (OSS) and ApsaraDB RDS. EMR is available on ECS, on ACK, and as a serverless offering.

The optimal cluster configuration varies by component. Before you configure an EMR cluster, see EMR cluster configuration recommendations and select a configuration that meets your requirements.

Supported cluster types

Before you can run EMR tasks, you must create and register an EMR cluster with DataWorks. The following cluster types are supported: DataLake clusters (new data lake): EMR on ECS, Custom clusters: EMR on ECS, Hadoop clusters (legacy data lake): EMR on ECS, Spark clusters: EMR on ACK, and EMR Serverless Spark clusters.

Important

The following EMR versions of Hadoop clusters (legacy data lake) are supported in DataWorks:

EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, EMR-5.4.3
Hadoop clusters (legacy data lake) are no longer recommended. We recommend that you migrate your Hadoop clusters to DataLake clusters as soon as possible. For more information, see Migrate a Hadoop cluster to a DataLake cluster.

Limitations

Permissions: Only RAM users or RAM roles with the following identities can register an EMR cluster. To grant these permissions, see Grant permissions to a RAM user.
- An Alibaba Cloud account.
- A RAM user or RAM role with the DataWorks Workspace Administrator role and the AliyunEMRFullAccess policy.
- A RAM user or RAM role with the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies.
Region: EMR Serverless Spark is available only in China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).
Task type: DataWorks does not support EMR Flink tasks.
Task execution: DataWorks supports running EMR tasks by using serverless resource groups (recommended) or legacy exclusive resource groups for scheduling.
Data governance:
- Only SQL tasks on EMR Hive, EMR Spark, and EMR Spark SQL nodes support data lineage generation. For clusters of version 5.9.1 or 3.43.1 or later, these nodes support both table-level and column-level data lineage.
  
  Note
  For Spark nodes, table-level and column-level data lineage are supported on EMR clusters of version 5.8.0 or 3.42.0 or later. On earlier versions, only Spark 2.x supports table-level data lineage.
- To manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, DataWorks cannot display metadata in real time, generate audit logs, or show data lineage. Currently, EMR-HOOK is supported only for the EMR Hive and EMR Spark SQL services. For more information, see Configure EMR-HOOK for Hive and Configure EMR-HOOK for Spark SQL.
For EMR clusters with Kerberos authentication enabled, you must add an inbound rule to the security group to allow UDP access from the CIDR block of the vSwitch that is associated with the resource group.

Note
Click the icon for Cluster Security Group on the Basic information tab of the EMR cluster to go to the Security Group Details page. On the Access Rules tab, click Inbound, and then click Added Manually. Set Protocol Type to Custom UDP. For Port Range, enter the KDC port specified in the /etc/krb5.conf file of the EMR cluster. Set Authorized object to the CIDR block of the vSwitch that is associated with the resource group.

Usage notes

To isolate the development and production environments for a DataWorks workspace in standard mode, you must register two separate EMR clusters: one for the development environment and one for the production environment. The metadata of these clusters must be stored by using one of the following methods:
- Method 1 (recommended for data lake solutions): Store the metadata in two different catalogs in Data Lake Formation (DLF). For more information, see Switch the metadata storage type.
- Method 2: Store the metadata in two different databases in RDS. For more information, see Configure a self-managed RDS instance.
An EMR cluster can be registered with multiple workspaces within the same Alibaba Cloud account, but not across different accounts.
To ensure that a DataWorks resource group can access an EMR cluster, if the resource group still cannot connect after you bind it to the same VPC and vSwitch as the EMR cluster, check the cluster security group rules. Add inbound rules for the vSwitch CIDR block and the ports of common open source components. For more information, see Manage security groups of an EMR cluster.

Step 1: Go to the cluster registration page

Log on to the DataWorks console. In the target region, click More > Management Center in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane, click Clusters. On the Clusters page, click Register Cluster and set Select Cluster Type to E-MapReduce.

Step 2: Register an EMR cluster

On the Register EMR Cluster page, configure the cluster information.

Note

For workspaces in standard mode, you must configure cluster information separately for the development and production environments. For more information about workspace modes, see Differences between workspace modes.

Display Name of Cluster: Enter a unique display name for the cluster.
Alibaba Cloud Account To Which Cluster Belongs: Select the account that owns the EMR cluster.

Note
EMR Serverless Spark clusters cannot be registered across accounts.

Configure the parameters based on the selected account type.

Current account

If the cluster belongs to the Current Alibaba Cloud Account, you must configure the following parameters:

Parameter	Description
Cluster Type	Select the type of EMR cluster to register. For a list of supported cluster types, see Limitations.
Cluster	Select the EMR cluster in the current account that you want to register with DataWorks. Note If you select EMR Serverless Spark, you must follow the on-screen instructions and reference the descriptions to select an Workspace Created in EMR Serverless Spark (the cluster to register), a default engine version, and a default resource queue.
Default Access Identity	Specify the identity used to access the EMR cluster in the current workspace. Development environment: You can use the cluster account `hadoop` or the cluster account that is mapped to the task executor. Production environment: You can use the cluster account `hadoop` or the cluster account that is mapped to the task owner, Alibaba Cloud account, or RAM user. Note If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies: If a RAM user runs the task, DataWorks runs the task by using an EMR system account that has the same name as the current operator by default. If the cluster has LDAP or Kerberos authentication enabled, the task fails. If an Alibaba Cloud account runs the task, the DataWorks task reports an error.
Pass Proxy User Information	Specify whether to pass proxy user information. Note When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user. Pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the proxy user. Data Studio and Data Analysis: The Alibaba Cloud account name of the task executor is dynamically passed as the proxy user information. Operation Center: The Alibaba Cloud account name of the default access identity that is configured during cluster registration is passed as the proxy user information. Do not pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the authentication method configured during cluster registration. Proxy user information is passed in the following ways for different types of EMR tasks: EMR Kyuubi tasks: passed by using the `hive.server2.proxy.user` parameter. EMR Spark tasks and EMR Spark SQL tasks in non-JDBC mode: passed by using the `-proxy-user` parameter.
Configuration File	If you select HADOOP for Cluster Type, you can obtain the configuration files from the EMR console. For more information, see Export and import service configurations. After you export the files, rename them as required on the UI. On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration. You can also log on to the EMR cluster and obtain the configuration files from the following paths: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`

Another account

If the cluster belongs to Another Alibaba Cloud Account, you must configure the following parameters:

Parameter	Description
UID of Alibaba Cloud Account	Enter the UID of the Alibaba Cloud account that owns the EMR cluster.
RAM Role	The RAM role used to access the EMR cluster. The role must meet the following requirements: The RAM role is created in the other Alibaba Cloud account. The RAM role in the other Alibaba Cloud account is granted permissions to access the DataWorks service in the current account. Note For more information about how to register an EMR cluster that belongs to a different account, see Scenario: Register an EMR cluster that belongs to a different account.
EMR Cluster Type	Select the type of EMR cluster to register. For cross-account registration, only `EMR on ECS: DataLake cluster`, `EMR on ECS: Hadoop cluster`, and `EMR on ECS: Custom cluster` are supported.
EMR Cluster	Select the EMR cluster in that account that you want to register with DataWorks.
Configuration File	Configure each configuration file as prompted on the UI. For details about how to obtain configuration files, see Export and import service configurations. After you export the files, rename them as required on the UI. On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration. You can also log on to the EMR cluster and obtain the configuration files from the following paths: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`
Default Access Identity	Specify the identity used to access the EMR cluster in the current workspace. Development environment: You can use the cluster account hadoop or the cluster account that is mapped to the task owner. Production environment: You can use the cluster account hadoop or the cluster account that is mapped to the task owner, Alibaba Cloud account, or RAM user. Note If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies: If a RAM user runs the task, DataWorks runs the task by using an EMR system account that has the same name as the current operator by default. If the cluster has LDAP or Kerberos authentication enabled, the task fails. If an Alibaba Cloud account runs the task, the DataWorks task reports an error.
Pass Proxy User Information	Specify whether to pass proxy user information. Note When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user. Pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the proxy user. Data Studio and Data Analysis: The Alibaba Cloud account name of the task executor is dynamically passed as the proxy user information. Operation Center: The Alibaba Cloud account name of the default access identity that is configured during cluster registration is passed as the proxy user information. Do not pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the authentication method configured during cluster registration. Proxy user information is passed in the following ways for different types of EMR tasks: EMR Kyuubi tasks: passed by using the `hive.server2.proxy.user` parameter. EMR Spark tasks and EMR Spark SQL tasks in non-JDBC mode: passed by using the `-proxy-user` parameter.

Step 3: Initialize a resource group

Initialize the resource group after you register a cluster for the first time, change cluster service configurations, or upgrade a component version (for example, by modifying core-site.xml). This ensures that the resource group can access the EMR cluster and that tasks run with the current configuration.

On the Open Source Clusters page, select the tab for the registered EMR cluster and click Initialize Resource Group.
Locate the required resource group and click Initialize in the Actions column.

You can initialize a serverless resource group or a legacy exclusive resource group for scheduling.
Wait 1 to 2 minutes for the initialization to complete, and then click OK.

Important

If the initialization fails, use the connectivity diagnosis tool to troubleshoot the cause.
Initialization can cause running tasks to fail. Unless immediate re-initialization is necessary, for example, to prevent widespread task failures after a configuration change, we recommend initializing the resource group during off-peak hours.

Next steps

Data development: Follow the data development workflow guide to configure component environments.
Configure cluster identity mappings: If the default access identity for the EMR cluster is not the hadoop account, you must configure identity mappings to control which resources RAM users can access in DataWorks for permission management.
Set global YARN resource queues: Use YARN resource queue mappings to specify which YARN queue each module uses. You can also configure these settings to override individual module configurations.
Set global Spark parameters: Customize global Spark parameters based on the official Spark documentation. You can also specify whether workspace-level settings override module-specific configurations for parameters with the same name.
Set Kyuubi connection information: To use a custom account and password for Kyuubi tasks, customize the connection information as described in this topic.