Data Studio (legacy): Register an EMR cluster

更新时间:
复制 MD 格式

DataWorks lets you create nodes such as Hive, MapReduce (MR), Presto, and Spark SQL on an E-MapReduce (EMR) cluster to configure task workflows, schedule jobs, and manage metadata. This topic describes how to register an EMR cluster that belongs to the same or a different Alibaba Cloud account.

Background

The open source big data development platform E-MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform.

Built on open source Apache Hadoop and Apache Spark, EMR lets you use other systems in their ecosystems to analyze and process data. EMR can also transfer data to and from other Alibaba Cloud services, such as Object Storage Service (OSS) and RDS. Alibaba Cloud EMR is available in various forms, including on ECS, on ACK, and serverless, to meet different user needs.

When you run EMR tasks in DataWorks, you can choose from various EMR components. The optimal configuration varies by component. When you configure an EMR cluster, refer to EMR cluster configuration recommendations and select one based on your actual requirements.

Supported cluster types

Before you can run related tasks, you must create and register an EMR cluster with DataWorks. DataWorks supports registering the following cluster types: DataLake clusters (new data lake): EMR on ECS, Custom clusters: EMR on ECS, Hadoop clusters (legacy data lake): EMR on ECS, Spark clusters: EMR on ACK, and EMR Serverless Spark clusters.

Important
  • The following EMR versions of Hadoop clusters (legacy data lake) are supported in DataWorks:

    EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, EMR-5.4.3

  • Hadoop clusters (legacy data lake) are no longer recommended. We recommend that you migrate your Hadoop clusters to DataLake clusters as soon as possible. For more information, see Migrate a Hadoop cluster to a DataLake cluster.

Limitations

  • Permissions: Only RAM users or RAM roles with the following identities can register an EMR cluster. To grant these permissions, see Grant permissions to a RAM user.

    • An Alibaba Cloud account.

    • A RAM user or RAM role with the DataWorks Workspace Administrator role and the AliyunEMRFullAccess policy.

    • A RAM user or RAM role with the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies.

  • Region: EMR Serverless Spark is available only in China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).

  • Task type: DataWorks does not support EMR Flink tasks.

  • Task execution: DataWorks supports running EMR tasks by using serverless resource groups (recommended) or legacy exclusive resource groups for scheduling.

  • Data governance:

    • Only SQL tasks on EMR Hive, EMR Spark, and EMR Spark SQL nodes support data lineage generation. For clusters of version 5.9.1 or 3.43.1 or later, these nodes support both table-level and column-level data lineage.

      Note

      For Spark nodes, table-level and column-level data lineage are supported on EMR clusters of version 5.8.0 or 3.42.0 or later. On earlier versions, only Spark 2.x supports table-level data lineage.

    • To manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, DataWorks cannot display metadata in real time, generate audit logs, or show data lineage. Currently, EMR-HOOK is supported only for the EMR Hive and EMR Spark SQL services. For more information, see Configure EMR-HOOK for Hive and Configure EMR-HOOK for Spark SQL.

  • For EMR clusters with Kerberos authentication enabled, you must add an inbound rule to the security group to allow UDP access from the CIDR block of the vSwitch that is associated with the resource group.

    Note

    Click the image icon for Cluster Security Group on the Basic information tab of the EMR cluster. This action takes you to the Security Group Details page. On the Access Rules tab, click Inbound, and then click Added Manually. Set Protocol Type to Custom UDP. For Port Range, enter the KDC port specified in the /etc/krb5.conf file of the EMR cluster. Set Authorized object to the CIDR block of the vSwitch that is associated with the resource group.

Usage notes

  • To isolate the development and production environments for a DataWorks workspace in standard mode, you must register two separate EMR clusters: one for the development environment and one for the production environment. The metadata of these clusters must be stored by using one of the following methods:

  • An EMR cluster can be registered with multiple workspaces within the same Alibaba Cloud account, but not across different accounts.

  • To ensure that a DataWorks resource group can access an EMR cluster, if the resource group still cannot connect after you bind it to the same VPC and vSwitch as the EMR cluster, check the cluster security group rules. Add inbound rules for the vSwitch CIDR block and the ports of common open source components. For more information, see Manage security groups of an EMR cluster.

Step 1: Go to the cluster registration page

  1. Log on to the DataWorks console. In the target region, click More > Management Center in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane, click Clusters to open the Clusters page. Click Register Cluster and set Select Cluster Type to E-MapReduce. The Register EMR Cluster page appears.

Step 2: Register an EMR cluster

On the Register EMR Cluster page, configure the cluster information.

Note

For workspaces in standard mode, you must configure cluster information separately for the development and production environments. For more information about workspace modes, see Differences between workspace modes.

  • Display Name of Cluster: Enter a unique display name for the cluster.

  • Alibaba Cloud Account To Which Cluster Belongs: Select the account that owns the EMR cluster.

    Note

    EMR Serverless Spark clusters cannot be registered across accounts.

Configure the parameters based on the selected account type.

Current account

If the cluster belongs to the Current Alibaba Cloud Account, you must configure the following parameters:

Parameter

Description

Cluster Type

Select the type of EMR cluster to register. For a list of supported cluster types, see Limitations.

Cluster

Select the EMR cluster in the current account that you want to register with DataWorks.

Note

If you select EMR Serverless Spark, you must follow the on-screen instructions and reference the descriptions to select an Workspace Created in EMR Serverless Spark (the cluster to register), a default engine version, and a default resource queue.

Default Access Identity

Specify the identity used to access the EMR cluster in the current workspace.

  • Development environment: You can use the cluster account hadoop or the cluster account that is mapped to the task executor.

  • Production environment: You can use the cluster account hadoop or the cluster account that is mapped to the task owner, Alibaba Cloud account, or RAM user.

Note

If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies:

  • If a RAM user runs the task, DataWorks runs the task by using an EMR system account that has the same name as the current operator by default. If the cluster has LDAP or Kerberos authentication enabled, the task fails.

  • If an Alibaba Cloud account runs the task, the DataWorks task reports an error.

Pass Proxy User Information

Specify whether to pass proxy user information.

Note

When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user.

  • Pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the proxy user.

    • Data Studio and Data Analysis: The Alibaba Cloud account name of the task executor is dynamically passed as the proxy user information.

    • Operation Center: The Alibaba Cloud account name of the default access identity that is configured during cluster registration is passed as the proxy user information.

  • Do not pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the authentication method configured during cluster registration.

Proxy user information is passed in the following ways for different types of EMR tasks:

  • EMR Kyuubi tasks: passed by using the hive.server2.proxy.user parameter.

  • EMR Spark tasks and EMR Spark SQL tasks in non-JDBC mode: passed by using the -proxy-user parameter.

Configuration File

If you select HADOOP for Cluster Type, you can obtain the configuration files from the EMR console. For more information, see Export and import service configurations. After you export the files, rename them as required on the UI.

On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration.

You can also log on to the EMR cluster and obtain the configuration files from the following paths:

/etc/ecm/hadoop-conf/core-site.xml
/etc/ecm/hadoop-conf/hdfs-site.xml
/etc/ecm/hadoop-conf/mapred-site.xml
/etc/ecm/hadoop-conf/yarn-site.xml
/etc/ecm/hive-conf/hive-site.xml
/etc/ecm/spark-conf/spark-defaults.conf
/etc/ecm/spark-conf/spark-env.sh

Another account

If the cluster belongs to Another Alibaba Cloud Account, you must configure the following parameters:

Parameter

Description

UID of Alibaba Cloud Account

Enter the UID of the Alibaba Cloud account that owns the EMR cluster.

RAM Role

The RAM role used to access the EMR cluster. The role must meet the following requirements:

  • The RAM role is created in the other Alibaba Cloud account.

  • The RAM role in the other Alibaba Cloud account is granted permissions to access the DataWorks service in the current account.

Note

For more information about how to register an EMR cluster that belongs to a different account, see Scenario: Register an EMR cluster that belongs to a different account.

EMR Cluster Type

Select the type of EMR cluster to register. For cross-account registration, only EMR on ECS: DataLake cluster, EMR on ECS: Hadoop cluster, and EMR on ECS: Custom cluster are supported.

EMR Cluster

Select the EMR cluster in that account that you want to register with DataWorks.

Configuration File

Configure each configuration file as prompted on the UI. For details about how to obtain configuration files, see Export and import service configurations. After you export the files, rename them as required on the UI.

On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration.

You can also log on to the EMR cluster and obtain the configuration files from the following paths:

/etc/ecm/hadoop-conf/core-site.xml
/etc/ecm/hadoop-conf/hdfs-site.xml
/etc/ecm/hadoop-conf/mapred-site.xml
/etc/ecm/hadoop-conf/yarn-site.xml
/etc/ecm/hive-conf/hive-site.xml
/etc/ecm/spark-conf/spark-defaults.conf
/etc/ecm/spark-conf/spark-env.sh

Default Access Identity

Specify the identity used to access the EMR cluster in the current workspace.

  • Development environment: You can use the cluster account hadoop or the cluster account that is mapped to the task owner.

  • Production environment: You can use the cluster account hadoop or the cluster account that is mapped to the task owner, Alibaba Cloud account, or RAM user.

Note

If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies:

  • If a RAM user runs the task, DataWorks runs the task by using an EMR system account that has the same name as the current operator by default. If the cluster has LDAP or Kerberos authentication enabled, the task fails.

  • If an Alibaba Cloud account runs the task, the DataWorks task reports an error.

Pass Proxy User Information

Specify whether to pass proxy user information.

Note

When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user.

  • Pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the proxy user.

    • Data Studio and Data Analysis: The Alibaba Cloud account name of the task executor is dynamically passed as the proxy user information.

    • Operation Center: The Alibaba Cloud account name of the default access identity that is configured during cluster registration is passed as the proxy user information.

  • Do not pass: When tasks are run in the EMR cluster, data access permissions are verified and controlled based on the authentication method configured during cluster registration.

Proxy user information is passed in the following ways for different types of EMR tasks:

  • EMR Kyuubi tasks: passed by using the hive.server2.proxy.user parameter.

  • EMR Spark tasks and EMR Spark SQL tasks in non-JDBC mode: passed by using the -proxy-user parameter.

Step 3: Initialize a resource group

You must initialize the resource group after you register a cluster for the first time, change cluster service configurations, or upgrade a component version (for example, by modifying core-site.xml). This ensures that the resource group can access EMR and that EMR tasks can run with the current environment configuration.

  1. On the Open Source Clusters page, select the tab for the registered EMR cluster and click Initialize Resource Group.

  2. Locate the required resource group and click Initialize in the Actions column.

    You can initialize a serverless resource group or a legacy exclusive resource group for scheduling.
  3. Wait 1 to 2 minutes for the initialization to complete, and then click OK.

Important
  • If the initialization fails, use the connectivity diagnosis tool to troubleshoot the cause.

  • Initialization can cause running tasks to fail. Unless immediate re-initialization is necessary, for example, to prevent widespread task failures after a configuration change, we recommend initializing the resource group during off-peak hours.

Next steps

  • Data development: Follow the data development workflow guide to configure component environments.

  • Configure cluster identity mappings: If the default access identity for the EMR cluster is not the hadoop account, you must configure identity mappings to control which resources RAM users can access in DataWorks for permission management.

  • Set global YARN resource queues: Use YARN resource queue mappings to specify which YARN queue each module uses. You can also configure these settings to override individual module configurations.

  • Set global Spark parameters: Customize global Spark parameters based on the official Spark documentation. You can also specify whether workspace-level settings override module-specific configurations for parameters with the same name.

  • Set Kyuubi connection information: To use a custom account and password for Kyuubi tasks, customize the connection information as described in this topic.