DataWorks lets you create nodes such as Hive, MapReduce (MR), Presto, and Spark SQL on an E-MapReduce (EMR) cluster to configure task workflows, schedule jobs, and manage metadata. This topic describes how to register an EMR cluster that belongs to the same or a different Alibaba Cloud account.
Background
The open source big data development platform E-MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform.
Built on open source Apache Hadoop and Apache Spark, EMR lets you use other systems in their ecosystems to analyze and process data. EMR can also transfer data to and from other Alibaba Cloud services, such as Object Storage Service (OSS) and RDS. Alibaba Cloud EMR is available in various forms, including on ECS, on ACK, and serverless, to meet different user needs.
When you run EMR tasks in DataWorks, you can choose from various EMR components. The optimal configuration varies by component. When you configure an EMR cluster, refer to EMR cluster configuration recommendations and select one based on your actual requirements.
Supported cluster types
Limitations
-
Permissions: Only RAM users or RAM roles with the following identities can register an EMR cluster. To grant these permissions, see Grant permissions to a RAM user.
-
An Alibaba Cloud account.
-
A RAM user or RAM role with the DataWorks
Workspace Administrator roleand theAliyunEMRFullAccesspolicy. -
A RAM user or RAM role with the
AliyunDataWorksFullAccessandAliyunEMRFullAccesspolicies.
-
-
Region: EMR Serverless Spark is available only in China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).
-
Task type: DataWorks does not support EMR Flink tasks.
-
Task execution: DataWorks supports running EMR tasks by using serverless resource groups (recommended) or legacy exclusive resource groups for scheduling.
-
Data governance:
-
Only SQL tasks on EMR Hive, EMR Spark, and EMR Spark SQL nodes support data lineage generation. For clusters of version 5.9.1 or 3.43.1 or later, these nodes support both table-level and column-level data lineage.
NoteFor Spark nodes, table-level and column-level data lineage are supported on EMR clusters of version 5.8.0 or 3.42.0 or later. On earlier versions, only Spark 2.x supports table-level data lineage.
-
To manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, DataWorks cannot display metadata in real time, generate audit logs, or show data lineage. Currently, EMR-HOOK is supported only for the EMR Hive and EMR Spark SQL services. For more information, see Configure EMR-HOOK for Hive and Configure EMR-HOOK for Spark SQL.
-
-
For EMR clusters with Kerberos authentication enabled, you must add an inbound rule to the security group to allow UDP access from the CIDR block of the vSwitch that is associated with the resource group.
NoteClick the
icon for Cluster Security Group on the Basic information tab of the EMR cluster. This action takes you to the Security Group Details page. On the Access Rules tab, click Inbound, and then click Added Manually. Set Protocol Type to Custom UDP. For Port Range, enter the KDC port specified in the /etc/krb5.conffile of the EMR cluster. Set Authorized object to the CIDR block of the vSwitch that is associated with the resource group.
Usage notes
-
To isolate the development and production environments for a DataWorks workspace in standard mode, you must register two separate EMR clusters: one for the development environment and one for the production environment. The metadata of these clusters must be stored by using one of the following methods:
-
Method 1 (recommended for data lake solutions): Store the metadata in two different catalogs in Data Lake Formation (DLF). For more information, see Switch the metadata storage type.
-
Method 2: Store the metadata in two different databases in RDS. For more information, see Configure a self-managed RDS instance.
-
-
An EMR cluster can be registered with multiple workspaces within the same Alibaba Cloud account, but not across different accounts.
-
To ensure that a DataWorks resource group can access an EMR cluster, if the resource group still cannot connect after you bind it to the same VPC and vSwitch as the EMR cluster, check the cluster security group rules. Add inbound rules for the vSwitch CIDR block and the ports of common open source components. For more information, see Manage security groups of an EMR cluster.
Step 1: Go to the cluster registration page
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.
-
In the left-side navigation pane, click Clusters to open the Clusters page. Click Register Cluster and set Select Cluster Type to E-MapReduce. The Register EMR Cluster page appears.
Step 2: Register an EMR cluster
On the Register EMR Cluster page, configure the cluster information.
For workspaces in standard mode, you must configure cluster information separately for the development and production environments. For more information about workspace modes, see Differences between workspace modes.
-
Display Name of Cluster: Enter a unique display name for the cluster.
-
Alibaba Cloud Account To Which Cluster Belongs: Select the account that owns the EMR cluster.
NoteEMR Serverless Spark clusters cannot be registered across accounts.
Configure the parameters based on the selected account type.
Current account
If the cluster belongs to the Current Alibaba Cloud Account, you must configure the following parameters:
|
Parameter |
Description |
|
Cluster Type |
Select the type of EMR cluster to register. For a list of supported cluster types, see Limitations. |
|
Cluster |
Select the EMR cluster in the current account that you want to register with DataWorks. Note
If you select EMR Serverless Spark, you must follow the on-screen instructions and reference the descriptions to select an Workspace Created in EMR Serverless Spark (the cluster to register), a default engine version, and a default resource queue. |
|
Default Access Identity |
Specify the identity used to access the EMR cluster in the current workspace.
Note
If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies:
|
|
Pass Proxy User Information |
Specify whether to pass proxy user information. Note
When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user.
Proxy user information is passed in the following ways for different types of EMR tasks:
|
|
Configuration File |
If you select HADOOP for Cluster Type, you can obtain the configuration files from the EMR console. For more information, see Export and import service configurations. After you export the files, rename them as required on the UI. On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration. You can also log on to the EMR cluster and obtain the configuration files from the following paths:
|
Another account
If the cluster belongs to Another Alibaba Cloud Account, you must configure the following parameters:
|
Parameter |
Description |
|
UID of Alibaba Cloud Account |
Enter the UID of the Alibaba Cloud account that owns the EMR cluster. |
|
RAM Role |
The RAM role used to access the EMR cluster. The role must meet the following requirements:
Note
For more information about how to register an EMR cluster that belongs to a different account, see Scenario: Register an EMR cluster that belongs to a different account. |
|
EMR Cluster Type |
Select the type of EMR cluster to register. For cross-account registration, only |
|
EMR Cluster |
Select the EMR cluster in that account that you want to register with DataWorks. |
|
Configuration File |
Configure each configuration file as prompted on the UI. For details about how to obtain configuration files, see Export and import service configurations. After you export the files, rename them as required on the UI. On the Basic Information tab of the cluster details page, click All Operations in the upper-right corner. In the Cluster Services section of the drop-down menu, select Export Service Configuration. You can also log on to the EMR cluster and obtain the configuration files from the following paths:
|
|
Default Access Identity |
Specify the identity used to access the EMR cluster in the current workspace.
Note
If you select an identity mapped to a cluster account (such as task owner, Alibaba Cloud account, or RAM user), you can manually configure the mappings between DataWorks members and specified EMR cluster accounts. For more information, see Configure cluster identity mappings. If an access identity that requires a mapping is used to run an EMR task in DataWorks but no specific mapping is configured, DataWorks applies the following policies:
|
|
Pass Proxy User Information |
Specify whether to pass proxy user information. Note
When an authentication method such as LDAP or Kerberos is enabled, the cluster issues a credential to each user. To centralize permission management, you can use a superuser (real user) to act as a proxy for a regular user (proxy user) for authentication. When a proxy user accesses the cluster, the identity credentials of the superuser are used. You only need to add the user as a proxy user.
Proxy user information is passed in the following ways for different types of EMR tasks:
|
Step 3: Initialize a resource group
You must initialize the resource group after you register a cluster for the first time, change cluster service configurations, or upgrade a component version (for example, by modifying core-site.xml). This ensures that the resource group can access EMR and that EMR tasks can run with the current environment configuration.
-
On the Open Source Clusters page, select the tab for the registered EMR cluster and click Initialize Resource Group.
-
Locate the required resource group and click Initialize in the Actions column.
You can initialize a serverless resource group or a legacy exclusive resource group for scheduling.
-
Wait 1 to 2 minutes for the initialization to complete, and then click OK.
-
If the initialization fails, use the connectivity diagnosis tool to troubleshoot the cause.
-
Initialization can cause running tasks to fail. Unless immediate re-initialization is necessary, for example, to prevent widespread task failures after a configuration change, we recommend initializing the resource group during off-peak hours.
Next steps
-
Data development: Follow the data development workflow guide to configure component environments.
-
Configure cluster identity mappings: If the default access identity for the EMR cluster is not the
hadoopaccount, you must configure identity mappings to control which resources RAM users can access in DataWorks for permission management. -
Set global YARN resource queues: Use YARN resource queue mappings to specify which YARN queue each module uses. You can also configure these settings to override individual module configurations.
-
Set global Spark parameters: Customize global Spark parameters based on the official Spark documentation. You can also specify whether workspace-level settings override module-specific configurations for parameters with the same name.
-
Set Kyuubi connection information: To use a custom account and password for Kyuubi tasks, customize the connection information as described in this topic.