Create an EMR Hive node

更新时间:
复制 MD 格式

You can create an E-MapReduce (EMR) Hive node, which lets you use SQL-like statements to read, write, and manage large datasets stored in distributed systems. This is ideal for analyzing massive log volumes and performing related development work.

Prerequisites

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see DataStudio (legacy): Register an EMR compute resource.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.

  • A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

Limitations

  • EMR Hive nodes can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.

  • To manage metadata for a DataLake cluster or a custom cluster in DataWorks, you must configure the EMR-HOOK on the cluster. Without this configuration, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related governance tasks. For instructions, see Configure the EMR-HOOK for Hive.

Step 1: Create an EMR Hive node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Hive node.

    1. Right-click the target workflow and choose Create Node > EMR > EMR Hive.

      Note

      You can also hover over Create and choose Create Node > EMR > EMR Hive.

    2. In the Create Node dialog box, enter a Name and select an Engine Instance, Node Type, and Path. Click Confirm to open the node editing page.

      Note

      The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop the EMR Hive task

On the EMR Hive node editing page, double-click the node you created to open the task development page. Then, follow these steps.

Write SQL code

In the SQL editor, write the task code. You can define variables in the code by using the ${variable_name} format and assign values to them in the Scheduling > Scheduling Parameter section of the right-side panel. This enables dynamic parameter passing for scheduled runs. For more information about supported formats, see Supported formats for scheduling parameters. The following code provides an example:

show tables;
select '${var}'; -- You can use this with scheduling parameters.
select * from userinfo;
Note
  • The total size of the SQL statements cannot exceed 130 KB.

  • If multiple EMR compute engines are associated with your workspace in DataStudio, you must select the appropriate compute engine based on your business requirements. If only one EMR compute engine is associated, no selection is needed.

  • To change parameter assignments in your code, click Advanced Run in the toolbar. For more information about parameter assignment logic, see Differences in parameter assignment logic among Run, Run with Parameters, and smoke testing.

(Optional) Configure advanced parameters

You can configure node-specific properties in the Advanced Settings section. For more information about available settings, see Spark Configuration. The available advanced parameters vary by EMR cluster type, as shown in the following tables.

DataLake and custom clusters

Parameter

Description

queue

The scheduling queue for job submission. The default queue is default. For more information about YARN in E-MapReduce, see Basic queue configurations.

priority

Specifies the job priority. The default value is 1.

FLOW_SKIP_SQL_ANALYZE

Controls how SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): One SQL statement is executed at a time.

Note

This parameter is available only for workflow test runs in the development environment.

DATAWORKS_SESSION_DISABLE

Applies to direct test runs in the development environment. Valid values:

  • true: A new JDBC connection is created each time an SQL statement is run.

  • false (default): The same JDBC connection is reused when you run different SQL statements in a node.

Note

When this parameter is set to false, Hive does not print the yarn applicationId. To print the yarn applicationId, set this parameter to true.

Others

You can also add custom Hive connection parameters in the advanced settings.

Hadoop cluster

Parameter

Description

queue

The scheduling queue for job submission. The default queue is default. For more information about YARN in E-MapReduce, see Basic queue configurations.

priority

Specifies the job priority. The default value is 1.

FLOW_SKIP_SQL_ANALYZE

Controls how SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): One SQL statement is executed at a time.

Note

This parameter is available only for workflow test runs in the development environment.

USE_GATEWAY

Specifies whether to submit jobs from this node through a gateway cluster. Valid values:

  • true: Submits jobs through a gateway cluster.

  • false (default): Does not submit jobs through a gateway cluster. By default, jobs are submitted to the header node.

Note

If the node's cluster is not associated with a gateway cluster, setting this parameter to true causes subsequent EMR job submissions to fail.

Run the SQL task

  1. In the toolbar, click the 高级运行 icon. In the Parameter dialog box, select the scheduling resource group that you created and click Running.

    Note
    • To access a compute engine on the public internet or in a VPC, use a scheduling resource group that can connect to that engine. For more information, see Network connection solutions.

    • To switch to a different resource group for subsequent runs, click the Run with Parameters 高级运行 icon and select another scheduling resource group.

    • When you query data using an EMR Hive node, query results are capped at 10,000 records and a total size of 10 MB.

  2. Click the 保存 icon to save the SQL statements.

  3. (Optional) Perform smoke testing.

    If you want to perform smoke testing in the development environment, you can run it before or after you submit the node. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploying tasks.

More operations

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage auto triggered tasks.

FAQ

Q: Why do I receive a connection timeout (ConnectException) error when running a node?

image

A: This error can occur if the resource group and the cluster do not have network connectivity. To resolve this issue, go to the computing resource list page, find the resource, and click Resource Initialization. In the dialog box that appears, click Re-initialize and ensure that the resource is successfully initialized.

image

image