EMR Hive node

更新时间:
复制 MD 格式

The E-MapReduce (EMR) Hive node in DataWorks supports the batch analysis of large-scale data stored in distributed systems. This node simplifies big data processing and improves development efficiency. In an EMR Hive node, you can use SQL-like statements to read, write, and manage large datasets. This lets you efficiently analyze massive log data and complete development work.

Prerequisites

  • Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.

  • (Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

    If you are using a root account, skip this step.
  • Configure a Hive data source in DataWorks and ensure that it passes the connectivity test. For more information, see Data Source Management.

Limits

  • This type of task can run only on Serverless resource groups (recommended) or exclusive resource groups for scheduling.

  • To manage metadata for DataLake or custom clusters in DataWorks, you must first configure EMR-HOOK on the cluster. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.

    Note

    If you do not configure EMR-HOOK on the cluster, DataWorks cannot display metadata in real time, generate audit logs, display data lineage, or perform EMR-related administration tasks.

Step 1: Develop an EMR Hive node

You can develop the node on the EMR Hive node editing page.

Develop SQL code

Write the code for your task in the SQL editing area. You can define variables in the code using the ${variable_name} format and assign a value to each variable in the Scheduling Parameters section under Scheduling Settings on the right side of the node editing page. This lets you dynamically pass parameters to the code in scheduling scenarios. For more information about scheduling parameters, see Source and Expressions of Scheduling Parameters. The following is an example.

SHOW  TABLES ; 
SELECT '${var}'; --Use with scheduling parameters.
SELECT * FROM userinfo ;
Note

The maximum size of an SQL statement is 130 KB.

Step 2: Configure the EMR Hive node

(Optional) Configure advanced parameters

You can configure the property parameters listed in the following table in the Scheduling Settings section, which is on the right side of the node, under EMR Node Parameters > DataWorks parameters.

Note
  • The available advanced parameters vary based on the EMR cluster type, as shown in the following tables.

  • You can configure more open-source Spark properties in the Scheduling Settings section on the right side of the node, under EMR Node Parameters > Spark parameter.

DataLake cluster/Custom cluster: EMR on ECS

Advanced parameter

Description

queue

The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configuration.

priority

The priority. The default value is 1.

FLOW_SKIP_SQL_ANALYZE

The execution mode for SQL statements. Valid values:

  • true: Executes multiple SQL statements at a time.

  • false (default): Executes one SQL statement at a time.

Note

This parameter is supported only for test runs in the development environment.

DATAWORKS_SESSION_DISABLE

Applies to test run scenarios in the development environment. Valid values:

  • true: A new Java Database Connectivity (JDBC) connection is created each time an SQL statement is run.

  • false (default): The same JDBC connection is reused when a user runs different SQL statements within the same node.

Note

If this parameter is set to false, the Hive yarn applicationId is not printed. To print the yarn applicationId, set this parameter to true.

Other

You can also append custom Hive connection parameters in the advanced configuration section.

Hadoop cluster: EMR on ECS

Advanced parameter

Description

queue

The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configuration.

priority

The priority. The default value is 1.

FLOW_SKIP_SQL_ANALYZE

The execution mode for SQL statements. Valid values:

  • true: Executes multiple SQL statements at a time.

  • false (default): Executes one SQL statement at a time.

Note

This parameter is supported only for test runs in the development environment.

USE_GATEWAY

Specifies whether to submit the job for this node through a gateway cluster. Valid values:

  • true: Submits the job through a gateway cluster.

  • false (default): Does not submit the job through a gateway cluster. The job is submitted to the header node by default.

Note

If the cluster where this node resides is not associated with a gateway cluster, and you manually set this parameter to true, subsequent EMR job submissions will fail.

To run the node task on a schedule, you must configure its scheduling properties. For more information, see Node scheduling configuration.

Step 3: Run and debug the node

Execute the SQL task

  1. In Run Configuration, under Compute Resource, you can configure Compute Resource and Resource Group.

    Note
    • You can also set CUs for Scheduling based on the resources required for task execution. The default is 0.25.

    • To access data sources on the public internet or in a VPC, you must use a scheduling resource group that has passed the connectivity test for that data source. For more information, see Network connectivity solutions.

  2. In the parameter dialog box on the toolbar, select the Hive data source that you created and click Run to execute the SQL task.

    Note

    When you query data from an EMR Hive node, the query is limited to 10000 records and a total data size of 10 MB.

  3. You can save the node task by clicking Save.

What to do next

  1. After you configure the node, you must publish it. For more information, see Publish nodes or workflows.

  2. After the task is published, you can view the status of the auto triggered task in the Operation Center. For more information, see Get started with Operation Center.

FAQ

Q: Why does a connection timeout (ConnectException) occur when I run a node?

EMR execute task failed!
SQL: {"name":"dw20251018","type":"HIVE_SQL","launcher":{"allocationSpec":{}},"properties":{"envs":{"FLOW_SKIP_SQL_ANALYZE":false},"arguments":["select * from default.dim_customers"],"tags":[],"description":"DataWorks"}}
TASK-MESSAGE:
FAILED: Build connection error! Could not open client transport with JDBC Uri: jdbc:hive2://xxx:10000: java.net.ConnectException: Connection timed out

A: Ensure that the resource group and the cluster can connect to each other over the network. Go to the computing resources list and click Resource Initialization. In the dialog box that appears, click Re-initialize. Verify that the initialization is successful.