Create an EMR Impala node

更新时间:
复制 MD 格式

Impala is an interactive SQL query engine for performing fast, real-time queries on petabyte-scale big data. This topic describes how to create an EMR Impala node in DataWorks and use it for data development.

Prerequisites

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see Data Studio (legacy): Register an EMR cluster.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.

  • A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

Limitations

  • This type of task can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.

  • EMR Impala runs only on compute resources of the legacy data lake cluster (Hadoop) type. DataWorks no longer supports binding new Hadoop-type clusters, but you can continue to use Hadoop clusters that were previously bound.

Step 1: Create an EMR Impala node

  1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Impala node.

    1. Right-click the target workflow and choose Create Node > EMR > EMR Impala.

      Note

      You can also hover over Create and choose Create Node > EMR > EMR Impala.

    2. In the Create Node dialog box, enter a Name and select the Engine Instance, Node Type, and Path. Click OK to go to the EMR Impala node configuration tab.

      Note

      A node name can contain uppercase letters, lowercase letters, digits, underscores (_), and periods (.).

Step 2: Develop the EMR Impala task

On the EMR Impala node configuration tab, double-click the node that you created to go to the task development tab and perform the following operations.

Develop SQL code

Develop your task code in the SQL editor. You can define variables in the format ${variable_name} and assign values to them in the Schedule Settings > Scheduling Parameter section in the right-side navigation pane. This enables dynamic parameter passing in scheduled scenarios. For more information about the supported formats of scheduling parameters, see Formats of scheduling parameters. The following code provides an example.

show tables;
CREATE TABLE IF NOT EXISTS userinfo (
ip STRING COMMENT'IP address',
uid STRING COMMENT'User ID'
)PARTITIONED BY(
dt STRING
); 
ALTER TABLE userinfo ADD IF NOT EXISTS PARTITION(dt='${bizdate}'); --Can be used with scheduling parameters.
select * from userinfo ;
Note
  • The maximum size of SQL statements cannot exceed 130 KB.

  • If multiple EMR compute resources are associated in the Data Studio of your workspace, select the appropriate compute resource based on your business requirements. If only one EMR compute resource is associated, no selection is required.

  • If you want to modify parameter values in the code, click Advanced Run on the toolbar. For more information about parameter assignment logic, see Parameter assignment logic.

(Optional) Configure advanced parameters

You can configure specific property parameters in the Advanced Settings section of the node. For more property parameter settings, see Spark Configuration. The advanced parameters that can be configured vary by EMR cluster type, as described in the following tables.

DataLake cluster/custom cluster: EMR on ECS

Advanced parameter

Description

FLOW_SKIP_SQL_ANALYZE

The SQL statement execution method. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): A single SQL statement is executed at a time.

Note

This parameter is supported only for test runs in the development environment.

DATAWORKS_SESSION_DISABLE

Applicable to direct test run scenarios in the development environment. Valid values:

  • true: A new JDBC connection is created each time an SQL statement is run.

  • false (default): The same JDBC connection is reused when different SQL statements are run within a single node.

Note

When this parameter is set to false, the Hive yarn applicationId is not printed. To print the yarn applicationId, set this parameter to true.

Hadoop cluster: EMR on ECS

Advanced parameter

Description

FLOW_SKIP_SQL_ANALYZE

The SQL statement execution method. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): A single SQL statement is executed at a time.

Note

This parameter is supported only for test runs in the development environment.

USE_GATEWAY

Specifies whether to submit jobs through a Gateway cluster for this node. Valid values:

  • true: Jobs are submitted through the Gateway cluster.

  • false (default): Jobs are not submitted through the Gateway cluster and are submitted to the header node by default.

Note

If the cluster where this node resides is not associated with a Gateway cluster and you manually set this parameter to true, subsequent EMR job submissions will fail.

Run the SQL task

  1. Click the Advanced Run icon on the toolbar. In the Parameter dialog box, select the resource group for scheduling that you created, and click Running.

    Note
    • To access compute resources in a public network or VPC network environment, use a resource group for scheduling that has passed the connectivity test with the compute resource. For more information, see Network connectivity.

    • If you need to change the resource group for subsequent task runs, click the Advanced Run Advanced Run icon and select the resource group for scheduling that you want to use.

  2. Click the Save icon to save the SQL statements.

  3. (Optional) Smoke testing.

    If you want to perform smoke testing in the development environment, you can do so before or after submitting the node. For more information, see Smoke testing.

(Optional) View lineage information

To display table-level and column-level lineage for EMR Impala tasks in Data Map, you must first enable Impala lineage logs on the EMR cluster side. This feature is supported only for EMR DataLake clusters (both HMS and DLF metadata are supported). For more information, see Configure Impala lineage.

Note

This feature is currently in gray release. Before you use it, submit a ticket or contact Alibaba Cloud technical support to enable it.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

More operations

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage scheduled tasks.

FAQ

  • Q: The error "Impala JDBC Url is Empty" occurs?

    >>> [ERROR][LauncherFactory]: JobLauncher init Failed!
    java.lang.RuntimeException: Impala JDBC Url is Empty!
        at com.aliyun.emr.dataworks.dcc.launcher.type.ImpalaJobLauncher.initJdbcConnection

    A: Make sure that the Impala service has been added to the cluster. The Impala service is available only to existing users.

  • Q: A connection timeout error occurs when the node runs?

    EMR execute task failed!
    FAILED: Build connection error! Could not open client transport with JDBC Uri: jdbc:hive2://<host>:21050/;auth=noSasl: java.net.ConnectException: Connection timed out (Connection timed out)

    A: Ensure that the resource group and the cluster can connect to each other over the network. Go to the computing resources list and click Resource Initialization. In the dialog box that appears, click Re-initialize. Verify that the initialization is successful.