EMR Presto node

更新时间:
复制 MD 格式

Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine that supports interactive analysis and queries of big data using standard SQL. DataWorks provides the EMR Presto node for developing and periodically scheduling Presto tasks. This topic describes the main workflow and key considerations for developing tasks using an EMR Presto node.

Prerequisites

  • You have created an Alibaba Cloud EMR cluster and registered it with DataWorks. For more information, see Bind an E-MapReduce (EMR) compute engine to a DataWorks workspace.

  • (Optional. Required for a RAM user.) For task development, a RAM user must be added to the workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about how to add members to a workspace, see Add workspace members.

    If you are using an Alibaba Cloud account, you can skip this step.

Limitations

  • Only Hadoop-based data lake clusters are supported. DataLake clusters and custom clusters are not supported.

  • You can run this type of task only using a Serverless resource group (recommended) or an exclusive resource group for scheduling.

  • Data lineage: Tasks run on EMR Presto nodes do not generate data lineage.

Procedure

  1. On the EMR Presto node editor page, follow these development steps.

    Develop the SQL code

    In the SQL editor, develop the task code. You can define variables in the format of ${variable_name} in the code and assign values to them in the Scheduling Parameters section on the Scheduling Settings tab. This enables dynamic parameter passing for scheduled runs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.

    select '${var}'; -- Can be used with scheduling parameters.
    
    select * from userinfo ;
    Note
    • An SQL statement cannot exceed 130 KB.

    • Queries run on an EMR Presto node can return a maximum of 10,000 rows, and the total data size cannot exceed 10 MB.

    (Optional) Configure advanced parameters

    In the right-side pane on the Scheduling Settings tab, configure the parameters described in the following table under EMR Node Parameters > DataWorks parameters.

    Note

    More open-source Presto properties can be configured under EMR Node Parameters > Spark parameter in the right-side Scheduling Settings pane.

    EMR on ECS

    Parameter

    Description

    DATAWORKS_SESSION_DISABLE

    This parameter applies only to test runs in the development environment. Valid values:

    • true: A new JDBC connection is created each time an SQL statement is run.

    • false (Default): The same JDBC connection is reused when different SQL statements are run in the same node.

    Note

    If this parameter is set to false, the Presto yarn applicationId is not printed. To print the yarn applicationId, set this parameter to true.

    FLOW_SKIP_SQL_ANALYZE

    Controls how SQL statements are executed. Valid values:

    • true: Multiple SQL statements are executed at once.

    • false (Default): One SQL statement is executed at a time.

    Note

    This parameter applies only to test runs in the development environment.

    priority

    The job priority. Default value: 1.

    queue

    The YARN queue where jobs are submitted. The default is the default queue. For more information about EMR YARN, see Basic queue configurations.

    Run the SQL task

    1. In the Run Configuration section, on the Compute Resource tab, select a Compute Resource and a DataWorks Resource Group.

      Note
      • You can also adjust the CUs for Scheduling value based on the task's resource requirements. The default is 0.25.

      • To access a data source over a public network or in a virtual private cloud (VPC), you must use a scheduling resource group that has passed the network connectivity test for the data source. For more information, see Network connectivity solutions.

    2. In the parameter dialog box in the toolbar, select the target data source and click Run.

  2. If you need to run the node task periodically, configure its scheduling properties based on your business requirements. For more information, see Configure scheduling for a node.

  3. After you configure the node task, you must publish the node. For more information, see Publish nodes and workflows.

  4. After you publish the task, you can view the status of its periodic runs in Operation Center. For more information, see Get started with Operation Center.

FAQ

  • Q: Why does the "Error executing query" message appear?

    image

    A: Ensure the cluster is a Hadoop-based data lake cluster.

  • Q: Why does a connection timeout occur when the node runs?

    A: Ensure that network connectivity exists between the DataWorks resource group and the cluster. Go to the compute resource list page and click to initialize the resource. In the dialog box that appears, click Re-initialize and ensure the initialization succeeds.

    image

    image