Create an EMR Presto node

更新时间:
复制 MD 格式

Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine that supports fast, interactive analytics over large datasets using standard SQL. DataWorks provides E-MapReduce (EMR) Presto nodes that allow you to develop and periodically schedule Presto tasks. This topic describes the main workflow and key considerations for developing tasks with an EMR Presto node.

Prerequisites

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see Data Studio (legacy): Associate an EMR compute resource.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.

  • A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

Limitations

  • Only legacy Hadoop data lake clusters are supported. DataLake clusters and custom clusters are not supported.

  • You can run this type of task only by using a serverless resource group (recommended) or an exclusive resource group for scheduling.

  • When you develop a Presto task, the SQL code cannot exceed 130 KB.

  • When you use an EMR Presto node to query data, the query result can contain a maximum of 10,000 records, and the total data size cannot exceed 10 MB.

  • Data lineage: EMR Presto nodes do not support data lineage.

Step 1: Create an EMR Presto node

  1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Presto node.

    1. Right-click the target workflow and choose Create Node > EMR > EMR Presto.

      Note

      Alternatively, you can hover over Create and choose Create Node > EMR > EMR Presto.

    2. In the Create Node dialog box, enter a Name, and select an engine instance, a Node Type, and a Path. Click Confirm to go to the EMR Presto node editing page.

      Note

      Node names can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Presto task

In the EMR Presto node editor, double-click the node to open the task development page and perform the following operations.

Develop SQL code

In the SQL editor, write the task code. You can define variables in the code using the ${variable_name} format and assign values to these variables in the scheduling configuration > Scheduling Parameter section of the right-side navigation pane. This allows you to dynamically pass parameters to the code during scheduling. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. The following sample code is provided for reference:

select '${var}'; -- The variable can be used with scheduling parameters.

select * from userinfo ;
Note
  • The SQL code for a task cannot exceed 130 KB.

  • If multiple EMR computing resources are bound to the DataStudio workspace, you must select the appropriate one based on your business needs. If only one EMR computing resource is bound, no selection is required.

  • If you need to change the parameter assignments in your code, click Advanced Run in the toolbar. For more information about the parameter assignment logic, see Parameter assignment logic for Run, Run With Parameters, and smoke testing.

(Optional) Configure advanced parameters

You can configure specific properties in the advanced settings section of the node. For more information about property settings, see Spark Configuration. The available advanced parameters vary by EMR cluster type, as shown in the following table.

Hadoop: EMR on ECS

Parameter

Description

FLOW_SKIP_SQL_ANALYZE

Specifies the SQL statement execution method. Valid values:

  • true: executes multiple SQL statements at a time.

  • false (default): executes one SQL statement at a time.

Note

This parameter is applicable only to test runs in the development environment.

USE_GATEWAY

Specifies whether to submit the job for this node through a gateway cluster. Valid values:

  • true: submits the job through a gateway cluster.

  • false (default): does not submit the job through a gateway cluster. The job is submitted to the header node by default.

Note

If the node's cluster is not associated with a gateway cluster and you set this parameter to true, the EMR job submission will fail.

Run the SQL task

  1. In the toolbar, click the 高级运行 icon. In the Parameter dialog box, select the created scheduling resource group and click Running.

    Note
    • To access a computing resource over the public internet or in a VPC, you must use a scheduling resource group that has passed a connectivity test with the computing resource. For more information, see Network connectivity solutions.

    • If you need to change the resource group for subsequent task runs, you can click the Run With Parameters 高级运行 icon and select a different scheduling resource group.

  2. Click the 保存 icon to save the SQL statements.

  3. (Optional) Perform smoke testing.

    If you want to perform smoke testing in the development environment, you can run it when you submit the node or after the node is submitted. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

More operations

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage auto triggered tasks.

FAQ

  • Q: Why does the "Error executing query" message appear?

    image

    A: Ensure the cluster is a Hadoop-based data lake cluster.

  • Q: Why does a connection timeout occur when the node runs?

    A: This error can occur if the resource group and the cluster do not have network connectivity. To resolve this issue, go to the computing resource list page, find the resource, and click Resource Initialization. In the dialog box that appears, click Re-initialize and ensure that the resource is successfully initialized.

    image

    image