Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine that supports interactive analysis and queries of big data using standard SQL. DataWorks provides the EMR Presto node for developing and periodically scheduling Presto tasks. This topic describes the main workflow and key considerations for developing tasks using an EMR Presto node.
Prerequisites
-
You have created an Alibaba Cloud EMR cluster and registered it with DataWorks. For more information, see Bind an E-MapReduce (EMR) compute engine to a DataWorks workspace.
-
(Optional. Required for a RAM user.) For task development, a RAM user must be added to the workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about how to add members to a workspace, see Add workspace members.
If you are using an Alibaba Cloud account, you can skip this step.
Limitations
-
Only Hadoop-based data lake clusters are supported. DataLake clusters and custom clusters are not supported.
-
You can run this type of task only using a Serverless resource group (recommended) or an exclusive resource group for scheduling.
-
Data lineage: Tasks run on EMR Presto nodes do not generate data lineage.
Procedure
-
On the EMR Presto node editor page, follow these development steps.
Develop the SQL code
In the SQL editor, develop the task code. You can define variables in the format of ${variable_name} in the code and assign values to them in the Scheduling Parameters section on the Scheduling Settings tab. This enables dynamic parameter passing for scheduled runs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.
select '${var}'; -- Can be used with scheduling parameters. select * from userinfo ;Note-
An SQL statement cannot exceed 130 KB.
-
Queries run on an EMR Presto node can return a maximum of 10,000 rows, and the total data size cannot exceed 10 MB.
(Optional) Configure advanced parameters
In the right-side pane on the Scheduling Settings tab, configure the parameters described in the following table under .
NoteMore open-source Presto properties can be configured under in the right-side Scheduling Settings pane.
EMR on ECS
Parameter
Description
DATAWORKS_SESSION_DISABLE
This parameter applies only to test runs in the development environment. Valid values:
-
true: A new JDBC connection is created each time an SQL statement is run. -
false(Default): The same JDBC connection is reused when different SQL statements are run in the same node.
NoteIf this parameter is set to
false, the Prestoyarn applicationIdis not printed. To print theyarn applicationId, set this parameter totrue.FLOW_SKIP_SQL_ANALYZE
Controls how SQL statements are executed. Valid values:
-
true: Multiple SQL statements are executed at once. -
false(Default): One SQL statement is executed at a time.
NoteThis parameter applies only to test runs in the development environment.
priority
The job priority. Default value: 1.
queue
The YARN queue where jobs are submitted. The default is the
defaultqueue. For more information about EMR YARN, see Basic queue configurations.Run the SQL task
-
In the Run Configuration section, on the Compute Resource tab, select a Compute Resource and a DataWorks Resource Group.
Note-
You can also adjust the CUs for Scheduling value based on the task's resource requirements. The default is
0.25. -
To access a data source over a public network or in a virtual private cloud (VPC), you must use a scheduling resource group that has passed the network connectivity test for the data source. For more information, see Network connectivity solutions.
-
-
In the parameter dialog box in the toolbar, select the target data source and click Run.
-
-
If you need to run the node task periodically, configure its scheduling properties based on your business requirements. For more information, see Configure scheduling for a node.
-
After you configure the node task, you must publish the node. For more information, see Publish nodes and workflows.
-
After you publish the task, you can view the status of its periodic runs in Operation Center. For more information, see Get started with Operation Center.
FAQ
-
Q: Why does the "Error executing query" message appear?

A: Ensure the cluster is a Hadoop-based data lake cluster.
-
Q: Why does a connection timeout occur when the node runs?
A: Ensure that network connectivity exists between the DataWorks resource group and the cluster. Go to the compute resource list page and click to initialize the resource. In the dialog box that appears, click Re-initialize and ensure the initialization succeeds.

