Create an EMR Trino node

更新时间:
复制 MD 格式

Trino is an open-source distributed SQL query engine for interactive analytics across multiple data sources. You can create an E-MapReduce (EMR) Trino node to perform large-scale multidimensional data aggregation and report analysis. For more information, see Trino.

Prerequisites

  • An Alibaba Cloud EMR cluster is created and registered with your DataWorks workspace.

    To create EMR-related nodes and develop EMR tasks, you must register the EMR cluster with your DataWorks workspace. For more information, see Bind EMR compute resources in legacy DataStudio.

  • (Optional, for RAM users) The RAM user for task development has been added to the workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be assigned with caution. For more information, see Add members to a workspace.

  • A serverless resource group is purchased and configured. The configuration includes binding the resource group to a workspace and setting up the network. For more information, see Use a serverless resource group.

  • A workflow is created in DataStudio.

    In DataStudio, development operations for different engines are based on workflows. Therefore, you must create a workflow before you create a node. For more information, see Create a workflow.

Limits

  • This type of task can run only on a serverless resource group.

  • To manage metadata for DataLake or custom clusters in DataWorks, you must configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related governance tasks in DataWorks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.

  • If Lightweight Directory Access Protocol (LDAP) authentication is enabled for Trino, you must log on to the master node of the EMR cluster and download the keystore file from the /etc/taihao-apps/trino-conf directory. Log on to the DataWorks console. In the left-side navigation pane, click More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center. In the left-side navigation pane, click Cluster Management. Find the registered EMR cluster. On the Account Mappings tab, click Edit Account Mappings, and then click Upload Keystore File to upload the file.

Step 1: Create an EMR Trino node

  1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  2. Right-click the target workflow and choose Create Node > EMR > EMR Trino.

  3. In the Create Node dialog box, enter a Name and select an Compute Engine Instance, a Node Type, and a Path. Click Confirm to go to the EMR Trino node editor page.

    Note

    A node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Trino task

Double-click the created node to go to the task development page and perform the following operations.

(Optional) Select an EMR cluster instance

If multiple EMR clusters are registered with your workspace, select the appropriate cluster from the drop-down list at the top of the node editor page. If only one EMR cluster is registered, DataWorks selects it by default.

To access domains that have an IP address whitelist, you must use a dedicated scheduling resource group.

Connector configuration

  • Before you query MySQL tables, you must configure the built-in EMR Trino connector. For more information, see MySQL connector.

  • Before you query Hive tables, you must configure the built-in EMR Trino connector. For more information, see Hive connector.

  • To query tables in other data sources, see Configure connectors for instructions on configuring the built-in connectors.

Edit SQL code

Enter the task code in the SQL editor. The following code provides an example.

-- Usage
-- SELECT * FROM <catalog>.<schema>.<table>;
-- Parameter description
-- <catalog>: the name of the data source to connect to.
-- <schema>: the name of the database to use.
-- <table>: the table to query.
-- Example: To view data in the hive_table table in the default database of a Hive data source:
-- Query a Hive table
SELECT * FROM hive.default.hive_table;
-- Example: To view data in the rt_user table in the user-created rt_data database of a MySQL data source:
-- Query a MySQL table  
SELECT * FROM mysql.rt_data.rt_user;
-- Join a Hive table and a MySQL table
SELECT DISTINCT a.id, a.name,b.rt_name FROM hive.default.hive_table a INNER JOIN mysql.rt_data.rt_user b ON a.id = b.id;
-- Query a Hive table using a scheduling parameter
SELECT * FROM hive.default.${table_name};
Note

Run the SQL task

  1. In the toolbar, click the 高级运行 icon. In the Parameter dialog box, select a scheduling resource group and click Running.

    Note
    • To access compute resources over the public internet or in a VPC, you must use a scheduling resource group that can connect to the compute resources. For more information, see Network connectivity solutions.

    • If you need to change the resource group for subsequent task runs, click the Running with Parameters 高级运行 icon and select the desired resource group.

    • Queries on EMR Trino nodes return a maximum of 10,000 records, and the total data size cannot exceed 10 MB.

  2. Click the 保存 icon to save the SQL code.

(Optional) Configure advanced parameters

If you need to adjust how the SQL statements are executed, click Advanced Settings in the right-side navigation pane.

Parameter

Description

FLOW_SKIP_SQL_ANALYZE

Specifies how SQL statements are executed. Valid values:

  • true: Runs multiple SQL statements at a time.

  • false (default): Runs one SQL statement at a time.

DATAWORKS_SESSION_DISABLE

This parameter applies to test runs in the development environment. Valid values:

  • true: Creates a new JDBC connection for each SQL statement run.

  • false (default): Reuses the same JDBC connection when a user runs different SQL statements in a node.

Step 3: Configure task scheduling

If you need to run the task on a schedule, click Scheduling in the right-side navigation pane of the node editor. Configure the scheduling information for the node based on your requirements. For more information, see Configure scheduling properties for a task.

Note

You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.

Step 4: Commit and deploy the task

After you configure the node task, you must commit and deploy it. After the task is deployed, it runs periodically based on the scheduling configuration.

  1. Click the 保存 icon in the toolbar to save the node.

  2. Click the 提交 icon in the toolbar to commit the node task.

    In the Submission dialog box, enter a Change Description and select whether to perform a code review after the node is committed.

    Note
    • You must set the Rerun attribute property and specify the Parent Nodes for the node before you can commit it.

    • Code review helps control code quality and prevents errors caused by deploying unreviewed code to the production environment. If you enable code review, reviewers must approve the committed code before it can be deployed. For more information, see Code review.

If your workspace runs in standard mode, you must click Deploy in the upper-right corner of the node editor page after you commit the task. This deploys the task to the production environment. For more information, see Deploy tasks.

Next steps

After the task is committed and deployed, it runs periodically based on its scheduling configuration. You can click O&M in the upper-right corner of the node editor page to view the scheduling and operational status of the periodic task. For more information, see Manage periodic tasks.

FAQ

  • Q: A connection timeout occurs when I run the node. Why?

    EMR execute task failed!
    SQL: {"name":"dw20251018","type":"TRINO_SQL","launcher":{"allocationSpec":{}},"properties":{"envs":{"FLOW_SKIP_SQL_ANALYZE":false},"arguments":["select * from default.dim_customers;"],"tags":[]},"description":"DataWorks"}
    TASK-MESSAGE:
    FAILED: Error executing query

    A: This error can occur if the resource group and the cluster do not have network connectivity. To resolve this issue, go to the computing resource list page, find the resource, and click Resource Initialization. In the dialog box that appears, click Re-initialize and ensure that the resource is successfully initialized.

    The dialog box displays the following message at the top: If this is the first time you bind a cluster or if the cluster service configuration (for example, the hive-site file) has changed, initialize the resource group. Otherwise, tasks may fail to run.