Trino is an open-source distributed SQL query engine for interactive analytics across multiple data sources. You can create an E-MapReduce (EMR) Trino node to perform large-scale multidimensional data aggregation and report analysis. For more information, see Trino.
Prerequisites
-
An Alibaba Cloud EMR cluster is created and registered with your DataWorks workspace.
To create EMR-related nodes and develop EMR tasks, you must register the EMR cluster with your DataWorks workspace. For more information, see Bind EMR compute resources in legacy DataStudio.
-
(Optional, for RAM users) The RAM user for task development has been added to the workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be assigned with caution. For more information, see Add members to a workspace.
-
A serverless resource group is purchased and configured. The configuration includes binding the resource group to a workspace and setting up the network. For more information, see Use a serverless resource group.
-
A workflow is created in DataStudio.
In DataStudio, development operations for different engines are based on workflows. Therefore, you must create a workflow before you create a node. For more information, see Create a workflow.
Limits
-
This type of task can run only on a serverless resource group.
-
To manage metadata for DataLake or custom clusters in DataWorks, you must configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related governance tasks in DataWorks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.
-
If Lightweight Directory Access Protocol (LDAP) authentication is enabled for Trino, you must log on to the master node of the EMR cluster and download the keystore file from the
/etc/taihao-apps/trino-confdirectory. Log on to the DataWorks console. In the left-side navigation pane, click . Select the target workspace from the drop-down list and click Go to Management Center. In the left-side navigation pane, click Cluster Management. Find the registered EMR cluster. On the Account Mappings tab, click Edit Account Mappings, and then click Upload Keystore File to upload the file.
Step 1: Create an EMR Trino node
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
-
Right-click the target workflow and choose .
-
In the Create Node dialog box, enter a Name and select an Compute Engine Instance, a Node Type, and a Path. Click Confirm to go to the EMR Trino node editor page.
NoteA node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
Step 2: Develop an EMR Trino task
Double-click the created node to go to the task development page and perform the following operations.
(Optional) Select an EMR cluster instance
If multiple EMR clusters are registered with your workspace, select the appropriate cluster from the drop-down list at the top of the node editor page. If only one EMR cluster is registered, DataWorks selects it by default.
To access domains that have an IP address whitelist, you must use a dedicated scheduling resource group.
Connector configuration
-
Before you query MySQL tables, you must configure the built-in EMR Trino connector. For more information, see MySQL connector.
-
Before you query Hive tables, you must configure the built-in EMR Trino connector. For more information, see Hive connector.
-
To query tables in other data sources, see Configure connectors for instructions on configuring the built-in connectors.
Edit SQL code
Enter the task code in the SQL editor. The following code provides an example.
-- Usage
-- SELECT * FROM <catalog>.<schema>.<table>;
-- Parameter description
-- <catalog>: the name of the data source to connect to.
-- <schema>: the name of the database to use.
-- <table>: the table to query.
-- Example: To view data in the hive_table table in the default database of a Hive data source:
-- Query a Hive table
SELECT * FROM hive.default.hive_table;
-- Example: To view data in the rt_user table in the user-created rt_data database of a MySQL data source:
-- Query a MySQL table
SELECT * FROM mysql.rt_data.rt_user;
-- Join a Hive table and a MySQL table
SELECT DISTINCT a.id, a.name,b.rt_name FROM hive.default.hive_table a INNER JOIN mysql.rt_data.rt_user b ON a.id = b.id;
-- Query a Hive table using a scheduling parameter
SELECT * FROM hive.default.${table_name};
-
DataWorks provides Scheduling Parameter to dynamically pass values to code in periodic scheduling scenarios. You can define variables in the ${variable_name} format in the code of a node task and assign values to the variables on the Scheduling Settings tab in the right-side navigation pane. In the Scheduling Parameter section, assign a value to the variable. For more information about the supported formats and configurations of scheduling parameters, see Supported formats of scheduling parameters and Configure and use scheduling parameters.
-
If you need to change the values of parameters in the code, click Advanced Run in the toolbar. For more information about how parameter values are assigned, see What is the difference in the value assignment logic among Run, Advanced Run, and smoke testing in the development environment?
Run the SQL task
-
In the toolbar, click the
icon. In the Parameter dialog box, select a scheduling resource group and click Running.Note-
To access compute resources over the public internet or in a VPC, you must use a scheduling resource group that can connect to the compute resources. For more information, see Network connectivity solutions.
-
If you need to change the resource group for subsequent task runs, click the Running with Parameters
icon and select the desired resource group. -
Queries on EMR Trino nodes return a maximum of 10,000 records, and the total data size cannot exceed 10 MB.
-
-
Click the
icon to save the SQL code.
(Optional) Configure advanced parameters
If you need to adjust how the SQL statements are executed, click Advanced Settings in the right-side navigation pane.
|
Parameter |
Description |
|
FLOW_SKIP_SQL_ANALYZE |
Specifies how SQL statements are executed. Valid values:
|
|
DATAWORKS_SESSION_DISABLE |
This parameter applies to test runs in the development environment. Valid values:
|
Step 3: Configure task scheduling
If you need to run the task on a schedule, click Scheduling in the right-side navigation pane of the node editor. Configure the scheduling information for the node based on your requirements. For more information, see Configure scheduling properties for a task.
You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.
Step 4: Commit and deploy the task
After you configure the node task, you must commit and deploy it. After the task is deployed, it runs periodically based on the scheduling configuration.
-
Click the
icon in the toolbar to save the node. -
Click the
icon in the toolbar to commit the node task.In the Submission dialog box, enter a Change Description and select whether to perform a code review after the node is committed.
Note-
You must set the Rerun attribute property and specify the Parent Nodes for the node before you can commit it.
-
Code review helps control code quality and prevents errors caused by deploying unreviewed code to the production environment. If you enable code review, reviewers must approve the committed code before it can be deployed. For more information, see Code review.
-
If your workspace runs in standard mode, you must click Deploy in the upper-right corner of the node editor page after you commit the task. This deploys the task to the production environment. For more information, see Deploy tasks.
Next steps
After the task is committed and deployed, it runs periodically based on its scheduling configuration. You can click O&M in the upper-right corner of the node editor page to view the scheduling and operational status of the periodic task. For more information, see Manage periodic tasks.
FAQ
-
Q: A connection timeout occurs when I run the node. Why?
EMR execute task failed! SQL: {"name":"dw20251018","type":"TRINO_SQL","launcher":{"allocationSpec":{}},"properties":{"envs":{"FLOW_SKIP_SQL_ANALYZE":false},"arguments":["select * from default.dim_customers;"],"tags":[]},"description":"DataWorks"} TASK-MESSAGE: FAILED: Error executing queryA: This error can occur if the resource group and the cluster do not have network connectivity. To resolve this issue, go to the computing resource list page, find the resource, and click Resource Initialization. In the dialog box that appears, click Re-initialize and ensure that the resource is successfully initialized.
The dialog box displays the following message at the top: If this is the first time you bind a cluster or if the cluster service configuration (for example, the hive-site file) has changed, initialize the resource group. Otherwise, tasks may fail to run.