Apache Kyuubi is a distributed, multi-tenant gateway that provides SQL and other query services for data lake query engines such as Spark, Flink, and Trino. In DataWorks, you can use EMR Kyuubi nodes to develop, periodically schedule, and integrate Kyuubi tasks with other jobs. This topic describes how to create and configure an EMR Kyuubi node.
Prerequisites
An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see DataStudio (legacy): Register an EMR compute resource.
(Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.
A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.
A workflow is created in DataStudio.
Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.
Limitations
This type of task can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling.
Step 1: Create an EMR Kyuubi node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
-
Create an EMR Kyuubi node.
-
Right-click the target workflow and choose .
NoteAlternatively, you can hover over the Create icon and choose .
-
In the Create Node dialog box, enter a Name and select an Engine Instance, a Node Type, and a Path. Click OK to open the EMR Kyuubi node editor.
NoteThe node name can contain uppercase and lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
-
Step 2: Develop an EMR Kyuubi task
In the EMR Kyuubi node editor, double-click the created node to open the task development page.
Develop SQL code
In the SQL editor, develop the task code. You can define variables in the code by using the ${variable_name} format. You can then assign values to these variables on the right-side navigation pane under Scheduling > Scheduling Parameter. This allows you to pass parameters dynamically during scheduled runs. For more information about supported formats of scheduling parameters, see Supported formats of scheduling parameters. The following example is provided.
show tables;
select * from kyuubi040702 where age >= '${a}'; -- You can use scheduling parameters.
-
The SQL statement cannot exceed 130 KB in size.
-
If your workspace is associated with multiple EMR computing resources in DataStudio, you must select the appropriate engine. If only one EMR computing resource is bound, you do not need to select an engine.
(Optional) Configure advanced parameters
You can configure advanced properties for the node in the Advanced Settings section. For more information about property settings, see Spark Configuration.
|
Parameter |
Description |
|
queue |
The scheduling queue where jobs are submitted. The default value is the Note
If you configure a workspace-level YARN resource queue when you register an EMR cluster with a DataWorks workspace, DataWorks uses the following rules to select a scheduling queue for a Kyuubi task at runtime:
For more information about EMR YARN, see Basic queue configurations. For more information about how to configure a queue when you register an EMR cluster, see Configure a global YARN resource queue. |
|
priority |
The priority. The default value is 1. |
|
FLOW_SKIP_SQL_ANALYZE |
Controls how SQL statements are executed. Valid values:
Note
This parameter applies only to test runs in the development environment. |
|
DATAWORKS_SESSION_DISABLE |
Applies to direct test runs in the development environment. Valid values:
Note
If you set this parameter to |
Run the SQL task
-
In the toolbar, click the
icon. In the Parameter dialog box, select the created scheduling resource group and click Running.Note-
To access a data source over the Internet or a VPC, you must use a scheduling resource group with confirmed network connectivity to the data source. For more information, see Network connectivity solutions.
-
If you need to change the resource group for a subsequent task run, click the Run with Custom Parameters
icon and select the resource group that you want to use.
-
-
Click the
icon to save the SQL code. -
(Optional) Perform smoke testing.
If you want to perform smoke testing in the development environment, you can do so before or after you commit the node. For more information, see Perform smoke testing.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
-
Click the
icon in the top toolbar to save the task. -
Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note-
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
-
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
-
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploying tasks.
More operations
After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage auto triggered tasks.
FAQ
-
Q: The node fails with a connection timeout error. What should I do?
A: This error can occur if the resource group and the cluster do not have network connectivity. To resolve this issue, go to the computing resource list page, find the resource, and click Resource Initialization. In the dialog box that appears, click Re-initialize and ensure that the resource is successfully initialized.



