The Serverless Kyuubi node in DataWorks lets you develop and periodically schedule Kyuubi tasks that run on an EMR Serverless Spark compute resource. You can also integrate these tasks with other jobs.
Prerequisites
Compute resource limits: Only an Associate EMR Serverless Spark computing resource is supported. Ensure that the resource group and the compute resource are connected over the network.
Resource group constraints: This task runs only in a Serverless resource group.
-
(Optional, for RAM users) The RAM user for task development has been added to the corresponding workspace and granted the Development or Workspace Administrator (this role has extensive permissions, grant with caution) role. For more information about how to add members, see Add members to a workspace.
If you use an Alibaba Cloud account, you can skip this step.
Create a node
For instructions, see Create a node.
Develop the node
Write the task code in the SQL editor. You can define variables in the code using the ${variable_name} syntax. Then, in the Scheduling Parameters section of the Scheduling Settings on the right, assign a value to each variable. The system then dynamically replaces the variables with their assigned values when the node runs on a schedule. For more information about scheduling parameters, see Scheduling parameter sources and expressions. The following code provides an example.
SHOW TABLES;
SELECT * FROM kyuubi040702 WHERE age >= '${a}'; -- Use with a scheduling parameter.An SQL statement cannot exceed 130 KB.
Debug the node
In the Run Configuration, configure parameters such as Compute Resource and Resource Group.
Parameter
Description
Compute Resource
Select a bound EMR Serverless Spark compute resource. You must first bind an EMR Serverless Spark compute resource. If no compute resources are available, select Create Compute Resource from the drop-down list.
Resource Group
Select a resource group that is bound to the workspace.
Script Parameters
If you define variables using the
${parameter_name}syntax in the node content, you must specify the Parameter name and Parameter Value in the Script Parameters section. The system replaces the variables with their actual values at runtime. For more information, see Scheduling parameter sources and expressions.ServerlessSpark node parameter
Native Spark properties. For more information, see Open Source Spark Properties and Custom Spark parameters. Use the following format:
"spark.eventLog.enabled": false.NoteDataWorks lets you set global Spark parameters for different DataWorks modules at the workspace level. You can specify whether these global parameters take precedence over the parameters that are set within a specific module. For more information, see Set global Spark parameters.
On the toolbar at the top of the node editing page, click Run to run the task.
ImportantBefore deploying the node, synchronize the ServerlessSpark node parameter settings from the Run Configuration to the ServerlessSpark node parameter section of the Scheduling Settings.
Next steps
Configure node scheduling: If you need to run a node periodically, configure its Scheduling Policy in the Scheduling Settings panel on the right.
Publish a node: To run a task in the production environment, click the
icon to publish the node. A node runs on schedule only after it is published to the production environment.Task O&M: After a task is published, you can monitor the status of its periodic runs in the Operation Center. For more information, see Get started with Operation Center.