A Serverless Spark SQL node runs distributed SQL queries on structured data using EMR Serverless Spark compute resources, with no infrastructure to configure or manage. Write SQL in the node editor, configure compute settings, and schedule jobs to run in the production environment.
Prerequisites
Before you begin, make sure you have:
An EMR Serverless Spark compute resource attached to the workspace, with network connectivity available between the resource group and the compute resource. See Attach EMR Serverless Spark compute resources.
A Serverless resource group — only Serverless resource groups can run this node type
(RAM users only) Added to the workspace with the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions — assign it with caution. See Add members to a workspace. If you use an Alibaba Cloud account, skip this step.
Create a node
See Create a node for the general node creation steps.
Write SQL
Write task code in the SQL editing area. Use the catalog.database.tablename syntax to reference tables:
If
catalogis omitted, the task uses the cluster's default catalog.If
catalog.databaseis omitted, the task uses the default database within the cluster's default catalog.
For information about data catalogs, see Manage data catalogs in EMR Serverless Spark.
-- Replace <catalog.database.tablename> with your actual table path.
SELECT * FROM <catalog.database.tablename>The maximum size of a single SQL statement is 130 KB.
Pass parameters to SQL
Use the ${variable_name} syntax to define variables in your SQL. At runtime, each variable is replaced with the value you assign in the scheduling configuration. This lets you reuse the same node logic across scheduled runs — for example, appending a date suffix to a table name.
To configure variables:
In your SQL, declare the variable using
${var}.On the right side of the node editing page, go to Scheduling Configuration > Scheduling Parameters and assign a value to the variable.
The following example creates a partitioned table whose name includes a date suffix. Assigning ${yyyymmdd} to var produces a new table name on each scheduled run.
SHOW TABLES;
-- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix using a scheduled task.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT 'IP address',
uid STRING COMMENT 'User ID'
) PARTITIONED BY (
dt STRING
);For supported variable formats, see Supported formats for scheduling parameters.
Debug the node
In the Debug Configuration section, configure the following settings:
Setting Description Compute resource Select an attached EMR Serverless Spark compute resource. If none are available, select Create Computing Resource from the drop-down list. Resource group Select a resource group attached to the workspace. Script parameters Set Parameter Name and Parameter Value for any variables defined in your SQL using ${Parameter Name}. These values are applied when the task runs. See Supported formats for scheduling parameters.Serverless Spark node parameters Runtime parameters for the Spark program. Two types are supported: custom DataWorks parameters (see Appendix: DataWorks parameters) and Spark native properties (see Open source Spark properties and List of custom Spark Conf parameters). Format: spark.eventLog.enabled : false. To apply Spark parameters across the workspace, see Set global Spark parameters.In the toolbar at the top of the node editing page, click Run to execute the SQL task.
Before publishing, sync the ServerlessSpark Node Parameters under Debug Configuration with the ServerlessSpark Node Parameters under Scheduling Configuration.
What's next
Schedule the node: To run the node periodically, configure the Scheduling Policy and scheduling properties in the Scheduling pane. See Schedule the node.
Publish the node: To run the node in the production environment, click the
icon to start the publishing process. Periodic scheduling takes effect only after the node is published. See Publish the node.Monitor the node: After publishing, track the status of auto-triggered tasks in Operation Center. See Get started with Operation Center.
Appendix: DataWorks parameters
The following parameters control how the Serverless Spark SQL node submits and runs tasks in Data Studio.
| Parameter | Description |
|---|---|
| FLOW_SKIP_SQL_ANALYZE | Execution mode for SQL statements. true: run multiple SQL statements at once. false (default): run one SQL statement at a time. Note Supported only for test runs in the development environment. |
| DATAWORKS_SESSION_DISABLE | Job submission method. false (default): submit the task to SQL Compute. true: submit the task to a resource queue. When set to true, use SERVERLESS_QUEUE_NAME to specify the queue. Note Takes effect only during execution in Data Studio, not during scheduled runs. |
| SERVERLESS_QUEUE_NAME | The resource queue to submit tasks to. By default, the Default Resource Queue configured for the cluster in Cluster Management under Management Center is used. To specify a queue, set this parameter directly or set global Spark parameters. For queue setup, see Manage resource queues. Note Takes effect only when |
| SERVERLESS_SQL_COMPUTE | The SQL Compute session to use. By default, the Default SQL Compute configured for the cluster under Computing Resources in Management Center is used. Set this parameter to assign different SQL sessions to different tasks. See Manage SQL sessions. |