Serverless Spark SQL node

更新时间:
复制 MD 格式

A Serverless Spark SQL node runs distributed SQL queries on structured data using EMR Serverless Spark compute resources, with no infrastructure to configure or manage. Write SQL in the node editor, configure compute settings, and schedule jobs to run in the production environment.

Prerequisites

Before you begin, make sure you have:

  • An EMR Serverless Spark compute resource attached to the workspace, with network connectivity available between the resource group and the compute resource. See Attach EMR Serverless Spark compute resources.

  • A Serverless resource group — only Serverless resource groups can run this node type

  • (RAM users only) Added to the workspace with the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions — assign it with caution. See Add members to a workspace. If you use an Alibaba Cloud account, skip this step.

Create a node

See Create a node for the general node creation steps.

Write SQL

Write task code in the SQL editing area. Use the catalog.database.tablename syntax to reference tables:

  • If catalog is omitted, the task uses the cluster's default catalog.

  • If catalog.database is omitted, the task uses the default database within the cluster's default catalog.

For information about data catalogs, see Manage data catalogs in EMR Serverless Spark.

-- Replace <catalog.database.tablename> with your actual table path.
SELECT * FROM <catalog.database.tablename>

The maximum size of a single SQL statement is 130 KB.

Pass parameters to SQL

Use the ${variable_name} syntax to define variables in your SQL. At runtime, each variable is replaced with the value you assign in the scheduling configuration. This lets you reuse the same node logic across scheduled runs — for example, appending a date suffix to a table name.

To configure variables:

  1. In your SQL, declare the variable using ${var}.

  2. On the right side of the node editing page, go to Scheduling Configuration > Scheduling Parameters and assign a value to the variable.

The following example creates a partitioned table whose name includes a date suffix. Assigning ${yyyymmdd} to var produces a new table name on each scheduled run.

SHOW TABLES;
-- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix using a scheduled task.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip  STRING COMMENT 'IP address',
  uid STRING COMMENT 'User ID'
) PARTITIONED BY (
  dt STRING
);

For supported variable formats, see Supported formats for scheduling parameters.

Debug the node

  1. In the Debug Configuration section, configure the following settings:

    SettingDescription
    Compute resourceSelect an attached EMR Serverless Spark compute resource. If none are available, select Create Computing Resource from the drop-down list.
    Resource groupSelect a resource group attached to the workspace.
    Script parametersSet Parameter Name and Parameter Value for any variables defined in your SQL using ${Parameter Name}. These values are applied when the task runs. See Supported formats for scheduling parameters.
    Serverless Spark node parametersRuntime parameters for the Spark program. Two types are supported: custom DataWorks parameters (see Appendix: DataWorks parameters) and Spark native properties (see Open source Spark properties and List of custom Spark Conf parameters). Format: spark.eventLog.enabled : false. To apply Spark parameters across the workspace, see Set global Spark parameters.
  2. In the toolbar at the top of the node editing page, click Run to execute the SQL task.

Important

Before publishing, sync the ServerlessSpark Node Parameters under Debug Configuration with the ServerlessSpark Node Parameters under Scheduling Configuration.

What's next

  • Schedule the node: To run the node periodically, configure the Scheduling Policy and scheduling properties in the Scheduling pane. See Schedule the node.

  • Publish the node: To run the node in the production environment, click the image icon to start the publishing process. Periodic scheduling takes effect only after the node is published. See Publish the node.

  • Monitor the node: After publishing, track the status of auto-triggered tasks in Operation Center. See Get started with Operation Center.

Appendix: DataWorks parameters

The following parameters control how the Serverless Spark SQL node submits and runs tasks in Data Studio.

ParameterDescription
FLOW_SKIP_SQL_ANALYZEExecution mode for SQL statements. true: run multiple SQL statements at once. false (default): run one SQL statement at a time.
Note

Supported only for test runs in the development environment.

DATAWORKS_SESSION_DISABLEJob submission method. false (default): submit the task to SQL Compute. true: submit the task to a resource queue. When set to true, use SERVERLESS_QUEUE_NAME to specify the queue.
Note

Takes effect only during execution in Data Studio, not during scheduled runs.

SERVERLESS_QUEUE_NAMEThe resource queue to submit tasks to. By default, the Default Resource Queue configured for the cluster in Cluster Management under Management Center is used. To specify a queue, set this parameter directly or set global Spark parameters. For queue setup, see Manage resource queues.
Note

Takes effect only when DATAWORKS_SESSION_DISABLE is true and the SQL Compute session for the registered cluster is not started in the EMR Serverless Spark console. During scheduled execution in Operation Center, tasks are always submitted to a queue and cannot be submitted to SQL Compute.

SERVERLESS_SQL_COMPUTEThe SQL Compute session to use. By default, the Default SQL Compute configured for the cluster under Computing Resources in Management Center is used. Set this parameter to assign different SQL sessions to different tasks. See Manage SQL sessions.