DataWorks Serverless Ray nodes let you develop and periodically schedule Python jobs that use the Ray distributed framework on EMR Serverless Ray computing resources.
Introduction
EMR Serverless Ray provides managed Ray computing capabilities on top of Spark workspaces. It is compatible with open source Ray APIs and supports the Python programming model for distributed computing, machine learning, and data processing. With DataWorks Serverless Ray nodes, you can write Python code online and configure the ray job submit command to develop, debug, and schedule your jobs.
Limitations
-
Computing resources: You can only select a bound EMR Serverless Ray computing resource. The Serverless resource group must also have network connectivity with the computing resource.
-
Language: Only the Python language is supported.
-
Execution: You can only submit the entire script for execution. Running single lines or individual code blocks is not supported.
Prerequisites
-
Bind an EMR Serverless Ray computing resource to the target DataWorks workspace and ensure that the Ray cluster is available.
-
(Optional) If you are using a RAM user to develop tasks, add the RAM user to the workspace and grant it the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information about how to add a member, see Add a member to a workspace.
NoteIf you are using a root account, you can skip this step.
Create a Serverless Ray node
You create a Serverless Ray node the same way as other node types in Data Studio. For more information, see Nodes.
Develop a Serverless Ray node
Developing a Serverless Ray node involves writing Python code in the code editor and configuring the job submission command in the Submit Command section. When you create a file, the system automatically generates a submission command with a filename that matches the node name and has a .py extension.
Node configuration reference
The following table describes the configuration parameters for a Serverless Ray node.
|
Section |
Parameter |
Description |
|
Python code |
Python code |
Write Python code that uses the Ray framework. Ray APIs such as |
|
Submit command |
Submit command |
Configure the submission command for the Ray job. The command format is |
|
runtime-env-json |
Optional. Configure the runtime environment. For example, use the |
|
|
Parameters |
Specify the parameters to pass to your code. You can configure a parameter as a dynamic parameter by using |
If your job depends on multiple Python files, you can create the dependency files as DataWorks Ray File resources, reference them in your code by using ##@resource_reference, and then organize the ray job submit command with --working-dir pointing to the working directory. For more information about creating resources, see Create an EMR resource.
Debug a Serverless Ray node
-
Configure the run configuration.
In the Run Configuration panel on the right side of the node, configure the following parameters.
Parameter
Description
Compute Resource
Select the Serverless Ray compute resource that you have associated.
Resource Group
Select a serverless resource group that has passed the network connectivity test. Serverless Ray nodes support only serverless resource groups.
Script Parameters
When configuring the node content, you can define variables by using ${parameter name}. Configure the Script Parameters section with the Parameter name and Parameter Value. These values are dynamically replaced with actual values at runtime. For more information, see Configure scheduling parameters.
-
Debug and run the node.
Click Save and then click Run to debug the node.
Next steps
-
Configure schedule settings: If you want nodes in the project directory to be periodically scheduled, configure the Scheduling Policy and related scheduling properties in the Scheduling Settings panel on the right side of the node.
-
Deploy a node: If you want to deploy the task to the production environment, click the
icon on the page to initiate the deployment process. Nodes in the project directory are periodically scheduled only after they are deployed to the production environment.
Reference
For details on referencing Ray File resources, see Reference a Ray File resource.