Managed Ray service
AnalyticDB Ray is a fully managed Ray service from AnalyticDB for MySQL. This service enhances and optimizes open source Ray by improving kernel performance and simplifying operations management. AnalyticDB Ray is designed for complex AI scenarios, such as multimodal processing, search and recommendation, and financial risk control. It helps enterprises efficiently build integrated Data + AI architectures and deploy AI applications at scale.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
What is AnalyticDB Ray
Open source Ray is a distributed computing framework designed for AI and high-performance computing (HPC). It uses simple API abstractions for efficient distributed scheduling. You can scale a single-machine task to a cluster of thousands of nodes with just a few lines of code and schedule remote resources as if you were calling local functions. Its built-in modules, such as Ray Tune, Ray Train, and Ray Serve, are seamlessly compatible with frameworks such as TensorFlow and PyTorch. Driven by an active open source community and companies such as Anyscale, Ray has become an important tool for building AI applications.
Although open source Ray provides highly flexible distributed computing capabilities, enterprises still face challenges in production environments. These challenges include distributed job optimization, fine-grained resource scheduling, complex cluster operations and maintenance (O&M), and ensuring system stability and high availability.
To address these challenges, AnalyticDB for MySQL introduced a fully managed Ray service: AnalyticDB Ray (ADB Ray). ADB Ray builds on the rich ecosystem of open source Ray. It has been proven in scenarios such as multimodal processing, embodied intelligence, search and recommendation, and financial risk control. ADB Ray enhances the Ray kernel and its service capabilities, optimizes kernel performance, and simplifies cluster O&M. It also seamlessly integrates with the AnalyticDB for MySQL data lakehouse platform. This integration helps enterprises build Data + AI architectures and accelerate the large-scale deployment of enterprise AI.
Billing details
Creating a Ray Cluster resource group is billed as follows:
-
Worker Disk Storage is billed based on the configured storage space.
-
When Worker Resource Type is set to CPU, you are billed based on the number of elastic resource group ACUs that are used.
-
When Worker Resource Type is set to GPU, you are billed based on the GPU specifications and quantity.
Precautions
-
Deleting or restarting a worker node has the following effects. To prevent unexpected data loss and job failures, change the worker configuration of the Ray Cluster resource group during off-peak hours and avoid scheduling jobs to worker nodes that are scheduled for a restart.
-
Drivers, Actors, and Tasks running on the worker node will fail. However, Ray automatically resubmits the Actors and Tasks.
-
Data in the Ray distributed object store is lost. If other Tasks depend on the data on the restarted worker, those Tasks will also fail.
-
-
Resource group changes:
-
Delete a resource group: If tasks are running in the resource group, deleting the group will interrupt them.
-
Delete a Worker Group: Deleting a Worker Group in a Ray Cluster resource group causes the worker nodes to be deleted. For more information, see the effects of deleting a worker node.
-
Change the number of workers: If the new maximum number of workers is less than the previous minimum number of workers, worker nodes are deleted. For more information, see the effects of deleting a worker node.
-
Change other configurations: Changes to parameters other than Minimum number of workers and Maximum number of workers, such as Head resource type or Worker resource type, cause the head node or the worker nodes in the corresponding Worker Group to restart. For more information, see the effects of restarting a worker node.
-
-
Auto scaling:
-
A Ray Cluster scales out based on logical resource demand, not physical resource utilization. Therefore, a Ray Cluster might trigger auto scaling even if physical resource utilization is low.
-
Some third-party applications create as many Tasks as possible to fully utilize resources. If auto scaling is enabled, many Tasks are created, causing Ray to quickly scale out to its maximum size. Therefore, before you run a third-party program, understand its internal logic for creating Tasks to avoid unnecessary resource consumption.
-
-
Disaster recovery: ADB Ray uses a Redis-based disaster recovery mechanism. This mechanism ensures that the Ray Cluster state, Actor state, and Task state can be recovered if the head node restarts.
Create a Ray service
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the navigation pane on the left, choose Cluster Management > Resource Management. Click the Resource Groups tab. Then, in the upper-right corner of the resource group list, click Create Resource Group.
-
Name the resource group, set Job Type to AI, and configure the following parameters:
Parameter
Description
Deployment Mode
The deployment mode of the resource group. Select RayCluster.
Head Resource Specifications
The head node is responsible for managing Ray metadata, running the Global Control Store (GCS) service, and scheduling tasks, but does not execute tasks.
The head resource specifications refer to the number of CPU cores. You can choose specifications such as small, m.xlarge, and m.2xlarge. The number of CPU cores is the same between head resource specifications and Spark resource specifications. For more information, see Spark resource specifications.
ImportantThe head node is responsible for job scheduling. Select the head resource specifications based on the overall scale of the Ray cluster.
Worker Group Name
The name of the worker group. You can configure multiple worker groups with different names in one AI resource group.
Worker Resource Type
The type of the worker group. Valid values: CPU and GPU.
-
If your business involves daily computing tasks, multitasking, or complex logical operations, we recommend that you select CPU.
-
If your business involves large-scale data parallel processing, machine learning, or deep learning training, we recommend that you select GPU.
Worker Resource Specifications
-
If you set the Worker Resource Type parameter to CPU, you can select specifications such as small, m.xlarge, and m.2xlarge. The number of CPU cores is the same between head resource specifications and Spark resource specifications. For more information, see Spark resource specifications.
-
If you set the Worker Resource Type parameter to GPU, submit a ticket for technical assistance because the specifications are related to GPU models and inventory.
Worker Disk Storage
The disk storage is used to store Ray logs, temporary data, and overflow data from Ray distributed object storage. Unit: GB. Valid values: 30 to 2000. Default value: 100.
ImportantDisks are used for temporary data storage and cannot be used for long-term storage.
Minimum Workers
Maximum Workers
Minimum Workers: the minimum number of worker nodes that are required in a worker group, with a minimum value of 1.
Maximum Workers: the maximum number of worker nodes that are allowed in a worker group, with a maximum value of 8.
Each worker group can be automatically scaled. If the minimum and maximum numbers of worker nodes in a worker group are different, AnalyticDB for MySQL dynamically adjusts the number of worker nodes based on the number of current tasks. If multiple worker groups exist, AnalyticDB for MySQL performs automatic matching to prevent overloading or underutilizing a single worker group.
Distribution Unit
The number of GPUs that are allocated to each worker node. Example: 1/3.
ImportantThis parameter is required only when you set the Worker Resource Type parameter to GPU.
-
-
Click OK.
Connect to and use the Ray service
Step 1: Obtain the endpoint
-
In the navigation pane on the left, choose and click the Resource Groups tab.
-
In the Actions column for the resource group, click to view the endpoints.
-
Ray Grafana: The endpoint for the Grafana visualization tool. Click the endpoint to go to the Grafana visualization page.
-
Ray cluster endpoint: The internal endpoint.
-
Ray Dashboard: The public endpoint and Dashboard address. Click the endpoint to go to the Ray visualization interface, where you can view the status of the Ray Cluster resource group and jobs.
-
Step 2: Submit a job
Prerequisites
Python 3.7 or later is installed.
Procedure
You can submit a job in one of the following two ways:
-
Submit a job using CTL (Recommended): Use CTL to package and upload the script file to the Ray Cluster for execution. The entry program runs in the Ray Cluster and consumes resources of the Ray Cluster resource group.
-
Connect to the Ray Cluster using ray.init to run a job: Use the ray.init method to connect to the Ray Cluster. The entry program runs locally and does not consume resources from the Ray Cluster resource group. The local Ray and Python versions must be consistent with the versions on the Ray Cluster. If the Ray Cluster version changes, you must update your local environment configuration accordingly.
Submit a job using CTL
-
Run the following command to install Ray.
pip3 install ray[default] -
(Optional) Configure environment variables.
NoteYou can configure a global environment variable to specify the endpoint, or specify the endpoint when you submit the job.
export RAY_ADDRESS="RAY_URL"Parameter description:
RAY_URL: The Ray connection address. Set this parameter to the connection address that you obtained in the "Obtain the endpoint" step. -
Submit the job.
ImportantWhen you submit a job, the system packages and uploads all files from the directory specified by the
working-dirparameter to the ray head for execution. Therefore, note the following:-
Keep the directory specified by the
working-dirparameter as small as possible. Otherwise, an excessively large directory may cause the upload to fail. -
All dependent script files must be in the directory specified by the
working-dirparameter. Otherwise, the job may fail due to missing dependencies.
-
If you have configured the environment variable, run the following statement to submit the job.
ray job submit --working-dir your_working_directory -- python your_python.pyParameter description:
-
your_working_directory: The path where the script file is located. In this topic, the sample path is/root/Ray. -
your_python.py: The script file. In this topic, the sample script file isscripts.py.
Example:
ray job submit --working-dir /root/Ray -- python scripts.py -
-
If you have not configured the environment variable, run the following statement to submit the job.
ray job submit --address ray_url --working-dir your_working_directory -- python your_python.pyParameter description:
-
ray_url: The Ray connection address. Enter the connection address that you obtained in the "Obtain the endpoint" step. -
your_working_directory: The path where the script file is located. -
your_python.py: The script file. In this topic, the sample script file isscripts.py.
Example:
ray job submit --address http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265 --working-dir /root/Ray -- python scripts.py -
-
-
Check the job running status.
You can check the job status in one of the following two ways:
-
Use a command:
ray job list -
Use the visualization interface:
-
On the Resource Groups tab, in the Actions column for the resource group, click .
-
Click the Ray Dashboard address to open the visualization interface.
-
-
Connect to the Ray Cluster using ray.init to run a job
-
Run the following command to install Ray.
pip3 install ray -
(Optional) Configure a global environment variable.
NoteYou can configure a global environment variable to specify the endpoint, or specify the endpoint in the script file.
export RAY_ADDRESS="RAY_URL"Parameter description:
RAY_URL: The Ray connection address. The address you obtained in the "Obtain the endpoint" step is the dashboard address, which uses port 8265. To connect using `ray.init()`, you must change the protocol to `ray` and the port to 10001.For example, if the dashboard address is
http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265, you must change it toray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001. -
Run the program.
-
If you have configured the global environment variable, run the following command to run the program.
python scripts.py -
If you have not configured the global environment variable, run the following command to run the program.
-
Modify the script file to specify the endpoint.
ray.init(address="RAY_URL")Parameter description:
RAY_URL: The Ray connection address. The address you obtained in the "Obtain the endpoint" step is the dashboard address, which uses port 8265. To connect using `ray.init()`, you must change the protocol to `ray` and the port to 10001.For example, if the dashboard address is
http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265, you must change it toray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001.ImportantIf the Ray endpoint is configured incorrectly, ray.init() starts a local cluster to run the program. Check the output logs to make sure that you are correctly connected to the Ray cluster.
-
Run the following command to run the program:
python scripts.py
-
-