Managed Ray service

更新时间:
复制 MD 格式

AnalyticDB Ray is a fully managed Ray service from AnalyticDB for MySQL. This service enhances and optimizes open source Ray by improving kernel performance and simplifying operations management. AnalyticDB Ray is designed for complex AI scenarios, such as multimodal processing, search and recommendation, and financial risk control. It helps enterprises efficiently build integrated Data + AI architectures and deploy AI applications at scale.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.

What is AnalyticDB Ray

Open source Ray is a distributed computing framework designed for AI and high-performance computing (HPC). It uses simple API abstractions for efficient distributed scheduling. You can scale a single-machine task to a cluster of thousands of nodes with just a few lines of code and schedule remote resources as if you were calling local functions. Its built-in modules, such as Ray Tune, Ray Train, and Ray Serve, are seamlessly compatible with frameworks such as TensorFlow and PyTorch. Driven by an active open source community and companies such as Anyscale, Ray has become an important tool for building AI applications.

Although open source Ray provides highly flexible distributed computing capabilities, enterprises still face challenges in production environments. These challenges include distributed job optimization, fine-grained resource scheduling, complex cluster operations and maintenance (O&M), and ensuring system stability and high availability.

To address these challenges, AnalyticDB for MySQL introduced a fully managed Ray service: AnalyticDB Ray (ADB Ray). ADB Ray builds on the rich ecosystem of open source Ray. It has been proven in scenarios such as multimodal processing, embodied intelligence, search and recommendation, and financial risk control. ADB Ray enhances the Ray kernel and its service capabilities, optimizes kernel performance, and simplifies cluster O&M. It also seamlessly integrates with the AnalyticDB for MySQL data lakehouse platform. This integration helps enterprises build Data + AI architectures and accelerate the large-scale deployment of enterprise AI.

Billing details

Creating a Ray Cluster resource group is billed as follows:

  • Worker Disk Storage is billed based on the configured storage space.

  • When Worker Resource Type is set to CPU, you are billed based on the number of elastic resource group ACUs that are used.

  • When Worker Resource Type is set to GPU, you are billed based on the GPU specifications and quantity.

Precautions

  • Deleting or restarting a worker node has the following effects. To prevent unexpected data loss and job failures, change the worker configuration of the Ray Cluster resource group during off-peak hours and avoid scheduling jobs to worker nodes that are scheduled for a restart.

    • Drivers, Actors, and Tasks running on the worker node will fail. However, Ray automatically resubmits the Actors and Tasks.

    • Data in the Ray distributed object store is lost. If other Tasks depend on the data on the restarted worker, those Tasks will also fail.

  • Resource group changes:

    • Delete a resource group: If tasks are running in the resource group, deleting the group will interrupt them.

    • Delete a Worker Group: Deleting a Worker Group in a Ray Cluster resource group causes the worker nodes to be deleted. For more information, see the effects of deleting a worker node.

    • Change the number of workers: If the new maximum number of workers is less than the previous minimum number of workers, worker nodes are deleted. For more information, see the effects of deleting a worker node.

    • Change other configurations: Changes to parameters other than Minimum number of workers and Maximum number of workers, such as Head resource type or Worker resource type, cause the head node or the worker nodes in the corresponding Worker Group to restart. For more information, see the effects of restarting a worker node.

  • Auto scaling:

    • A Ray Cluster scales out based on logical resource demand, not physical resource utilization. Therefore, a Ray Cluster might trigger auto scaling even if physical resource utilization is low.

    • Some third-party applications create as many Tasks as possible to fully utilize resources. If auto scaling is enabled, many Tasks are created, causing Ray to quickly scale out to its maximum size. Therefore, before you run a third-party program, understand its internal logic for creating Tasks to avoid unnecessary resource consumption.

  • Disaster recovery: ADB Ray uses a Redis-based disaster recovery mechanism. This mechanism ensures that the Ray Cluster state, Actor state, and Task state can be recovered if the head node restarts.

Create a Ray service

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

  2. In the navigation pane on the left, choose Cluster Management > Resource Management. Click the Resource Groups tab. Then, in the upper-right corner of the resource group list, click Create Resource Group.

  3. Name the resource group, set Job Type to AI, and configure the following parameters:

    Parameter

    Description

    Deployment Mode

    The deployment mode of the resource group. Select RayCluster.

    Head Resource Specifications

    The head node is responsible for managing Ray metadata, running the Global Control Store (GCS) service, and scheduling tasks, but does not execute tasks.

    The head resource specifications refer to the number of CPU cores. You can choose specifications such as small, m.xlarge, and m.2xlarge. The number of CPU cores is the same between head resource specifications and Spark resource specifications. For more information, see Spark resource specifications.

    Important

    The head node is responsible for job scheduling. Select the head resource specifications based on the overall scale of the Ray cluster.

    Worker Group Name

    The name of the worker group. You can configure multiple worker groups with different names in one AI resource group.

    Worker Resource Type

    The type of the worker group. Valid values: CPU and GPU.

    • If your business involves daily computing tasks, multitasking, or complex logical operations, we recommend that you select CPU.

    • If your business involves large-scale data parallel processing, machine learning, or deep learning training, we recommend that you select GPU.

    Worker Resource Specifications

    • If you set the Worker Resource Type parameter to CPU, you can select specifications such as small, m.xlarge, and m.2xlarge. The number of CPU cores is the same between head resource specifications and Spark resource specifications. For more information, see Spark resource specifications.

    • If you set the Worker Resource Type parameter to GPU, submit a ticket for technical assistance because the specifications are related to GPU models and inventory.

    Worker Disk Storage

    The disk storage is used to store Ray logs, temporary data, and overflow data from Ray distributed object storage. Unit: GB. Valid values: 30 to 2000. Default value: 100.

    Important

    Disks are used for temporary data storage and cannot be used for long-term storage.

    Minimum Workers

    Maximum Workers

    Minimum Workers: the minimum number of worker nodes that are required in a worker group, with a minimum value of 1.

    Maximum Workers: the maximum number of worker nodes that are allowed in a worker group, with a maximum value of 8.

    Each worker group can be automatically scaled. If the minimum and maximum numbers of worker nodes in a worker group are different, AnalyticDB for MySQL dynamically adjusts the number of worker nodes based on the number of current tasks. If multiple worker groups exist, AnalyticDB for MySQL performs automatic matching to prevent overloading or underutilizing a single worker group.

    Distribution Unit

    The number of GPUs that are allocated to each worker node. Example: 1/3.

    Important

    This parameter is required only when you set the Worker Resource Type parameter to GPU.

  4. Click OK.

Connect to and use the Ray service

Step 1: Obtain the endpoint

  1. In the navigation pane on the left, choose Cluster Management > Resource Management and click the Resource Groups tab.

  2. In the Actions column for the resource group, click More > Details to view the endpoints.

    • Ray Grafana: The endpoint for the Grafana visualization tool. Click the endpoint to go to the Grafana visualization page.

    • Ray cluster endpoint: The internal endpoint.

    • Ray Dashboard: The public endpoint and Dashboard address. Click the endpoint to go to the Ray visualization interface, where you can view the status of the Ray Cluster resource group and jobs.

Step 2: Submit a job

Prerequisites

Python 3.7 or later is installed.

Procedure

You can submit a job in one of the following two ways:

  • Submit a job using CTL (Recommended): Use CTL to package and upload the script file to the Ray Cluster for execution. The entry program runs in the Ray Cluster and consumes resources of the Ray Cluster resource group.

  • Connect to the Ray Cluster using ray.init to run a job: Use the ray.init method to connect to the Ray Cluster. The entry program runs locally and does not consume resources from the Ray Cluster resource group. The local Ray and Python versions must be consistent with the versions on the Ray Cluster. If the Ray Cluster version changes, you must update your local environment configuration accordingly.

Submit a job using CTL

  1. Run the following command to install Ray.

    pip3 install ray[default]
  2. (Optional) Configure environment variables.

    Note

    You can configure a global environment variable to specify the endpoint, or specify the endpoint when you submit the job.

    export RAY_ADDRESS="RAY_URL"

    Parameter description:

    RAY_URL: The Ray connection address. Set this parameter to the connection address that you obtained in the "Obtain the endpoint" step.

  3. Submit the job.

    Important

    When you submit a job, the system packages and uploads all files from the directory specified by the working-dir parameter to the ray head for execution. Therefore, note the following:

    • Keep the directory specified by the working-dir parameter as small as possible. Otherwise, an excessively large directory may cause the upload to fail.

    • All dependent script files must be in the directory specified by the working-dir parameter. Otherwise, the job may fail due to missing dependencies.

    • If you have configured the environment variable, run the following statement to submit the job.

      ray job submit --working-dir your_working_directory -- python your_python.py 

      Parameter description:

      • your_working_directory: The path where the script file is located. In this topic, the sample path is /root/Ray.

      • your_python.py: The script file. In this topic, the sample script file is scripts.py.

      Example:

      ray job submit --working-dir /root/Ray -- python scripts.py
    • If you have not configured the environment variable, run the following statement to submit the job.

      ray job submit --address ray_url --working-dir your_working_directory -- python your_python.py 

      Parameter description:

      • ray_url: The Ray connection address. Enter the connection address that you obtained in the "Obtain the endpoint" step.

      • your_working_directory: The path where the script file is located.

      • your_python.py: The script file. In this topic, the sample script file is scripts.py.

      Example:

      ray job submit --address http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265 --working-dir /root/Ray -- python scripts.py 
  4. Check the job running status.

    You can check the job status in one of the following two ways:

    • Use a command:

      ray job list
    • Use the visualization interface:

      1. On the Resource Groups tab, in the Actions column for the resource group, click More > Details.

      2. Click the Ray Dashboard address to open the visualization interface.

Connect to the Ray Cluster using ray.init to run a job

  1. Run the following command to install Ray.

    pip3 install ray
  2. (Optional) Configure a global environment variable.

    Note

    You can configure a global environment variable to specify the endpoint, or specify the endpoint in the script file.

    export RAY_ADDRESS="RAY_URL"

    Parameter description:

    RAY_URL: The Ray connection address. The address you obtained in the "Obtain the endpoint" step is the dashboard address, which uses port 8265. To connect using `ray.init()`, you must change the protocol to `ray` and the port to 10001.

    For example, if the dashboard address is http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265, you must change it to ray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001.

  3. Run the program.

    • If you have configured the global environment variable, run the following command to run the program.

      python scripts.py
    • If you have not configured the global environment variable, run the following command to run the program.

      1. Modify the script file to specify the endpoint.

        ray.init(address="RAY_URL")

        Parameter description:

        RAY_URL: The Ray connection address. The address you obtained in the "Obtain the endpoint" step is the dashboard address, which uses port 8265. To connect using `ray.init()`, you must change the protocol to `ray` and the port to 10001.

        For example, if the dashboard address is http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265, you must change it to ray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001.

        Important

        If the Ray endpoint is configured incorrectly, ray.init() starts a local cluster to run the program. Check the output logs to make sure that you are correctly connected to the Ray cluster.

      2. Run the following command to run the program:

        python scripts.py