文档

基于Ray autoscaler与ACK autoscaler实现弹性伸缩

更新时间:

Ray分布式计算框架提供Ray autoscaler组件,支持根据工作负载动态调整Ray Cluster的计算资源。ACK集群也提供ACK autoscaler组件实现自动伸缩功能,根据集群中工作负载的实际需要自动调整节点数量。Ray autoscaler与ACK autoscaler弹性功能的结合能更充分地发挥云的弹性能力,提高计算资源供给效率和性价比。

前提条件

Ray Auto-Scaler结合ACK Cluster-Autoscaler实现弹性伸缩

  1. 执行以下命令,在ACK集群中通过Helm安装Ray Cluster应用。

    helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS}
    helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS} 
  2. 执行以下命令,查看Ray Cluster中资源的运行情况。

    kubectl get pod -n ${RAY_CLUSTER_NS}
    NAME                                           READY   STATUS     RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running    0          22m
  3. 执行以下命令,登录Head节点,查看集群Status信息。

    请将Pod名称替换为实际的Ray Cluster的Pod名称。

    kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash
    (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status

    预期输出:

    ======== Autoscaler status: 2024-01-25 00:00:19.879963 ========
    Node status
    ---------------------------------------------------------------
    Healthy:
     1 head-group
    Pending:
     (no pending nodes)
    Recent failures:
     (no failures)
    
    Resources
    ---------------------------------------------------------------
    Usage:
     0B/1.86GiB memory
     0B/452.00MiB object_store_memory
    
    Demands:
     (no resource demands)
  4. 在Ray Cluster中运行提交如下Job。

    下方代码启动了15个Task,每个Task需要1核CPU的调度资源。默认创建的Ray Cluster的Head pod的--num-cpus为0,即不允许调度Task;Work Pod的CPU内存默认为1核,1GB。因此,共需要自动扩容15个Work Pod。由于ACK集群中的节点资源不足,Pending的Pod会自动触发ACK的节点自动伸缩

    import time
    import ray
    import socket
    
    ray.init()
    
    @ray.remote(num_cpus=1)
    def get_task_hostname():
        time.sleep(120)
        host = socket.gethostbyname(socket.gethostname())
        return host
    
    object_refs = []
    for _ in range(15):
        object_refs.append(get_task_hostname.remote())
    
    ray.wait(object_refs)
    
    for t in object_refs:
        print(ray.get(t))
  5. 执行以下命令,查看Ray Cluster下的Pod状态。

    kubectl get pod -n ${RAY_CLUSTER_NS} -w
    # 预期输出:
    NAME                                           READY   STATUS    RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running   0          47m
    myfirst-ray-cluster-worker-workergroup-btgmm   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-c2lmq   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-gstcc   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-hfshs   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-nrfh8   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-pjbdw   0/1     Pending   0          29s
    myfirst-ray-cluster-worker-workergroup-qxq7v   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-sm8mt   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-wr87d   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-xc4kn   1/1     Running   0          30s
    ...
  6. 执行以下命令,查看Node状态。

    kubectl get node -w
    # 预期输出:
    cn-hangzhou.172.16.0.204   Ready    <none>   44h   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   1s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   11s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   10s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   14s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   31s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   60s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    Ready      <none>   61s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    Ready      <none>   64s   v1.24.6-aliyun.1
    ...

相关文档

  • 本页导读 (1)
文档反馈