Ray分布式计算框架提供Ray autoscaler组件,支持根据工作负载动态调整Ray Cluster的计算资源。ACK集群也提供ACK autoscaler组件实现自动伸缩功能,根据集群中工作负载的实际需要自动调整节点数量。Ray autoscaler与ACK autoscaler弹性功能的结合能更充分地发挥云的弹性能力,提高计算资源供给效率和性价比。
前提条件
Ray Auto-Scaler结合ACK Cluster-Autoscaler实现弹性伸缩
执行以下命令,在ACK集群中通过Helm安装Ray Cluster应用。
helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}
执行以下命令,查看Ray Cluster中资源的运行情况。
kubectl get pod -n ${RAY_CLUSTER_NS} NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 22m
执行以下命令,登录Head节点,查看集群Status信息。
请将Pod名称替换为实际的Ray Cluster的Pod名称。
kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status
预期输出:
======== Autoscaler status: 2024-01-25 00:00:19.879963 ======== Node status --------------------------------------------------------------- Healthy: 1 head-group Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0B/1.86GiB memory 0B/452.00MiB object_store_memory Demands: (no resource demands)
在Ray Cluster中运行提交如下Job。
下方代码启动了15个Task,每个Task需要1核CPU的调度资源。默认创建的Ray Cluster的Head pod的
--num-cpus
为0,即不允许调度Task;Work Pod的CPU内存默认为1核,1GB。因此,共需要自动扩容15个Work Pod。由于ACK集群中的节点资源不足,Pending的Pod会自动触发ACK的节点自动伸缩。import time import ray import socket ray.init() @ray.remote(num_cpus=1) def get_task_hostname(): time.sleep(120) host = socket.gethostbyname(socket.gethostname()) return host object_refs = [] for _ in range(15): object_refs.append(get_task_hostname.remote()) ray.wait(object_refs) for t in object_refs: print(ray.get(t))
执行以下命令,查看Ray Cluster下的Pod状态。
kubectl get pod -n ${RAY_CLUSTER_NS} -w # 预期输出: NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 47m myfirst-ray-cluster-worker-workergroup-btgmm 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-c2lmq 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-gstcc 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-hfshs 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-nrfh8 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-pjbdw 0/1 Pending 0 29s myfirst-ray-cluster-worker-workergroup-qxq7v 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-sm8mt 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-wr87d 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-xc4kn 1/1 Running 0 30s ...
执行以下命令,查看Node状态。
kubectl get node -w # 预期输出: cn-hangzhou.172.16.0.204 Ready <none> 44h v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 1s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 11s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 10s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 14s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 31s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 60s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 Ready <none> 61s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 Ready <none> 64s v1.24.6-aliyun.1 ...
相关文档
您可以在本地访问Ray的可视化Web界面DashBoard,请参见本地访问Ray DashBoard。
文档内容是否对您有帮助?