企业在管理集群资源时面临的主要挑战是任务量庞大而资源有限。为解决这一问题,需要优先将资源分配给关键部门或个人,并保持高度的灵活性以随时调整资源分配。本文将介绍如何提高企业集群资源的利用率,并通过统一的任务管理平台自动化处理来自不同部门的大量RayJob,支持任务插队和动态优先级调整,确保高优先级任务能够优先获得资源。
环境准备(二选一)
ACK托管集群Pro版
集群版本要求:不低于v1.24。
已通过kubectl连接Kubernetes集群,且已在本地安装kubectl。具体操作,请参见获取集群KubeConfig并通过kubectl工具连接集群。
Kuberay组件:具体操作,请参见安装Kuberay-Operator组件。
Kube Queue:v1.21.4及以上。具体操作,请参见使用任务队列ack-kube-queue管理AI/ML工作负载。
Kube Queue配置支持RayJob资源。
默认节点池:8vCPU 32GiB规格以上的ECS实例3台及以上。
ACK灵骏集群
集群版本要求:不低于v1.24。
已通过kubectl连接Kubernetes集群,且已在本地安装kubectl。具体操作,请参见获取集群KubeConfig并通过kubectl工具连接集群。
Kuberay组件:具体操作,请参见安装Kuberay-Operator组件。
Kube Queue:v1.21.4及以上。具体操作,请参见使用任务队列ack-kube-queue管理AI/ML工作负载。
Kube Queue配置支持RayJob资源。
默认节点池:8vCPU 32GiB规格以上的ECS实例3台及以上。
如果您想要了解更多关于RayJob CR配置,请参见RayJob Configuration。
本文示例中head Pod和work Pod使用Ray官方镜像:
rayproject/ray:2.36.1
,若您无法拉取,请参见【产品公告】关于ACK集群无法拉取海外源镜像的公告,替换成订阅后的镜像地址。
资源配额
通过ACK集群ElasticQuotaTree能力配合RayJob来自动调度任务,更灵活地管理和分配计算资源。让RayJob可以在这些限定的资源范围内高效地安排作业,确保每个团队都能充分利用其分配到的资源,同时避免资源浪费或过度竞争。
资源配额ElasticQuotaTree可以根据部门或个人设定,明确每个团队可使用的最大资源量或机器数量,每个节点都代表了所属团队或者部门能够使用的最小和最大资源配额。在RayJob提交后,系统会自动检查该RayJob所属的资源配额是否足以满足其需求。只有当确认资源配额能够支持所需资源时,RayJob任务才会被分配到相应的计算资源上并开始执行。这样既保证了资源的有效利用,也确保了任务按照优先级合理安排。
ElasticQuotaTree定义了集群内的配额信息,涵盖了配额的层级结构、与配额关联的资源量,以及配额所绑定的命名空间。当在这些命名空间下提交任务时,任务将自动归入相应命名空间的资源配额中。您可以参考以下示例,设置ElasticQuotaTree进行资源配额。
为了构建符合企业需求的资源配额体系,可以向集群提交如下的ElasticQuotaTree配置。
ElasticQuotaTree是一种树形结构,其中每个节点通过
max
字段定义了资源配额的最大使用量,而min
字段则指定了该节点的最低保障资源量。当某一资源配额的最低保障无法得到满足时,调度系统将尝试从那些实际占用超过其最低保障资源量的其他配额中回收资源来运行任务。特别是对于标记为intern-text
和intern-video
的任务,由于它们的保障资源量设定为0,这意味着如果算法团队成员提交了需要立即处理的任务,并且此时有实习生的任务正在执行,那么可以通过抢占这些实习生任务所使用的资源来优先保证算法团队任务的执行,从而确保高优先级任务能够顺利进行。查看
kube-system
命名空间下已生效的ElasticQuotaTree
设置。kubectl -n kube-system get elasticquotatree elasticquotatree -o yaml
任务队列
任务队列Queue具备将来自不同部门和团队的RayJob分配至相应队列的能力,在ElasticQuotaTree提交后,ack-kube-queue会在集群中自动创建相应的Queue用于任务排队。每个叶子节点的资源配额对应于集群中的一个独立Queue。当RayJob被提交到集群时,ack-kube-queue会根据RayJob所在命名空间自动关联对应Queue,自动将任务分配至对应的队列内,根据排队策略或者Quota来决定是否出队。具体操作,请参见ack-kube-queue管理任务队列。
具体示例如下图所示,视频团队video关联了video
的namespace,通过min和max配置资源配额,Kube Queue会自动为此配额创建关联的queue:root-algorithm-video
。后续在video
namespace下的RayJob(.spec.suspend
字段设为True
)提交后,会自动创建对应的QueueUnit资源对象,进入root-algorithm-video
队列进行排队。针对RayJob,KubeQueue会在任务提交后,把Head Pod的Request值与每个WorkerGroup的总资源量(即副本数乘以单个Pod Request的资源量)相加,以此总和作为该RayJob所需的Request资源总量。如果RayJob资源总量请求满足当前的可用的配额管理,则RayJob会从root-algorithm-video
中出队,进入调度器逻辑。
ElasticQuotaTree创建完后,kube queue会根据ElasticQuotaTree里的配置,为每个叶子节点自动创建对应的Queue。
例如,算法部门/video团队,自动创建
queue:root-algorithm-video
。kubectl get queue -n kube-queue root-algorithm-video-k42kq -o yaml apiVersion: scheduling.x-k8s.io/v1alpha1 kind: Queue metadata: annotations: kube-queue/parent-quota-fullname: algorithm kube-queue/quota-fullname: root/algorithm/video creationTimestamp: "2025-01-09T03:32:27Z" generateName: root-algorithm-video- generation: 1 labels: create-by-kubequeue: "true" name: root-algorithm-video-k42kq namespace: kube-queue resourceVersion: "18282630" uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf spec: queuePolicy: Round status: queueItemDetails: active: [] backoff: []
active
: 等待调度的任务的优先级以及在队列中的位置。backoff
:由于近期调度失败(比如资源配额不够)而处于静默状态的任务。查看queue队列。
kubectl get queue -n kube-queue
输出结果如下。
NAME AGE root-algorithm-n54fm 51s root-algorithm-text-hgbvz 51s root-algorithm-video-k42kq 51s root-devops-2zccw 51s root-infrastructure-devops-d6zqq 51s root-infrastructure-vbpkt 51s root-k8htb 51s
创建RayJob
模拟在
video
namespace下定义RayJob资源,自动关联root-algorithm-video
队列,并关联当前队列对应的资源配额video。- name: video min: cpu: 12 memory: 12Gi nvidia.com/gpu: 2 max: cpu: 14 memory: 14Gi nvidia.com/gpu: 4 namespaces: # 配置对应的Namespace - video
最小保障资源:cpu:12,memory:12Gi,nvidia.com/gpu:2。
最大可用资源:cpu:14,memory:14Gi,nvidia.com/gpu:4。
创建ConfigMap,定义RayJob将要在RayCluster执行的Python代码。示例代码通过ray.remote装饰器,创建了一个actor,并调用了actor的inc()和get_count()方法。
--- apiVersion: v1 kind: ConfigMap metadata: name: rayjob-video namespace: video data: sample_code.py: | import ray import os import requests ray.init() @ray.remote class Counter: def __init__(self): # Used to verify runtimeEnv self.name = os.getenv("counter_name") assert self.name == "test_counter" self.counter = 0 def inc(self): self.counter += 1 def get_counter(self): return "{} got {}".format(self.name, self.counter) counter = Counter.remote() for _ in range(2): ray.get(counter.inc.remote()) print(ray.get(counter.get_counter.remote())) # Verify that the correct runtime env was used for the job. assert requests.__version__ == "2.26.0"
配置RayJob YAML示例。
apiVersion: ray.io/v1 kind: RayJob metadata: labels: job-type: video generateName: rayjob-video- namespace: video spec: entrypoint: python /home/ray/samples/sample_code.py runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: counter_name: "test_counter" ttlSecondsAfterFinished: 10 # if suspend: true , should set shutdownAfterJobFinishes to true shutdownAfterJobFinishes: true # Suspend specifies whether the RayJob controller should create a RayCluster instance. # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false. # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created. suspend: true rayClusterSpec: rayVersion: '2.36.1' # should match the Ray version in the image of the containers headGroupSpec: rayStartParams: dashboard-host: '0.0.0.0' num-cpus: "0" template: spec: containers: - name: ray-head image: rayproject/ray:2.36.1 ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "4" memory: 4G requests: cpu: "4" memory: 4G volumeMounts: - mountPath: /home/ray/samples name: code-sample volumes: # You set volumes at the Pod level, then mount them into containers inside that Pod - name: code-sample configMap: # Provide the name of the ConfigMap you want to mount. name: rayjob-video # An array of keys from the ConfigMap to create as files items: - key: sample_code.py path: sample_code.py workerGroupSpecs: - replicas: 2 groupName: small-group rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.36.1 lifecycle: preStop: exec: command: [ "/bin/sh","-c","ray stop" ] resources: limits: cpu: "4" memory: 4G requests: cpu: "4" memory: 4G
使用
kubectl create -f
创建两次,生成两个RayJob。kubectl get rayjob -n video NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE rayjob-video-g2lvn Initializing 2025-01-10T01:36:24Z 6s rayjob-video-h4x2q Suspended 2025-01-10T01:36:25Z 5s 5s 3s
其中
rayjob-video-g2lvn
处于Initializing
状态,rayjob-video-h4x2q
处于suspended状态。rayjob-video-g2lvn
的annotations
中通过kube-queue/job-dequeue-timestamp
和kube-queue/job-enqueue-timestamp
查看到入队列和出队列的时间。kubectl -n video get rayjob rayjob-video-g2lvn -o yaml apiVersion: ray.io/v1 kind: RayJob metadata: annotations: kube-queue/job-dequeue-timestamp: 2025-01-10 01:36:24.641181026 +0000 UTC m=+132100.596012828 kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:24.298639916 +0000 UTC m=+132100.253471714 creationTimestamp: "2025-01-10T01:36:24Z"
rayjob-video-h4x2q的annotations只看到了入队时间,kube-queue/job-enqueue-timestamp,并没有看到出队时间。这说明此RayJob还没出队列,并没有进入调度器逻辑。
kubectl -n video get rayjob rayjob-video-h4x2q -o yaml apiVersion: ray.io/v1 kind: RayJob metadata: annotations: kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:25.505182364 +0000 UTC m=+132101.460014182 creationTimestamp: "2025-01-10T01:36:25Z"
查看Pod,现阶段只能看到rayjob-video-4hwhd的Pod已经开始了调度。
kubectl -n video get pod NAME READY STATUS RESTARTS AGE rayjob-video-g2lvn-9gz66 1/1 Running 0 28s rayjob-video-g2lvn-raycluster-v8tfh-head-6trq5 1/1 Running 0 49s rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-hkt7m 1/1 Running 0 49s rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-rbzjn 1/1 Running 0 49s
查看队列的相关属性。
k -n kube-queue get queue root-algorithm-video-k42kq -o yaml apiVersion: scheduling.x-k8s.io/v1alpha1 kind: Queue metadata: annotations: kube-queue/parent-quota-fullname: algorithm kube-queue/quota-fullname: root/algorithm/video creationTimestamp: "2025-01-09T08:34:57Z" generateName: root-algorithm-video- generation: 1 labels: create-by-kubequeue: "true" name: root-algorithm-video-k42kq namespace: kube-queue resourceVersion: "19070012" uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf spec: queuePolicy: Round status: queueItemDetails: active: [] backoff: - name: rayjob-video-h4x2q-ray-qu namespace: video position: 1
设置Gang调度
当您面临需要利用多台计算机共同解决问题时,且希望简化这些复杂性带来的编程挑战时,在以下几种场景中,您可考虑使用Ray来设置节点协同调度。
大规模机器学习训练。
当需要处理非常大的数据集或复杂的模型时,单一机器可能无法提供足够的计算资源,这时需要一组容器协同工作。Gang调度策略能够确保这些容器同时调度,避免资源争抢和死锁,从而提高训练效率。
MPI计算框架。
MPI框架下的多线程并行计算需要主从进程协同工作。Gang调度策略能够确保这些进程同时调度,减少通信延迟,提升计算效率。
数据处理与分析。
对于那些需要处理海量数据的应用来说,比如日志分析、实时流处理等,多个任务可能需要同时运行以完成复杂的分析工作。Gang调度策略可以确保这些任务同时调度,提高整体处理速度。
自定义分布式应用开发。
在游戏服务器架构中实现玩家匹配服务;或者在物联网(IoT)项目里协调来自成千上万个设备的数据收集与处理工作。
如果您需要提交带有Gang调度需求的RayJob,在ACK Pro以及ACK灵骏的环境中只需要在RayJob的Metadata中声明ray.io/scheduler-name=kube-scheduler
即可,提交任务之后rayoperator会在创建出Pod时注入Gang相关标记。
如果您需要在RayJob中提交Gang调度需求,在ACK Pro及ACK灵骏环境中,您只需在RayJob的Metadata中声明ray.io/scheduler-name=kube-scheduler
。提交任务后,Ray Operator会在创建Pod时自动注入与Gang调度相关的标记。相关代码如下所示。
创建Pod时为其设置标签,以便于后续的识别、分组、筛选及管理操作。相关代码如下所示。
当资源不足时,您可以通过事件(events)获取调度失败的详细信息。使用--field-selector='type=Warning,reason=GangFailedScheduling'
可以过滤出与Gang相关的调度失败记录。此外,“cycle xx”表示不同轮次下的调度失败详情,具体说明了在该轮中Pod未能成功调度的原因。具体示例如下所示。
➜ kubequeue-doc kubectl get events -n algorithm-text --field-selector='type=Warning,reason=GangFailedScheduling' | grep "cycle 1"
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7 rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6 rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
- 本页导读 (1)
- 环境准备(二选一)
- 资源配额
- 任务队列
- 创建RayJob
- 设置Gang调度