在ACK集群中自动化执行RayJob

更新时间:2025-03-12 02:22:53

企业在管理集群资源时面临的主要挑战是任务量庞大而资源有限。为解决这一问题,需要优先将资源分配给关键部门或个人,并保持高度的灵活性以随时调整资源分配。本文将介绍如何提高企业集群资源的利用率,并通过统一的任务管理平台自动化处理来自不同部门的大量RayJob,支持任务插队和动态优先级调整,确保高优先级任务能够优先获得资源。

环境准备(二选一)

  • ACK托管集群Pro

    • 集群版本要求:不低于v1.24。

    • 已通过kubectl连接Kubernetes集群,且已在本地安装kubectl。具体操作,请参见获取集群KubeConfig并通过kubectl工具连接集群。

    • Kuberay组件:具体操作,请参见安装Kuberay-Operator组件

    • Kube Queue:v1.21.4及以上。具体操作,请参见使用任务队列ack-kube-queue管理AI/ML工作负载

    • Kube Queue配置支持RayJob资源。

    • 默认节点池:8vCPU 32GiB规格以上的ECS实例3台及以上。

  • ACK灵骏集群

    • 集群版本要求:不低于v1.24。

    • 已通过kubectl连接Kubernetes集群,且已在本地安装kubectl。具体操作,请参见获取集群KubeConfig并通过kubectl工具连接集群。

    • Kuberay组件:具体操作,请参见安装Kuberay-Operator组件

    • Kube Queue:v1.21.4及以上。具体操作,请参见使用任务队列ack-kube-queue管理AI/ML工作负载

    • Kube Queue配置支持RayJob资源。

    • 默认节点池:8vCPU 32GiB规格以上的ECS实例3台及以上。

重要

资源配额

通过ACK集群ElasticQuotaTree能力配合RayJob来自动调度任务,更灵活地管理和分配计算资源。让RayJob可以在这些限定的资源范围内高效地安排作业,确保每个团队都能充分利用其分配到的资源,同时避免资源浪费或过度竞争。

资源配额ElasticQuotaTree可以根据部门或个人设定,明确每个团队可使用的最大资源量或机器数量,每个节点都代表了所属团队或者部门能够使用的最小和最大资源配额。在RayJob提交后,系统会自动检查该RayJob所属的资源配额是否足以满足其需求。只有当确认资源配额能够支持所需资源时,RayJob任务才会被分配到相应的计算资源上并开始执行。这样既保证了资源的有效利用,也确保了任务按照优先级合理安排。

ElasticQuotaTree定义了集群内的配额信息,涵盖了配额的层级结构、与配额关联的资源量,以及配额所绑定的命名空间。当在这些命名空间下提交任务时,任务将自动归入相应命名空间的资源配额中。您可以参考以下示例,设置ElasticQuotaTree进行资源配额。

  1. 为了构建符合企业需求的资源配额体系,可以向集群提交如下的ElasticQuotaTree配置

    展开查看完整代码示例

    ---
    apiVersion: v1
    kind: Namespace
    metadata:
      name: video 
    
    
    ---
    
    apiVersion: scheduling.sigs.k8s.io/v1beta1
    kind: ElasticQuotaTree
    metadata:
      name: elasticquotatree # 当前仅支持单一ElasticQuotaTree
      namespace: kube-system # 只有kube-system下才会生效
    spec:
      root:
        name: root 
        min:
          cpu: 100
          memory: 50Gi
          nvidia.com/gpu: 16
        max:
          cpu: 100
          memory: 50Gi
          nvidia.com/gpu: 16
        children:
    
        - name: algorithm  
          min:
            cpu: 50
            memory: 25Gi
            nvidia.com/gpu: 10 
          max:
            cpu: 80
            memory: 50Gi
            nvidia.com/gpu: 14 
          children:
          - name: video 
            min:
              cpu: 12
              memory: 12Gi
              nvidia.com/gpu: 2 
            max:
              cpu: 14
              memory: 14Gi
              nvidia.com/gpu: 4 
            namespaces: # 配置对应的Namespace
            - video 
     

    ElasticQuotaTree是一种树形结构,其中每个节点通过max字段定义了资源配额的最大使用量,而min字段则指定了该节点的最低保障资源量。当某一资源配额的最低保障无法得到满足时,调度系统将尝试从那些实际占用超过其最低保障资源量的其他配额中回收资源来运行任务。特别是对于标记为intern-textintern-video的任务,由于它们的保障资源量设定为0,这意味着如果算法团队成员提交了需要立即处理的任务,并且此时有实习生的任务正在执行,那么可以通过抢占这些实习生任务所使用的资源来优先保证算法团队任务的执行,从而确保高优先级任务能够顺利进行。

  2. 查看kube-system命名空间下已生效的ElasticQuotaTree设置。

    kubectl -n kube-system get elasticquotatree elasticquotatree -o yaml

任务队列

任务队列Queue具备将来自不同部门和团队的RayJob分配至相应队列的能力,在ElasticQuotaTree提交后,ack-kube-queue会在集群中自动创建相应的Queue用于任务排队。每个叶子节点的资源配额对应于集群中的一个独立Queue。当RayJob被提交到集群时,ack-kube-queue会根据RayJob所在命名空间自动关联对应Queue,自动将任务分配至对应的队列内,根据排队策略或者Quota来决定是否出队。具体操作,请参见ack-kube-queue管理任务队列

具体示例如下图所示,视频团队video关联了videonamespace,通过minmax配置资源配额,Kube Queue会自动为此配额创建关联的queue:root-algorithm-video。后续在videonamespace下的RayJob(.spec.suspend字段设为True)提交后,会自动创建对应的QueueUnit资源对象,进入root-algorithm-video队列进行排队。针对RayJob,KubeQueue会在任务提交后,把Head PodRequest值与每个WorkerGroup的总资源量(即副本数乘以单个Pod Request的资源量)相加,以此总和作为该RayJob所需的Request资源总量。如果RayJob资源总量请求满足当前的可用的配额管理,则RayJob会从root-algorithm-video中出队,进入调度器逻辑。

image

ElasticQuotaTree创建完后,kube queue会根据ElasticQuotaTree里的配置,为每个叶子节点自动创建对应的Queue。

  1. 例如,算法部门/video团队,自动创建queue:root-algorithm-video

    kubectl get queue -n kube-queue root-algorithm-video-k42kq -o yaml
    
    apiVersion: scheduling.x-k8s.io/v1alpha1
    kind: Queue
    metadata:
      annotations:
        kube-queue/parent-quota-fullname: algorithm
        kube-queue/quota-fullname: root/algorithm/video
      creationTimestamp: "2025-01-09T03:32:27Z"
      generateName: root-algorithm-video-
      generation: 1
      labels:
        create-by-kubequeue: "true"
      name: root-algorithm-video-k42kq
      namespace: kube-queue
      resourceVersion: "18282630"
      uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf
    spec:
      queuePolicy: Round
    status:
      queueItemDetails:
        active: []
        backoff: []
    说明

    active: 等待调度的任务的优先级以及在队列中的位置。

    backoff:由于近期调度失败(比如资源配额不够)而处于静默状态的任务。

  2. 查看queue队列。

    kubectl get queue -n kube-queue

    输出结果如下。

    NAME                               AGE
    root-algorithm-n54fm               51s
    root-algorithm-text-hgbvz          51s
    root-algorithm-video-k42kq         51s
    root-devops-2zccw                  51s
    root-infrastructure-devops-d6zqq   51s
    root-infrastructure-vbpkt          51s
    root-k8htb                         51s

创建RayJob

  1. 模拟在videonamespace下定义RayJob资源,自动关联root-algorithm-video队列,并关联当前队列对应的资源配额video。

          - name: video 
            min:
              cpu: 12
              memory: 12Gi
              nvidia.com/gpu: 2 
            max:
              cpu: 14
              memory: 14Gi
              nvidia.com/gpu: 4 
            namespaces: # 配置对应的Namespace
            - video 
    说明
    1. 最小保障资源:cpu:12,memory:12Gi,nvidia.com/gpu:2。

    2. 最大可用资源:cpu:14,memory:14Gi,nvidia.com/gpu:4。

  2. 创建ConfigMap,定义RayJob将要在RayCluster执行的Python代码。示例代码通过ray.remote装饰器,创建了一个actor,并调用了actorinc()和get_count()方法。

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rayjob-video
      namespace: video 
    data:
      sample_code.py: |
        import ray
        import os
        import requests
    
        ray.init()
    
        @ray.remote
        class Counter:
            def __init__(self):
                # Used to verify runtimeEnv
                self.name = os.getenv("counter_name")
                assert self.name == "test_counter"
                self.counter = 0
    
            def inc(self):
                self.counter += 1
    
            def get_counter(self):
                return "{} got {}".format(self.name, self.counter)
    
        counter = Counter.remote()
    
        for _ in range(2):
            ray.get(counter.inc.remote())
            print(ray.get(counter.get_counter.remote()))
    
        # Verify that the correct runtime env was used for the job.
        assert requests.__version__ == "2.26.0"
  3. 配置RayJob YAML示例。

    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      labels:
        job-type: video
      generateName: rayjob-video-
      namespace: video 
    spec:
      entrypoint: python /home/ray/samples/sample_code.py
      runtimeEnvYAML: |
        pip:
          - requests==2.26.0
          - pendulum==2.1.2
        env_vars:
          counter_name: "test_counter"
      
      ttlSecondsAfterFinished: 10
      # if suspend: true , should set shutdownAfterJobFinishes to true
      shutdownAfterJobFinishes: true
      # Suspend specifies whether the RayJob controller should create a RayCluster instance.
      # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
      # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
      suspend: true
    
      rayClusterSpec:
        rayVersion: '2.36.1' # should match the Ray version in the image of the containers
        headGroupSpec:
          rayStartParams:
            dashboard-host: '0.0.0.0'
            num-cpus: "0"
          template:
            spec:
              containers:
                - name: ray-head
                  image: rayproject/ray:2.36.1
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265 # Ray dashboard
                      name: dashboard
                    - containerPort: 10001
                      name: client
                  resources:
                    limits:
                      cpu: "4"
                      memory: 4G
                    requests:
                      cpu: "4"
                      memory: 4G
                  volumeMounts:
                    - mountPath: /home/ray/samples
                      name: code-sample
              volumes:
                # You set volumes at the Pod level, then mount them into containers inside that Pod
                - name: code-sample
                  configMap:
                    # Provide the name of the ConfigMap you want to mount.
                    name: rayjob-video
                    # An array of keys from the ConfigMap to create as files
                    items:
                      - key: sample_code.py
                        path: sample_code.py
        workerGroupSpecs:
          - replicas: 2 
            groupName: small-group
            rayStartParams: {}
            template:
              spec:
                containers:
                  - name: ray-worker 
                    image: rayproject/ray:2.36.1
                    lifecycle:
                      preStop:
                        exec:
                          command: [ "/bin/sh","-c","ray stop" ]
                    resources:
                      limits:
                        cpu: "4"
                        memory: 4G
                      requests:
                        cpu: "4"
                        memory: 4G
    
    

    展开查看RayJob配置说明。

    配置项

    配置说明

    namespace:video

    设置命名空间为video。

    submissionMode:K8sJobMode

    通过Kubernetes job资源来向RayCluster提交Ray任务。

    ttlSecondsAfterFinished:10

    任务完成后,等待10秒。

    shutdownAfterJobFinishes:true

    在等待时间结束后,关闭相关资源。确保任务完成后系统能够优雅地关闭资源,避免资源泄漏。

    说明

    如果suspend:true,表示任务完成后暂停执行后续操作。

    此时,shutdownAfterJobFinishes应设为true,以确保在等待时间结束后正确关闭资源。

    suspend:true

    suspendtrueRayJob才会进入queue队列。

  4. 使用kubectl create -f 创建两次,生成两个RayJob。

    kubectl  get rayjob -n video
    NAME                 JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
    rayjob-video-g2lvn                Initializing        2025-01-10T01:36:24Z              6s
    rayjob-video-h4x2q                Suspended           2025-01-10T01:36:25Z              5s            5s            3s

    其中rayjob-video-g2lvn处于Initializing状态,rayjob-video-h4x2q处于suspended状态。

  5. rayjob-video-g2lvnannotations中通过kube-queue/job-dequeue-timestampkube-queue/job-enqueue-timestamp查看到入队列和出队列的时间。

    kubectl -n video get rayjob rayjob-video-g2lvn -o yaml
    
    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      annotations:
        kube-queue/job-dequeue-timestamp: 2025-01-10 01:36:24.641181026 +0000 UTC m=+132100.596012828
        kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:24.298639916 +0000 UTC m=+132100.253471714
      creationTimestamp: "2025-01-10T01:36:24Z"
      

    rayjob-video-h4x2qannotations只看到了入队时间,kube-queue/job-enqueue-timestamp,并没有看到出队时间。这说明此RayJob还没出队列,并没有进入调度器逻辑。

     kubectl -n video get rayjob rayjob-video-h4x2q -o yaml
     
    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      annotations:
        kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:25.505182364 +0000 UTC m=+132101.460014182
      creationTimestamp: "2025-01-10T01:36:25Z"
  6. 查看Pod,现阶段只能看到rayjob-video-4hwhdPod已经开始了调度。

     kubectl -n video get pod
    NAME                                                           READY   STATUS    RESTARTS   AGE
    rayjob-video-g2lvn-9gz66                                       1/1     Running   0          28s
    rayjob-video-g2lvn-raycluster-v8tfh-head-6trq5                 1/1     Running   0          49s
    rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-hkt7m   1/1     Running   0          49s
    rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-rbzjn   1/1     Running   0          49s
  7. 查看队列的相关属性。

    k -n kube-queue get queue root-algorithm-video-k42kq -o yaml
    
    apiVersion: scheduling.x-k8s.io/v1alpha1
    kind: Queue
    metadata:
      annotations:
        kube-queue/parent-quota-fullname: algorithm
        kube-queue/quota-fullname: root/algorithm/video
      creationTimestamp: "2025-01-09T08:34:57Z"
      generateName: root-algorithm-video-
      generation: 1
      labels:
        create-by-kubequeue: "true"
      name: root-algorithm-video-k42kq
      namespace: kube-queue
      resourceVersion: "19070012"
      uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf
    spec:
      queuePolicy: Round
    status:
      queueItemDetails:
        active: []
        backoff:
        - name: rayjob-video-h4x2q-ray-qu
          namespace: video
          position: 1

设置Gang调度

当您面临需要利用多台计算机共同解决问题时,且希望简化这些复杂性带来的编程挑战时,在以下几种场景中,您可考虑使用Ray来设置节点协同调度。

  • 大规模机器学习训练。

    当需要处理非常大的数据集或复杂的模型时,单一机器可能无法提供足够的计算资源,这时需要一组容器协同工作。Gang调度策略能够确保这些容器同时调度,避免资源争抢和死锁,从而提高训练效率。

  • ‌MPI计算框架‌。

    MPI框架下的多线程并行计算需要主从进程协同工作。Gang调度策略能够确保这些进程同时调度,减少通信延迟,提升计算效率。

  • 数据处理与分析。

    对于那些需要处理海量数据的应用来说,比如日志分析、实时流处理等,多个任务可能需要同时运行以完成复杂的分析工作。Gang调度策略可以确保这些任务同时调度,提高整体处理速度。

  • 自定义分布式应用开发。

    在游戏服务器架构中实现玩家匹配服务;或者在物联网(IoT)项目里协调来自成千上万个设备的数据收集与处理工作。

如果您需要提交带有Gang调度需求的RayJob,在ACK Pro以及ACK灵骏的环境中只需要在RayJobMetadata中声明ray.io/scheduler-name=kube-scheduler即可,提交任务之后rayoperator会在创建出Pod时注入Gang相关标记。

如果您需要在RayJob中提交Gang调度需求,在ACK ProACK灵骏环境中,您只需在RayJobMetadata中声明ray.io/scheduler-name=kube-scheduler。提交任务后,Ray Operator会在创建Pod时自动注入与Gang调度相关的标记。相关代码如下所示。

展开查看完整代码

apiVersion: ray.io/v1
kind: RayJob
metadata:
  generateName: rayjob-sample-
  namespace: algorithm-text
  labels:
    # 通过ray.io/scheduler-name指定Gang调度
    ray.io/scheduler-name: kube-scheduler
    # 通过quota.scheduling.alibabacloud.com/name指定Quota
    quota.scheduling.alibabacloud.com/name: algorithm-video
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"
  shutdownAfterJobFinishes: true
  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  suspend: true

  rayClusterSpec:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "1"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      - replicas: 30
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "1"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
  namespace: algorithm-text
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"

    import time
    time.sleep(30)

创建Pod时为其设置标签,以便于后续的识别、分组、筛选及管理操作。相关代码如下所示。

展开查看完整代码

apiVersion: v1
kind: Pod
metadata:
  annotations:
    ray.io/ft-enabled: "false"
  creationTimestamp: "2024-10-10T02:38:29Z"
  generateName: rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-
  labels:
    app.kubernetes.io/created-by: kuberay-operator
    app.kubernetes.io/name: kuberay
    # 带上了ACK需要的Coscheduling标记
    pod-group.scheduling.sigs.k8s.io/min-available: "31"
    pod-group.scheduling.sigs.k8s.io/name: rayjob-sample-hhbdr-raycluster-ljj69
    # 带上了ACK识别的指定Quota标记
    quota.scheduling.alibabacloud.com/name: algorithm-video
    ray.io/cluster: rayjob-sample-hhbdr-raycluster-ljj69
    ray.io/group: small-group
    ray.io/identifier: rayjob-sample-hhbdr-raycluster-ljj69-worker
    ray.io/is-ray-node: "yes"
    ray.io/node-type: worker
    scheduling.x-k8s.io/pod-group: rayjob-sample-hhbdr-raycluster-ljj69
  name: rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-xnzjh
  namespace: algorithm-text
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: rayjob-sample-hhbdr-raycluster-ljj69
    uid: 74259f20-86fd-4777-b826-73d201065931
  resourceVersion: "7482744"
  uid: 4f666efe-a25d-4620-824f-cab3f4fa0ce7
spec:
  containers:
  - args:
    - 'ulimit -n 65536; ray start  --address=rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=1 '
    command:
    - /bin/bash
    - -lc
    - --
    env:
    - name: FQ_RAY_IP
      value: rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local
    - name: RAY_IP
      value: rayjob-sample-hhbdr-raycluster-ljj69-head-svc
    - name: RAY_CLUSTER_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['ray.io/cluster']
    - name: RAY_CLOUD_INSTANCE_ID
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: RAY_NODE_TYPE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['ray.io/group']
    - name: KUBERAY_GEN_RAY_START_CMD
      value: 'ray start  --address=rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=1 '
    - name: RAY_PORT
      value: "6379"
    - name: RAY_ADDRESS
      value: rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local:6379
    - name: RAY_USAGE_STATS_KUBERAY_IN_USE
      value: "1"
    - name: REDIS_PASSWORD
    - name: RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE
      value: "1"
    image: rayproject/ray:2.9.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - ray stop
    livenessProbe:
      exec:
        command:
        - bash
        - -c
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep
          success
      failureThreshold: 120
      initialDelaySeconds: 30
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 2
    name: ray-worker
    ports:
    - containerPort: 8080
      name: metrics
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - bash
        - -c
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep
          success
      failureThreshold: 10
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 2
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: shared-mem
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rq67v
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - "\n\t\t\t\t\tSECONDS=0\n\t\t\t\t\twhile true; do\n\t\t\t\t\t\tif (( SECONDS
      <= 120 )); then\n\t\t\t\t\t\t\tif ray health-check --address rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local:6379
      > /dev/null 2>&1; then\n\t\t\t\t\t\t\t\techo \"GCS is ready.\"\n\t\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t\tfi\n\t\t\t\t\t\t\techo
      \"$SECONDS seconds elapsed: Waiting for GCS to be ready.\"\n\t\t\t\t\t\telse\n\t\t\t\t\t\t\tif
      ray health-check --address rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local:6379;
      then\n\t\t\t\t\t\t\t\techo \"GCS is ready. Any error messages above can be safely
      ignored.\"\n\t\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t\tfi\n\t\t\t\t\t\t\techo \"$SECONDS
      seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer
      to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.\"\n\t\t\t\t\t\tfi\n\t\t\t\t\t\tsleep
      5\n\t\t\t\t\tdone\n\t\t\t\t"
    command:
    - /bin/bash
    - -lc
    - --
    env:
    - name: FQ_RAY_IP
      value: rayjob-sample-hhbdr-raycluster-ljj69-head-svc.algorithm-text.svc.cluster.local
    - name: RAY_IP
      value: rayjob-sample-hhbdr-raycluster-ljj69-head-svc
    image: rayproject/ray:2.9.0
    imagePullPolicy: IfNotPresent
    name: wait-gcs-ready
    resources:
      limits:
        cpu: 200m
        memory: 256Mi
      requests:
        cpu: 200m
        memory: 256Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rq67v
      readOnly: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: shared-mem
  - name: kube-api-access-rq67v
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-10T02:38:29Z"
    message: '0/7 nodes are available: 3 node(s) had untolerated taint {virtual-kubelet.io/provider:
      alibabacloud}, 4 Insufficient cpu. failed to get current scheduling unit not
      found, preemption: 0/7 nodes are available: 1 No victims found on node cn-hongkong.10.0.118.181
      for preemptor pod rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-xnzjh,
      1 No victims found on node cn-hongkong.10.1.0.52 for preemptor pod rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-xnzjh,
      1 No victims found on node cn-hongkong.10.2.0.24 for preemptor pod rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-xnzjh,
      1 No victims found on node cn-hongkong.10.2.0.5 for preemptor pod rayjob-sample-hhbdr-raycluster-ljj69-small-group-worker-xnzjh,
      3 Preemption is not helpful for scheduling., '
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

当资源不足时,您可以通过事件(events)获取调度失败的详细信息。使用--field-selector='type=Warning,reason=GangFailedScheduling'可以过滤出与Gang相关的调度失败记录。此外,“cycle xx”表示不同轮次下的调度失败详情,具体说明了在该轮中Pod未能成功调度的原因。具体示例如下所示。

➜  kubequeue-doc kubectl get events -n algorithm-text --field-selector='type=Warning,reason=GangFailedScheduling' | grep "cycle 1"
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s       Warning   GangFailedScheduling   pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6   rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
  • 本页导读 (1)
  • 环境准备(二选一)
  • 资源配额
  • 任务队列
  • 创建RayJob
  • 设置Gang调度