ACK基于Scheduling Framework机制,实现GPU拓扑感知调度,即在节点的GPU组合中选择具有最优训练速度的组合。本文主要介绍如何激活GPU拓扑感知调度,且举例说明未激活GPU拓扑感知调度的使用示例,让您有更好的用户体验。
应用限制
- 仅支持提交作业的所有Pod对资源(CPU、GPU等资源)请求一致的场景。
- 只有当提交作业的所有Pod对资源请求都满足条件时,才能创建Pod并启动作业,否则请求会处于资源等待状态。
节点配置
您需执行以下命令设置节点label,显式激活节点GPU拓扑感知调度。
kubectl label node <Your Node Name> ack.node.gpu.schedule=topology
说明 当节点激活GPU拓扑感知调度后,不再支持普通GPU资源调度。可执行以下命令更改label,恢复普通GPU资源调度功能。
kubectl label node <Your Node Name> ack.node.gpu.schedule=default --overwrite
示例一:未激活GPU拓扑感知调度
注意 测试以下用例需要预装Arena或者单独部署Pytorch-Operator。
在Pytorch中未开启GPU拓扑感知调度的样例如下。
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-nccl"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch_dist_mnist:1.3-gpu-py3
command:
- python
args: ["/opt/mnist/src/mnist.py", "--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "GRAPH"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch_dist_mnist:1.3-gpu-py3
command:
- python
args: ["/opt/mnist/src/mnist.py", "--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "GRAPH"
示例二:激活GPU拓扑感知调度作业配置
在Pytorch中开启了GPU拓扑感知调度的样例如下。
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-nccl"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
labels:
gpu-topology: pytorch-dist-nccl # 标识分布式训练的Job名称。
gpu-topology-replica: "2" # 标识分布式训练的副本总数。
annotations:
sidecar.istio.io/inject: "false"
spec:
hostIPC: true # 开启hostIPC
hostNetwork: true # 开启hostNetwork
hostPID: true # 开启hostPID
dnsPolicy: ClusterFirstWithHostNet # dnsPolicy设置为ClusterFirstWithHostNet。
containers:
- name: pytorch
image: registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch_dist_mnist:1.3-gpu-py3
command:
- python
args: ["/opt/mnist/src/mnist.py", "--backend", "nccl"]
resources:
limits:
aliyun.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
labels:
gpu-topology: pytorch-dist-nccl # 标识分布式训练的Job名称。
gpu-topology-replica: "2" # 标识分布式训练的副本总数。
annotations:
sidecar.istio.io/inject: "false"
spec:
hostIPC: true # 开启hostIPC
hostNetwork: true # 开启hostNetwork
hostPID: true # 开启hostPID
dnsPolicy: ClusterFirstWithHostNet # dnsPolicy设置为ClusterFirstWithHostNet。
containers:
- name: pytorch
image: registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch_dist_mnist:1.3-gpu-py3
command:
- python
args: ["/opt/mnist/src/mnist.py", "--backend", "nccl"]
resources:
limits:
aliyun.com/gpu: 1 # 资源请求改为aliyun.com/gpu。
在文档使用中是否遇到以下问题
更多建议
匿名提交