如何安装和使用eGPU Kubernetes组件_人工智能平台 PAI(PAI)-阿里云帮助中心

eGPU是容器虚拟化方案，可直接用于支持云原生资源平台，为大规模集群提供GPU共享能力。为了在Kubernetes集群中使用eGPU共享GPU资源，需要通过以下步骤安装eGPU device plugin使能GPU虚拟化的调度能力。

前提条件

Docker推荐19.03.5及以上版本，本文档基于19.03.5。
Kubernetes推荐1.21.2及以上版本，本文档基于1.21.2。
Kubernetes master节点上安装helm 3.0.0及以上版本，本文档基于3.3.4。
已安装Nvidia-docker2。
已安装eGPU。

如果是通过ACK托管的集群，则默认已安装ACK自带的GPU虚拟化调度组件，可以跳过本文档中的安装步骤，直接进入验证步骤。目前通过ACK调度组件使用eGPU暂时不支持使用eGPU的算力切分功能。

安装流程

使能 Nvidia Runtime。

修改/etc/docker/daemon.json，加入如下内容。

{
 "default-runtime": "nvidia",
 "runtimes": {
 "nvidia": {
 "path": "/usr/bin/nvidia-container-runtime",
 "runtimeArgs": []
 }
 }
}

重启Docker。
```
sudo systemctl restart docker
```

节点打标。
1. 打标egpu。
  让device plugin能够正确识别拥有egpu能力的节点。在master上执行如下命令，命令中target_node为kubectl get no得到的节点NAME，比如cn-shanghai.gpu1。
```
sudo kubectl label node <target_node> egpu=true
```
2. 打标node.gpu.placement。
  指定某一个node的schedule模式。对于一个具体的node，schedule模式有两种：
  - binpack：优先将一张GPU资源用完以后再用下一张。
  - spread：将下一个任务调度到资源剩余量最多的那张卡上。
  说明
  不打标签则默认binpack模式。
```
sudo kubectl label node <target_node> node.gpu.placement=spread
sudo kubectl label node <target_node> node.gpu.placement=binpack
```

利用helm进行安装。

master节点需获取安装文件，假设文件都在deployer文件夹中。

利用helm进行安装的命令。

sudo helm install egpu -n kube-system deployer/

验证安装成功并运行。

以下运行结果显示，一个由单个master节点和一个GPU节点组成的集群正常运行时的输出。

$sudo kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-7d75679df-gw7w8 1/1 Running 0 11d
coredns-7d75679df-pl4p9 1/1 Running 0 11d
device-plugin-evict-ds-x5n6p 1/1 Running 0 84m
etcd-harp-03 1/1 Running 0 11d
gpushare-device-plugin-ds-6hxxs 1/1 Running 0 84m
gpushare-installer-scckq 0/2 Completed 0 84m
gpushare-schd-extender-56688dd8d4-kgc6z 1/1 Running 0 84m
kube-apiserver-harp-03 1/1 Running 0 11d
kube-controller-manager-harp-03 1/1 Running 0 11d
kube-flannel-ds-lkh9g 1/1 Running 1 116m
kube-flannel-ds-nj6mk 1/1 Running 0 11d
kube-proxy-f9x7x 1/1 Running 0 116m
kube-proxy-mdp59 1/1 Running 0 11d
kube-scheduler-harp-03 1/1 Running 0 83m

egpu inspector。

egpu inspector是一个可以用来查询集群内GPU资源整体使用状况的工具，安装在master节点上。

$sudo kubectl inspect egpu
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU4(Allocated/Total) GPU5(Allocated/Total) GPU6(Allocated/Total) GPU7(Allocated/Total) GPU Memory(GiB)
cn-shanghai.gpu1 100.82.131.102 0/39 0/39 0/39 0/39 0/39 0/39 0/39 0/39 0/312
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU4(Allocated/Total) GPU5(Allocated/Total) GPU6(Allocated/Total) GPU7(Allocated/Total) GPU Memory(GiB)
cn-shanghai.gpu1 100.82.131.102 0/100 0/100 0/100 0/100 0/100 0/100 0/100 0/100 0/800
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/312 (0%)
Allocated/Total GPU Compute In Cluster:
0/800 (0%)

这表示该集群有1个node，每个GPU node上有8个GPU，每个GPU有39G显存及100%算力，目前都没有被使用。

说明

通过ACK托管的集群中暂时不支持此项功能。

验证device plugin已成功安装。

通过提交一个eGPU任务来验证device plugin安装过程无误。假设环境中有两台服务器节点，一台master节点，一台GPU节点。

$sudo kubectl get no
NAME STATUS ROLES AGE VERSION
harp-03 Ready control-plane,master 11d v1.21.2
cn-shanghai.gpu1 Ready <none> 127m v1.21.2

编辑任务的YAML文件，添加以下部分和环境变量。

说明

通过ACK托管的集群中暂时不支持算力切分功能，不需要填写算力相关字段.

resources:
 limits:
 aliyun.com/gpu-mem: "16"
 aliyun.com/gpu-compute: "40"
 env:
 - name: AMP_VGPU_ENABLE
 value: "1"
 - name: AMP_USE_HOST_DAEMON
 value: "1"
 - name: NVIDIA_DRIVER_CAPABILITIES
 value: all

以下是一个完整的YAML sample，保存为egpu-test.yaml。它会尝试创建8个pod，每个pod占用16G显存及40%单卡算力。

apiVersion: batch/v1
kind: Job
metadata:
 name: amp-egpu-test
 labels:
 app: amp-egpu-test
spec:
 parallelism: 8
 template: # define the pods specifications
 metadata:
 labels:
 app: test-egpu-001
 spec:
 containers:
 - name: test-egpu-001
 image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
 command:
 - python
 - tensorflow-sample-code/tfjob/docker/mnist/main.py
 - --max_steps=100000
 - --data_dir=tensorflow-sample-code/data
 resources:
 limits:
 aliyun.com/gpu-mem: "16"
 aliyun.com/gpu-compute: "40"
 env:
 - name: AMP_VGPU_ENABLE
 value: "1"
 - name: AMP_USE_HOST_DAEMON
 value: "1"
 - name: NVIDIA_DRIVER_CAPABILITIES 
 value: all
 restartPolicy: Never

提交该YAML。

sudo kubectl apply -f egpu-test.yaml

稍许等待后验证结果。

sudo kubectl get po
NAME READY STATUS RESTARTS AGE
amp-egpu-test-6fntr 1/1 Running 0 66s
amp-egpu-test-6knks 1/1 Running 0 66s
amp-egpu-test-drwgq 1/1 Running 0 66s
amp-egpu-test-fsv48 1/1 Running 0 66s
amp-egpu-test-ldtqw 1/1 Running 0 66s
amp-egpu-test-m6zgz 1/1 Running 0 66s
amp-egpu-test-qhccq 1/1 Running 0 66s
amp-egpu-test-rfb8f 1/1 Running 0 66s

sudo kubectl inspect egpu
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU4(Allocated/Total) GPU5(Allocated/Total) GPU6(Allocated/Total) GPU7(Allocated/Total) GPU Memory(GiB)
cn-shanghai.gpu1 100.82.131.102 32/39 32/39 32/39 32/39 0/39 0/39 0/39 0/39 128/312
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU4(Allocated/Total) GPU5(Allocated/Total) GPU6(Allocated/Total) GPU7(Allocated/Total) GPU Compute
cn-shanghai.gpu1 100.82.131.102 80/100 80/100 80/100 80/100 0/100 0/100 0/100 0/100 320/800
---------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
128/312 (41%)
Allocated/Total GPU Compute In Cluster:
320/800 (40%)

可见在cn-shanghai.gpu1上调度了8个pod，8张卡中用了4张卡（index 0,1,2,3)，每张利用了32/39显存及80%的算力，调度结果符合预期。

卸载

取消节点的标签。

sudo kubectl label no <target-node> egpu-
sudo kubectl label no <target-node> node.gpu.placement-

删除device plugin。
```
sudo helm uninstall egpu -n kube-system
```