cGPU兼容性与问题排查FAQ-容器服务 Kubernetes 版 ACK-阿里云

cGPU是阿里云自研的GPU显存和算力隔离模块，使用cGPU服务可以隔离GPU资源，在多个容器共用一张GPU卡时容器之间的显存和计算资源使用互不影响。本文介绍cGPU使用中的已知问题及注意事项。

阅读前提示

若您集群中GPU节点已存在标签ack.node.gpu.schedule=cgpu、ack.node.gpu.schedule=core_mem或cgpu=true，则表明节点已开启cGPU隔离能力。
您需要查看ack-ai-installer组件发布记录，了解ack-ai-installer组件版本与cGPU组件版本的对应关系。
若您想了解更多关于cGPU的信息，请参见NVIDIA官方文档。

cGPU版本兼容性

NVIDIA驱动兼容性

cGPU版本

兼容的 NVIDIA 驱动

1.5.20

1.5.19

1.5.18

1.5.17

1.5.16

1.5.15

1.5.13

1.5.12

1.5.11

1.5.10

1.5.9

1.5.8

1.5.7

1.5.6

1.5.5

1.5.3

支持：

460系列
470系列
510系列
515系列
525系列
535系列
550系列
560系列
565系列
570系列
575系列

1.5.2

1.0.10

1.0.9

1.0.8

1.0.7

1.0.6

1.0.5

支持：

460系列
470系列 <= 470.161.03
510系列 <= 510.108.03
515系列 <= 515.86.01
525系列 <= 525.89.03

不支持：

535系列
550系列
560系列
565系列
570系列
575系列

1.0.3

0.8.17

0.8.13

支持：

460系列
470系列 <= 470.161.03

不支持：

510系列
515系列
525系列
535系列
550系列
560系列
565系列
570系列
575系列

实例规格族兼容性

cGPU版本

兼容的实例规格族

1.5.20

1.5.19

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e
gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e
gn8t / ebmgn8t
gn8is / gn8v / ebmgn8is / ebmgn8v
gn8ia / ebmgn8ia
ebmgn9t

1.5.18

1.5.17

1.5.16

1.5.15

1.5.13

1.5.12

1.5.11

1.5.10

1.5.9

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e
gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e
gn8t / ebmgn8t
gn8is / gn8v / ebmgn8is / ebmgn8v
gn8ia / ebmgn8ia

不支持：

ebmgn9t

1.5.8

1.5.7

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e
gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e
gn8t / ebmgn8t
gn8is / gn8v / ebmgn8is / ebmgn8v

不支持：

gn8ia / ebmgn8ia
ebmgn9t

1.5.6

1.5.5

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e
gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e
gn8t / ebmgn8t

不支持：

gn8is / gn8v / ebmgn8is / ebmgn8v
gn8ia / ebmgn8ia
ebmgn9t

1.5.3

1.5.2

1.0.10

1.0.9

1.0.8

1.0.7

1.0.6

1.0.5

1.0.3

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e
gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e

不支持：

gn8t / ebmgn8t
gn8is / gn8v / ebmgn8is / ebmgn8v
gn8ia / ebmgn8ia
ebmgn9t

0.8.17

0.8.13

支持：

gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e

不支持：

gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e
gn8t / ebmgn8t
gn8is / gn8v / ebmgn8is / ebmgn8v
gn8ia / ebmgn8ia
ebmgn9t

nvidia-container-toolkit兼容性

cGPU版本

兼容的 nvidia-container-toolkit

1.5.20

1.5.19

1.5.18

1.5.17

1.5.16

1.5.15

1.5.13

1.5.12

1.5.11

1.5.10

1.5.9

1.5.8

1.5.7

1.5.6

1.5.5

1.5.3

1.5.2

1.0.10

支持：

nvidia-container-toolkit <= 1.10
nvidia-container-toolkit: 1.11 ~ 1.17

1.0.9

1.0.8

1.0.7

1.0.6

1.0.5

1.0.3

0.8.17

0.8.13

支持：

nvidia-container-toolkit <= 1.10

不支持：

nvidia-container-toolkit: 1.11 ~ 1.17

kernel 版本兼容性

cGPU版本

兼容的 kernel 版本

1.5.20

1.5.19

1.5.18

1.5.17

1.5.16

1.5.15

1.5.13

1.5.12

1.5.11

1.5.10

1.5.9

支持：

kernel 3.x
kernel 4.x
kernel 5.x <= 5.15

1.5.8

1.5.7

1.5.6

1.5.5

1.5.3

支持：

kernel 3.x
kernel 4.x
kernel 5.x <= 5.10

1.5.2

1.0.10

1.0.9

1.0.8

1.0.7

1.0.6

1.0.5

1.0.3

支持：

kernel 3.x
kernel 4.x
kernel 5.x <= 5.1

0.8.17

支持：

kernel 3.x
kernel 4.x
kernel 5.x <= 5.0

0.8.13

0.8.12

0.8.10

支持：

kernel 3.x
kernel 4.x

不支持：

kernel 5.x

常见问题

使用cGPU出现Linux Kernel Panic。

背景：使用cGPU组件时，出现cGPU内核驱动死锁现象（即并发执行的进程互相牵制），导致Linux Kernel Panic问题。
原因：您安装的cGPU≤1.5.7版本，组件版本过低导致。
解决方案：推荐您安装cGPU≥1.5.10版本，或将低版本cGPU逐步升级到1.5.10及以上版本，避免新业务出现内核错误问题。升级方式，请参见升级节点cGPU版本。

使用cGPU部分场景出现cGPU Pod启动失败。

背景：使用阿里云容器优化版本操作系统镜像时，部分场景下cGPU节点上cGPU Pod启动失败。错误描述如下：

"Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 2, stdout: , stderr: Auto-detected mode as 'legacy': unknown"

原因：当cgpu≤1.5.18版本时，会导致部分场景下cGPU节点上首个cGPU Pod启动失败。
解决方案：升级ack-ai-installer≥1.12.6版本，具体升级操作请参见升级共享GPU调度组件。

创建cGPU Pod出现`modprobe: ERROR`。

背景：创建cGPU Pod出现modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted或modprobe: ERROR: could not insert 'km': Operation not permitted报错。

原因：以上两者报错信息如下所示：

Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 2, stdout: , stderr: modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted Auto-detected mode as 'legacy': unknown

modprobe: ERROR: could not insert 'km': Operation not permitted

解决方案：通常表明操作系统版本与cGPU不兼容，请参见升级共享GPU调度组件，将组件升级为最新版本。

创建cGPU Pod的容器失败或超时退出。

背景：创建cGPU Pod的容器失败或超时退出报错问题。
原因：您安装的组件cGPU≤1.0.10版本，且NVIDIA Toolkit≥1.11版本，组件间版本不兼容导致。
解决方案：通常表明NVIDIA Toolkit版本与cGPU不兼容，请参见升级共享GPU调度组件，将组件升级为最新版本。

创建cGPU Pod出现报错`Error occurs when creating cGPU instance: unknown`。

背景：为考虑性能因素，在使用cGPU的情况下，单张GPU卡最多创建20个Pod。
原因：当已创建Pod数超过此限制时，后续调度到该卡上的Pod将无法运行，将出现错误信息Error occurs when creating cGPU instance: unknown。
解决方案：使用cGPU时，控制单张GPU卡创建Pod数≤20。

cGPU Pod执行`nvidia-smi`出现报错`Failed to initialize NVML`。

背景：Pod申请共享GPU调度资源且状态处于Running后，您可以在Pod中执行nvidia-smi命令，输出以下内容。
```
Failed to initialize NVML: GPU access blocked by operating system
```
原因：您安装的组件cGPU≤1.5.2版本，且GPU驱动是在2023年07月后版本，导致cGPU版本与GPU驱动版本不兼容，GPU版本驱动请参见查看GPU驱动发布时间。匹配ACK各集群版本支持默认GPU驱动版本，请参见ACK支持的NVIDIA驱动版本列表。
解决方案：为解决此问题，请参见升级共享GPU调度组件将组件升级为最新版本。

阅读前提示

cGPU版本兼容性

NVIDIA驱动兼容性

实例规格族兼容性

nvidia-container-toolkit兼容性

kernel 版本兼容性

常见问题

使用cGPU出现Linux Kernel Panic。

使用cGPU部分场景出现cGPU Pod启动失败。

创建cGPU Pod出现modprobe: ERROR。

创建cGPU Pod的容器失败或超时退出。

创建cGPU Pod出现报错Error occurs when creating cGPU instance: unknown。

cGPU Pod执行nvidia-smi出现报错Failed to initialize NVML。

创建cGPU Pod出现`modprobe: ERROR`。

创建cGPU Pod出现报错`Error occurs when creating cGPU instance: unknown`。

cGPU Pod执行`nvidia-smi`出现报错`Failed to initialize NVML`。