在ACS上使用PPU PIP服务示例
更新时间:
复制为 MD 格式
本文为您介绍在DinD、Buildah、普通PPU容器等场景下在ACS上使用PPU PIP服务示例。
前提条件
熟悉如何在ACS上使用PTG提供的PPU PIP服务和阿里云提供的PPU PIP服务。具体操作,请参见在ACS上使用PPU PIP服务。
记录ACS集群的Service Account名称,后续的配置步骤会引用该名称。具体操作,请参见在ACS产品中开通PPU PIP免密授权。
熟悉配置镜像鉴权的Secret、准备模型文件、部署GPU算力等步骤。具体操作,请参见部署推理算力。
使用PPU PIP服务示例
DinD示例
YAML文件定义了预留实例场景,其他场景请修改
metadata.labels。apiVersion: v1 kind: Pod metadata: labels: alibabacloud.com/compute-class: gpu-hpn alibabacloud.com/gpu-model-series: PPU810E alibabacloud.com/compute-qos: default alibabacloud.com/hpn-type: "rdma" name: acs-dind-demo-fuyi01 spec: containers: - image: registry.cn-hangzhou.aliyuncs.com/acs-demo-ns/docker:27-dind name: main resources: #根据需求调整资源 limits: cpu: "16" ephemeral-storage: 256Gi memory: 128Gi alibabacloud.com/ppu: "2" requests: cpu: "16" ephemeral-storage: 256Gi memory: 128Gi alibabacloud.com/ppu: "2" volumeMounts: - mountPath: /var/lib/docker name: docker # 可选 使用buildkit需要设置 - mountPath: /var/lib/buildkit name: buildkit securityContext: # 声明为特权容器,需要提交工单申请特权能力 privileged: true #添加前序步骤创建的ServiceAccount名称 serviceAccountName: pip-default volumes: - emptyDir: {} name: docker # 可选 使用buildkit需要设置 - emptyDir: {} name: buildkit创建配置DinD容器。
#创建DinD容器 kubectl apply -f dind.yaml #连接进入DinD容器 kubectl exec -it acs-dind-demo-fuyi01 -- sh #【DinD容器内】鉴权,为了拉取PPU镜像 docker login --username=public_pull@1903015075229209 egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com --password=CnpER062Qo! #【DinD容器内】镜像拉取到本地 docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch:25.06 #【DinD容器内】启动内层PPU容器 id=$(docker run --network host -tid --privileged \ -e ALIBABA_CLOUD_OIDC_PROVIDER_ARN=$ALIBABA_CLOUD_OIDC_PROVIDER_ARN \ -e ALIBABA_CLOUD_ROLE_ARN=$ALIBABA_CLOUD_ROLE_ARN \ -e ALIBABA_CLOUD_OIDC_TOKEN_FILE=$ALIBABA_CLOUD_OIDC_TOKEN_FILE \ -e ALIBABA_CLOUD_STS_ENDPOINT=$ALIBABA_CLOUD_STS_ENDPOINT \ -v /var/run/secrets/ack.alibabacloud.com/rrsa-tokens/:/var/run/secrets/ack.alibabacloud.com/rrsa-tokens/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch:25.06 bash) #【DinD容器内】进入内容PPU容器 docker exec -it $id bash #【内层PPU容器内】安装免密插件 pip install aiext-pypi-plugin -i http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/simple --trusted-host mirrors.cloud.aliyuncs.com #【内层PPU容器内】安装一个镜像中不存在的torchdata (注意版本匹配) pip install torchdata -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu126/simple/ # 在内层容器内可以进行必要的apt install / pip install,环境开发过程略PIP免密安装过程。
root@acs-dind-demo-fuyi01:/workspace# pip install torchdata -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu126/simple/ [AIEXT_PYPI_PLUGIN] Plugin has been loaded. Looking in indexes: https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu126/simple/, http://mirrors.cloud.aliyuncs.com/pypi/simple/ Collecting torchdata Downloading https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu126/packages/torchdata/0.11.0%2Bppu1.5.0.ce/torchdata-0.11.0%2Bppu1.5.0.ce-py3-none-any.whl (59 kB) Requirement already satisfied: urllib3>=1.25 in /opt/ac2/lib/python3.12/site-packages (from torchdata) (2.5.0) Requirement already satisfied: requests in /opt/ac2/lib/python3.12/site-packages (from torchdata) (2.32.4) Requirement already satisfied: torch>=2 in /opt/ac2/lib/python3.12/site-packages (from torchdata) (2.6.0+ali.7.post1.ppu1.5.2.cu126) Requirement already satisfied: filelock in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (3.18.0) Requirement already satisfied: typing-extensions>=4.10.0 in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (4.14.0) Requirement already satisfied: setuptools in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (80.9.0) Requirement already satisfied: sympy==1.13.1 in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (1.13.1) Requirement already satisfied: networkx in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (3.5) Requirement already satisfied: jinja2 in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (3.1.4) Requirement already satisfied: fsspec in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (2024.9.0) Requirement already satisfied: triton==3.2.0 in /opt/ac2/lib/python3.12/site-packages (from torch>=2->torchdata) (3.2.0+ppu1.5.2.cu126) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/ac2/lib/python3.12/site-packages (from sympy==1.13.1->torch>=2->torchdata) (1.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/ac2/lib/python3.12/site-packages (from jinja2->torch>=2->torchdata) (3.0.2) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/ac2/lib/python3.12/site-packages (from requests->torchdata) (3.4.2) Requirement already satisfied: idna<4,>=2.5 in /opt/ac2/lib/python3.12/site-packages (from requests->torchdata) (3.10) Requirement already satisfied: certifi>=2017.4.17 in /opt/ac2/lib/python3.12/site-packages (from requests->torchdata) (2025.6.15) Installing collected packages: torchdata root@acs-dind-demo-fuyi01:/workspace# pip list | grep torchdata torchdata 0.11.0+ppu1.5.0.ce退出并保存DinD容器。
#【内层PPU容器内】从内层PPU容器退出到DinD容器 exit #【DinD容器内】保存内层PPU容器为镜像 docker commit $id my-new-image:new-id
Buildah示例
YAML文件定义了预留实例场景,其他场景请修改
metadata.labels。kind: Pod apiVersion: v1 metadata: name: acs-buildah-demo-fuyi labels: alibabacloud.com/compute-class: gpu-hpn alibabacloud.com/gpu-model-series: PPU810E alibabacloud.com/compute-qos: default alibabacloud.com/hpn-type: "rdma" spec: restartPolicy: OnFailure volumes: - emptyDir: {} name: buildah containers: - name: builder image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/buildah:sampleBuild command: - sh - -c - sleep infinity imagePullPolicy: Always volumeMounts: - mountPath: /var/lib/containers/storage name: buildah resources: #根据需求调整资源 limits: cpu: "16" memory: "128Gi" ephemeral-storage: 500Gi alibabacloud.com/ppu: 2 requests: cpu: "16" memory: "128Gi" ephemeral-storage: 500Gi alibabacloud.com/ppu: 2 #添加前序步骤创建的ServiceAccount名称 serviceAccountName: pip-default创建配置Buildah容器。
# 创建Buildah容器 kubectl apply -f buildah.yaml # 进入Buildah容器 kubectl exec -it acs-buildah-demo-fuyi -- sh #在buildah容器里边,可以使用buildah一比一替换docker的命令 # 【普通PPU容器内】登录,便于拉取公网镜像 buildah login --username=public_pull@1903015075229209 egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com --password=CnpER062Qo! #【普通PPU容器内】拉取Base镜像到本地,便于构建过程 buildah pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch-c:xp-fuyao-ppu1.4.0_hotfix2-cuda11.4-torch1.10-py38-24.12.27-squash #【普通PPU容器内】使用如下Dockerfile构建新镜像(只是演示使用免密pip) buildah --storage-driver=overlay build --no-cache \ --build-arg ALIBABA_CLOUD_OIDC_PROVIDER_ARN=$ALIBABA_CLOUD_OIDC_PROVIDER_ARN \ --build-arg ALIBABA_CLOUD_ROLE_ARN=$ALIBABA_CLOUD_ROLE_ARN \ --build-arg ALIBABA_CLOUD_STS_ENDPOINT=$ALIBABA_CLOUD_STS_ENDPOINT \ -t demo:v1 .PIP免密安装(省略了其他Dockerfile指令)。
# 使用ACS提供的标准镜像,如下镜像尺寸较小 FROM egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch-c:xp-fuyao-ppu1.4.0_hotfix2-cuda11.4-torch1.10-py38-24.12.27-squash # 其他Dockerfile步骤省略 # 只示意使用阿里云提供的PIP软件安装相关的步骤 ## 1. 设置免密相关Env ARG ALIBABA_CLOUD_STS_ENDPOINT ARG ALIBABA_CLOUD_OIDC_PROVIDER_ARN ARG ALIBABA_CLOUD_ROLE_ARN ENV ALIBABA_CLOUD_STS_ENDPOINT=${ALIBABA_CLOUD_STS_ENDPOINT} ENV ALIBABA_CLOUD_OIDC_PROVIDER_ARN=${ALIBABA_CLOUD_OIDC_PROVIDER_ARN} ENV ALIBABA_CLOUD_ROLE_ARN=${ALIBABA_CLOUD_ROLE_ARN} ENV ALIBABA_CLOUD_OIDC_TOKEN_FILE=/mount/rrsa-tokens/token ## 2. 首先安装免密插件aiext-pypi-plugin,安装插件限制(暂不支持以python -m pip执行请求) RUN pip install aiext-pypi-plugin -i http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/simple --trusted-host mirrors.cloud.aliyuncs.com ## 3. 然后每一条pip install前都要增加--mount选项,就可以正常pip install了(找了2个镜像中未装的包): RUN --mount=type=bind,source=/var/run/secrets/ack.alibabacloud.com/rrsa-tokens,target=/mount/rrsa-tokens \ pip install cumm-cu114 -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ RUN --mount=type=bind,source=/var/run/secrets/ack.alibabacloud.com/rrsa-tokens,target=/mount/rrsa-tokens \ pip install spconv-cu114 -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ ## 4. PIP安完成后建议清理环境 RUN pip uninstall -y aiext-pypi-plugin构建过程输出如下所示:
sh-5.2# buildah --storage-driver=overlay build --no-cache \ --build-arg ALIBABA_CLOUD_OIDC_PROVIDER_ARN=$ALIBABA_CLOUD_OIDC_PROVIDER_ARN \ --build-arg ALIBABA_CLOUD_ROLE_ARN=$ALIBABA_CLOUD_ROLE_ARN \ --build-arg ALIBABA_CLOUD_OIDC_TOKEN_FILE=$ALIBABA_CLOUD_OIDC_TOKEN_FILE \ --build-arg ALIBABA_CLOUD_STS_ENDPOINT=$ALIBABA_CLOUD_STS_ENDPOINT \ -t demo:v1 . STEP 1/13: FROM egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch-c:xp-fuyao-ppu1.4.0_hotfix2-cuda11.4-torch1.10-py38-24.12.27-squash STEP 2/13: ARG ALIBABA_CLOUD_STS_ENDPOINT STEP 3/13: ARG ALIBABA_CLOUD_OIDC_PROVIDER_ARN STEP 4/13: ARG ALIBABA_CLOUD_ROLE_ARN STEP 5/13: ARG ALIBABA_CLOUD_OIDC_TOKEN_FILE STEP 6/13: ENV ALIBABA_CLOUD_STS_ENDPOINT=sts-vpc.cn-wulanchabu.aliyuncs.com STEP 7/13: ENV ALIBABA_CLOUD_OIDC_PROVIDER_ARN=acs:ram::1697237910442391:oidc-provider/ack-rrsa-c69a187d1281446df95887e01a6b26c6b STEP 8/13: ENV ALIBABA_CLOUD_ROLE_ARN=acs:ram::1697237910442391:role/pip-default-c69a187d1281446df95887e01a6b26c6b STEP 9/13: ENV ALIBABA_CLOUD_OIDC_TOKEN_FILE=/var/run/secrets/ack.alibabacloud.com/rrsa-tokens/token STEP 10/13: RUN pip install aiext-pypi-plugin -i http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/simple --trusted-host mirrors.cloud.aliyuncs.com Looking in indexes: http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/simple Collecting aiext-pypi-plugin Downloading http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/packages/aiext_pypi_plugin-0.1.0-py3-none-any.whl (3.0 kB) Installing collected packages: aiext-pypi-plugin Successfully installed aiext-pypi-plugin-0.1.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv STEP 11/13: RUN --mount=type=bind,source=/var/run/secrets/ack.alibabacloud.com/rrsa-tokens,target=/mount/rrsa-tokens pip install cumm-cu114 -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ [AIEXT_PYPI_PLUGIN] Plugin has been loaded. [AIEXT_PYPI_PLUGIN] Missing OIDC token file Looking in indexes: https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ Requirement already satisfied: cumm-cu114 in /opt/conda/lib/python3.8/site-packages (0.5.3) Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from cumm-cu114) (1.21.1) Requirement already satisfied: pybind11>=2.6.0 in /opt/conda/lib/python3.8/site-packages (from cumm-cu114) (2.7.0) Requirement already satisfied: pccm>=0.4.2 in /opt/conda/lib/python3.8/site-packages (from cumm-cu114) (0.4.16) Requirement already satisfied: fire in /opt/conda/lib/python3.8/site-packages (from cumm-cu114) (0.5.0) Requirement already satisfied: ccimport>=0.3.1 in /opt/conda/lib/python3.8/site-packages (from pccm>=0.4.2->cumm-cu114) (0.4.4) Requirement already satisfied: lark>=1.0.0 in /opt/conda/lib/python3.8/site-packages (from pccm>=0.4.2->cumm-cu114) (1.2.2) Requirement already satisfied: portalocker>=2.3.2 in /opt/conda/lib/python3.8/site-packages (from pccm>=0.4.2->cumm-cu114) (2.6.0) Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from fire->cumm-cu114) (1.16.0) Requirement already satisfied: termcolor in /opt/conda/lib/python3.8/site-packages (from fire->cumm-cu114) (2.1.1) Requirement already satisfied: ninja in /opt/conda/lib/python3.8/site-packages (from ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (1.11.1.3) Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (2.26.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (1.26.13) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (3.1) Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (2.0.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.3.1->pccm>=0.4.2->cumm-cu114) (2021.5.30) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [AIEXT_PYPI_PLUGIN] Missing OIDC token file STEP 12/13: RUN --mount=type=bind,source=/var/run/secrets/ack.alibabacloud.com/rrsa-tokens,target=/mount/rrsa-tokens pip install spconv-cu114 -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ [AIEXT_PYPI_PLUGIN] Plugin has been loaded. [AIEXT_PYPI_PLUGIN] Missing OIDC token file Looking in indexes: https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu114/simple/ Requirement already satisfied: spconv-cu114 in /opt/conda/lib/python3.8/site-packages (2.3.6) Requirement already satisfied: fire in /opt/conda/lib/python3.8/site-packages (from spconv-cu114) (0.5.0) Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from spconv-cu114) (1.21.1) Requirement already satisfied: pybind11>=2.6.0 in /opt/conda/lib/python3.8/site-packages (from spconv-cu114) (2.7.0) Requirement already satisfied: pccm>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spconv-cu114) (0.4.16) Requirement already satisfied: ccimport>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spconv-cu114) (0.4.4) Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from ccimport>=0.4.0->spconv-cu114) (2.26.0) Requirement already satisfied: ninja in /opt/conda/lib/python3.8/site-packages (from ccimport>=0.4.0->spconv-cu114) (1.11.1.3) Requirement already satisfied: portalocker>=2.3.2 in /opt/conda/lib/python3.8/site-packages (from pccm>=0.4.0->spconv-cu114) (2.6.0) Requirement already satisfied: lark>=1.0.0 in /opt/conda/lib/python3.8/site-packages (from pccm>=0.4.0->spconv-cu114) (1.2.2) Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from fire->spconv-cu114) (1.16.0) Requirement already satisfied: termcolor in /opt/conda/lib/python3.8/site-packages (from fire->spconv-cu114) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.4.0->spconv-cu114) (2021.5.30) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.4.0->spconv-cu114) (3.1) Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.4.0->spconv-cu114) (2.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->ccimport>=0.4.0->spconv-cu114) (1.26.13) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [AIEXT_PYPI_PLUGIN] Missing OIDC token file STEP 13/13: RUN pip uninstall -y aiext-pypi-plugin [AIEXT_PYPI_PLUGIN] Plugin has been loaded. [AIEXT_PYPI_PLUGIN] Missing OIDC token file Found existing installation: aiext-pypi-plugin 0.1.0 Uninstalling aiext-pypi-plugin-0.1.0: Successfully uninstalled aiext-pypi-plugin-0.1.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv COMMIT demo:v1 --> 73a79226bf56 Successfully tagged localhost/demo:v1 73a79226bf5653a8dfed3f01c46dddc6c683f2c410b0e9d8f0ca3f5d6e4ad336查看保存已构建的镜像。
# 查看构建成功的新镜像 buildah --storage-driver=overlay images # 之后可以按需推送到自己的镜像仓库,略 buildah login -u "{LOGIN_USERNAME}" -p "{LOGIN_PASSWORD}" {target-registry-domain} buildah --storage-driver=overlay push test:v1 docker://{target-registry-prefix}/{target-image}
普通PPU容器示例
YAML文件定义了普通容器场景,可以使用VPC PPU镜像。
apiVersion: v1 kind: Pod metadata: name: common-ppu-pod labels: alibabacloud.com/compute-class: gpu-hpn alibabacloud.com/gpu-model-series: PPU810E alibabacloud.com/compute-qos: default alibabacloud.com/hpn-type: "rdma" spec: imagePullSecrets: #可选使用secret,若普通容器使用vpc镜像不用secret,则公网镜像必须鉴权。 - name: acs-image-secret #需要和前提条件配置镜像鉴权的Secret的name一致。 containers: - name: demo image: acs-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-xpu-pytorch:25.06 command: - sh - -c - sleep infinity resources: limits: cpu: 16 memory: 128G alibabacloud.com/ppu: 2 requests: #结合上述运行模型的command args按需调整 cpu: 16 memory: 128G alibabacloud.com/ppu: 2 #添加前提条件创建的ServiceAccount名称 serviceAccountName: pip-default创建配置普通PPU容器。
#创建普通PPU容器 kubectl apply -f commonppu.yaml #连接进入普通PPU容器 kubectl exec -it acs-common-ppu-fuyi01 -- bash #【普通容器内】首先安装免密插件aiext-pypi-plugin,安装插件限制(暂不支持以python -m pip执行请求) pip install aiext-pypi-plugin -i http://mirrors.cloud.aliyuncs.com/aiext-pypi/aiext-pypi-plugin/simple --trusted-host mirrors.cloud.aliyuncs.com #【普通容器内】然后就可以正常免密安装相关pip wheel包,找了一个镜像中不存在的包(torchdata): pip install torchdata -i https://aiext-pypi.mirrors.aliyuncs.com/pg1-pip/ubuntu_cu126/simple/
该文章对您有帮助吗?