为解决DeepSeek推理服务对GPU规格需求越来越高的问题,您可以通过ACK Edge集群管理本地IDC的GPU机器,并借助集群的虚拟节点快速接入云上ACS Serverless GPU算力。该方案可以使推理任务优先在IDC GPU上运行,当本地IDC GPU资源不足时,任务将自动调度至云上的ACS Serverless GPU,满足业务扩展需求的同时降低成本。
背景介绍
方案介绍
整体架构
该方案采用ACK Edge集群的云边一体化管理能力,云上托管Kubernetes控制面,将IDC机器作为Kubernetes集群数据面节点。实现IDC机器的Kubernetes容器化管理,并通过集群的虚拟节点快速接入云上ACS Serverless GPU算力,统一纳管云上云下计算资源,实现计算任务的动态分配。
将本地IDC的资源与云上VPC通过专线打通。
将本地IDC资源以边缘节点形式接入ACK Edge集群,实现从云上对IDC业务的统一管理和调度。
为业务配置自定义调度策略ResourcePolicy,使任务优先调度到本地IDC资源,本地资源不足时再调度到云上虚拟节点。
为业务配置HPA(Horizontal Pod Autoscaler),当资源使用达到阈值时,自动触发扩容。
方案优势
极致弹性:可以提供大规模秒级的弹性伸缩能力,快速应对流量高峰场景。
成本精细化:无需自购服务器,按量付费,成本透明可控。
弹性资源多样化:支持CPU、GPU等不同的机型。
前提条件
选择一个地域作为中心地域,在该地域下创建ACK Edge集群。
创建专用网络的边缘节点池,并将IDC的资源添加到边缘节点池中。
操作步骤
步骤一:准备DeepSeek-R1-Distill-Qwen-7B模型文件
通常下载和上传模型文件需要1-2小时,您可以通过提交工单快速将模型文件复制到您的OSS Bucket。
执行以下命令从ModelScope下载DeepSeek-R1-Distill-Qwen-7B模型。
请确认是否已安装git-lfs插件,如未安装可执行
yum install git-lfs
或者apt-get install git-lfs
安装。更多的安装方式,请参见安装git-lfs。git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git cd DeepSeek-R1-Distill-Qwen-7B/ git lfs pull
在OSS中创建目录,将模型上传至OSS。
关于ossutil工具的安装和使用方法,请参见安装ossutil。
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
创建PV和PVC。为目标集群配置名为
llm-model
的存储卷PV和存储声明PVC。具体操作,请参见静态挂载OSS存储卷。以下为示例PV的基本配置信息:
配置项
说明
配置项
说明
存储卷类型
OSS
名称
llm-model
访问证书
配置用于访问OSS的AccessKey ID和AccessKey Secret。
Bucket ID
选择上一步所创建的OSS Bucket。
OSS Path
选择模型所在的路径,如
/models/DeepSeek-R1-Distill-Qwen-7B
。以下为示例PVC的基本配置信息:
配置项
说明
配置项
说明
存储声明类型
OSS
名称
llm-model
分配模式
选择已有存储卷。
已有存储卷
单击选择已有存储卷链接,选择已创建的存储卷PV。
以下为示例YAML:
apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <your-oss-ak> # 配置用于访问OSS的AccessKey ID akSecret: <your-oss-sk> # 配置用于访问OSS的AccessKey Secret --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <your-bucket-name> # bucket名称 url: <your-bucket-endpoint> # Endpoint信息,如oss-cn-hangzhou-internal.aliyuncs.com otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <your-model-path> # 本示例中为/models/DeepSeek-R1-Distill-Qwen-7B/ --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model
步骤二:创建自定义调度策略ResourcePolicy
通过创建ResourcePolicy CRD来定义弹性资源优先级调度规则。本示例中,labelSelector匹配了isvc.deepseek-predictor
的应用来定义规则,此规则明确应用应该优先调度到边缘IDC资源池,如果边缘IDC资源不足,则调度到云上虚拟节点上。更多ResourcePolicy使用说明,请参见自定义弹性资源优先级调度。
后续创建应用Pod时,需要为其添加与以下labelSelector一致的Label,用于关联此处定义的调度策略。
创建ResourcePolicy CRD,并保存为nginx-resourcepolicy.yaml文件。
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: deepseek namespace: default spec: selector: app: isvc.deepseek-predictor # 此处要与后续创建的Pod的label相关联。 strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np********* #边缘节点池ID。 - resource: eci
在集群中部署自定义调度策略,定义调度优先级。
kubectl create -f nginx-resourcepolicy.yaml
步骤三:部署模型
查询集群中节点的状态。
kubectl get nodes -owide
预期输出:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME cn-hangzhou.10.4.XX.25 Ready <none> 10d v1.30.7-aliyun.1 10.4.0.25 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36 cn-hangzhou.10.4.XX.26 Ready <none> 10d v1.30.7-aliyun.1 10.4.0.26 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36 idc001 Ready <none> 31s v1.30.7-aliyun.1 10.4.0.185 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36 virtual-kubelet-cn-hangzhou-b Ready agent 7d21h v1.30.7-aliyun.1 10.4.0.180 <none> <unknown> <unknown> <unknown>
预期输出表明,集群中有一个IDC节点(idc001)和一个虚拟节点(virtual-kubelet-cn-hangzhou-b)。该IDC节点有一张V100的GPU卡。
基于vLLM模型推理框架部署DeepSeek模型推理服务。
arena serve kserve \ --name=deepseek \ --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \ --annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \ --scale-target=50 \ --min-replicas=1 \ --max-replicas=3 \ --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \ "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"
主要参数说明如下所示:
参数
说明
示例值
参数
说明
示例值
--name
提交的推理服务名称,全局唯一。
deepseek
--image
推理服务的镜像地址。本示例使用vllm推理框架。
kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6
--gpus
推理服务需要使用的GPU卡数。默认值为0。
1
--cpu
推理服务需要使用的CPU数量。
4
--memory
推理服务需要使用的内存数量。
12Gi
--scale-metric
应用弹性伸缩标准。本示例使用GPU卡利用率
DCGM_CUSTOM_PROCESS_SM_UTIL
这个指标进行应用伸缩。更多指标,请参见二、配置HPA。DCGM_CUSTOM_PROCESS_SM_UTIL
--scale-target
应用伸缩目标。当GPU利用率超过50%时,开始扩容副本。
50
--min-replicas
最小副本数。
1
--max-replicas
最大副本数。
3
--data
服务的模型地址,本示例指定的模型存储在llm-model中,挂载到容器的/mnt/models/目录下。
llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"
预期输出:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/bingchang/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/bingchang/.kube/config horizontalpodautoscaler.autoscaling/deepseek-hpa created inferenceservice.serving.kserve.io/deepseek created INFO[0002] The Job deepseek has been submitted successfully INFO[0002] You can run `arena serve get deepseek --type kserve -n default` to check the job status
查看推理服务详细信息。
arena serve get deepseek
预期输出:
Name: deepseek Namespace: default Type: KServe Version: 1 Desired: 1 Available: 1 Age: 1m Address: http://deepseek-default.example.com Port: :80 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- deepseek-predictor-6b9455f8c5-wl5lc Running 1m 1/1 0 1 idc001
从结果可以看到,推理服务的业务Pod被调度到了IDC节点,符合自定义调度的优先级。
通过以下请求服务来验证推理服务已部署成功,请求地址可以从KServe自动创建的Ingress资源详情中获取。
curl -H "Host: deepseek-default.example.com" -H "Content-Type: application/json" http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
预期输出:
{"id":"chatcmpl-efc1225ad2f33cc39a8ddbc4039a41b9","object":"chat.completion","created":1739861087,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, so I need to figure out how to say \"This is a test!\" in Spanish. Hmm, I'm not super fluent in Spanish, but I know some basic phrases. Let me think about how to approach this.\n\nFirst, I remember that \"test\" is \"prueba\" in Spanish. So maybe I can start with \"Esto es una prueba.\" But I'm not sure if that's the best way to say it. Maybe there's a more common expression or a different structure.\n\nWait, I think there's a phrase that's commonly used in tests. Isn't it something like \"This is a test.\" or \"This is a quiz.\"? I think the Spanish equivalent would be \"Este es un test.\" That sounds more natural. Let me check if that makes sense.\n\nI can also think about how people use phrases in tests. Maybe they use \"This is the test\" or \"This is an exam.\" So perhaps \"Este es el test.\" or \"Este es el examen.\" I'm not sure which one is more appropriate.\n\nI should also consider the grammar. \"This is a test\" is a simple statement, so the subject is \"this\" (using \"este\"), the verb is \"is\" (using \"es\"), and the object is \"a test\" (using \"un test\"). So putting it together, it would be \"Este es un test.\"\n\nWait, but sometimes people use \"This is the test\" when referring to an important one, so maybe \"Este es el test.\" But I'm not entirely sure if that's the correct structure. Let me think about other similar phrases.\n\nI also recall that in some contexts, people might say \"This is a practice test\" or \"This is a sample test.\" But since the user just said \"This is a test,\" the most straightforward translation would be \"Este es un test.\"\n\nI should also consider if there are any idiomatic expressions or common phrases that are used in this context. For example, \"This is the test\" is often used to mean a significant exam or evaluation, so \"Este es el test\" might be more appropriate in that context.\n\nBut I'm a bit confused because I'm not 100% sure about the correct structure. Maybe I should look up some examples. Oh, wait, I can't look things up right now, so I'll have to rely on my memory.\n\nI think the basic structure is subject + verb + object. So \"this\" (this is \"este","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}
步骤四:模拟业务高峰请求以触发云上弹性
通过压测工具Hey发送大量的请求到已部署的推理服务中。
hey -z 5m -c 5 \ -m POST -host deepseek-default.example.com \ -H "Content-Type: application/json" \ -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \ http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions
以上请求会发送至现有Pod,但由于请求太多,当GPU使用率上升超过阈值50%时,会触发Pod扩容。
查看推理服务详情。
arena serve get deepseek
预期输出:
Name: deepseek Namespace: default Type: KServe Version: 1 Desired: 3 Available: 2 Age: 18m Address: http://deepseek-default.example.com Port: :80 GPU: 3 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- deepseek-predictor-6b9455f8c5-dtzdv Running 1m 0/1 0 1 virtual-kubelet-cn-hangzhou-h deepseek-predictor-6b9455f8c5-wl5lc Running 18m 1/1 0 1 idc001 deepseek-predictor-6b9455f8c5-zmpg8 Running 5m 1/1 0 1 virtual-kubelet-cn-hangzhou-h
此时,已在虚拟节点上扩容出了推理任务的两个Pod副本。
- 本页导读
- 背景介绍
- DeepSeek-R1模型
- vLLM
- Arena
- 方案介绍
- 整体架构
- 方案优势
- 前提条件
- 操作步骤
- 步骤一:准备DeepSeek-R1-Distill-Qwen-7B模型文件
- 步骤二:创建自定义调度策略ResourcePolicy
- 步骤三:部署模型
- 步骤四:模拟业务高峰请求以触发云上弹性