ACS PD分离部署 MoE模型(EP优化)最佳实践

更新时间:
复制为 MD 格式

在 DeepSeek 等 MoE 模型发布后,专家并行成为 MoE 工程化绕不过去的一道难题。如何完成基于专家并行的MoE 模型部署成为 DeepSeek 模型走向生产的最关键因素。使用容器计算服务 ACS(Container Compute Service)算力无需深入了解底层硬件,也无需涉及 GPU 节点管理和配置即可开箱即用。ACS 部署简单、支持按量付费,非常适合用于 LLM 推理任务,可以有效降低推理成本。本文介绍如何使用 ACS GPU 算力部署生产可用的 DeepSeek-R1 的分离式专家并行推理最佳实践。

背景介绍

专家并行

专家并行是 MoE 模型工程化中性能提升的关键。由于在 FFN 阶段每个token只会分发给 Top-K 个专家,导致每个专家只处理部分的token,计算效率很低。专家并行核心思想是将模型中的Expert模块(即子模型)按功能或参数分配到不同的计算设备上,每个设备仅负责特定专家的计算任务。这种并行方式通过解耦模型规模与硬件资源限制 ,允许模型扩展至万亿级参数规模,同时通过动态路由机制(如门控网络)将输入数据分发到最匹配的专家处理,从而提升计算效率。

SGLang

SGLang 是专门为大语言模型(LLM)设计的高性能服务框架。它通过同时设计后端运行程序和前端语言,提高用户和模型交互的效率,让交互过程更方便控制。

前提条件

部署方案

ACS 使用 RBG Controller 作为部署的组织者,保证PD分离+EP并行各个部署组件的有序可靠运行。整体部署架构如下:

______________________________________________________________________ACS PD分离部署 MoE模型(EP优化)最佳实践

部署配置信息如下:

模型配置

配置信息

说明

EP size

EP16

  • GU8TEF卡型上运行DeepSeek模型推荐使用EP16,性能收益最佳。

  • GU8TEFPod支持8卡,EP16 instance 包含两个GU8TEF Pod。

PD比例

2P:1D

  • 目前EP+PD分离还不支持PD比例自动调节能力。

Pod规格

GPU:GU8TEF

CPU:184core

MEM:1800G

  • 单个Pod包含8GU8TEF,184 vCPU,1800G内存。

操作步骤

步骤一:准备模型文件

大语言模型因其庞大的参数量,需要占用大量的磁盘空间来存储模型文件,建议您创建NAS存储卷或OSS存储卷来持久化存储模型文件,本文推荐使用OSS。如下示例使用 DeepSeek-R1 模型。

  1. 执行以下命令,从ModelScope下载 DeepSeek-R1 模型。

    请确认是否已安装git-lfs插件,如未安装可执行yum install git-lfs或者apt-get install git-lfs安装。更多的安装方式,请参见安装git-lfs

    yum install git-lfs -y
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git
    cd DeepSeek-R1
    git lfs pull
  2. 安装ossutil

  3. OSS中创建目录,将模型上传至OSS。

    ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1
    ossutil cp -r ./DeepSeek-R1 oss://<your-bucket-name>/models/DeepSeek-R1
  4. 将模型存储到OSS,通过PV/PVC方式读取模型。

    1. 配置用于访问OSSAK/SK。

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <your-oss-ak> # 配置用于访问OSSAccessKey ID
        akSecret: <your-oss-sk> # 配置用于访问OSSAccessKey Secret
    2. 创建 PVPVC 资源。

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <your-bucket-name> # bucket名称
            url: <your-bucket-endpoint> # Endpoint信息,如oss-cn-hangzhou-internal.aliyuncs.com
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <your-model-path> # 本示例中为/models/DeepSeek-R1/
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    3. 将上述资源 apply 到集群中后,可以通过命令工具查看:

      1. 查看PVC。

        kubectl get pvc

        预期输出:

        NAME        STATUS   VOLUME      CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
        llm-model   Bound    llm-model   30Gi       ROX                           <unset>                 3m19s            
      2. 查看PV。

        kubectl get pv

        预期输出:

        NAME        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM               STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
        llm-model   30Gi       ROX            Retain           Bound    default/llm-model                  <unset>                          3m53s
      3. 验证模型路径。

        后续完成PD分离部署,模型路径其实是非常重要的参数,这里可以提前拉起一个测试Pod,来验证一下模型的路径是否正确。

        详细操作步骤

        1. 创建验证Pod。

          apiVersion: v1
          kind: Pod
          metadata:
            labels:
              name: test-model
            name: test-model
          spec:
            volumes:
              - name: llm-model
                persistentVolumeClaim:
                  ## 上述配置的 PVC 名字
                  claimName: llm-model
            containers:
              - name: alinux3
                image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/alinux3:latest
                command: ["/bin/sleep", "infinity"]
                volumeMounts:
                  ## 模型在容器中的路径
                  - mountPath: /models/DeepSeek-R1
                    name: llm-model
        2. 进入到容器,查看模型路径/models/DeepSeek-R1是否正确。

          1. 获取测试Pod。

            kubectl get pods -l name=test-model

            预期输出:

            NAME         READY   STATUS    RESTARTS   AGE
            test-model   1/1     Running   0          49s
          2. 输出模型文件。

            kubectl exec -it test-model -c alinux3 -- /bin/bash

            进入容器后执行:

            ls /models/DeepSeek-R1/

            预期输出:

            LICENSE                            model-00037-of-000163.safetensors  model-00081-of-000163.safetensors  model-00125-of-000163.safetensors
            README.md                          model-00038-of-000163.safetensors  model-00082-of-000163.safetensors  model-00126-of-000163.safetensors
            config.json                        model-00039-of-000163.safetensors  model-00083-of-000163.safetensors  model-00127-of-000163.safetensors
            configuration.json                 model-00040-of-000163.safetensors  model-00084-of-000163.safetensors  model-00128-of-000163.safetensors
            configuration_deepseek.py          model-00041-of-000163.safetensors  model-00085-of-000163.safetensors  model-00129-of-000163.safetensors
            figures                            model-00042-of-000163.safetensors  model-00086-of-000163.safetensors  model-00130-of-000163.safetensors
            generation_config.json             model-00043-of-000163.safetensors  model-00087-of-000163.safetensors  model-00131-of-000163.safetensors
            lfs_pull.log                       model-00044-of-000163.safetensors  model-00088-of-000163.safetensors  model-00132-of-000163.safetensors
            model-00001-of-000163.safetensors  model-00045-of-000163.safetensors  model-00089-of-000163.safetensors  model-00133-of-000163.safetensors
            model-00002-of-000163.safetensors  model-00046-of-000163.safetensors  model-00090-of-000163.safetensors  model-00134-of-000163.safetensors
            model-00003-of-000163.safetensors  model-00047-of-000163.safetensors  model-00091-of-000163.safetensors  model-00135-of-000163.safetensors
            model-00004-of-000163.safetensors  model-00048-of-000163.safetensors  model-00092-of-000163.safetensors  model-00136-of-000163.safetensors
            model-00005-of-000163.safetensors  model-00049-of-000163.safetensors  model-00093-of-000163.safetensors  model-00137-of-000163.safetensors
            model-00006-of-000163.safetensors  model-00050-of-000163.safetensors  model-00094-of-000163.safetensors  model-00138-of-000163.safetensors
            model-00007-of-000163.safetensors  model-00051-of-000163.safetensors  model-00095-of-000163.safetensors  model-00139-of-000163.safetensors
            model-00008-of-000163.safetensors  model-00052-of-000163.safetensors  model-00096-of-000163.safetensors  model-00140-of-000163.safetensors
            model-00009-of-000163.safetensors  model-00053-of-000163.safetensors  model-00097-of-000163.safetensors  model-00141-of-000163.safetensors
            model-00010-of-000163.safetensors  model-00054-of-000163.safetensors  model-00098-of-000163.safetensors  model-00142-of-000163.safetensors
            model-00011-of-000163.safetensors  model-00055-of-000163.safetensors  model-00099-of-000163.safetensors  model-00143-of-000163.safetensors
            model-00012-of-000163.safetensors  model-00056-of-000163.safetensors  model-00100-of-000163.safetensors  model-00144-of-000163.safetensors
            model-00013-of-000163.safetensors  model-00057-of-000163.safetensors  model-00101-of-000163.safetensors  model-00145-of-000163.safetensors
            model-00014-of-000163.safetensors  model-00058-of-000163.safetensors  model-00102-of-000163.safetensors  model-00146-of-000163.safetensors
            model-00015-of-000163.safetensors  model-00059-of-000163.safetensors  model-00103-of-000163.safetensors  model-00147-of-000163.safetensors
            model-00016-of-000163.safetensors  model-00060-of-000163.safetensors  model-00104-of-000163.safetensors  model-00148-of-000163.safetensors
            model-00017-of-000163.safetensors  model-00061-of-000163.safetensors  model-00105-of-000163.safetensors  model-00149-of-000163.safetensors
            model-00018-of-000163.safetensors  model-00062-of-000163.safetensors  model-00106-of-000163.safetensors  model-00150-of-000163.safetensors
            model-00019-of-000163.safetensors  model-00063-of-000163.safetensors  model-00107-of-000163.safetensors  model-00151-of-000163.safetensors
            model-00020-of-000163.safetensors  model-00064-of-000163.safetensors  model-00108-of-000163.safetensors  model-00152-of-000163.safetensors
            model-00021-of-000163.safetensors  model-00065-of-000163.safetensors  model-00109-of-000163.safetensors  model-00153-of-000163.safetensors
            model-00022-of-000163.safetensors  model-00066-of-000163.safetensors  model-00110-of-000163.safetensors  model-00154-of-000163.safetensors
            model-00023-of-000163.safetensors  model-00067-of-000163.safetensors  model-00111-of-000163.safetensors  model-00155-of-000163.safetensors
            model-00024-of-000163.safetensors  model-00068-of-000163.safetensors  model-00112-of-000163.safetensors  model-00156-of-000163.safetensors
            model-00025-of-000163.safetensors  model-00069-of-000163.safetensors  model-00113-of-000163.safetensors  model-00157-of-000163.safetensors
            model-00026-of-000163.safetensors  model-00070-of-000163.safetensors  model-00114-of-000163.safetensors  model-00158-of-000163.safetensors
            model-00027-of-000163.safetensors  model-00071-of-000163.safetensors  model-00115-of-000163.safetensors  model-00159-of-000163.safetensors
            model-00028-of-000163.safetensors  model-00072-of-000163.safetensors  model-00116-of-000163.safetensors  model-00160-of-000163.safetensors
            model-00029-of-000163.safetensors  model-00073-of-000163.safetensors  model-00117-of-000163.safetensors  model-00161-of-000163.safetensors
            model-00030-of-000163.safetensors  model-00074-of-000163.safetensors  model-00118-of-000163.safetensors  model-00162-of-000163.safetensors
            model-00031-of-000163.safetensors  model-00075-of-000163.safetensors  model-00119-of-000163.safetensors  model-00163-of-000163.safetensors
            model-00032-of-000163.safetensors  model-00076-of-000163.safetensors  model-00120-of-000163.safetensors  model.safetensors.index.json
            model-00033-of-000163.safetensors  model-00077-of-000163.safetensors  model-00121-of-000163.safetensors  modeling_deepseek.py
            model-00034-of-000163.safetensors  model-00078-of-000163.safetensors  model-00122-of-000163.safetensors  tokenizer.json
            model-00035-of-000163.safetensors  model-00079-of-000163.safetensors  model-00123-of-000163.safetensors  tokenizer_config.json
            model-00036-of-000163.safetensors  model-00080-of-000163.safetensors  model-00124-of-000163.safetensors

步骤二:基于RBG部署MoE模型(2P1D)

  1. 基于 RGB 一键部署 DeepSeek EP并行版本。

    1. 下发服务注册组件。

      将以下内容保存为lingjun-runtime.yaml,然后执行kubectl apply -f lingjun-runtime.yaml。一个ACS集群中只需要下发一次。
      apiVersion: workloads.x-k8s.io/v1alpha1
      kind: ClusterEngineRuntimeProfile
      metadata:
        name: lingjun-runtime
      spec:
        containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/patio-runtime:v0.3.0
            imagePullPolicy: Always
            name: patio-runtime
            volumeMounts:
              - mountPath: /etc/patio
                name: patio-group-config
            env:
              - name: INFERENCE_ENGINE_ENDPOINT
                value: http://localhost:100
              - name: TOPO_TYPE
                value: "LingJun"
              - name: SCHEDULER_ROLE_NAME
                value: "scheduler"
        updateStrategy: NoUpdate
        volumes:
          - emptyDir: {}
            name: patio-group-config
    2. 使用PD分离部署实例,将创建 scheduler、prefill、decode 实例。请参考配置表修改prefilldecode的参数。

      将以下内容保存为lingjun-pd.yaml,然后执行kubectl apply -f lingjun-pd.yaml
      apiVersion: workloads.x-k8s.io/v1alpha1
      kind: RoleBasedGroup
      metadata:
        name: lingjun-pd
        namespace: default
      spec:
        roles:
          - engineRuntimes:
              - containers:
                  - env:
                      - name: INFERENCE_ENGINE_ENDPOINT
                        value: http://localhost:8008
                    name: patio-runtime
                profileName: lingjun-runtime
            name: scheduler
            replicas: 1
            template:
              metadata:
                labels:
                  alibabacloud.com/compute-class: performance
                  # 固定为lingjun-pd
                  alibabacloud.com/inference-backend: lingjun-pd
                  # 推荐与RoleBasedGroup.metadata.name相同
                  alibabacloud.com/inference-workload: "lingjun-pd"
                  pd-disagg: scheduler
              spec:
                containers:
                  - command:
                      - /bin/bash
                      - -c
                      - /app/scheduler/scheduler --port=8008 --PD-mode=advanced  --scheduler-mode=LoadBalance --tokenizer-path=/models/DeepSeek-R1/tokenizer.json --enable-health-check=false
                    image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.07-dpd-scheduler-250902
                    name: scheduler
                    ports:
                      - containerPort: 8008
                        name: http
                        protocol: TCP
                    env:
                      - name: CACHELINK_PORT_OFFSET
                        value: "100"
                      - name: aliyun_logs_scheduler
                        value: stdout
                    resources:
                      limits:
                        cpu: "8"
                        ephemeral-storage: 100Gi
                        memory: 16Gi
                      requests:
                        cpu: "8"
                        ephemeral-storage: 100Gi
                        memory: 16Gi
                    volumeMounts:
                      - mountPath: /models/DeepSeek-R1
                        name: llm-model
                volumes:
                  - name: llm-model
                    persistentVolumeClaim:
                      claimName: llm-model
            workload:
              apiVersion: apps/v1
              kind: StatefulSet
          - engineRuntimes:
              - containers:
                  - args:
                      - --instance-info={"data":{"port":100,"backend":"sglang","tags":{"worker_role":"prefill-only","gpu_per_node":8}},"topo_type":"LingJun"}
                    name: patio-runtime
                profileName: lingjun-runtime
            leaderWorkerSet:
              size: 1
            name: prefill
            replicas: 2
            restartPolicy: None
            template:
              metadata:
                labels:
                  alibabacloud.com/compute-class: gpu
                  alibabacloud.com/enable-gdr-copy: "true"
                  alibabacloud.com/enable-ibgda: "true"
                  alibabacloud.com/gpu-model-series: GU8TEF
                  alibabacloud.com/hpn-type: rdma
                  # 固定为lingjun-pd
                  alibabacloud.com/inference-backend: lingjun-pd
                  # 推荐与RoleBasedGroup.metadata.name相同
                  alibabacloud.com/inference-workload: "lingjun-pd"
              spec:
                containers:
                  - command:
                      - /bin/bash
                      - -c
                      - |
                        sysctl -w net.ipv4.ip_local_reserved_ports=100-300;
                        model_path=/models/DeepSeek-R1
                        DP8="--enable-dp-attention --dp-size 8 --dp-endpoint"
                        SPS="--speculative-algo=NEXTN --speculative-num-steps=2  --speculative-eagle-topk=1 --speculative-num-draft-tokens=3"
                        source set_env.sh && LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=1 python -m sglang.launch_server --model-path=$model_path \
                        --trust-remote-code --host=0.0.0.0 --port=100 --tp=8 --attention-backend=fa3 --disable-radix-cache --max-running-requests=1500  --mem-fraction-static=0.84 --disable-cuda-graph --chunked-prefill-size=32768 --kv-cache-dtype=fp8_e4m3 $SPS
                    env:
                      - name: OSS_ACCESS_KEY_ID
                        valueFrom:
                          secretKeyRef:
                            name: oss-secret
                            key: akId
                      - name: OSS_ACCESS_KEY_SECRET
                        valueFrom:
                          secretKeyRef:
                            name: oss-secret
                            key: akSecret
                      - name: OSS_REGION
                        value: {region}
                      - name: OSS_ENDPOINT
                        value: oss-{region}-internal.aliyuncs.com
                      - name: OSS_PATH
                        value: oss://{oss-bucket}/models/DeepSeek-R1/
                      - name: MODEL_DIR
                        value: /models/DeepSeek-R1
                      - name: DISABLE_PORT_OFFSET
                        value: "1"
                      - name: CACHELINK_UNIFY_NODE_PORT
                        value: "1"
                      - name: CACHELINK_PORT_OFFSET
                        value: "100"
                      - name: LWS_WORKER_INDEX
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
                      - name: aliyun_logs_prefill
                        value: stdout
                    image: acs-registry-vpc.{region}.cr.aliyuncs.com/egslingjun/sglang-nv:25.09-sglang-0.5.1.post2_ep0908-20250908
                    name: worker
                    ports:
                      - containerPort: 100
                        name: http
                        protocol: TCP
                    resources:
                      limits:
                        cpu: "184"
                        memory: 1800Gi
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "184"
                        memory: 1800Gi
                        nvidia.com/gpu: 8
                    volumeMounts:
                      - mountPath: /models/DeepSeek-R1
                        name: llm-model
                      - mountPath: /dev/shm
                        name: shm
                securityContext:
                  sysctls:
                    - name: net.ipv4.tcp_mem
                      value: 1115298 1487065 2230596
                    - name: net.ipv4.tcp_rmem
                      value: 4096 12582912 16777216
                    - name: net.ipv4.tcp_wmem
                      value: 4096 12582912 16777216
                volumes:
                  - name: llm-model
                    emptyDir: {}
                  - emptyDir:
                      medium: Memory
                      sizeLimit: 460Gi
                    name: shm
            workload:
              apiVersion: leaderworkerset.x-k8s.io/v1
              kind: LeaderWorkerSet
          - engineRuntimes:
              - containers:
                  - args:
                      - --instance-info={"data":{"port":100,"backend":"sglang","tags":{"worker_role":"decode-only","ep_size":16,"gpu_per_node":8,"dp_size":16}}}
                    name: patio-runtime
                profileName: lingjun-runtime
            leaderWorkerSet:
              size: 2
            name: decode
            replicas: 1
            restartPolicy: None
            template:
              metadata:
                labels:
                  alibabacloud.com/compute-class: gpu
                  alibabacloud.com/enable-gdr-copy: "true"
                  alibabacloud.com/enable-ibgda: "true"
                  alibabacloud.com/gpu-model-series: GU8TEF
                  alibabacloud.com/hpn-type: rdma
                  # 固定为lingjun-pd
                  alibabacloud.com/inference-backend: lingjun-pd
                  # 推荐与RoleBasedGroup.metadata.name相同
                  alibabacloud.com/inference-workload: "lingjun-pd"
              spec:
                containers:
                  - command:
                      - /bin/bash
                      - -c
                      - |
                        sysctl -w net.ipv4.ip_local_reserved_ports=100-300;
                        model_path=/models/DeepSeek-R1
                        DP16="--enable-dp-attention --dp-size 16 --dp-endpoint"
                        SPS="--speculative-algo=NEXTN --speculative-num-steps=2  --speculative-eagle-topk=1 --speculative-num-draft-tokens=3"
                        EP="--moe-a2a-backend=deepep --deepep-mode=low_latency --enable-eplb --expert-distribution-recorder-mode=stat --ep-dispatch-algorithm=static"
                        DIST="--dist-init-addr $(LWS_LEADER_ADDRESS):6379 --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX)"
                        source set_env.sh && LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=288 python -m sglang.launch_server --model-path=$model_path --trust-remote-code --host=0.0.0.0 --port=100 --tp=16 $DP16 $EP --moe-dense-tp-size=1 --enable-dp-lm-head --cuda-graph-max-bs=96 --disable-chunked-prefix-cache --kv-cache-dtype=fp8_e4m3 --attention-backend=flashmla --mem-fraction-static=0.84 --chunked-prefill-size=2048 --max-running-requests=3000 $DIST $SPS
                    env:
                      - name: OSS_ACCESS_KEY_ID
                        valueFrom:
                          secretKeyRef:
                            name: oss-secret
                            key: akId
                      - name: OSS_ACCESS_KEY_SECRET
                        valueFrom:
                          secretKeyRef:
                            name: oss-secret
                            key: akSecret
                      - name: OSS_REGION
                        value: {region}
                      - name: OSS_ENDPOINT
                        value: oss-{region}-internal.aliyuncs.com
                      - name: OSS_PATH
                        value: oss://{oss-bucket}/models/DeepSeek-R1/
                      - name: MODEL_DIR
                        value: /models/DeepSeek-R1
                      - name: SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK
                        value: "384"
                      - name: DISABLE_PORT_OFFSET
                        value: "1"
                      - name: CACHELINK_UNIFY_NODE_PORT
                        value: "1"
                      - name: CACHELINK_PORT_OFFSET
                        value: "100"
                      - name: LWS_WORKER_INDEX
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
                      - name: aliyun_logs_decode
                        value: stdout
                    image: acs-registry-vpc.{region}.cr.aliyuncs.com/egslingjun/sglang-nv:25.09-sglang-0.5.1.post2_ep0908-20250908
                    name: worker
                    ports:
                      - containerPort: 100
                        name: http
                        protocol: TCP
                    resources:
                      limits:
                        cpu: "184"
                        memory: 1800Gi
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "184"
                        memory: 1800Gi
                        nvidia.com/gpu: 8
                    volumeMounts:
                      - mountPath: /models/DeepSeek-R1
                        name: llm-model
                      - mountPath: /dev/shm
                        name: shm
                securityContext:
                  sysctls:
                    - name: net.ipv4.tcp_mem
                      value: 1115298 1487065 2230596
                    - name: net.ipv4.tcp_rmem
                      value: 4096 12582912 16777216
                    - name: net.ipv4.tcp_wmem
                      value: 4096 12582912 16777216
                volumes:
                  - name: llm-model
                    emptyDir: {}
                  - emptyDir:
                      medium: Memory
                      sizeLimit: 460Gi
                    name: shm
            workload:
              apiVersion: leaderworkerset.x-k8s.io/v1
              kind: LeaderWorkerSet

      上述配置总共包含三个角色:scheduler、prefill、decode。其中三个角色的配置需要根据实际情况修改,如下:

      Role

      配置项

      说明

      scheduler

      command

      command 是scheduler的启动命令,其中--tokenizer-path=/models/DeepSeek-R1/tokenizer.json 需要调整为模型的真实路径。

      volumeMounts

      mountPath 代表挂载到容器中的模型路径,需要根据实际情况调整。

      volumes

      persistentVolumeClaim.claimName是上述 OSS PVC 资源名字。

      prefill/decode

      replicas

      prefill/decode 数量,上述例子为prefill: 2decode: 1

      command

      command 是 prefill/decode 的启动命令,其中--model-path=/models/DeepSeek-R1需要根据实际的模型目录进行调整。

      volumeMounts

      mountPath 代表挂载到容器中的模型路径,需要根据实际情况调整。

      volumes

      persistentVolumeClaim.claimName 是上述 OSS PVC 资源名字。

      emptyDir.sizeLimit是临时文件系统大小,建议配置为 memory 一半。

      labels

      alibabacloud.com/gpu-model-series表示 GPU 卡类型,比如:GU8TEF

      resources

      cpu、memory、gpu 请根据卡型、机型配置。

      env

      以下OSS配置需与创建PVPVC资源步骤中的配置保持一致。

      • OSS_REGION:OSS地域,如cn-shanghai

      • OSS_ENDPOINT:OSS VPC内网地址,如oss-cn-shanghai-internal.aliyuncs.com

      • OSS_PATH:OSS模型地址,如oss://<your-bucket-name>/models/DeepSeek-R1

      使用OSS Connector来提升加载模型速度,当前北京、上海、杭州、深圳、新加坡推荐这种方式。如果是其他地域建议参考使用Fluid实现模型加速访问

      image

      请将acs-registry-vpc.{region}.cr.aliyuncs.com/egslingjun/sglang-nv:25.09-sglang-0.5.1.post2_ep0908-20250908中的{region}替换为实际地域,如cn-shanghai

      restartPolicy

      定义了Pod失败时候的策略:

      • RecreateRoleInstanceOnPodRestart:针对EP16的场景,需要两个Pod组成一个instance,如果其中一个Pod启动失败,需要整体重建才能恢复。生产模式建议这种配置。查看日志可以通过SLS持久化日志,请参见通过Pod环境变量配置应用日志采集

      • None:如果Pod启动失败,不会整体重建instance。建议在测试阶段使用,方便通过kubectl logs -p查看日志。

  2. 加载模型需要一定的时间,等待 Pod 都就绪。

    kubectl get pods | grep lingjun-pd

    预期输出:

    NAME                                READY   STATUS      RESTARTS   AGE
    lingjun-pd-decode-0                 2/2     Running     0          22h
    lingjun-pd-decode-0-1               1/1     Running     0          22h
    lingjun-pd-prefill-0                2/2     Running     0          22h
    lingjun-pd-prefill-1                2/2     Running     0          22h
    lingjun-pd-scheduler-0              2/2     Running     0          22h
  3. 通过 SLB 将流量代理到 scheduler,实现公网访问。

    将以下内容保存为lingjun-service.yaml,然后执行kubectl apply -f lingjun-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: lingjun-service
      annotations:
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain: "on"
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain-timeout: "300"
    spec:
      ports:
        - port: 80
          targetPort: 8008
          protocol: TCP
      selector:
        ## 上述rbgschedulerlabels 
        pd-disagg: scheduler
      type: LoadBalancer
  4. 访问 SGLang 服务。

    EXTERNAL_IP=$(kubectl get svc lingjun-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    curl http://$EXTERNAL_IP/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ds",
        "messages": [
          {
            "role": "user",
            "content": "给闺女写一份来自未来2035的信,同时告诉她要好好学习科技,做科技的主人,推动科技,经济发展;她现在是3年级"
          }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9,
        "seed": 10
      }'

    预期输出:

    {"id":"18c7cd5bbce14e6fa1a34e7a68e4f934","object":"chat.completion","created":1767690287,"model":"ds","choices":[{"index":0,"message":{"role":"assistant","content":"好的,用户让我写一封来自2035年的信,给他的女儿,现在三年级。需要鼓励她好好学习科技,成为科技的主人,推动科技和经济发展。首先,我需要确定信的语气,既要有未来感,又要符合父亲对孩子的关爱。\n\n接下来,得考虑未来科技的可能发展,比如量子计算、脑机接口、太空探索,这些元素能让信显得真实可信。然后要结合三年级孩子的认知水平,用简单易懂的例子,比如全息投影、智能助手,这样她更容易理解和感兴趣。\n\n还要融入情感元素,避免过于说教。可能需要加入一些未来的生活场景,比如在火星建立基地,或者用纳米机器人治疗疾病,这样既展示科技的力量,又激发她的想象力。同时,要强调学习的重要性,尤其是数学、编程和科学思维,但要用鼓励的方式,而不是强迫。\n\n需要注意信的结构,开头问候,中间讲述未来的变化,然后联系到现在学习的建议,最后表达爱和期望。可能需要加入一些个人化的细节,比如提到她三年级的具体情况,比如作业或爱好,让信更亲切。\n\n还要检查是否传达了“做科技的主人”这个核心信息,避免科技控制人类,而是人类引导科技发展。可能需要举例子,比如AI管理城市,但人类制定规则,这样具体化概念。\n\n最后,确保语言温暖,充满希望,激励孩子主动学习,同时感受到父母的支持。可能需要在结尾加入未来的约定,比如一起探索科技,增强互动感。还要注意信的格式,比如日期用2035年,署名来自未来的爸爸,增加真实感。\n</think>\n\n**亲爱的宝贝:**\n\n现在是2035915日清晨,我坐在月球环形山基地的落地窗前给你写信。脚下的重力只有地球的六分之一,但爸爸对你的爱却比整个银河系还要辽阔。你知道吗?此刻我手腕上的神经接口正在把思绪转化成文字,而你的AI小助手\"星河\"会把这份思念折叠成一道量子讯息,瞬间穿越12年的时空,悄悄藏进你正在读的这封信里。\n\n昨天我刚结束了一场关于\"地球-火星经济带\"的会议,看着全息投影里穿梭如织的太空货船,突然想起你三年级时用乐高搭的歪歪扭扭的火箭模型。那时的你总爱追着问我:\"爸爸,为什么星星会眨眼?\"现在的你会知道,那是大气湍流在调皮,但更浪漫的是——那些星光里,可能藏着外星小朋友用引力波发来的早安问候。\n\n记得你上周数学作业本上那个被橡皮擦破的洞吗?2035年的纳米修复笔已经能补好时空裂缝了呢。不过爸爸最骄傲的,是现在的你正在用稚嫩的小手,握着铅笔解三元一次方程的样子。对,就是那道让你撅着嘴说\"永远算不出来\"的应用题。等你学到量子计算机原理时就会明白,今天这些看似枯燥的运算,正是打开未来之门的密码。\n\n上周六你蹲在花园观察蚂蚁搬家时,是否注意到叶尖的露珠里藏着整个宇宙?现在的生物纳米机器人正在珊瑚礁重建生态,而它们的核心算法,就源自你此刻生物课上画得歪歪扭扭的细胞结构图。你知道吗,那位总夸你\"观察仔细\"的科学老师,二十年后会在联合国青年科技峰会上,向世界展示你设计的生态城市模型。\n\n昨天我在太空电梯里遇见一位穿着粉色智能运动鞋的少女,她脑后的神经接口闪着和你一样的酒窝笑。那一瞬间我突然眼眶发热——是的,那就是2035年的你,正用自己编写的算法优化着地月轨道交通网。你三年级时在草稿纸上画的\"会飞的汽车\",现在正以反重力穿梭机的形态,载着探险家们驶向柯伊伯带。\n\n宝贝,这个周末和妈妈去科技馆时,记得多摸摸那个磁悬浮地球仪。二十年后,当你的指尖在全息星图上轻轻一划就能点亮整个太阳系的能源网络时,你会感谢此刻对世界充满好奇的自己。那些让你抓耳挠腮的编程启蒙题,那些看似\"没用\"的自然观察笔记,正在你大脑里编织着改变世界的神经回路。\n\n此刻,火星殖民地的晨曦正透过舷窗洒在我的键盘上。知道吗?你此刻书桌上那盏护眼台灯,在2035年已经进化成能调节生物节律的智能光场。但比这更明亮的是你眼中对知识渴望的光芒——那才是驱动人类文明向星辰大海远征的永恒能源。\n\n永远爱你的爸爸  \n2035年中秋·广寒宫科研站  \n(附:记得把今天捡的银杏叶夹在科学书里,二十年后它会成为你实验室门禁卡的生物密钥哦!)\n\n**P.S.** 你此刻正在练习的《小星星变奏曲》,会在2035年国际空间站的跨年音乐会上,由你和智能钢琴协作演绎——用五维声波重构的","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":34,"total_tokens":1058,"completion_tokens":1024,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

可观测大盘

ACS PD分离部署方案默认接入了LLM推理服务监控大盘,详情请参见Model-Level Panel

image

使用Fluid实现模型加速访问

上述模型加载默认使用了oss-connector技术,在北京、上海、深圳、杭州、新加坡由于带宽比较大,所以模型加载速度比较快。如果是在其他地域比较慢的情况,可以考虑使用Fluid进行模型加速。

  1. ACS应用市场通过Helm安装 Fluid 组件,版本要求 >= v1.0.14-*。详细操作,请参见使用Helm管理ACS应用

    image.png

  2. 使用 Fluid 需要开启特权 privileged SYS_ADMIN,请提交工单开启。

  3. 配置用于访问OSSAK/SK。

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak> # 配置用于访问OSSAccessKey ID
      akSecret: <your-oss-sk> # 配置用于访问OSSAccessKey Secret
  4. 创建 Fluid Dataset 和 JindoRuntime 资源:

    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: llm-model-fluid
    spec:
      mounts:
        - encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  key: akId
                  name: oss-secret
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  key: akSecret
                  name: oss-secret
          mountPoint: oss://<your-bucket-name>/models/DeepSeek-R1
          name: llm-model-fluid
          options:
            fs.oss.endpoint: oss-<region>-internal.aliyuncs.com
          path: /
      placement: Shared
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      ## 必须与 Dataset 名字保持一致
      name: llm-model-fluid
    spec:
      networkmode: ContainerNetwork
      ## 按需调整
      replicas: 16
      master:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default
        resources:
          requests:
            cpu: 4
            memory: 8Gi
          limits:
            cpu: 4
            memory: 8Gi
      worker:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default
        resources:
          requests:
            cpu: 16
            memory: 128Gi
          limits:
            cpu: 16
            memory: 128Gi
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            quota: 120Gi
            high: "0.99"
            low: "0.95"

    资源

    配置项

    说明

    Dataset

    mountPoint

    模型的OSS地址,比如:oss://<your-bucket-name>/models/DeepSeek-R1

    fs.oss.endpoint

    OSS VPC内网地址,比如新加坡:oss-ap-southeast-1-internal.aliyuncs.com

  5. 查看Pod资源,一共会创建16Pod。

    kubectl get pods

    预期输出:

    NAME                               READY   STATUS      RESTARTS   AGE
    llm-model-fluid-jindofs-master-0    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-0    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-1    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-10   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-11   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-12   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-13   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-14   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-15   1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-2    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-3    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-4    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-5    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-6    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-7    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-8    1/1     Running     0          21h
    llm-model-fluid-jindofs-worker-9    1/1     Running     0          21h
  6. 查看dataset资源。

    kubectl get datasets

    预期输出:表示已经启动成功。

    NAME              UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    llm-model-fluid         641.31GiB    0GiB         720.00GiB       0%   Bound   9m33s
  7. 创建 dataload 资源,提前预热模型:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: llm-model-fluid
    spec:
      dataset:
        ## dataset name
        name: llm-model-fluid
        namespace: default
      loadMetadata: true

    根据模型大小和所在地域(不同地域OSS带宽不同),需要几分钟才能完成模型的预热。执行kubectl get dataloads,预期输出中预热耗时为1m20s。

    NAME              DATASET           PHASE      AGE   DURATION
    llm-model-fluid   llm-model-fluid   Complete   15m   1m20s
  8. 查看预热状态。

    kubectl get datasets

    预期输出:dataset 资源cache已经是100%,代表预热完成。

    NAME              UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    llm-model-fluid   641.31GiB        641.31GiB   720.00GiB        100.0%              Bound   9m33s

附录 - Fluid文件预取配置

annotations:
  file-prefetcher.fluid.io/inject: "true"
  # file-prefetcher.fluid.io/async-prefetch: "false"
  # file-prefetcher.fluid.io/image: "<fluid_prefetcher_image>"
  file-prefetcher.fluid.io/extra-envs: FILE_PREFETCHER_THREAD_POOL_SIZE=16
  file-prefetcher.fluid.io/file-list: pvc://llm-model-fluid/
  file-prefetcher.fluid.io/prefetch-timeout-seconds: "1200"

字段名

默认值

含义

inject

false

是否开启文件预读功能。以下配置仅在开启文件预取功能时生效。

file-list

Pod中指定的所有Fluid Dataset对应的PVC下的所有文件。等同于pvc://<pvc1>/**;pvc://<pvc2>/**;pvc://<pvc3>/**;...

指定待预读的文件列表。支持多个列表项,各个列表项之间可通过分号(;)分隔。

每个列表项格式必须为pvc://<pvc_name>/<glob_path>,其中<pvc_name>必须是Pod挂载的Fluid Dataset对应的PVC,<glob_path>为支持glob语法的字符串。

例如,以下示例均为合法的文件列表:

  • pvc://jfs-llm-model/

  • pvc://jfs-llm-model/mymodel/*.safetensors

  • pvc://jfs-llm-model/mymodel/**;pvc://mydataset2/model/*.safetensors

async-prefetch

false

是否开启异步预取。如果开启异步预取,主容器将不会等待预取容器预取完成直接启动。

prefetch-timeout-seconds

120

仅在async-prefetch=false时生效。指定主容器等待预取完成的最长等待时间。

配置 file-prefetcher.fluid.io/inject: "true"Annotation后,Fluid将会为该Pod注入Sidecar容器读取模型文件。如果选择了async-prefetch=false(默认值即为false),该Sidecar容器将会阻塞后续应用容器的启动,直至模型文件的预取完成。