Deploy full-featured DeepSeek on one GPU

更新时间:
复制 MD 格式

DeepSeek-V3/R1 is an open-source, 671B-parameter Mixture-of-Experts (MoE) model. This topic describes how to build an inference service for DeepSeek-V3/R1 on an ebmgn8v instance by using SGLang as the inference framework. The process works out of the box and requires no extra configuration.

Key tools

  • NVIDIA GPU driver: A program that drives NVIDIA GPUs. This topic uses driver version 550.127.08 as an example.

  • SGLang: A high-performance serving framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It combines a front-end structured programming language with an optimized back-end inference engine to accelerate complex LLM workloads. This topic uses SGLang v0.4.2.post1 as an example.

Procedure

Step 1: Prepare the environment

  1. Create a GPU instance and install the correct driver. For more information, see Create a GPU instance. Note the following key parameters:

    • Instance Type: The available instance type is ecs.ebmgn8v.48xlarge, which provides 1,024 GiB of memory, 8*96 GB of GPU memory, and 192 vCPUs. For more information, see GPU-accelerated compute type (gn/ebm/scc series).

    • Images: Select a public image. This topic uses Alibaba Cloud Linux 3.2104 LTS 64-bit as an example image.

      To deploy the DeepSeek-V3/R1 model on the GPU instance, you must install a GPU driver of version 550 or later. We recommend that you select Install GPU driver when you purchase the GPU instance in the ECS console. After the instance is created, the Tesla driver, CUDA, and cuDNN libraries are automatically installed. This method is faster than manual installation.

      After you select Install GPU driver, the system indicates that CUDA 12.4.1, Driver 550.127.08, and CUDNN 9.2.0.82 will be installed. We also recommend that you select Free Security Reinforcement.

    • System Disk: We recommend that you set the system disk size to 200 GiB or larger.

    • Data Disk: The DeepSeek-R1 and DeepSeek-V3 models are large, each with a file size of approximately 1.3 TiB. We recommend that you set the data disk size to 1.5 times the model size. Therefore, we recommend that you purchase a separate data disk of 2 TiB or larger to store the downloaded models.

    • Public IP Address: Select Assign Public IPv4 Addresses. For network billing, select Pay-by-traffic. We recommend that you set the peak bandwidth to 100 Mbps to speed up model downloads.

    • Security Group: Open ports 22.

  2. Install Docker.

    1. Connect to the GPU instance.

      For more information, see Log on to a Linux instance by using Workbench.

    2. Run the following command to install Docker. This example uses an Alibaba Cloud Linux 3 system. For installation instructions on other systems, see Install and use Docker and Docker Compose.

      # Add the Docker software package source.
      sudo wget -O /etc/yum.repos.d/docker-ce.repo http://mirrors.cloud.aliyuncs.com/docker-ce/linux/centos/docker-ce.repo
      sudo sed -i 's|https://mirrors.aliyun.com|http://mirrors.cloud.aliyuncs.com|g' /etc/yum.repos.d/docker-ce.repo
      # DNF source compatibility plugin for Alibaba Cloud Linux 3.
      sudo dnf -y install dnf-plugin-releasever-adapter --repo alinux3-plus
      # Install Docker Community Edition, the containerd.io runtime, and the Docker Buildx and Compose plugins.
      sudo dnf -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
      # Start Docker.
      sudo systemctl start docker
      # Set the Docker daemon to start automatically on system startup.
      sudo systemctl enable docker
    3. Start Docker and enable auto-start on boot.

      #Start Docker
      sudo systemctl start docker
      #Set the Docker daemon to start automatically on system boot
      sudo systemctl enable docker
  3. Install the NVIDIA Container Toolkit.

    If the installation fails, see the troubleshooting guide.

    Alibaba Cloud Linux/CentOS

    # Configure the production repository
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
      sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    # Install the NVIDIA Container Toolkit package
    sudo yum install -y nvidia-container-toolkit
    # Restart Docker
    sudo systemctl restart docker

    Ubuntu/Debian

    # Configure the production repository
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    # Update the package list from the repository
    sudo apt-get update
    # Install the NVIDIA Container Toolkit package
    sudo apt-get install -y nvidia-container-toolkit
    # Restart Docker
    sudo systemctl restart docker
  4. Run the following command to verify that Docker is running.

    sudo systemctl status docker

    The following command output indicates that Docker is running.

    $ sudo systemctl status docker
    ● docker.service - Docker Application Container Engine
         Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
         Active: active (running) since Tue 2024-07-09 16:37:54 CST; 2min 9s ago
    TriggeredBy: ● docker.socket
           Docs: https://docs.docker.com
       Main PID: 6987 (dockerd)
          Tasks: 20
         Memory: 31.9M
         CGroup: /system.slice/docker.service
                 └─6987 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
  5. If you added a data disk when you created the GPU instance, you must initialize and mount it to the /mnt directory.

    1. Run the lsblk command to view information about the data disk.

      [root@iZxxx ~]# lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
      vda    253:0    0   200G  0 disk
      ├─vda1 253:1    0     2M  0 part
      ├─vda2 253:2    0   200M  0 part /boot/efi
      └─vda3 253:3    0 199.8G  0 part /
      vdb    253:16   0     2T  0 disk
    2. Run the following commands to create a file system and mount it to the /mnt directory.

      sudo mkfs.ext4 /dev/vdb
      sudo mount /dev/vdb /mnt
    3. Run the lsblk command to verify that the data disk is mounted to the /mnt directory.

      [root@iZu1xxxZ ~]# lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
      vda    253:0    0   200G  0 disk
      ├─vda1 253:1    0     2M  0 part
      ├─vda2 253:2    0   200M  0 part /boot/efi
      └─vda3 253:3    0 199.8G  0 part /
      vdb    253:16   0     2T  0 disk /mnt
  6. If you did not add a data disk when you created the GPU instance, you must purchase and mount one.

    The DeepSeek-R1 and DeepSeek-V3 models are large, each with a file size of approximately 1.3 TiB. We recommend that you set the data disk size to 1.5 times the model size. Therefore, we recommend that you purchase a separate data disk of 2 TiB or larger to store the downloaded models and use /mnt as the mount point. For more information, see Mount a data disk.

Step 2: Deploy and run DeepSeek

  1. Run the following command to pull the inference image.

    sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207
  2. Download the model files. Visit ModelScope to select a model. You can find the model's name on its details page.

    # Define the name of the model to download. Get the MODEL_NAME from the model details page on ModelScope. This script uses DeepSeek-V3 as an example.
    MODEL_NAME="DeepSeek-V3"
    # Set the local storage path. Make sure this path has enough space for the model files (we recommend 1.5 times the model size). This example uses /mnt/V3.
    LOCAL_SAVE_PATH="/mnt/V3"
    # If the /mnt/V3 directory does not exist, create it.
    sudo mkdir -p ${LOCAL_SAVE_PATH}
    # Ensure the current user has write permissions for this directory. Adjust permissions as needed.
    sudo chmod ugo+rw ${LOCAL_SAVE_PATH}
    # Start the download. The container is automatically removed after the download is complete.
    sudo docker run -d -t --network=host --rm --name download \
    -v ${LOCAL_SAVE_PATH}:/data \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207 \
    /bin/bash -c "git-lfs clone https://www.modelscope.cn/models/deepseek-ai/${MODEL_NAME}.git /data"
  3. Run the following command to monitor the download progress. Wait for the download to complete.

    sudo docker logs -f download

    The model download can take a long time. After the download is complete, no new logs are generated. You can press Ctrl+C at any time to exit without affecting the container's operation. The download will not be interrupted even if you close the terminal.

  4. Start the model inference service.

    # Define the name of the model to download. Get the MODEL_NAME from the model details page on ModelScope. This script uses DeepSeek-V3 as an example.
    MODEL_NAME="DeepSeek-V3"
    # Set the local storage path. Make sure this path has enough space for the model files. This example uses /mnt/V3.
    LOCAL_SAVE_PATH="/mnt/V3"
    # Define the port for the service to listen on. You can change this as needed. The default is port 30000.
    PORT="30000"
    # Define the number of GPUs to use. This depends on the number of available GPUs on your instance, which you can check by running the nvidia-smi -L command.
    # This example assumes 8 GPUs are used.
    TENSOR_PARALLEL_SIZE="8"
    # Ensure the current user has read and write permissions for this directory. Adjust permissions as needed.
    sudo chmod ugo+rw ${LOCAL_SAVE_PATH}
    # Start the Docker container and run the service.
    sudo docker run -d -t --network=host --gpus all \
        --privileged \
        --ipc=host \
        --cap-add=SYS_PTRACE \
        --name ${MODEL_NAME} \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        -v ${LOCAL_SAVE_PATH}:/data \
        egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207 \
        /bin/bash -c "python3 -m sglang.launch_server \
            --port ${PORT} \
            --model-path /data \
            --mem-fraction-static 0.8 \
            --tp ${TENSOR_PARALLEL_SIZE} \
            --trust-remote-code"
  5. Run the following command to verify that the service started correctly.

    sudo docker logs ${MODEL_NAME}

    Look for a message similar to the following in the log output. This indicates that the service has started successfully and is listening on port 30000.

    INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

Step 3: Test and verify inference

Run the following command to send an inference request and verify the model's output.

curl http://localhost:30000/generate \
 -H "Content-Type: application/json" \
 -d '{
  "text": "How many letter e's are in the word deepseek?",
  "sampling_params": {
  "max_new_tokens": 3000,
  "temperature": 0
 }
}'
{"text":"Please think step by step before answering.\n\ndeepseek\n\nOkay, the problem I need to solve now is: how many letter e's are in the word 'deepseek'? Let me think carefully to make sure I answer this question correctly.\n\nFirst, I need to understand the question. The question asks how many times the letter e appears in the word 'deepseek'. My task is to count each letter in this word and tally the number of e's.\n\nNext, I need to spell the word correctly to ensure I haven't misread or misspelled it. The word is 'deepseek', composed of the letters d, e, e, p, s, e, e, k? Let me double-check. It's easy to misspell sometimes, so I'll carefully check the position of each letter.\n\nLet me break down the word into individual letters and check them one by one:\n\nd - the first letter is d, not e.\ne - the second letter is e, count is now 1.\ne - the third letter is also e, count is now 2.\np - the fourth letter is p, not e.\ns - the fifth letter is s, not e.\ne - the sixth letter is e, count is now 3.\ne - the seventh letter is e, count is now 4.\nk - the eighth letter is k, not e.\n\nSo, the sequence of letters is: d, e, e, p, s, e, e, k. There are 8 letters in total. The letter e appears at positions 2, 3, 6, and 7, for a total of 4 times.\n\nHowever, did I miscount? Let me count again to be sure. Sometimes, double-counting or missing a letter can lead to errors.\n\nLet's break it down again:\n\n1. d → not e\n2. e → 1\n3. e → 2\n4. p → not e\n5. s → not e\n6. e → 3\n7. e → 4\n8. k → not e\n\nYes, there are indeed 4 e's. However, someone might wonder if the spelling is incorrect, for example, if there are other letters in the middle, but based on the given word 'deepseek', it should be correct.\n\nAlso, is there a case sensitivity issue? For example, is the word in the question all lowercase, or does it have uppercase letters? But the question writes 'deepseek', which appears to be all lowercase, so it doesn't affect the count.\n\nTo summarize, after two careful checks, I can confirm that there are 4 letter e's in 'deepseek'.\n</think>\n\nIn the word 'deepseek', the letter **e** appears **4 times**. The distribution is as follows:\n\n1. **d** (1st position) → not e \n2. **e** (2nd position) → 1st e \n3. **e**..."}