DeepSeek-V3/R1 is an open-source, 671B-parameter Mixture-of-Experts (MoE) model. This topic describes how to build an inference service for DeepSeek-V3/R1 on an ebmgn8v instance by using SGLang as the inference framework. The process works out of the box and requires no extra configuration.
Key tools
-
NVIDIA GPU driver: A program that drives NVIDIA GPUs. This topic uses driver version 550.127.08 as an example.
-
SGLang: A high-performance serving framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It combines a front-end structured programming language with an optimized back-end inference engine to accelerate complex LLM workloads. This topic uses SGLang v0.4.2.post1 as an example.
Procedure
Step 1: Prepare the environment
-
Create a GPU instance and install the correct driver. For more information, see Create a GPU instance. Note the following key parameters:
-
Instance Type: The available instance type is
ecs.ebmgn8v.48xlarge, which provides 1,024 GiB of memory, 8*96 GB of GPU memory, and 192 vCPUs. For more information, see GPU-accelerated compute type (gn/ebm/scc series). -
Images: Select a public image. This topic uses Alibaba Cloud Linux 3.2104 LTS 64-bit as an example image.
To deploy the DeepSeek-V3/R1 model on the GPU instance, you must install a GPU driver of version 550 or later. We recommend that you select Install GPU driver when you purchase the GPU instance in the ECS console. After the instance is created, the Tesla driver, CUDA, and cuDNN libraries are automatically installed. This method is faster than manual installation.
After you select Install GPU driver, the system indicates that CUDA 12.4.1, Driver 550.127.08, and CUDNN 9.2.0.82 will be installed. We also recommend that you select Free Security Reinforcement.
-
System Disk: We recommend that you set the system disk size to 200 GiB or larger.
-
Data Disk: The DeepSeek-R1 and DeepSeek-V3 models are large, each with a file size of approximately 1.3 TiB. We recommend that you set the data disk size to 1.5 times the model size. Therefore, we recommend that you purchase a separate data disk of 2 TiB or larger to store the downloaded models.
-
Public IP Address: Select Assign Public IPv4 Addresses. For network billing, select Pay-by-traffic. We recommend that you set the peak bandwidth to 100 Mbps to speed up model downloads.
-
Security Group: Open ports 22.
-
-
-
Install the NVIDIA Container Toolkit.
If the installation fails, see the troubleshooting guide.
Alibaba Cloud Linux/CentOS
# Configure the production repository curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo # Install the NVIDIA Container Toolkit package sudo yum install -y nvidia-container-toolkit # Restart Docker sudo systemctl restart dockerUbuntu/Debian
# Configure the production repository curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Update the package list from the repository sudo apt-get update # Install the NVIDIA Container Toolkit package sudo apt-get install -y nvidia-container-toolkit # Restart Docker sudo systemctl restart docker -
Run the following command to verify that Docker is running.
sudo systemctl status dockerThe following command output indicates that Docker is running.
$ sudo systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2024-07-09 16:37:54 CST; 2min 9s ago TriggeredBy: ● docker.socket Docs: https://docs.docker.com Main PID: 6987 (dockerd) Tasks: 20 Memory: 31.9M CGroup: /system.slice/docker.service └─6987 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock -
-
If you did not add a data disk when you created the GPU instance, you must purchase and mount one.
The DeepSeek-R1 and DeepSeek-V3 models are large, each with a file size of approximately 1.3 TiB. We recommend that you set the data disk size to 1.5 times the model size. Therefore, we recommend that you purchase a separate data disk of 2 TiB or larger to store the downloaded models and use
/mntas the mount point. For more information, see Mount a data disk.
Step 2: Deploy and run DeepSeek
-
Run the following command to pull the inference image.
sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207 -
Download the model files. Visit ModelScope to select a model. You can find the model's name on its details page.
# Define the name of the model to download. Get the MODEL_NAME from the model details page on ModelScope. This script uses DeepSeek-V3 as an example. MODEL_NAME="DeepSeek-V3" # Set the local storage path. Make sure this path has enough space for the model files (we recommend 1.5 times the model size). This example uses /mnt/V3. LOCAL_SAVE_PATH="/mnt/V3" # If the /mnt/V3 directory does not exist, create it. sudo mkdir -p ${LOCAL_SAVE_PATH} # Ensure the current user has write permissions for this directory. Adjust permissions as needed. sudo chmod ugo+rw ${LOCAL_SAVE_PATH} # Start the download. The container is automatically removed after the download is complete. sudo docker run -d -t --network=host --rm --name download \ -v ${LOCAL_SAVE_PATH}:/data \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207 \ /bin/bash -c "git-lfs clone https://www.modelscope.cn/models/deepseek-ai/${MODEL_NAME}.git /data" -
Run the following command to monitor the download progress. Wait for the download to complete.
sudo docker logs -f downloadThe model download can take a long time. After the download is complete, no new logs are generated. You can press
Ctrl+Cat any time to exit without affecting the container's operation. The download will not be interrupted even if you close the terminal. -
Start the model inference service.
# Define the name of the model to download. Get the MODEL_NAME from the model details page on ModelScope. This script uses DeepSeek-V3 as an example. MODEL_NAME="DeepSeek-V3" # Set the local storage path. Make sure this path has enough space for the model files. This example uses /mnt/V3. LOCAL_SAVE_PATH="/mnt/V3" # Define the port for the service to listen on. You can change this as needed. The default is port 30000. PORT="30000" # Define the number of GPUs to use. This depends on the number of available GPUs on your instance, which you can check by running the nvidia-smi -L command. # This example assumes 8 GPUs are used. TENSOR_PARALLEL_SIZE="8" # Ensure the current user has read and write permissions for this directory. Adjust permissions as needed. sudo chmod ugo+rw ${LOCAL_SAVE_PATH} # Start the Docker container and run the service. sudo docker run -d -t --network=host --gpus all \ --privileged \ --ipc=host \ --cap-add=SYS_PTRACE \ --name ${MODEL_NAME} \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v ${LOCAL_SAVE_PATH}:/data \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207 \ /bin/bash -c "python3 -m sglang.launch_server \ --port ${PORT} \ --model-path /data \ --mem-fraction-static 0.8 \ --tp ${TENSOR_PARALLEL_SIZE} \ --trust-remote-code" -
Run the following command to verify that the service started correctly.
sudo docker logs ${MODEL_NAME}Look for a message similar to the following in the log output. This indicates that the service has started successfully and is listening on port
30000.INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
Step 3: Test and verify inference
Run the following command to send an inference request and verify the model's output.
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "How many letter e's are in the word deepseek?",
"sampling_params": {
"max_new_tokens": 3000,
"temperature": 0
}
}'
{"text":"Please think step by step before answering.\n\ndeepseek\n\nOkay, the problem I need to solve now is: how many letter e's are in the word 'deepseek'? Let me think carefully to make sure I answer this question correctly.\n\nFirst, I need to understand the question. The question asks how many times the letter e appears in the word 'deepseek'. My task is to count each letter in this word and tally the number of e's.\n\nNext, I need to spell the word correctly to ensure I haven't misread or misspelled it. The word is 'deepseek', composed of the letters d, e, e, p, s, e, e, k? Let me double-check. It's easy to misspell sometimes, so I'll carefully check the position of each letter.\n\nLet me break down the word into individual letters and check them one by one:\n\nd - the first letter is d, not e.\ne - the second letter is e, count is now 1.\ne - the third letter is also e, count is now 2.\np - the fourth letter is p, not e.\ns - the fifth letter is s, not e.\ne - the sixth letter is e, count is now 3.\ne - the seventh letter is e, count is now 4.\nk - the eighth letter is k, not e.\n\nSo, the sequence of letters is: d, e, e, p, s, e, e, k. There are 8 letters in total. The letter e appears at positions 2, 3, 6, and 7, for a total of 4 times.\n\nHowever, did I miscount? Let me count again to be sure. Sometimes, double-counting or missing a letter can lead to errors.\n\nLet's break it down again:\n\n1. d → not e\n2. e → 1\n3. e → 2\n4. p → not e\n5. s → not e\n6. e → 3\n7. e → 4\n8. k → not e\n\nYes, there are indeed 4 e's. However, someone might wonder if the spelling is incorrect, for example, if there are other letters in the middle, but based on the given word 'deepseek', it should be correct.\n\nAlso, is there a case sensitivity issue? For example, is the word in the question all lowercase, or does it have uppercase letters? But the question writes 'deepseek', which appears to be all lowercase, so it doesn't affect the count.\n\nTo summarize, after two careful checks, I can confirm that there are 4 letter e's in 'deepseek'.\n</think>\n\nIn the word 'deepseek', the letter **e** appears **4 times**. The distribution is as follows:\n\n1. **d** (1st position) → not e \n2. **e** (2nd position) → 1st e \n3. **e**..."}