Install and use DeepGPU-LLM for large language model inference-Elastic GPU Service(EGS)-阿里云帮助中心

To run large language model (LLM) tasks, you can install the DeepGPU-LLM inference engine in a GPU instance or a Docker environment. DeepGPU-LLM optimizes large language models, such as the Llama, ChatGLM, Baichuan, and Qwen models, for high-performance inference on GPUs.

Note

The capabilities of LLM models make them a natural fit for gpu compute-optimized instances. For more information, see gpu compute-optimized instances (gn/ebm/scc series). This topic uses the gn7i gpu compute-optimized instance type as an example.

Install DeepGPU-LLM on a GPU instance

To use DeepGPU-LLM on a GPU instance, you can either select an Alibaba Cloud Marketplace image that comes with DeepGPU-LLM pre-installed, or create a GPU instance and then install it manually.

Automated method (select a Marketplace image)

Obtain a Marketplace image and create a GPU instance.

Marketplace images come pre-installed with the DeepGPU-LLM tool. You can obtain a Marketplace image in two ways.

From the ECS creation page

Go to the Instance Creation Page.
Select the Custom Launch tab.

Configure parameters such as the billing method, region, instance type, and image.

Pay attention to the following parameters. For more information about other configuration items, see Configuration Description.

① Instance: This topic uses the ecs.gn7i-c8g1.2xlarge, 8 vCPU 30 GiB instance type as an example.

② Image: This example uses an image from the Marketplace image section. This free image is an AI inference solution from Alibaba Cloud for Large Language Model (LLM) scenarios. When you create a GPU compute-optimized instance, Marketplace offers more AI inference solution images for LLM scenarios to choose from. Available images include:

More images and version information

Supported instance types	Pre-installed LLM images	Latest version
GPU compute-optimized instance	CentOS 7.9 with DeepGPU-LLM pre-installed	24.3
	Ubuntu 20.04 with DeepGPU-LLM pre-installed	24.4
	Ubuntu 22.04 with DeepGPU-LLM pre-installed	24.3
	Ubuntu image for deploying DeepGPU-LLM	V 1.1.3
	deepgpu-llm-inference-ubuntu2004	V 0.1
	CentOS 8.5 with DeepGPU-LLM pre-installed (uefi erdma)	24.3
	CentOS 7.9 with DeepGPU-LLM pre-installed (uefi erdma)	24.3
	Ubuntu 20.04 with DeepGPU-LLM pre-installed (uefi + erdma)	24.3.1
	Ubuntu 22.04 with DeepGPU-LLM pre-installed (uefi + erdma)	24.3

Note

Only some instance types, such as ebmgn7ix and ebmgn8is, support images with uefi and erdma. The options available in the console are definitive.

Public IP: Select Assign Public IPv4 Address. Set the bandwidth billing method to Pay-By-Traffic and the peak bandwidth to 100 Mbps to accelerate model downloads.

Follow the on-screen instructions and click Confirm Order.
On the payment page, review the total cost of the instance. If the details are correct, complete the payment.

From Marketplace

Go to the Alibaba Cloud Marketplace page.
In the search box, enter deepgpu-llm and press Enter.

Select the desired image and click Details.

This topic uses the Ubuntu 22.04 with DeepGPU-LLM pre-installed image as an example.

Marketplace offers more AI inference solution images for LLM scenarios. Available images include:

More images and version information

Supported instance types	Pre-installed LLM images	Latest version
GPU compute-optimized instance	CentOS 7.9 with DeepGPU-LLM pre-installed	24.3
	Ubuntu 20.04 with DeepGPU-LLM pre-installed	24.4
	Ubuntu 22.04 with DeepGPU-LLM pre-installed	24.3
	Ubuntu image for deploying DeepGPU-LLM	V 1.1.3
	deepgpu-llm-inference-ubuntu2004	V 0.1
	CentOS 8.5 with DeepGPU-LLM pre-installed (uefi erdma)	24.3
	CentOS 7.9 with DeepGPU-LLM pre-installed (uefi erdma)	24.3
	Ubuntu 20.04 with DeepGPU-LLM pre-installed (uefi + erdma)	24.3.1
	Ubuntu 22.04 with DeepGPU-LLM pre-installed (uefi + erdma)	24.3

Note

Only some instance types, such as ebmgn7ix and ebmgn8is, support images with uefi and erdma. The options available in the console are definitive.

On the image details page, click Custom Launch.

Note
The image itself is free. You are charged only for the GPU instance.
On the instance creation page, in the Image section, verify that your purchased image is selected on the Marketplace image tab.

The purchased image is typically selected by default. If it is not selected, click Select Another Image and choose the desired image.
On the instance creation page, configure other parameters and create the GPU instance.

For the Public IP parameter, select Assign Public IPv4 Address. Set the bandwidth billing method to Pay-By-Traffic and the peak bandwidth to 100 Mbps to accelerate model downloads. For more information about other parameters, see Configuration Description.

Connect to the GPU instance remotely.

For more information, see Log on to a Linux instance by using Workbench.
Run the following command to check the installation status and version of DeepGPU-LLM.
```
sudo pip list | grep deepgpu-llm
```
The following output indicates that DeepGPU-LLM is installed and the current version is 24.3.
```
deepgpu-llm    24.3+pt2.1cu121
```
Note
You can also run the sudo pip show -f deepgpu-llm command to view detailed information about the installed DeepGPU-LLM package.
(Optional) Upgrade DeepGPU-LLM.

If the current DeepGPU-LLM version does not meet your requirements, you can upgrade it by installing a later version.
1. Go to the DeepGPU-LLM Installation Package page.
2. Find the DeepGPU-LLM package that you want to install, right-click it, then select Copy Link Address.
3. In the remote session to the GPU instance, run the following commands to install a later version of DeepGPU-LLM.
  
  In this example, the deepgpu_llm-24.6+pt2.1cu121-py3-none-any.whl package is downloaded. Replace it with the DeepGPU-LLM version that you need.
```
sudo wget https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm-24.6%2Bpt2.1cu121-py3-none-any.whl
sudo pip install deepgpu_llm-24.6+pt2.1cu121-py3-none-any.whl
```
4. Run the following command to verify the upgrade.
```
sudo pip list | grep deepgpu-llm
```
  The following output indicates that DeepGPU-LLM has been upgraded to version 24.6.
```
deepgpu-llm    24.6+pt2.1cu121
```

Manual method (public image)

Create a GPU instance and then install DeepGPU-LLM on it. This topic uses a public image of Ubuntu 22.04 64-bit or Alibaba Cloud Linux 3 as an example.

Ubuntu 22.04

Create a GPU instance.
1. Go to the Instance Creation page.
2. Click the Custom Launch tab.
3. Configure the billing method, region, network, zone, instance type, and image.
  
  Note the following parameters. For more information about other configuration parameters, see Configuration options.
  - Instance: This example uses the ecs.gn7i-c8g1.2xlarge 8 vCPU 30 GiB instance type.
  - Image: From public images, select Ubuntu 22.04 64-bit. You can select the Install GPU Driver option to install the GPU driver, CUDA, and cuDNN at the same time.
  - Public IP: Select Assign Public IPv4 Address, set the bandwidth billing method to Pay-By-Traffic, and set the peak bandwidth to 100 Mbps to accelerate model downloads.
4. Complete the instance configuration and click Create Order.
5. On the payment page, review the total cost and complete the payment.
(Conditional) If you did not select the Install GPU Driver option when you created the GPU instance, you must manually install the Tesla driver and CUDA Toolkit.

For more information, see Manually install a Tesla driver on a GPU-accelerated compute instance (Linux) and Install CUDA.
Remotely connect to the GPU instance.

For more information, see Log on to a Linux instance by using Workbench.

Run the following commands to configure environment variables.

export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

Run the following commands to verify that the GPU driver and CUDA are installed.

nvidia-smi
nvcc -V

The following output confirms that the driver and CUDA are installed.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07          Driver Version: 550.90.07     CUDA Version: 12.4         |
|--------------------------------------------+------------------------+-------------------+
| GPU  Name       Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC        |
| Fan  Temp  Perf Pwr:Usage/Cap |  Memory-Usage          | GPU-Util  Compute M.        |
|                               |                        |               MIG M.        |
|===============================+========================+=============================|
|   0  NVIDIA A10          On   | 00000000:00:07.0  Off  |                           0 |
|  0%   27C    P8     9W / 150W |     1MiB / 23028MiB    |      0%      Default        |
|                               |                        |                     N/A     |
+-------------------------------+------------------------+-----------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

(Conditional) If your GPU instance belongs to the ebmgn7, ebmgn7e, ebmgn7ex, or sccgn7ex instance family, install the nvidia-fabricmanager service that corresponds to the driver version.

For more information, see Install the nvidia-fabricmanager service.

Run the following commands to install dependencies for DeepGPU-LLM.

sudo apt-get update
sudo apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev curl vim

Run the following command to install DeepGPU-LLM.

Note
The download and installation may take a long time.

Select an appropriate DeepGPU-LLM installation package based on the required versions of DeepGPU-LLM, PyTorch, and CUDA. To obtain the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration installation packages.
```
sudo pip3 install deepgpu_llm=={deepgpu-llm-version}+{pytorch-version}{cuda-version} \
    -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
```
For example, if {deepgpu-llm-version} is 24.7.2, {pytorch-version} is pt2.4, and {cuda-version} is cu124, the command installs DeepGPU-LLM version 24.7.2.
```
sudo pip3 install deepgpu_llm==24.7.2+pt2.4cu124 \
    -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
```
Run the following command to verify the installation and check the version of DeepGPU-LLM.
```
sudo pip list | grep deepgpu-llm
```
The following output confirms that DeepGPU-LLM version 24.7.2 is installed.
```
deepgpu-llm    24.7.2+pt2.4cu124
```

Alibaba Cloud Linux 3

Create a GPU instance.
1. Go to the Instance Creation page.
2. Click the Custom Launch tab.
3. Configure the billing method, region, network, zone, instance type, and image.
  
  Note the following parameters. For more information about other configuration parameters, see Configuration options.
  - Instance: This example uses the ecs.gn7i-c8g1.2xlarge 8 vCPU 30 GiB instance type.
  - Image: From public images, select Alibaba Cloud Linux 3.2014 LTS 64-bit, and select the Install GPU Driver option to install the GPU driver, CUDA, and cuDNN at the same time.
  - Public IP: Select Assign Public IPv4 Address, set the bandwidth billing method to Pay-By-Traffic, and set the peak bandwidth to 100 Mbps to accelerate model downloads.
4. Complete the instance configuration and click Create Order.
5. On the payment page, review the total cost and complete the payment.
Connect to the GPU instance remotely.

For more information, see Log on to a Linux instance by using Workbench.

Run the following commands to verify that the GPU driver and CUDA are installed.

nvidia-smi
nvcc -V

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07          Driver Version: 550.90.07     CUDA Version: 12.4         |
|--------------------------------------------+------------------------+-------------------+
| GPU  Name       Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC        |
| Fan  Temp  Perf Pwr:Usage/Cap |  Memory-Usage          | GPU-Util  Compute M.        |
|                               |                        |               MIG M.        |
|===============================+========================+=============================|
|   0  NVIDIA A10          On   | 00000000:00:07.0  Off  |                           0 |
|  0%   27C    P8     9W / 150W |     1MiB / 23028MiB    |      0%      Default        |
|                               |                        |                     N/A     |
+-------------------------------+------------------------+-----------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Run the following commands to install dependencies for DeepGPU-LLM.

sudo yum install epel-release
sudo yum update
sudo yum install openmpi3 openmpi3-devel curl
sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo chmod +x Miniconda3-latest-Linux-x86_64.sh
sudo ./Miniconda3-latest-Linux-x86_64.sh

Run the following commands to update environment variables.

export PATH=/usr/lib64/openmpi3/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi3/lib:$LD_LIBRARY_PATH

Run the following commands to initialize Miniconda and create a Python environment.

This example installs Python 3.10. If you want to install Python 3.9, adjust the command accordingly.
```
sudo su
/root/miniconda3/bin/conda init
source ~/.bashrc 
conda create -n py310 python=3.10
conda activate py310 
```
Run the following command to install DeepGPU-LLM.

Note
The download and installation may take a long time.

Select an appropriate DeepGPU-LLM installation package based on the required versions of DeepGPU-LLM, PyTorch, and CUDA. To obtain the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration installation packages.
```
pip3 install deepgpu_llm==24.7.2+pt2.4cu124 \
    -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
```
Run the following command to verify the installation and check the version of DeepGPU-LLM.
```
pip list | grep deepgpu-llm
```
The following output confirms that DeepGPU-LLM version 24.7.2 is installed.
```
deepgpu-llm    24.7.2+pt2.4cu124
```

Install DeepGPU-LLM in Docker

Manual installation

Prepare the Docker environment.

Run the following commands to install or upgrade docker-ce.

On Ubuntu

sudo apt update
sudo apt remove docker docker-engine docker-ce docker.io containerd runc
sudo apt install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
sudo curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
sudo apt update
sudo apt install docker-ce
docker -v

On Alibaba Cloud Linux

sudo yum remove docker docker-client docker-client-latest docker-common docker-latest docker-latest-logrotate docker-logrotate docker-engine
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io
sudo systemctl start docker
sudo systemctl enable docker

If the preceding commands fail, run the following commands instead.

yum-config-manager --add-repo https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo
sed -i 's+https://download.docker.com+https://mirrors.tuna.tsinghua.edu.cn/docker-ce+' /etc/yum.repos.d/docker-ce.repo

Run the following commands to install nvidia-container-toolkit.

On Ubuntu

sudo curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

On Alibaba Cloud Linux

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum clean expire-cache
yum install -y nvidia-docker2
systemctl restart docker

For more information, see Installing the NVIDIA Container Toolkit.

Run the following commands to pull the Docker image and start a container.

This example uses the pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel image.

sudo docker pull pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
sudo docker run -ti --gpus all --name="deepgpu_llm" --network=host \
           -v /root/workspace:/root/workspace \
           --shm-size 5g pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

Key parameters

Parameter	Description
`--shm-size`	Specifies the shared memory size of the container. This size affects Triton server deployment. Example: `--shm-size 5g` sets the shared memory size to 5 GB. Adjust this value based on the memory required for your model inference workload.
`-v /root/workspace:/root/workspace`	Maps a host directory to a directory in the container, enabling file sharing between the host and container. Configure this mapping based on your environment.
`pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel`	The PyTorch Docker image tag.

Run the following commands to install dependencies.
```
apt update
apt install openmpi-bin libopenmpi-dev curl
```
The preceding command installs the openmpi-bin, libopenmpi-dev, and curl packages.
Install DeepGPU-LLM.

Install DeepGPU-LLM by running the pip3 install command, based on the DeepGPU-LLM version and the required PyTorch version. To get the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration packages.
```
sudo pip3 install deepgpu_llm=={deepgpu-llm-version}+{pytorch-version}{cuda-version} \
    -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
```
For example, if {deepgpu-llm-version} is 24.3, {pytorch-version} is pt2.1, and {cuda-version} is cu121, the command installs DeepGPU-LLM version 24.3.
```
sudo pip3 install deepgpu_llm==24.3+pt2.1cu121 \
    -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
```
Run the following command to check the installation status and version of DeepGPU-LLM.
```
sudo pip list | grep deepgpu-llm
```
If the following output is returned, DeepGPU-LLM is installed and the current version is 24.3.
```
deepgpu-llm    24.3+pt2.1cu121
```

Container image

The DeepGPU-LLM container image provides a quick installation method. It works out-of-the-box and does not require you to understand the underlying hardware optimizations.

Obtain the DeepGPU-LLM container image.

Log on to the Container Registry console.
In the navigation pane on the left, click Artifact Center.

In the Repository Name search box, enter deepgpu and select the target image egs/deepgpu-llm.

The DeepGPU-LLM container image is updated approximately every three months. The following table describes the image details.

Image name

Component information

Image address

Applicable GPU instances

DeepGPU-LLM

DeepGPU-LLM: 24.3
Python: 3.10
PyTorch: 2.1.0
CUDA: 12.1.1
cuDNN: 8.9.0.131
Base image: Ubuntu 22.04

egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/deepgpu-llm:24.3-pytorch2.1-cuda12.1-cudnn8-ubuntu22.04

The DeepGPU-LLM image only supports the following GPU instances. For more information, see GPU compute-optimized instances (gn/ebm/scc series).

gn6e, ebmgn6e
gn7i, ebmgn7i, ebmgn7ix
gn7e, ebmgn7e, ebmgn7ex

Install DeepGPU-LLM.

After the Docker environment is ready, pull the DeepGPU-LLM container image. Then, follow the Procedure for installing DeepGPU-LLM to complete the installation.

Run a model with DeepGPU-LLM

Before downloading a model, log on to the GPU instance. For more information, see Connection method overview.

Download an open source model.

ModelScope is an open source model platform provided by Alibaba DAMO Academy. This section uses the Qwen-7B-Chat model as an example to show how to download a model from ModelScope. You can download the model using one of the following methods.

Important
If the model download fails due to insufficient disk space, resize your cloud disk. For more information, see Cloud disk resizing guide.
Git LFS clone command
1. Go to the ModelScope official website and search for the model name, for example, qwen.
2. In the model library section of the search results page, click Qwen-7B-Chat.
3. Find the ModelScope-specific model name and copy the model ID.
  
  The model ID Qwen/Qwen-7B-Chat is at the top of the model details page. Click the adjacent copy icon to copy it.
4. Run the following command to download the model.
```
sudo git-lfs clone https://modelscope.cn/qwen/Qwen-7B-Chat.git
```
  Note
  The git-lfs: command not found error indicates that Git LFS is not installed. Run the following commands to install it.
  sudo apt-get update sudo apt-get install git-lfs
Snapshot_download method
1. Go to the ModelScope official website and search for the model name, for example, qwen.
2. In the model library section of the search results page, click Qwen-7B-Chat.
3. Find the ModelScope-specific model name and copy the model ID.
  
  The model ID Qwen/Qwen-7B-Chat is at the top of the model details page. Click the adjacent copy icon to copy it.
4. Create the download_from_modelscope.py script with the following content:
  Sample script
  
  import argparse import shutil from modelscope.hub.snapshot_download import snapshot_download parser = argparse.ArgumentParser(description='download from modelscope') parser.add_argument('--model_name', help='the download model name') parser.add_argument('--version', help='the model version') args = parser.parse_args() base_dir = '/root/deepgpu/modelscope' model_dir = snapshot_download(args.model_name, cache_dir=base_dir,revision=args.version) print(model_dir)
5. Run the following command to download the model.
  
  Before you download the model, check the model version on the Model Files tab of the Qwen-7B-Chat page. This command uses v1.1.7 as an example.
```
python3 download_from_modelscope.py --model_name Qwen/Qwen-7B-Chat --version v1.1.7
```

Run the Qwen model for conversational inference.

Obtain details about the scripts provided by DeepGPU-LLM to run LLM models.

DeepGPU-LLM provides different scripts for running LLM models, depending on the version:
- DeepGPU-LLM versions earlier than 24.9 provide xxx_cli scripts, such as llama_cli, qwen_cli, baichuan_cli, and chatglm_cli, to run LLM models.
- DeepGPU-LLM versions 24.9 and later provide the deepgpu_cli script to run LLM models.
Run the xxx_cli -h or deepgpu_cli -h command to view help for the scripts. For example, run qwen_cli -h to view help for the qwen_cli script.
```
usage: deepgpu_cli [-h] --model_dir MODEL_DIR [--tp_size TP_SIZE] [--precision {fp16,int8,int4}] [--tokenizer_dir TOKENIZER_DIR]

options:
  -h, --help            show this help message and exit
  --model_dir MODEL_DIR, -i MODEL_DIR
                        model dir
  --tp_size TP_SIZE, -g TP_SIZE
                        How many gpus for inference
  --precision {fp16,int8,int4}, -p {fp16,int8,int4}
  --tokenizer_dir TOKENIZER_DIR, -t TOKENIZER_DIR
                        tokenizer dir
```

Run the following command for conversational inference:

xxx_cli --model_dir [MODEL_DIR] --tp_size [TP_SIZE] --precision [Type]

xxx_cli: The name of the script. Replace this with the specific script name for your DeepGPU-LLM version, such as qwen_cli or deepgpu_cli.
[MODEL_DIR]: The directory of the downloaded model files.
[TP_SIZE]: The number of GPUs to use for inference.
[Type]: The precision type for inference, such as fp16, int8, or int4.

The following examples show how to run the qwen_cli script to load the qwen-7b-chat or qwen1.5-7b-chat model for conversational inference.

Qwen-7b-chat model

qwen_cli --model_dir /home/ecs-user/Qwen-7B-Chat --tp_size 1 --precision fp16

After the command runs, you can enter text to chat with the Qwen model. Example:

Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.

User: The poem 'A Quiet Night's Thought'
床前明月光，疑是地上霜。
举头望明月，低头思故乡。

Cost time: 1.06 s
Throughput: 21.76 tokens/s

Qwen1.5-7b-chat model

qwen_cli --model_dir /home/ecs-user/Qwen1.5-7B-Chat --tp_size 1 --precision fp16

After the command runs, you can enter text to chat with the Qwen model. Example:

Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.

User: Who wrote the poem 'A Quiet Night's Thought'?
静夜思的作者是唐朝诗人李白。

Cost time: 0.41 s
Throughput: 29.6 tokens/s

(Optional) Convert the model and run it for conversational inference.

In some restricted scenarios, you can convert the model before running it for conversational inference. This step uses the qwen1.5-7b-chat model as an example.

Convert the model format.

huggingface_model_convert --in_file /root/Qwen1.5-7B-Chat --saved_dir /root/qwen1.5-7b-chat --infer_gpu_num 1 --weight_data_type fp16 --model_name qwen1.5-7b-chat

Parameters

Parameter	Description
`huggingface_model_convert`	The script that converts the model. Note If you cannot find this command, your DeepGPU-LLM version may be outdated. You can upgrade it by following the instructions in (Optional) Upgrade DeepGPU-LLM. Alternatively, depending on the LLM type, you can replace the `model` field with the specific LLM name to perform the model conversion. Check the `help` output to adjust the parameters accordingly.
`--in_file`	The directory of the downloaded model. The provided path is an example; replace it with your actual path.
`--saved_dir`	The directory where the converted model is saved. The provided path is an example; replace it with your desired path.
`--infer_gpu_num`	The number of GPUs to use for inference, which is also the number of model shards.
`--weight_data_type`	The data type for the model weights, which must be consistent with the expected computation type. Valid values are fp16 and bf16.
`--model_name`	The name of the model.

Run the following command for conversational inference.

qwen_cli --tokenizer_dir /root/Qwen1.5-7B-Chat --model_dir /root/qwen1.5-7b-chat/1-gpu/  --tp_size 1 --precision fp16

Parameters

Parameter	Description
`--tp_size`	This value must match the value of `--infer_gpu_num` used during model conversion.
`--precision`	The computational precision. int8 and int4 enable quantization on the weights, while fp16 does not.

Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.

User: How do I purchase an ECS Savings Plan?
ECS (Elastic Compute Service) is an on-demand compute service provided by Alibaba Cloud. To help you manage and reduce costs, Alibaba Cloud offers ECS Savings Plans:

1. **Purchase from Cloud Marketplace**:
   First, log on to the Alibaba Cloud official website (https://www.aliyun.com/). On the home page, go to Cloud Marketplace, search for "Savings Plan", and then select a plan that meets your needs...

FAQ

Problem: DeepGPU-LLM fails to install when you run the following commands on a GPU instance running Ubuntu 20.04.

apt-get update
apt-get -y install python3-pip openmpi-bin libopenmpi-dev curl vim
pip3 install deepgpu_llm -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html

Cause and solution: This issue occurs because apt cannot directly install Python 3.10. To resolve this, install the other required components, skipping the Python 3.10 installation. During this process, the gdm3 module may be installed as a dependency, which causes the system to boot into a graphical user interface (GUI) instead of the default command line. Run the following commands to disable the GUI.
```
systemctl disable gdm3
reboot
```

Contact us

For help with installing or using DeepGPU-LLM, join the DingTalk group 23210030587 (download the DingTalk client).

Install DeepGPU-LLM on a GPU instance

Automated method (select a Marketplace image)

From the ECS creation page

From Marketplace

Manual method (public image)

Ubuntu 22.04

Alibaba Cloud Linux 3

Install DeepGPU-LLM in Docker

Manual installation

Container image

Run a model with DeepGPU-LLM

Git LFS clone command

Snapshot_download method

FAQ

Contact us