Install and use DeepGPU-LLM

更新时间:
复制 MD 格式

To run large language model (LLM) tasks, you can install the DeepGPU-LLM inference engine in a GPU instance or a Docker environment. DeepGPU-LLM optimizes large language models, such as the Llama, ChatGLM, Baichuan, and Qwen models, for high-performance inference on GPUs.

Note

The capabilities of LLM models make them a natural fit for gpu compute-optimized instances. For more information, see gpu compute-optimized instances (gn/ebm/scc series). This topic uses the gn7i gpu compute-optimized instance type as an example.

Install DeepGPU-LLM on a GPU instance

To use DeepGPU-LLM on a GPU instance, you can either select an Alibaba Cloud Marketplace image that comes with DeepGPU-LLM pre-installed, or create a GPU instance and then install it manually.

Automated method (select a Marketplace image)

  1. Obtain a Marketplace image and create a GPU instance.

    Marketplace images come pre-installed with the DeepGPU-LLM tool. You can obtain a Marketplace image in two ways.

    From the ECS creation page

    1. Go to the Instance Creation Page.

    2. Select the Custom Launch tab.

    3. Configure parameters such as the billing method, region, instance type, and image.

      Pay attention to the following parameters. For more information about other configuration items, see Configuration Description.

      • Instance: This topic uses the ecs.gn7i-c8g1.2xlarge, 8 vCPU 30 GiB instance type as an example.

      • Image: This example uses an image from the Marketplace image section. This free image is an AI inference solution from Alibaba Cloud for Large Language Model (LLM) scenarios. When you create a GPU compute-optimized instance, Marketplace offers more AI inference solution images for LLM scenarios to choose from. Available images include:

        More images and version information

        Supported instance types

        Pre-installed LLM images

        Latest version

        GPU compute-optimized instance

        CentOS 7.9 with DeepGPU-LLM pre-installed

        24.3

        Ubuntu 20.04 with DeepGPU-LLM pre-installed

        24.4

        Ubuntu 22.04 with DeepGPU-LLM pre-installed

        24.3

        Ubuntu image for deploying DeepGPU-LLM

        V 1.1.3

        deepgpu-llm-inference-ubuntu2004

        V 0.1

        CentOS 8.5 with DeepGPU-LLM pre-installed (uefi erdma)

        24.3

        CentOS 7.9 with DeepGPU-LLM pre-installed (uefi erdma)

        24.3

        Ubuntu 20.04 with DeepGPU-LLM pre-installed (uefi + erdma)

        24.3.1

        Ubuntu 22.04 with DeepGPU-LLM pre-installed (uefi + erdma)

        24.3

        Note

        Only some instance types, such as ebmgn7ix and ebmgn8is, support images with uefi and erdma. The options available in the console are definitive.

      • Public IP: Select Assign Public IPv4 Address. Set the bandwidth billing method to Pay-By-Traffic and the peak bandwidth to 100 Mbps to accelerate model downloads.

    4. Follow the on-screen instructions and click Confirm Order.

    5. On the payment page, review the total cost of the instance. If the details are correct, complete the payment.

    From Marketplace

    1. Go to the Alibaba Cloud Marketplace page.

    2. In the search box, enter deepgpu-llm and press Enter.

    3. Select the desired image and click Details.

      This topic uses the Ubuntu 22.04 with DeepGPU-LLM pre-installed image as an example.

      Marketplace offers more AI inference solution images for LLM scenarios. Available images include:

      More images and version information

      Supported instance types

      Pre-installed LLM images

      Latest version

      GPU compute-optimized instance

      CentOS 7.9 with DeepGPU-LLM pre-installed

      24.3

      Ubuntu 20.04 with DeepGPU-LLM pre-installed

      24.4

      Ubuntu 22.04 with DeepGPU-LLM pre-installed

      24.3

      Ubuntu image for deploying DeepGPU-LLM

      V 1.1.3

      deepgpu-llm-inference-ubuntu2004

      V 0.1

      CentOS 8.5 with DeepGPU-LLM pre-installed (uefi erdma)

      24.3

      CentOS 7.9 with DeepGPU-LLM pre-installed (uefi erdma)

      24.3

      Ubuntu 20.04 with DeepGPU-LLM pre-installed (uefi + erdma)

      24.3.1

      Ubuntu 22.04 with DeepGPU-LLM pre-installed (uefi + erdma)

      24.3

      Note

      Only some instance types, such as ebmgn7ix and ebmgn8is, support images with uefi and erdma. The options available in the console are definitive.

    4. On the image details page, click Custom Launch.

      Note

      The image itself is free. You are charged only for the GPU instance.

    5. On the instance creation page, in the Image section, verify that your purchased image is selected on the Marketplace image tab.

      The purchased image is typically selected by default. If it is not selected, click Select Another Image and choose the desired image.

    6. On the instance creation page, configure other parameters and create the GPU instance.

      For the Public IP parameter, select Assign Public IPv4 Address. Set the bandwidth billing method to Pay-By-Traffic and the peak bandwidth to 100 Mbps to accelerate model downloads. For more information about other parameters, see Configuration Description.

  2. Connect to the GPU instance remotely.

    For more information, see Log on to a Linux instance by using Workbench.

  3. Run the following command to check the installation status and version of DeepGPU-LLM.

    sudo pip list | grep deepgpu-llm

    The following output indicates that DeepGPU-LLM is installed and the current version is 24.3.

    deepgpu-llm    24.3+pt2.1cu121
    Note

    You can also run the sudo pip show -f deepgpu-llm command to view detailed information about the installed DeepGPU-LLM package.

  4. (Optional) Upgrade DeepGPU-LLM.

    If the current DeepGPU-LLM version does not meet your requirements, you can upgrade it by installing a later version.

    1. Go to the DeepGPU-LLM Installation Package page.

    2. Find the DeepGPU-LLM package that you want to install, right-click it, then select Copy Link Address.

    3. In the remote session to the GPU instance, run the following commands to install a later version of DeepGPU-LLM.

      In this example, the deepgpu_llm-24.6+pt2.1cu121-py3-none-any.whl package is downloaded. Replace it with the DeepGPU-LLM version that you need.

      sudo wget https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm-24.6%2Bpt2.1cu121-py3-none-any.whl
      sudo pip install deepgpu_llm-24.6+pt2.1cu121-py3-none-any.whl
    4. Run the following command to verify the upgrade.

      sudo pip list | grep deepgpu-llm

      The following output indicates that DeepGPU-LLM has been upgraded to version 24.6.

      deepgpu-llm    24.6+pt2.1cu121

Manual method (public image)

Create a GPU instance and then install DeepGPU-LLM on it. This topic uses a public image of Ubuntu 22.04 64-bit or Alibaba Cloud Linux 3 as an example.

Ubuntu 22.04

  1. Create a GPU instance.

    1. Go to the Instance Creation page.

    2. Click the Custom Launch tab.

    3. Configure the billing method, region, network, zone, instance type, and image.

      Note the following parameters. For more information about other configuration parameters, see Configuration options.

      • Instance: This example uses the ecs.gn7i-c8g1.2xlarge 8 vCPU 30 GiB instance type.

      • Image: From public images, select Ubuntu 22.04 64-bit. You can select the Install GPU Driver option to install the GPU driver, CUDA, and cuDNN at the same time.

      • Public IP: Select Assign Public IPv4 Address, set the bandwidth billing method to Pay-By-Traffic, and set the peak bandwidth to 100 Mbps to accelerate model downloads.

    4. Complete the instance configuration and click Create Order.

    5. On the payment page, review the total cost and complete the payment.

  2. (Conditional) If you did not select the Install GPU Driver option when you created the GPU instance, you must manually install the Tesla driver and CUDA Toolkit.

    For more information, see Manually install a Tesla driver on a GPU-accelerated compute instance (Linux) and Install CUDA.

  3. Remotely connect to the GPU instance.

    For more information, see Log on to a Linux instance by using Workbench.

  4. Run the following commands to configure environment variables.

    export PATH=/usr/local/cuda-12.4/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
  5. Run the following commands to verify that the GPU driver and CUDA are installed.

    nvidia-smi
    nvcc -V

    The following output confirms that the driver and CUDA are installed.

    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.90.07          Driver Version: 550.90.07     CUDA Version: 12.4         |
    |--------------------------------------------+------------------------+-------------------+
    | GPU  Name       Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC        |
    | Fan  Temp  Perf Pwr:Usage/Cap |  Memory-Usage          | GPU-Util  Compute M.        |
    |                               |                        |               MIG M.        |
    |===============================+========================+=============================|
    |   0  NVIDIA A10          On   | 00000000:00:07.0  Off  |                           0 |
    |  0%   27C    P8     9W / 150W |     1MiB / 23028MiB    |      0%      Default        |
    |                               |                        |                     N/A     |
    +-------------------------------+------------------------+-----------------------------+
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2024 NVIDIA Corporation
    Built on Thu_Mar_28_02:18:24_PDT_2024
    Cuda compilation tools, release 12.4, V12.4.131
    Build cuda_12.4.r12.4/compiler.34097967_0
  6. (Conditional) If your GPU instance belongs to the ebmgn7, ebmgn7e, ebmgn7ex, or sccgn7ex instance family, install the nvidia-fabricmanager service that corresponds to the driver version.

    For more information, see Install the nvidia-fabricmanager service.

  7. Run the following commands to install dependencies for DeepGPU-LLM.

    sudo apt-get update
    sudo apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev curl vim
  8. Run the following command to install DeepGPU-LLM.

    Note

    The download and installation may take a long time.

    Select an appropriate DeepGPU-LLM installation package based on the required versions of DeepGPU-LLM, PyTorch, and CUDA. To obtain the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration installation packages.

    sudo pip3 install deepgpu_llm=={deepgpu-llm-version}+{pytorch-version}{cuda-version} \
        -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html

    For example, if {deepgpu-llm-version} is 24.7.2, {pytorch-version} is pt2.4, and {cuda-version} is cu124, the command installs DeepGPU-LLM version 24.7.2.

    sudo pip3 install deepgpu_llm==24.7.2+pt2.4cu124 \
        -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
  9. Run the following command to verify the installation and check the version of DeepGPU-LLM.

    sudo pip list | grep deepgpu-llm

    The following output confirms that DeepGPU-LLM version 24.7.2 is installed.

    deepgpu-llm    24.7.2+pt2.4cu124

Alibaba Cloud Linux 3

  1. Create a GPU instance.

    1. Go to the Instance Creation page.

    2. Click the Custom Launch tab.

    3. Configure the billing method, region, network, zone, instance type, and image.

      Note the following parameters. For more information about other configuration parameters, see Configuration options.

      • Instance: This example uses the ecs.gn7i-c8g1.2xlarge 8 vCPU 30 GiB instance type.

      • Image: From public images, select Alibaba Cloud Linux 3.2014 LTS 64-bit, and select the Install GPU Driver option to install the GPU driver, CUDA, and cuDNN at the same time.

      • Public IP: Select Assign Public IPv4 Address, set the bandwidth billing method to Pay-By-Traffic, and set the peak bandwidth to 100 Mbps to accelerate model downloads.

    4. Complete the instance configuration and click Create Order.

    5. On the payment page, review the total cost and complete the payment.

  2. Connect to the GPU instance remotely.

    For more information, see Log on to a Linux instance by using Workbench.

  3. Run the following commands to verify that the GPU driver and CUDA are installed.

    nvidia-smi
    nvcc -V
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.90.07          Driver Version: 550.90.07     CUDA Version: 12.4         |
    |--------------------------------------------+------------------------+-------------------+
    | GPU  Name       Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC        |
    | Fan  Temp  Perf Pwr:Usage/Cap |  Memory-Usage          | GPU-Util  Compute M.        |
    |                               |                        |               MIG M.        |
    |===============================+========================+=============================|
    |   0  NVIDIA A10          On   | 00000000:00:07.0  Off  |                           0 |
    |  0%   27C    P8     9W / 150W |     1MiB / 23028MiB    |      0%      Default        |
    |                               |                        |                     N/A     |
    +-------------------------------+------------------------+-----------------------------+
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2024 NVIDIA Corporation
    Built on Thu_Mar_28_02:18:24_PDT_2024
    Cuda compilation tools, release 12.4, V12.4.131
    Build cuda_12.4.r12.4/compiler.34097967_0
  4. Run the following commands to install dependencies for DeepGPU-LLM.

    sudo yum install epel-release
    sudo yum update
    sudo yum install openmpi3 openmpi3-devel curl
    sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    sudo chmod +x Miniconda3-latest-Linux-x86_64.sh
    sudo ./Miniconda3-latest-Linux-x86_64.sh
  5. Run the following commands to update environment variables.

    export PATH=/usr/lib64/openmpi3/bin:$PATH
    export LD_LIBRARY_PATH=/usr/lib64/openmpi3/lib:$LD_LIBRARY_PATH
  6. Run the following commands to initialize Miniconda and create a Python environment.

    This example installs Python 3.10. If you want to install Python 3.9, adjust the command accordingly.

    sudo su
    /root/miniconda3/bin/conda init
    source ~/.bashrc 
    conda create -n py310 python=3.10
    conda activate py310 
  7. Run the following command to install DeepGPU-LLM.

    Note

    The download and installation may take a long time.

    Select an appropriate DeepGPU-LLM installation package based on the required versions of DeepGPU-LLM, PyTorch, and CUDA. To obtain the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration installation packages.

    pip3 install deepgpu_llm==24.7.2+pt2.4cu124 \
        -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
  8. Run the following command to verify the installation and check the version of DeepGPU-LLM.

    pip list | grep deepgpu-llm

    The following output confirms that DeepGPU-LLM version 24.7.2 is installed.

    deepgpu-llm    24.7.2+pt2.4cu124

Install DeepGPU-LLM in Docker

Manual installation

  1. Prepare the Docker environment.

    1. Run the following commands to install or upgrade docker-ce.

      • On Ubuntu

        sudo apt update
        sudo apt remove docker docker-engine docker-ce docker.io containerd runc
        sudo apt install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
        sudo curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
        sudo apt-key fingerprint 0EBFCD88
        sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
        sudo apt update
        sudo apt install docker-ce
        docker -v
      • On Alibaba Cloud Linux

        sudo yum remove docker docker-client docker-client-latest docker-common docker-latest docker-latest-logrotate docker-logrotate docker-engine
        sudo yum install -y yum-utils
        sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
        sudo yum install docker-ce docker-ce-cli containerd.io
        sudo systemctl start docker
        sudo systemctl enable docker

        If the preceding commands fail, run the following commands instead.

        yum-config-manager --add-repo https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo
        sed -i 's+https://download.docker.com+https://mirrors.tuna.tsinghua.edu.cn/docker-ce+' /etc/yum.repos.d/docker-ce.repo
                          
    2. Run the following commands to install nvidia-container-toolkit.

      On Ubuntu

      sudo curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
        && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
          sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
          sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
        && \
          sudo apt-get update
      
      sudo apt-get install -y nvidia-container-toolkit
      sudo nvidia-ctk runtime configure --runtime=docker
      sudo systemctl restart docker

      On Alibaba Cloud Linux

      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
      yum clean expire-cache
      yum install -y nvidia-docker2
      systemctl restart docker

      For more information, see Installing the NVIDIA Container Toolkit.

  2. Run the following commands to pull the Docker image and start a container.

    This example uses the pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel image.

    sudo docker pull pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
    sudo docker run -ti --gpus all --name="deepgpu_llm" --network=host \
               -v /root/workspace:/root/workspace \
               --shm-size 5g pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

    Key parameters

    Parameter

    Description

    --shm-size

    Specifies the shared memory size of the container. This size affects Triton server deployment.

    Example: --shm-size 5g sets the shared memory size to 5 GB. Adjust this value based on the memory required for your model inference workload.

    -v /root/workspace:/root/workspace

    Maps a host directory to a directory in the container, enabling file sharing between the host and container. Configure this mapping based on your environment.

    pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

    The PyTorch Docker image tag.

  3. Run the following commands to install dependencies.

    apt update
    apt install openmpi-bin libopenmpi-dev curl

    The preceding command installs the openmpi-bin, libopenmpi-dev, and curl packages.

  4. Install DeepGPU-LLM.

    Install DeepGPU-LLM by running the pip3 install command, based on the DeepGPU-LLM version and the required PyTorch version. To get the latest DeepGPU-LLM version number, see DeepGPU-LLM acceleration packages.

    sudo pip3 install deepgpu_llm=={deepgpu-llm-version}+{pytorch-version}{cuda-version} \
        -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html

    For example, if {deepgpu-llm-version} is 24.3, {pytorch-version} is pt2.1, and {cuda-version} is cu121, the command installs DeepGPU-LLM version 24.3.

    sudo pip3 install deepgpu_llm==24.3+pt2.1cu121 \
        -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
  5. Run the following command to check the installation status and version of DeepGPU-LLM.

    sudo pip list | grep deepgpu-llm

    If the following output is returned, DeepGPU-LLM is installed and the current version is 24.3.

    deepgpu-llm    24.3+pt2.1cu121

Container image

The DeepGPU-LLM container image provides a quick installation method. It works out-of-the-box and does not require you to understand the underlying hardware optimizations.

  1. Obtain the DeepGPU-LLM container image.

    1. Log on to the Container Registry console.

    2. In the navigation pane on the left, click Artifact Center.

    3. In the Repository Name search box, enter deepgpu and select the target image egs/deepgpu-llm.

      The DeepGPU-LLM container image is updated approximately every three months. The following table describes the image details.

      Image name

      Component information

      Image address

      Applicable GPU instances

      DeepGPU-LLM

      • DeepGPU-LLM: 24.3

      • Python: 3.10

      • PyTorch: 2.1.0

      • CUDA: 12.1.1

      • cuDNN: 8.9.0.131

      • Base image: Ubuntu 22.04

      egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/deepgpu-llm:24.3-pytorch2.1-cuda12.1-cudnn8-ubuntu22.04

      The DeepGPU-LLM image only supports the following GPU instances. For more information, see GPU compute-optimized instances (gn/ebm/scc series).

      • gn6e, ebmgn6e

      • gn7i, ebmgn7i, ebmgn7ix

      • gn7e, ebmgn7e, ebmgn7ex

  2. Install DeepGPU-LLM.

    After the Docker environment is ready, pull the DeepGPU-LLM container image. Then, follow the Procedure for installing DeepGPU-LLM to complete the installation.

Run a model with DeepGPU-LLM

Before downloading a model, log on to the GPU instance. For more information, see Connection method overview.

  1. Download an open source model.

    ModelScope is an open source model platform provided by Alibaba DAMO Academy. This section uses the Qwen-7B-Chat model as an example to show how to download a model from ModelScope. You can download the model using one of the following methods.

    Important

    If the model download fails due to insufficient disk space, resize your cloud disk. For more information, see Cloud disk resizing guide.

    Git LFS clone command

    1. Go to the ModelScope official website and search for the model name, for example, qwen.

    2. In the model library section of the search results page, click Qwen-7B-Chat.

    3. Find the ModelScope-specific model name and copy the model ID.

      The model ID Qwen/Qwen-7B-Chat is at the top of the model details page. Click the adjacent copy icon to copy it.

    4. Run the following command to download the model.

      sudo git-lfs clone https://modelscope.cn/qwen/Qwen-7B-Chat.git
      Note

      The git-lfs: command not found error indicates that Git LFS is not installed. Run the following commands to install it.

      sudo apt-get update
      sudo apt-get install git-lfs

    Snapshot_download method

    1. Go to the ModelScope official website and search for the model name, for example, qwen.

    2. In the model library section of the search results page, click Qwen-7B-Chat.

    3. Find the ModelScope-specific model name and copy the model ID.

      The model ID Qwen/Qwen-7B-Chat is at the top of the model details page. Click the adjacent copy icon to copy it.

    4. Create the download_from_modelscope.py script with the following content:

      Sample script

      import argparse
      import shutil
      from modelscope.hub.snapshot_download import snapshot_download
      parser = argparse.ArgumentParser(description='download from modelscope')
      parser.add_argument('--model_name', help='the download model name')
      parser.add_argument('--version', help='the model version')
      args = parser.parse_args()
      base_dir = '/root/deepgpu/modelscope'
      model_dir = snapshot_download(args.model_name, cache_dir=base_dir,revision=args.version)
      print(model_dir)
    5. Run the following command to download the model.

      Before you download the model, check the model version on the Model Files tab of the Qwen-7B-Chat page. This command uses v1.1.7 as an example.

      python3 download_from_modelscope.py --model_name Qwen/Qwen-7B-Chat --version v1.1.7
  2. Run the Qwen model for conversational inference.

    1. Obtain details about the scripts provided by DeepGPU-LLM to run LLM models.

      DeepGPU-LLM provides different scripts for running LLM models, depending on the version:

      • DeepGPU-LLM versions earlier than 24.9 provide xxx_cli scripts, such as llama_cli, qwen_cli, baichuan_cli, and chatglm_cli, to run LLM models.

      • DeepGPU-LLM versions 24.9 and later provide the deepgpu_cli script to run LLM models.

      Run the xxx_cli -h or deepgpu_cli -h command to view help for the scripts. For example, run qwen_cli -h to view help for the qwen_cli script.

      usage: deepgpu_cli [-h] --model_dir MODEL_DIR [--tp_size TP_SIZE] [--precision {fp16,int8,int4}] [--tokenizer_dir TOKENIZER_DIR]
      
      options:
        -h, --help            show this help message and exit
        --model_dir MODEL_DIR, -i MODEL_DIR
                              model dir
        --tp_size TP_SIZE, -g TP_SIZE
                              How many gpus for inference
        --precision {fp16,int8,int4}, -p {fp16,int8,int4}
        --tokenizer_dir TOKENIZER_DIR, -t TOKENIZER_DIR
                              tokenizer dir
    2. Run the following command for conversational inference:

      xxx_cli --model_dir [MODEL_DIR] --tp_size [TP_SIZE] --precision [Type]
      • xxx_cli: The name of the script. Replace this with the specific script name for your DeepGPU-LLM version, such as qwen_cli or deepgpu_cli.

      • [MODEL_DIR]: The directory of the downloaded model files.

      • [TP_SIZE]: The number of GPUs to use for inference.

      • [Type]: The precision type for inference, such as fp16, int8, or int4.

      The following examples show how to run the qwen_cli script to load the qwen-7b-chat or qwen1.5-7b-chat model for conversational inference.

      Qwen-7b-chat model

      qwen_cli --model_dir /home/ecs-user/Qwen-7B-Chat --tp_size 1 --precision fp16

      After the command runs, you can enter text to chat with the Qwen model. Example:

      Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.
      
      User: The poem 'A Quiet Night's Thought'
      床前明月光,疑是地上霜。
      举头望明月,低头思故乡。
      
      Cost time: 1.06 s
      Throughput: 21.76 tokens/s

      Qwen1.5-7b-chat model

      qwen_cli --model_dir /home/ecs-user/Qwen1.5-7B-Chat --tp_size 1 --precision fp16

      After the command runs, you can enter text to chat with the Qwen model. Example:

      Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.
      
      User: Who wrote the poem 'A Quiet Night's Thought'?
      静夜思的作者是唐朝诗人李白。
      
      Cost time: 0.41 s
      Throughput: 29.6 tokens/s
  3. (Optional) Convert the model and run it for conversational inference.

    In some restricted scenarios, you can convert the model before running it for conversational inference. This step uses the qwen1.5-7b-chat model as an example.

    1. Convert the model format.

      huggingface_model_convert --in_file /root/Qwen1.5-7B-Chat --saved_dir /root/qwen1.5-7b-chat --infer_gpu_num 1 --weight_data_type fp16 --model_name qwen1.5-7b-chat

      Parameters

      Parameter

      Description

      huggingface_model_convert

      The script that converts the model.

      Note

      If you cannot find this command, your DeepGPU-LLM version may be outdated. You can upgrade it by following the instructions in (Optional) Upgrade DeepGPU-LLM. Alternatively, depending on the LLM type, you can replace the model field with the specific LLM name to perform the model conversion. Check the help output to adjust the parameters accordingly.

      --in_file

      The directory of the downloaded model. The provided path is an example; replace it with your actual path.

      --saved_dir

      The directory where the converted model is saved. The provided path is an example; replace it with your desired path.

      --infer_gpu_num

      The number of GPUs to use for inference, which is also the number of model shards.

      --weight_data_type

      The data type for the model weights, which must be consistent with the expected computation type. Valid values are fp16 and bf16.

      --model_name

      The name of the model.

    2. Run the following command for conversational inference.

      qwen_cli --tokenizer_dir /root/Qwen1.5-7B-Chat --model_dir /root/qwen1.5-7b-chat/1-gpu/  --tp_size 1 --precision fp16

      Parameters

      Parameter

      Description

      --tp_size

      This value must match the value of --infer_gpu_num used during model conversion.

      --precision

      The computational precision. int8 and int4 enable quantization on the weights, while fp16 does not.

      Welcome to the DeepGPU-accelerated Qwen model. Enter your text to start a conversation. Type 'clear' to clear the history, or 'stop' to exit.
      
      User: How do I purchase an ECS Savings Plan?
      ECS (Elastic Compute Service) is an on-demand compute service provided by Alibaba Cloud. To help you manage and reduce costs, Alibaba Cloud offers ECS Savings Plans:
      
      1. **Purchase from Cloud Marketplace**:
         First, log on to the Alibaba Cloud official website (https://www.aliyun.com/). On the home page, go to Cloud Marketplace, search for "Savings Plan", and then select a plan that meets your needs...

FAQ

  • Problem: DeepGPU-LLM fails to install when you run the following commands on a GPU instance running Ubuntu 20.04.

    apt-get update
    apt-get -y install python3-pip openmpi-bin libopenmpi-dev curl vim
    pip3 install deepgpu_llm -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-llm/deepgpu_llm.html
  • Cause and solution: This issue occurs because apt cannot directly install Python 3.10. To resolve this, install the other required components, skipping the Python 3.10 installation. During this process, the gdm3 module may be installed as a dependency, which causes the system to boot into a graphical user interface (GUI) instead of the default command line. Run the following commands to disable the GUI.

    systemctl disable gdm3
    reboot

Contact us

For help with installing or using DeepGPU-LLM, join the DingTalk group 23210030587 (download the DingTalk client).