DeepGPU-LLM: an inference engine for large language models (llms)-Elastic GPU Service(EGS)-阿里云帮助中心

DeepGPU-LLM is an inference engine for Large Language Models (LLMs) developed by Alibaba Cloud. It runs on Elastic GPU Service and provides high-performance inference services for LLM tasks.

Product Introduction

DeepGPU-LLM is an inference engine developed by Alibaba Cloud that is easy to use and widely applicable. It is designed to optimize Large Language Model inference on Elastic GPU Service. Using optimization and parallel computing techniques, DeepGPU-LLM provides free, high-performance, and low-latency inference services.

The DeepGPU-LLM architecture is shown below:

Mainstream Models: DeepGPU-LLM optimizes and accelerates mainstream Large Language Models, such as Qwen.
Open-source Platform: Open-source model platforms, such as Modelscope and Huggingface, provide numerous pre-trained models. These platforms offer model storage, management, and distribution features that make it easy to obtain and use the mainstream Large Language Models.
Inference Architecture: DeepGPU-LLM uses Tensor Parallel technology to optimize Large Language Model inference on Elastic GPU Service and provides high-performance, low-latency inference services.
Underlying Hardware: GPU-accelerated instances with installed drivers, CUDA, and other basic environments serve as the underlying hardware for DeepGPU-LLM. They provide powerful compute resources to support efficient Large Language Model inference.

Function Introduction

The main features of DeepGPU-LLM include the following:

Supports Multi-GPU Parallelism (Tensor Parallel)

Splits Large Language Models across multiple GPUs for parallel computing to improve computational efficiency.
Supports Various Mainstream Models

Supports mainstream models, such as the Qwen, Llama, ChatGLM, and Baichuan series, to meet model inference needs in various scenarios.
Supports FP16 Precision Inference

Supports weight quantization and KV-Cache quantization modes to enable low-precision model inference. This reduces compute resource consumption while maintaining model performance.
Supports Inter-Card Communication Optimization

Improves the efficiency and speed of multi-GPU parallel computing.
Supports Offline and Serving Mode Output

The offline mode supports streaming output and regular output. The serving mode provides API operations, such as the `generate_cb`, `generate_cb_async`, and `generate_cb_async_id` functions, to adapt to different scenarios.

Basic Environment Dependencies

The basic environment dependencies for DeepGPU-LLM are as follows:

Category		Specification or Version
Hardware Dependencies	GPU Specification	SM=70, 75, 80, 86, 89, 90 (such as A800, A30, A10, V100, T4)
Software Dependencies	Operating System	Ubuntu 22.04, Ubuntu 20.04, CentOS series, and Alibaba Cloud Linux series
	CUDA Version	12.4, 12.1, 11.8, 11.7
	PyTorch Version	2.4, 2.3, 2.1
	OpenMPI	4.0.3 and later

Installation Package and Related File Descriptions

To use DeepGPU-LLM for Large Language Model (LLM) inference optimization on GPUs, you must first download the installation package. Download path: DeepGPU-LLM acceleration installation package. For example, if the installation package name has the format deepgpu_llm-x.x.x+ptx.xcuxxx-py3-none-any.whl, the details are as follows:

deepgpu_llm-x.x.x: The version number of DeepGPU-LLM to be installed.
ptx.x: The supported PyTorch version number.
cuxxx: The supported CUDA version number.

After you download the DeepGPU-LLM installation package, you can find the inference dependency code for mainstream models, weight conversion scripts for mainstream models, and runnable example code within the package.

How to Use DeepGPU-LLM

For more information about how to use the DeepGPU-LLM inference engine to optimize inference for different models, such as Llama, ChatGLM, Baichuan, and Qwen, see Install and Use DeepGPU-LLM.