DeepGPU-LLM is an inference engine for Large Language Models (LLMs) developed by Alibaba Cloud. It runs on Elastic GPU Service and provides high-performance inference services for LLM tasks.
Product Introduction
DeepGPU-LLM is an inference engine developed by Alibaba Cloud that is easy to use and widely applicable. It is designed to optimize Large Language Model inference on Elastic GPU Service. Using optimization and parallel computing techniques, DeepGPU-LLM provides free, high-performance, and low-latency inference services.
The DeepGPU-LLM architecture is shown below:
-
Mainstream Models: DeepGPU-LLM optimizes and accelerates mainstream Large Language Models, such as Qwen.
-
Open-source Platform: Open-source model platforms, such as Modelscope and Huggingface, provide numerous pre-trained models. These platforms offer model storage, management, and distribution features that make it easy to obtain and use the mainstream Large Language Models.
-
Inference Architecture: DeepGPU-LLM uses Tensor Parallel technology to optimize Large Language Model inference on Elastic GPU Service and provides high-performance, low-latency inference services.
-
Underlying Hardware: GPU-accelerated instances with installed drivers, CUDA, and other basic environments serve as the underlying hardware for DeepGPU-LLM. They provide powerful compute resources to support efficient Large Language Model inference.
Function Introduction
The main features of DeepGPU-LLM include the following:
-
Supports Multi-GPU Parallelism (Tensor Parallel)
Splits Large Language Models across multiple GPUs for parallel computing to improve computational efficiency.
-
Supports Various Mainstream Models
Supports mainstream models, such as the Qwen, Llama, ChatGLM, and Baichuan series, to meet model inference needs in various scenarios.
-
Supports FP16 Precision Inference
Supports weight quantization and KV-Cache quantization modes to enable low-precision model inference. This reduces compute resource consumption while maintaining model performance.
-
Supports Inter-Card Communication Optimization
Improves the efficiency and speed of multi-GPU parallel computing.
-
Supports Offline and Serving Mode Output
The offline mode supports streaming output and regular output. The serving mode provides API operations, such as the `generate_cb`, `generate_cb_async`, and `generate_cb_async_id` functions, to adapt to different scenarios.
Basic Environment Dependencies
The basic environment dependencies for DeepGPU-LLM are as follows:
|
Category |
Specification or Version |
|
|
Hardware Dependencies |
GPU Specification |
SM=70, 75, 80, 86, 89, 90 (such as A800, A30, A10, V100, T4) |
|
Software Dependencies |
Operating System |
Ubuntu 22.04, Ubuntu 20.04, CentOS series, and Alibaba Cloud Linux series |
|
CUDA Version |
12.4, 12.1, 11.8, 11.7 |
|
|
PyTorch Version |
2.4, 2.3, 2.1 |
|
|
OpenMPI |
4.0.3 and later |
|
Installation Package and Related File Descriptions
To use DeepGPU-LLM for Large Language Model (LLM) inference optimization on GPUs, you must first download the installation package. Download path: DeepGPU-LLM acceleration installation package. For example, if the installation package name has the format deepgpu_llm-x.x.x+ptx.xcuxxx-py3-none-any.whl, the details are as follows:
-
deepgpu_llm-x.x.x: The version number of DeepGPU-LLM to be installed. -
ptx.x: The supported PyTorch version number. -
cuxxx: The supported CUDA version number.
After you download the DeepGPU-LLM installation package, you can find the inference dependency code for mainstream models, weight conversion scripts for mainstream models, and runnable example code within the package.
How to Use DeepGPU-LLM
For more information about how to use the DeepGPU-LLM inference engine to optimize inference for different models, such as Llama, ChatGLM, Baichuan, and Qwen, see Install and Use DeepGPU-LLM.