ChatLLM-WebUI is a managed Docker image for deploying open-source large language models (LLMs) on PAI Elastic Algorithm Service (EAS). Each release ships new image variants, updated library versions, and expanded model and backend support.
Release history
2024-06-21 — v3.0.4
Highlights: Adds Rerank model support and joint Embedding + Rerank + LLM deployment. Introduces FP8 inference and multi-quantization tooling (AWQ, HQQ, Quanto) on the Transformers backend. Extends H-series GPU support.
This release changes the model type identifier for Qwen1.5 from qwen to qwen1.5. Update any deployment configurations that reference this model type before upgrading.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4 | chat-llm-webui:3.0 |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm | chat-llm-webui:3.0-vllm |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-blade | chat-llm-webui:3.0-blade |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.3.0 |
| Torchvision | 0.18.0 |
| Transformers | 4.41.2 |
| vLLM | 0.5.0.post1 |
| vllm-flash-attn | 2.5.9 |
| Blade | 0.7.0 |
Model support:
Transformers backend: adds DeepSeek-V2, Yi 1.5, and Qwen2.
vLLM backend: adds Qwen2.
BladeLLM backend: adds Llama 3 and Qwen2.
Changes the model type identifier of Qwen1.5 to
qwen1.5.
New features:
Supports Rerank model deployment.
Supports simultaneous or separate deployment of Embedding, Rerank, and LLM models.
HuggingFace (HF) backend: supports batch inputs.
BladeLLM backend: supports OpenAI Chat.
Transformers backend: supports 8-bit floating point (FP8) model deployment.
Transformers backend: supports quantization via AWQ, HQQ, and Quanto.
vLLM backend: supports FP8.
vLLM and Blade inference parameters: support configuring stop words.
Transformers backend: adapted for H-series GPUs.
Bug fixes:
Fixes BladeLLM
/metricsaccess.
2024-04-30 — v3.0.3
Highlights: Adds embedding and Sentence Transformers model deployment. Expands model support across all three backends, including Llama 3, Phi-3, and Qwen2-MoE. Releases flash attention runtime images for the Transformers and vLLM backends.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3 | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade | — |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.3.0 |
| Torchvision | 0.18.0 |
| Transformers | 4.40.2 |
| vLLM | 0.4.2 |
| Blade | 0.5.1 |
Model support:
Transformers backend: adds Yi-9B, Qwen2-MoE, Llama 3, QwenCode, Qwen1.5-32G/110B, Phi-3, and Gemma 1.1 2B/7B.
vLLM backend: adds Yi-9B, Qwen2-MoE, SeaLLM, Llama 3, and Phi-3.
Blade backend: adds Qwen1.5 and SeaLLM.
New features:
Supports embedding model deployment.
Supports Sentence Transformers model deployment.
Supports multi-model deployment of LLM and Embedding models together.
vLLM backend: returns token usage in responses.
Releases flash attention runtime images for the Transformers and vLLM backends.
2024-03-28 — v3.0.2
Highlights: Introduces the Blade inference backend with multi-GPU and quantization support. Adds Multi-LoRA inference on the HF backend. Enables /metrics access on vLLM and Blade runtime images.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2 | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade | — |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.1.2 |
| Torchvision | 0.16.2 |
| Transformers | 4.38.2 |
| vLLM | 0.3.3 |
| Blade | 0.4.8 |
Model support:
Transformers backend: adds DeepSeek and Gemma.
vLLM backend: adds DeepSeek and Gemma.
Blade backend: adds Qwen1.5 and Yi.
New features:
Adds the Blade inference backend, which supports multi-GPU configurations on a single machine and quantization settings.
Blade: supports quantized model deployment and automatically splits models.
HF backend: supports Multi-LoRA inference.
Transformers backend: performs inference based on tokenizer chat templates.
Transformers backend: supports token statistics in streaming output.
vLLM and Blade runtime images: expose the
/metricsendpoint.
2024-02-22 — v3.0.1
Highlights: Extends vLLM inference parameter configurability and adds Multi-LoRA and quantized model deployment. Removes the LangChain demo dependency from the vLLM runtime image.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1 | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1-vllm | — |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.1.2 |
| Torchvision | 0.16.0 |
| Transformers | 4.37.2 |
| vLLM | 0.3.0 |
Model support:
Transformers backend: adds Qwen1.5 and Qwen2.
vLLM backend: adds Qwen1.5 and Qwen2.
New features:
vLLM: supports Multi-LoRA inference.
vLLM: supports quantized model deployment.
vLLM: all inference parameters are now configurable at inference time.
vLLM runtime image: no longer depends on the LangChain demo.
2024-01-23 — v3.0
Highlights: Splits backend runtime images for independent compilation. Introduces the BladeLLM backend and standard OpenAI API support. Optimizes asynchronous streaming and backend API calls.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0 | — |
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0-vllm | — |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.1.2 |
| Torchvision | 0.16.2 |
| Transformers | 4.37.2 |
| vLLM | 0.2.6 |
Model support:
Adds support for Yi-6B-Chat, Yi-34B-Chat, and SecGPT.
Baichuan and similar models: support performance statistics.
The
openai/v1/chat/completionsendpoint is adapted for the ChatGLM3 history format.
New features:
Backend runtime images are now split for independent compilation and publishing.
Adds the BladeLLM backend.
Supports the standard OpenAI API.
vLLM model support is aligned with HF.
Optimizes asynchronous streaming and backend API calls.
Improves error logs.
2023-12-06 — v2.1
Highlights: Expands HF backend model support to include Mistral, Zephyr, Yi, and multiple quantized Qwen variants. Adds flash attention support and the --history-format parameter.
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1 | chat-llm-webui:2.1 |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.0.1 |
| Torchvision | 0.15.2 |
| Transformers | 4.33.3 |
| vLLM | 0.2.0 |
Model support:
HF backend: adds Mistral, Zephyr, Yi-6B, Yi-34B, Qwen-72B, Qwen-1.8B, and quantized variants (qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, qwen-1.8b-int8).
vLLM backend: adds Qwen and ChatGLM 1/2/3.
ChatGLM series: supports performance statistics.
New features:
HF backend: supports flash attention.
Adds the
--history-formatcommand-line parameter for configuring role settings.LangChain demo: supports the Qwen model.
Optimizes the FastAPI streaming access interface.
2023-09-13 — v2.0
Highlights: Initial multi-backend release with vLLM and HF support. Adds HTTP and WebSocket streaming, multi-turn conversations, system prompt configuration, and multi-GPU tensor parallelism (TP).
Image versions:
| Image | Tag alias |
|---|---|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0 | chat-llm-webui:2.0 |
Built-in library versions:
| Library | Version |
|---|---|
| Torch | 2.0.1+cu117 |
| Torchvision | 0.15.2+cu117 |
| Transformers | 4.33.3 |
| vLLM | 0.2.0 |
Model support:
Supports Baichuan, Baichuan2, Qwen, Falcon, Llama 2, ChatGLM, ChatGLM2, ChatGLM3, and Yi.
LangChain demo: supports ChatLLM and Llama 2 models.
New features:
Supports multiple backends: vLLM and HF.
Adds HTTP and WebSocket support for conversation streaming.
Non-streaming responses include the number of generated tokens.
All models support multi-turn conversations.
Supports exporting conversation records.
Supports system prompt settings and prompt concatenation for template-free inputs.
Inference parameters are configurable.
Supports debug mode for logs, including inference time in the output.
vLLM backend: uses transactional processing (TP) by default for multi-GPU configurations on a single machine.
Supports model deployment with Float32, Float16, Int8, and Int4 precision.
What's next
EAS provides a scenario-based deployment method for ChatLLM-WebUI that requires only a few parameter configurations. For deployment and calling instructions, see Deploy large language models.