ChatLLM-WebUI release notes

更新时间:
复制 MD 格式

ChatLLM-WebUI is a managed Docker image for deploying open-source large language models (LLMs) on PAI Elastic Algorithm Service (EAS). Each release ships new image variants, updated library versions, and expanded model and backend support.

Release history

2024-06-21 — v3.0.4

Highlights: Adds Rerank model support and joint Embedding + Rerank + LLM deployment. Introduces FP8 inference and multi-quantization tooling (AWQ, HQQ, Quanto) on the Transformers backend. Extends H-series GPU support.

Important

This release changes the model type identifier for Qwen1.5 from qwen to qwen1.5. Update any deployment configurations that reference this model type before upgrading.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4chat-llm-webui:3.0
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllmchat-llm-webui:3.0-vllm
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-bladechat-llm-webui:3.0-blade

Built-in library versions:

LibraryVersion
Torch2.3.0
Torchvision0.18.0
Transformers4.41.2
vLLM0.5.0.post1
vllm-flash-attn2.5.9
Blade0.7.0

Model support:

  • Transformers backend: adds DeepSeek-V2, Yi 1.5, and Qwen2.

  • vLLM backend: adds Qwen2.

  • BladeLLM backend: adds Llama 3 and Qwen2.

  • Changes the model type identifier of Qwen1.5 to qwen1.5.

New features:

  • Supports Rerank model deployment.

  • Supports simultaneous or separate deployment of Embedding, Rerank, and LLM models.

  • HuggingFace (HF) backend: supports batch inputs.

  • BladeLLM backend: supports OpenAI Chat.

  • Transformers backend: supports 8-bit floating point (FP8) model deployment.

  • Transformers backend: supports quantization via AWQ, HQQ, and Quanto.

  • vLLM backend: supports FP8.

  • vLLM and Blade inference parameters: support configuring stop words.

  • Transformers backend: adapted for H-series GPUs.

Bug fixes:

  • Fixes BladeLLM /metrics access.

2024-04-30 — v3.0.3

Highlights: Adds embedding and Sentence Transformers model deployment. Expands model support across all three backends, including Llama 3, Phi-3, and Qwen2-MoE. Releases flash attention runtime images for the Transformers and vLLM backends.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade

Built-in library versions:

LibraryVersion
Torch2.3.0
Torchvision0.18.0
Transformers4.40.2
vLLM0.4.2
Blade0.5.1

Model support:

  • Transformers backend: adds Yi-9B, Qwen2-MoE, Llama 3, QwenCode, Qwen1.5-32G/110B, Phi-3, and Gemma 1.1 2B/7B.

  • vLLM backend: adds Yi-9B, Qwen2-MoE, SeaLLM, Llama 3, and Phi-3.

  • Blade backend: adds Qwen1.5 and SeaLLM.

New features:

  • Supports embedding model deployment.

  • Supports Sentence Transformers model deployment.

  • Supports multi-model deployment of LLM and Embedding models together.

  • vLLM backend: returns token usage in responses.

  • Releases flash attention runtime images for the Transformers and vLLM backends.

2024-03-28 — v3.0.2

Highlights: Introduces the Blade inference backend with multi-GPU and quantization support. Adds Multi-LoRA inference on the HF backend. Enables /metrics access on vLLM and Blade runtime images.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade

Built-in library versions:

LibraryVersion
Torch2.1.2
Torchvision0.16.2
Transformers4.38.2
vLLM0.3.3
Blade0.4.8

Model support:

  • Transformers backend: adds DeepSeek and Gemma.

  • vLLM backend: adds DeepSeek and Gemma.

  • Blade backend: adds Qwen1.5 and Yi.

New features:

  • Adds the Blade inference backend, which supports multi-GPU configurations on a single machine and quantization settings.

  • Blade: supports quantized model deployment and automatically splits models.

  • HF backend: supports Multi-LoRA inference.

  • Transformers backend: performs inference based on tokenizer chat templates.

  • Transformers backend: supports token statistics in streaming output.

  • vLLM and Blade runtime images: expose the /metrics endpoint.

2024-02-22 — v3.0.1

Highlights: Extends vLLM inference parameter configurability and adds Multi-LoRA and quantized model deployment. Removes the LangChain demo dependency from the vLLM runtime image.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1-vllm

Built-in library versions:

LibraryVersion
Torch2.1.2
Torchvision0.16.0
Transformers4.37.2
vLLM0.3.0

Model support:

  • Transformers backend: adds Qwen1.5 and Qwen2.

  • vLLM backend: adds Qwen1.5 and Qwen2.

New features:

  • vLLM: supports Multi-LoRA inference.

  • vLLM: supports quantized model deployment.

  • vLLM: all inference parameters are now configurable at inference time.

  • vLLM runtime image: no longer depends on the LangChain demo.

2024-01-23 — v3.0

Highlights: Splits backend runtime images for independent compilation. Introduces the BladeLLM backend and standard OpenAI API support. Optimizes asynchronous streaming and backend API calls.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0-vllm

Built-in library versions:

LibraryVersion
Torch2.1.2
Torchvision0.16.2
Transformers4.37.2
vLLM0.2.6

Model support:

  • Adds support for Yi-6B-Chat, Yi-34B-Chat, and SecGPT.

  • Baichuan and similar models: support performance statistics.

  • The openai/v1/chat/completions endpoint is adapted for the ChatGLM3 history format.

New features:

  • Backend runtime images are now split for independent compilation and publishing.

  • Adds the BladeLLM backend.

  • Supports the standard OpenAI API.

  • vLLM model support is aligned with HF.

  • Optimizes asynchronous streaming and backend API calls.

  • Improves error logs.

2023-12-06 — v2.1

Highlights: Expands HF backend model support to include Mistral, Zephyr, Yi, and multiple quantized Qwen variants. Adds flash attention support and the --history-format parameter.

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1chat-llm-webui:2.1

Built-in library versions:

LibraryVersion
Torch2.0.1
Torchvision0.15.2
Transformers4.33.3
vLLM0.2.0

Model support:

  • HF backend: adds Mistral, Zephyr, Yi-6B, Yi-34B, Qwen-72B, Qwen-1.8B, and quantized variants (qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, qwen-1.8b-int8).

  • vLLM backend: adds Qwen and ChatGLM 1/2/3.

  • ChatGLM series: supports performance statistics.

New features:

  • HF backend: supports flash attention.

  • Adds the --history-format command-line parameter for configuring role settings.

  • LangChain demo: supports the Qwen model.

  • Optimizes the FastAPI streaming access interface.

2023-09-13 — v2.0

Highlights: Initial multi-backend release with vLLM and HF support. Adds HTTP and WebSocket streaming, multi-turn conversations, system prompt configuration, and multi-GPU tensor parallelism (TP).

Image versions:

ImageTag alias
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0chat-llm-webui:2.0

Built-in library versions:

LibraryVersion
Torch2.0.1+cu117
Torchvision0.15.2+cu117
Transformers4.33.3
vLLM0.2.0

Model support:

  • Supports Baichuan, Baichuan2, Qwen, Falcon, Llama 2, ChatGLM, ChatGLM2, ChatGLM3, and Yi.

  • LangChain demo: supports ChatLLM and Llama 2 models.

New features:

  • Supports multiple backends: vLLM and HF.

  • Adds HTTP and WebSocket support for conversation streaming.

  • Non-streaming responses include the number of generated tokens.

  • All models support multi-turn conversations.

  • Supports exporting conversation records.

  • Supports system prompt settings and prompt concatenation for template-free inputs.

  • Inference parameters are configurable.

  • Supports debug mode for logs, including inference time in the output.

  • vLLM backend: uses transactional processing (TP) by default for multi-GPU configurations on a single machine.

  • Supports model deployment with Float32, Float16, Int8, and Int4 precision.

What's next

EAS provides a scenario-based deployment method for ChatLLM-WebUI that requires only a few parameter configurations. For deployment and calling instructions, see Deploy large language models.