ChatLLM-WebUI release notes-Platform For AI(PAI)-阿里云帮助中心

Docs ICP Filing Console

ChatLLM-WebUI is an LLM inference service image from EAS. This document lists the image addresses, built-in library versions, and updates for each release.

Key releases

Date	Image version	Built-in library version	Updates
June 21, 2024	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4 Tag: chat-llm-webui:3.0 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm Tag: chat-llm-webui:3.0-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-blade Tag: chat-llm-webui:3.0-blade	Torch: 2.3.0 Torchvision: 0.18.0 Transformers: 4.41.2 vLLM: 0.5.0.post1 vllm-flash-attn: 2.5.9 Blade: 0.7.0	Added support for Rerank model deployment. Added support for deploying Embedding, Rerank, and LLM models, either individually or in combination. The Transformers backend now supports Deepseek-V2, Yi 1.5, and Qwen2. Updated the model type for Qwen1.5 to qwen1.5. The vLLM backend now supports Qwen2. The BladeLLM backend now supports Llama3 and Qwen2. The HuggingFace backend now supports batch input. The BladeLLM backend now supports OpenAI Chat. Fixed access to BladeLLM metrics. The Transformers backend now supports FP8 model deployment. The Transformers backend now supports multiple quantization tools, including AWQ, HQQ, and Quanto. The vLLM backend now supports FP8. Inference parameters for vLLM and Blade now support stop words. The Transformers backend is now compatible with H-series GPUs.
April 30, 2024	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade	Torch: 2.3.0 Torchvision: 0.18.0 Transformers: 4.40.2 vllm: 0.4.2 Blade: 0.5.1	Added support for Embedding model deployment. The vLLM backend now returns token usage. Added support for Sentence-Transformers model deployment. The Transformers backend now supports yi-9B, qwen2-moe, llama3, qwencode, qwen1.5-32G/110B, phi-3, and gemma-1.1-2/7B. The vLLM backend now supports yi-9B, qwen2-moe, SeaLLM, llama3, and phi-3. The Blade backend now supports qwen1.5 and SeaLLM. Added support for multi-model deployment of LLM and Embedding models. Released a flash-attn image for the Transformers backend. Released a flash-attn image for the vLLM backend.
March 28, 2024	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade	Torch: 2.1.2 Torchvision: 0.16.2 Transformers: 4.38.2 Vllm: 0.3.3 Blade: 0.4.8	Introduced the Blade inference backend with support for single-machine multi-GPU and quantization configurations. The Transformers backend now performs inference based on the tokenizer chat template. The HF backend now supports Multi-LoRA inference. Blade now supports quantized model deployment. Blade now automatically shards models. The Transformers backend now supports Deepseek and Gemma. The vLLM backend now supports Deepseek and Gemma. The Blade backend now supports qwen1.5 and yi models. Enabled access to the /metrics endpoint for vLLM and Blade images. Streaming responses in the Transformers backend now include token statistics.
February 22, 2024	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1-vllm	Torch: 2.1.2 Torchvision: 0.16.0 Transformers: 4.37.2 vLLM: 0.3.0	Added extended parameter configurations for vLLM, allowing all vLLM inference parameters to be modified at runtime. vLLM now supports Multi-LoRA. vLLM now supports quantized model deployment. Removed the LangChain demo dependency from the vLLM image. The Transformers inference backend now supports qwen1.5 and qwen2 models. The vLLM inference backend now supports qwen1.5 and qwen2 models.
January 23, 2024	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0-vllm	Torch: 2.1.2 Torchvision: 0.16.2 Transformers: 4.37.2 vLLM: 0.2.6	Decoupled backend images to enable independent compilation and release. Introduced the new BladeLLM backend. Added support for the standard OpenAI API. Added support for performance metrics in Baichuan and other models. Added support for models such as yi-6b-chat, yi-34b-chat, and secgpt. Adapted the openai/v1/chat/completions endpoint for the chatglm3 history format. Optimized asynchronous streaming. Synchronized the list of models supported by vLLM with HuggingFace. Optimized backend call interfaces. Improved error logs.
December 6, 2023	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1 Tag: chat-llm-webui:2.1	Torch: 2.0.1 Torchvision: 0.15.2 Transformers: 4.33.3 vLLM: 0.2.0	The HuggingFace backend now supports mistral, zephyr, yi-6b, yi-34b, qwen-72b, qwen-1.8b, qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, and qwen-1.8b-int8 models. The vLLM backend now supports Qwen and ChatGLM1/2/3 models. The HuggingFace inference backend now supports flash attention. Added support for performance metrics in the ChatGLM series models. Added the --history-format command-line argument to support role settings. The LangChain demo now supports the Qwen model. Optimized the FastAPI streaming interface.
September 13, 2023	eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0 Tag: chat-llm-webui:2.0	Torch: 2.0.1+cu117 Torchvision: 0.15.2+cu117 Transformers: 4.33.3 vLLM: 0.2.0	Added support for multiple backends: vLLM and HuggingFace. Added a LangChain demo for ChatLLM and Llama2 models. Added support for Baichuan, Baichuan2, Qwen, Falcon, Llama2, ChatGLM, ChatGLM2, ChatGLM3, and yi models. Added HTTP and WebSocket support for conversational streaming. Non-streaming responses now include generated token counts. Enabled multi-turn conversations for all models. Added support for exporting conversation history. Added support for System Prompt settings and prompt concatenation without a template. Made inference parameters configurable. Added a debug mode for logs to output inference times. The vLLM backend now defaults to the Tensor Parallelism (TP) scheme for single-machine multi-GPU deployments. Added support for deploying models at various precisions, including Float32, Float16, Int8, and Int4.

Related documentation

EAS provides a streamlined deployment method for ChatLLM that lets you deploy open-source LLM services with minimal configuration. For deployment and invocation instructions, see Deploy large language models (LLMs).

Previous：Deploy MoE models with expert parallelismNext：LLM intelligent router deployment

该文章对您有帮助吗？