通义千问VL
通义千问视觉理解大模型Qwen-VL于2023年12月1日发布重大更新,不仅大幅提升通用OCR、视觉推理、中文文本理解基础能力,还能处理各种分辨率和规格的图像,甚至能“看图做题”。
升级的Qwen-VL(qwen-vl-plus/qwen-vl-max/qwen-vl-max-0809/qwen-vl-plus-0809/qwen-vl-max-0201)模型现有几大特点:
大幅增强了图片中文字处理能力,能够提取、整理、总结文字,成为生产力帮手。
增加可处理分辨率范围,各分辨率和长宽比的图都能处理,大图和长图能看清。
增强视觉推理和决策能力,适于搭建视觉Agent,让大模型Agent的想象力进一步扩展。
升级看图做题能力,拍一拍习题图发给Qwen-VL,大模型能帮用户一步步解题。
qwen-vl-max、qwen-vl-max-0809、qwen-vl-plus-0809模型支持处理视频内容。
前提条件
请您参考获取API-KEY,开通百炼服务并获得API-KEY。
您可以使用OpenAI Python SDK、DashScope SDK或HTTP接口调用通义千问VL模型,请您根据您的需求,参考以下方式准备您的计算环境。
如果您之前使用OpenAI SDK以及HTTP方式调用OpenAI的服务,只需在原有框架下调整API-KEY、base_url、model等参数,就可以直接调用通义千问VL模型。
调用方式
准备条件
通过OpenAI Python SDK调用
您可以通过以下命令安装或更新OpenAI SDK:
# 如果下述命令报错,请将pip替换为pip3 pip install -U openai
您需要配置的base_url如下:
https://dashscope.aliyuncs.com/compatible-mode/v1
通过OpenAI兼容-HTTP调用
如果您需要通过OpenAI兼容的HTTP方式进行调用,需要配置的完整访问endpoint如下:
POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
通过DashScope SDK调用
DashScope SDK提供了Python和Java两个版本,请参考安装SDK,安装最新版SDK。
通过DashScope HTTP调用
如果您需要通过DashScope的HTTP方式进行调用,需要配置的完整访问endpoint如下:
POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
我们推荐您将API-KEY配置到环境变量中以降低API-KEY的泄露风险,详情可参考配置API-KEY到环境变量。您也可以在代码中配置API-KEY,但是会存在泄露风险。
示例代码
图像理解
您可以参考以下示例代码,通过OpenAI或者DashScope的方式,调用通义千问VL模型。
您可以输入单张或多张图片。
OpenAI兼容
您可以通过OpenAI SDK或OpenAI兼容的HTTP方式调用通义千问VL模型。
Python
示例代码
from openai import OpenAI
import os
def get_response():
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
}
},
{
"type": "text",
"text": "这些是什么"
}
]
}
]
)
print(completion.model_dump_json())
if __name__=='__main__':
get_response()
返回结果
{
"id": "chatcmpl-4b5a3bb9-221f-9687-bdd7-a7d56aae44df",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海浪和天空,整个画面充满了温馨和愉快的氛围。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色相间的条纹,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面给人一种野生自然的感觉。",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
}
],
"created": 1725948492,
"model": "qwen-vl-max",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 106,
"prompt_tokens": 2497,
"total_tokens": 2603
}
}
curl
示例代码
curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-max",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
}
},
{
"type": "text",
"text": "这些是什么"
}
]
}
]
}'
返回结果
{
"choices": [
{
"message": {
"content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海景和日落的天空,整个画面显得非常温馨和谐。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面充满了自然的野性和生机。",
"role": "assistant"
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 2497,
"completion_tokens": 109,
"total_tokens": 2606
},
"created": 1725948561,
"system_fingerprint": null,
"model": "qwen-vl-max",
"id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
DashScope
您可以通过DashScope SDK或HTTP方式调用通义千问VL模型。
Python
示例代码
from http import HTTPStatus
import dashscope
def simple_multimodal_conversation_call():
messages = [
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
{"text": "这些是什么?"}
]
}
]
response = dashscope.MultiModalConversation.call(
model='qwen-vl-plus',
messages=messages
)
if response.status_code == HTTPStatus.OK:
print(response)
else:
print(response.code)
print(response.message)
if __name__ == '__main__':
simple_multimodal_conversation_call()
返回结果
{
"status_code": 200,
"request_id": "3a031529-707f-9b7d-968c-172e7533debc",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "图1中是一名女子和狗在沙滩上玩耍。\n图2是孟加拉虎的插画,它正向镜头走来。\n图3里是一只可爱的小白兔。"
}
]
}
}
]
},
"usage": {
"input_tokens": 3743,
"output_tokens": 41,
"image_tokens": 3697
}
}
Java
示例代码
// Copyright (c) Alibaba, Inc. and its affiliates.
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"),
Collections.singletonMap("text", "这些是什么?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.model("qwen-vl-plus")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
返回结果
{
"requestId": "dcb38a0f-fd69-9071-bcde-c4530f9a7559",
"usage": {
"input_tokens": 3740,
"output_tokens": 48
},
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "图1中是一名女子和一只大金毛在沙滩上玩耍。\n图2是孟加拉虎的写实照片,老虎正向镜头走来。\n图3是一幅插画,主要展示了一只兔子。"
}
]
}
}
]
}
}
curl
示例代码
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-plus",
"input":{
"messages":[
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
{"text": "这些是什么?"}
]
}
]
}
}'
返回结果
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "这张图片显示了一位女士和她的狗在海滩上。她们似乎正在享受彼此的陪伴,狗狗坐在沙滩上伸出爪子与女士握手或互动。背景是美丽的日落景色,海浪轻轻拍打着海岸线。\n\n请注意,我提供的描述基于图像中可见的内容,并不包括任何超出视觉信息之外的信息。如果您需要更多关于这个场景的具体细节,请告诉我!"
}
]
}
}
]
},
"usage": {
"output_tokens": 81,
"input_tokens": 1277,
"image_tokens": 1247
},
"request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}
视频理解
qwen-vl-max
qwen-vl-max-0809
、qwen-vl-plus-0809
模型支持对视频内容的理解功能。您可以直接传入视频文件,或以图片列表形式传入。
from http import HTTPStatus
import dashscope
def simple_multimodal_conversation_call():
"""Simple single round multimodal conversation call.
"""
messages = [
{
"role": "user",
"content": [
# 以视频文件传入
{"video": "https://cloud.video.taobao.com/vod/S8T54f_w1rkdfLdYjL3S5zKN9CrhkzuhRwOhF313tIQ.mp4"},
# 或以图片列表形式传入
# {"video":[
# "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg",
# "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
# ]},
{"text": "视频的内容是什么?"}
]
}
]
response = dashscope.MultiModalConversation.call(
model='qwen-vl-max',
messages=messages
)
if response.status_code == HTTPStatus.OK:
print(response)
else:
print(response.code) # The error code.
print(response.message) # The error message.
if __name__ == '__main__':
simple_multimodal_conversation_call()
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-max",
"input":{
"messages":[
{
"role": "user",
"content": [
{"video": ["https://cloud.video.taobao.com/vod/S8T54f_w1rkdfLdYjL3S5zKN9CrhkzuhRwOhF313tIQ.mp4"]},
{"text": "这是什么?"}
]
}
]
}
}'
返回结果:
{
"status_code": 200,
"request_id": "a6772f55-5509-9c2c-bcca-3b9132ed6f63",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "视频的内容是一个人使用阿里云的通义千问模型进行对话的演示。在视频中,用户向模型输入了“你好”作为问候语,模型回应了“你好!有什么我能为你效劳的吗?”这个演示展示了通义千问模型的对话功能,以及它如何与用户进行交互。"
}
]
}
}
]
},
"usage": {
"input_tokens": 5205,
"output_tokens": 69,
"video_tokens": 5180
}
}
了解更多
有关通义千问VL API的详细调用文档可前往API详情页面进行了解。