文档

快速开始

更新时间:

通义千问VL

通义千问视觉理解大模型Qwen-VL于2023年12月1日发布重大更新,不仅大幅提升通用OCR、视觉推理、中文文本理解基础能力,还能处理各种分辨率和规格的图像,甚至能“看图做题”。

升级的Qwen-VL(qwen-vl-plus/qwen-vl-max/qwen-vl-max-0809/qwen-vl-plus-0809/qwen-vl-max-0201)模型现有几大特点:

  • 大幅增强了图片中文字处理能力,能够提取、整理、总结文字,成为生产力帮手。

  • 增加可处理分辨率范围,各分辨率和长宽比的图都能处理,大图和长图能看清。

  • 增强视觉推理和决策能力,适于搭建视觉Agent,让大模型Agent的想象力进一步扩展。

  • 升级看图做题能力,拍一拍习题图发给Qwen-VL,大模型能帮用户一步步解题。

  • qwen-vl-max、qwen-vl-max-0809、qwen-vl-plus-0809模型支持处理视频内容。

前提条件

  • 请您参考获取API-KEY,开通百炼服务并获得API-KEY。

  • 您可以使用OpenAI Python SDK、DashScope SDK或HTTP接口调用通义千问VL模型,请您根据您的需求,参考以下方式准备您的计算环境。

    如果您之前使用OpenAI SDK以及HTTP方式调用OpenAI的服务,只需在原有框架下调整API-KEY、base_url、model等参数,就可以直接调用通义千问VL模型。

    调用方式

    准备条件

    通过OpenAI Python SDK调用

    您可以通过以下命令安装或更新OpenAI SDK:

    # 如果下述命令报错,请将pip替换为pip3
    pip install -U openai

    您需要配置的base_url如下:

    https://dashscope.aliyuncs.com/compatible-mode/v1

    通过OpenAI兼容-HTTP调用

    如果您需要通过OpenAI兼容的HTTP方式进行调用,需要配置的完整访问endpoint如下:

    POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions

    通过DashScope SDK调用

    DashScope SDK提供了Python和Java两个版本,请参考安装SDK,安装最新版SDK。

    通过DashScope HTTP调用

    如果您需要通过DashScope的HTTP方式进行调用,需要配置的完整访问endpoint如下:

    POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
说明

我们推荐您将API-KEY配置到环境变量中以降低API-KEY的泄露风险,详情可参考配置API-KEY到环境变量。您也可以在代码中配置API-KEY,但是会存在泄露风险。

示例代码

图像理解

您可以参考以下示例代码,通过OpenAI或者DashScope的方式,调用通义千问VL模型。

您可以输入单张或多张图片。

OpenAI兼容

您可以通过OpenAI SDK或OpenAI兼容的HTTP方式调用通义千问VL模型。

Python

示例代码

from openai import OpenAI
import os


def get_response():
    client = OpenAI(
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-max",
        messages=[
            {
              "role": "user",
              "content": [
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
                  }
                },
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
                  }
                },
                {
                  "type": "text",
                  "text": "这些是什么"
                }
              ]
            }
          ]
        )
    print(completion.model_dump_json())

if __name__=='__main__':
    get_response()

返回结果

{
  "id": "chatcmpl-4b5a3bb9-221f-9687-bdd7-a7d56aae44df",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海浪和天空,整个画面充满了温馨和愉快的氛围。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色相间的条纹,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面给人一种野生自然的感觉。",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1725948492,
  "model": "qwen-vl-max",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 106,
    "prompt_tokens": 2497,
    "total_tokens": 2603
  }
}

curl

示例代码

curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "这些是什么"
        }
      ]
    }
  ]
}'

返回结果

{
  "choices": [
    {
      "message": {
        "content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海景和日落的天空,整个画面显得非常温馨和谐。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面充满了自然的野性和生机。",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

您可以通过DashScope SDK或HTTP方式调用通义千问VL模型。

Python

示例代码

from http import HTTPStatus
import dashscope


def simple_multimodal_conversation_call():
    messages = [
        {
            "role": "user",
            "content": [
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
                {"text": "这些是什么?"}
            ]
        }
    ]
    response = dashscope.MultiModalConversation.call(
        model='qwen-vl-plus',
        messages=messages
        )
    if response.status_code == HTTPStatus.OK:
        print(response)
    else:
        print(response.code)
        print(response.message)


if __name__ == '__main__':
    simple_multimodal_conversation_call()

返回结果

{
    "status_code": 200,
    "request_id": "3a031529-707f-9b7d-968c-172e7533debc",
    "code": "",
    "message": "",
    "output": {
        "text": null,
        "finish_reason": null,
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "图1中是一名女子和狗在沙滩上玩耍。\n图2是孟加拉虎的插画,它正向镜头走来。\n图3里是一只可爱的小白兔。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "input_tokens": 3743,
        "output_tokens": 41,
        "image_tokens": 3697
    }
}

Java

示例代码

// Copyright (c) Alibaba, Inc. and its affiliates.

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"),
                        Collections.singletonMap("text", "这些是什么?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen-vl-plus")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

{
  "requestId": "dcb38a0f-fd69-9071-bcde-c4530f9a7559",
  "usage": {
    "input_tokens": 3740,
    "output_tokens": 48
  },
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "图1中是一名女子和一只大金毛在沙滩上玩耍。\n图2是孟加拉虎的写实照片,老虎正向镜头走来。\n图3是一幅插画,主要展示了一只兔子。"
            }
          ]
        }
      }
    ]
  }
}

curl

示例代码

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
                    {"text": "这些是什么?"}
                ]
            }
        ]
    }
}'

返回结果

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "这张图片显示了一位女士和她的狗在海滩上。她们似乎正在享受彼此的陪伴,狗狗坐在沙滩上伸出爪子与女士握手或互动。背景是美丽的日落景色,海浪轻轻拍打着海岸线。\n\n请注意,我提供的描述基于图像中可见的内容,并不包括任何超出视觉信息之外的信息。如果您需要更多关于这个场景的具体细节,请告诉我!"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

视频理解

qwen-vl-maxqwen-vl-max-0809qwen-vl-plus-0809模型支持对视频内容的理解功能。您可以直接传入视频文件,或以图片列表形式传入。

from http import HTTPStatus
import dashscope


def simple_multimodal_conversation_call():
    """Simple single round multimodal conversation call.
    """
    messages = [
        {
            "role": "user",
            "content": [
                # 以视频文件传入
                {"video": "https://cloud.video.taobao.com/vod/S8T54f_w1rkdfLdYjL3S5zKN9CrhkzuhRwOhF313tIQ.mp4"},
                # 或以图片列表形式传入
                # {"video":[
                #     "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg",
                #     "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
                #     ]},
                {"text": "视频的内容是什么?"}
            ]
        }
    ]
    response = dashscope.MultiModalConversation.call(
        model='qwen-vl-max',
        messages=messages
        )
    if response.status_code == HTTPStatus.OK:
        print(response)
    else:
        print(response.code)  # The error code.
        print(response.message)  # The error message.


if __name__ == '__main__':
    simple_multimodal_conversation_call()
    
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"video": ["https://cloud.video.taobao.com/vod/S8T54f_w1rkdfLdYjL3S5zKN9CrhkzuhRwOhF313tIQ.mp4"]},
                    {"text": "这是什么?"}
                ]
            }
        ]
    }
}'

返回结果:

{
  "status_code": 200,
  "request_id": "a6772f55-5509-9c2c-bcca-3b9132ed6f63",
  "code": "",
  "message": "",
  "output": {
    "text": null,
    "finish_reason": null,
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "视频的内容是一个人使用阿里云的通义千问模型进行对话的演示。在视频中,用户向模型输入了“你好”作为问候语,模型回应了“你好!有什么我能为你效劳的吗?”这个演示展示了通义千问模型的对话功能,以及它如何与用户进行交互。"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens": 5205,
    "output_tokens": 69,
    "video_tokens": 5180
  }
}

了解更多

有关通义千问VL API的详细调用文档可前往API详情页面进行了解。