通义千问VL

更新时间:2025-04-11 06:36:08

通义千问VL模型可以根据您传入的图片来进行回答。

访问视觉模型可以在线体验图片理解能力。

应用场景

  • 图像问答:描述图像中的内容或者对其进行分类打标,如识别人物、地点、花鸟鱼虫等。

  • 数学题目解答:解答图像中的数学问题,适用于中小学、大学以及成人教育阶段。

  • 视频理解:分析视频内容,如对具体事件进行定位并获取时间戳,或生成关键时间段的摘要。

  • 物体定位:定位图像中的物体,返回外边界矩形框的左上角、右下角坐标或者中心点坐标。

  • 文档解析:将图像类的文档(如扫描件/图片PDF)解析为 QwenVL HTML格式,该格式不仅能精准识别文本,还能获取图像、表格等元素的位置信息。

  • 文字识别与信息抽取:识别图像中的文字、公式,或者抽取票据、证件、表单中的信息,支持格式化输出文本;可识别的语言有中文、英语、日语、韩语、阿拉伯语、越南语、法语、德语、意大利语、西班牙语和俄语。

支持的模型

商业版模型

模型名称

版本

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-max

相比qwen-vl-plus再次提升视觉推理和指令遵循能力,在更多复杂任务中提供最佳性能
当前等同qwen-vl-max-2024-11-19

稳定版

32,768

30,720

单图最大16384

2,048

0.003

Batch调用:0.0015

0.009

Batch调用:0.0045

100Token

有效期:百炼开通后180天内

qwen-vl-max-latest

始终等同最新快照版

最新版

131,072

129,024

单图最大16384

8,192

qwen-vl-max-2025-04-02

又称qwen-vl-max-0402

快照版

0.003

0.009

qwen-vl-max-2025-01-25

又称qwen-vl-max-0125
此版本属于Qwen2.5-VL系列模型,扩展上下文至128k,显著增强图像和视频的理解能力。

qwen-vl-max-2024-12-30

又称qwen-vl-max-1230

32,768

30,720

单图最大16384

2,048

qwen-vl-max-2024-11-19

又称qwen-vl-max-1119

qwen-vl-max-2024-10-30

又称qwen-vl-max-1030

0.02

qwen-vl-max-2024-08-09

又称qwen-vl-max-0809
此版本扩展上下文至32k,增强图像理解能力,能更好地识别图片中的多语种和手写体。

qwen-vl-max-2024-02-01

又称qwen-vl-max-0201

8,000

6,000

单图最大1280

2,000

qwen-vl-plus

大幅提升细节识别和文字识别能力,支持超百万像素分辨率和任意宽高比的图像。在广泛的视觉任务中提供卓越性能

稳定版

8,000

6,000

单图最大1280

0.0015

Batch调用:0.00075

0.0045

Batch调用:0.00225

qwen-vl-plus-latest

始终等同最新快照版

最新版

131,072

129,024

单图最大16384

8,192

0.0015

0.0045

qwen-vl-plus-2025-01-25

又称qwen-vl-plus-0125
此版本属于Qwen2.5-VL系列模型,扩展上下文至128k,显著增强图像和视频的理解能力。

快照版

qwen-vl-plus-2025-01-02

又称qwen-vl-plus-0102

32,768

30,720

单图最大16384

2,048

qwen-vl-plus-2024-08-09

又称qwen-vl-plus-0809

qwen-vl-plus-2023-12-01

8,000

6,000

2,000

0.008

开源版模型

qvq-72b-preview模型是由 Qwen 团队开发的实验性研究模型,专注于提升视觉推理能力,尤其是在数学推理领域。详情可参见QVQ官方博客。如果希望模型先输出思考过程再输出回答内容,请使用通义千问QVQ模型

模型名称

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qvq-72b-preview

32,768

16,384

单图最大16384

16,384

0.012

0.036

10Token

有效期:百炼开通后180天内

模型名称

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen2.5-vl-72b-instruct 

131,072

129,024

单图最大16384

8,192

0.016

0.048

100Token

有效期:百炼开通后180天内

qwen2.5-vl-32b-instruct

0.008

0.024

qwen2.5-vl-7b-instruct

0.002

0.005

qwen2.5-vl-3b-instruct

0.0012

0.0036

qwen2-vl-72b-instruct

32,768

30,720

单图最大16384

2,048

0.016

0.048

qwen2-vl-7b-instruct

32,000

30,000

单图最大16384

2,000

目前仅供免费体验。

免费额度用完后不可调用,敬请关注后续动态。

10Token

有效期:百炼开通后180天内

qwen2-vl-2b-instruct

限时免费

qwen-vl-v1

8,000

6,000

单图最大1280

1,500

目前仅供免费体验。

免费额度用完后不可调用,敬请关注后续动态。

qwen-vl-chat-v1

图像转换为Token的规则

28x28像素对应一个Token,一张图最少需要4Token。您可以通过以下代码计算图像的Token:

Python
Node.js
Java
import math
# 使用以下命令安装Pillow库:pip install Pillow
from PIL import Image

def token_calculate(image_path):
    # 打开指定的PNG图片文件
    image = Image.open(image_path)

    # 获取图片的原始尺寸
    height = image.height
    width = image.width
    
    # 将高度调整为28的整数倍
    h_bar = round(height / 28) * 28
    # 将宽度调整为28的整数倍
    w_bar = round(width / 28) * 28
    
    # 图像的Token下限:4个Token
    min_pixels = 28 * 28 * 4
    # 图像的Token上限:1280个Token
    max_pixels = 1280 * 28 * 28
        
    # 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if h_bar * w_bar > max_pixels:
        # 计算缩放因子beta,使得缩放后的图像总像素数不超过max_pixels
        beta = math.sqrt((height * width) / max_pixels)
        # 重新计算调整后的高度,确保为28的整数倍
        h_bar = math.floor(height / beta / 28) * 28
        # 重新计算调整后的宽度,确保为28的整数倍
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        # 计算缩放因子beta,使得缩放后的图像总像素数不低于min_pixels
        beta = math.sqrt(min_pixels / (height * width))
        # 重新计算调整后的高度,确保为28的整数倍
        h_bar = math.ceil(height * beta / 28) * 28
        # 重新计算调整后的宽度,确保为28的整数倍
        w_bar = math.ceil(width * beta / 28) * 28
    return h_bar, w_bar

# 将test.png替换为本地的图像路径
h_bar, w_bar = token_calculate("test.png")
print(f"缩放后的图像尺寸为:高度为{h_bar},宽度为{w_bar}")

# 计算图像的Token数:总像素除以28 * 28
token = int((h_bar * w_bar) / (28 * 28))

# 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各计1个Token)
print(f"图像的Token数为{token + 2}")
// 使用以下命令安装sharp: npm install sharp
import sharp from 'sharp';
import fs from 'fs';

async function tokenCalculate(imagePath) {
    // 打开指定的PNG图片文件
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // 获取图片的原始尺寸
    const height = metadata.height;
    const width = metadata.width;

    // 将高度调整为28的整数倍
    let hBar = Math.round(height / 28) * 28;
    // 将宽度调整为28的整数倍
    let wBar = Math.round(width / 28) * 28;

    // 图像的Token下限:4个Token
    const minPixels = 28 * 28 * 4;
    // 图像的Token上限:1280个Token
    const maxPixels = 1280 * 28 * 28;

    // 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if (hBar * wBar > maxPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不超过max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.floor(height / beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不低于min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.ceil(height * beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.ceil(width * beta / 28) * 28;
    }

    return { hBar, wBar };
}

// 将test.png替换为本地的图像路径
const imagePath = 'test.png';
tokenCalculate(imagePath).then(({ hBar, wBar }) => {
    console.log(`缩放后的图像尺寸为:高度为${hBar},宽度为${wBar}`);

    // 计算图像的Token数:总像素除以28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));

    // 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各占1个Token)
    console.log(`图像的总Token数为${token + 2}`);
}).catch(err => {
    console.error('Error processing image:', err);
});
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class Main {

    // 自定义类存储调整后的尺寸
    public static class ResizedSize {
        public final int height;
        public final int width;

        public ResizedSize(int height, int width) {
            this.height = height;
            this.width = width;
        }
    }

    public static ResizedSize smartResize(String imagePath) throws IOException {
        // 1. 加载图像
        BufferedImage image = ImageIO.read(new File(imagePath));
        if (image == null) {
            throw new IOException("无法加载图像文件: " + imagePath);
        }

        int originalHeight = image.getHeight();
        int originalWidth = image.getWidth();

        final int minPixels = 28 * 28 * 4;
        final int maxPixels = 1280 * 28 * 28;
        // 2. 初始调整为28的倍数
        int hBar = (int) (Math.round(originalHeight / 28.0) * 28);
        int wBar = (int) (Math.round(originalWidth / 28.0) * 28);
        int currentPixels = hBar * wBar;

        // 3. 根据条件调整尺寸
        if (currentPixels > maxPixels) {
            // 当前像素超过最大值,需要缩小
            double beta = Math.sqrt(
                    (originalHeight * (double) originalWidth) / maxPixels
            );
            double scaledHeight = originalHeight / beta;
            double scaledWidth = originalWidth / beta;

            hBar = (int) (Math.floor(scaledHeight / 28) * 28);
            wBar = (int) (Math.floor(scaledWidth / 28) * 28);
        } else if (currentPixels < minPixels) {
            // 当前像素低于最小值,需要放大
            double beta = Math.sqrt(
                    (double) minPixels / (originalHeight * originalWidth)
            );
            double scaledHeight = originalHeight * beta;
            double scaledWidth = originalWidth * beta;

            hBar = (int) (Math.ceil(scaledHeight / 28) * 28);
            wBar = (int) (Math.ceil(scaledWidth / 28) * 28);
        }

        return new ResizedSize(hBar, wBar);
    }

    public static void main(String[] args) {
        try {
            ResizedSize size = smartResize(
                    // xxx/test.png替换为你的图像路径
                    "xxx/test.png"
            );

            System.out.printf("缩放后的图像尺寸:高度 %d,宽度 %d%n", size.height, size.width);

            // 计算 Token(总像素 / 28×28 + 2)
            int token = (size.height * size.width) / (28 * 28) + 2;
            System.out.printf("图像总 Token 数:%d%n", token);

        } catch (IOException e) {
            System.err.println("错误:" + e.getMessage());
            e.printStackTrace();
        }
    }
}

模型选型建议

  • 通义千问VL-MAX模型的视觉理解能力最强;通义千问VL-PLUS模型在效果、成本上比较均衡,如果您暂时不确定使用某种模型,可以优先尝试使用通义千问VL-PLUS模型。

  • 若图像中涉及复杂的数学推理问题,建议使用通义千问QVQ模型解决。通义千问QVQ是视觉推理模型,支持视觉输入及思维链输出,在数学、编程、视觉分析、创作以及通用任务上表现更强。

如何使用

您需要已获取API Key配置API Key到环境变量。如果通过OpenAI SDKDashScope SDK进行调用,还需要安装最新版SDK,并确保您的DashScope Python SDK版本不低于1.20.7,DashScope Java SDK版本不低于2.18.3。如果您是子业务空间的成员,请确保超级管理员已为该业务空间进行模型授权
您可以使用Prompt指南中的推荐提示词去适配应用场景,其中文档解析、物体定位和与时间戳相关的视频理解的功能仅Qwen2.5-VL系列模型支持。

快速开始

下面是传入图像URL进行图像理解的示例代码。

您可以在使用限制处查看对输入图像的限制,如需传入本地图像请参见使用本地文件
OpenAI兼容
DashScope
Python
Node.js
curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "图中描绘的是什么景象?"},
            ],
        },
    ],
)

print(completion.choices[0].message.content)

返回结果

这是一张在海滩上拍摄的照片。照片中,一个人和一只狗坐在沙滩上,背景是大海和天空。人和狗似乎在互动,狗的前爪搭在人的手上。阳光从画面的右侧照射过来,给整个场景增添了一种温暖的氛围。
import OpenAI from "openai";

const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen-vl-max-latest",
    messages: [{
        role: "system",
        content: [{
          type: "text",
          text: "You are a helpful assistant."
        }]
      },
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "图中描绘的是什么景象?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}

main()

返回结果

这是一张在海滩上拍摄的照片。照片中,一位穿着格子衬衫的女性坐在沙滩上,与一只戴着项圈的黄色拉布拉多犬互动。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。
curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
  {"role":"system",
  "content":[
    {"type": "text", "text": "You are a helpful assistant."}]},
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
      {"type": "text", "text": "图中描绘的是什么景象?"}
    ]
  }]
}'

返回结果

{
  "choices": [
    {
      "message": {
        "content": "这张图片展示了一位女士和一只狗在海滩上互动。女士坐在沙滩上,微笑着与狗握手。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。狗戴着项圈,显得很温顺。",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
Python
Java
curl
import os
import dashscope
messages = [
{
    "role": "system",
    "content": [
    {"text": "You are a helpful assistant."}]
},
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "图中描绘的是什么景象?"}]
}]
response = dashscope.MultiModalConversation.call(
    #若没有配置环境变量, 请用百炼API Key将下行替换为: api_key ="sk-xxx"
    api_key = os.getenv('DASHSCOPE_API_KEY'),
    model = 'qwen-vl-max-latest',
    messages = messages
)
print(response.output.choices[0].message.content[0]["text"])

返回结果

是一张在海滩上拍摄的照片。照片中有一位女士和一只狗。女士坐在沙滩上,微笑着与狗互动。狗戴着项圈,似乎在与女士握手。背景是大海和天空,阳光洒在她们身上,营造出温馨的氛围。
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
               {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    }
}'

返回结果

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。他们坐在沙滩上,背景是大海和天空。阳光从画面的右侧照射过来,给整个场景增添了一种温暖的氛围。"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

多轮对话(参考历史对话信息)

通义千问VL模型可以参考历史对话信息实现多轮对话,您需要维护一个messages 数组,将每一轮的对话历史以及新的指令添加到 messages 数组中。

OpenAI兼容
DashScope
Python
Node.js
curl
from openai import OpenAI
import os

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                },
            },
            {"type": "text", "text": "图中描绘的是什么景象?"},
        ],
    }
]
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=messages,
    )
print(f"第一轮输出:{completion.choices[0].message.content}")
assistant_message = completion.choices[0].message
messages.append(assistant_message.model_dump())
messages.append({
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": "做一首诗描述这个场景"
        }
        ]
    })
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=messages,
    )
print(f"第二轮输出:{completion.choices[0].message.content}")

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中,一位穿着格子衬衫的女士坐在沙滩上,与一只戴着项圈的金毛犬互动。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。
第二轮输出:沙滩上,阳光洒,
女子与犬,笑语哗。
海浪轻拍,风儿吹,
快乐时光,心儿醉。
import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

let messages = [
    {
	role: "system",
	content: [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        role: "user",
	content: [
        { type: "image_url", image_url: { "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg" } },
        { type: "text", text: "图中描绘的是什么景象?" },
    ]
}]
async function main() {
    let response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: messages
    });
    console.log(`第一轮输出:${response.choices[0].message.content}`);
    messages.push(response.choices[0].message);
    messages.push({"role": "user", "content": "做一首诗描述这个场景"});
    response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: messages
    });
    console.log(`第二轮输出:${response.choices[0].message.content}`);
}

main()

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光从画面的右侧照射过来,营造出温暖的氛围。
第二轮输出:沙滩上,人与狗,  
面对面,笑语稠。  
海风轻拂,阳光柔,  
心随波浪,共潮头。  

项圈闪亮,情意浓,  
格子衫下,心相通。  
海天一色,无尽空,  
此刻温馨,永铭中。
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "图中描绘的是什么景象?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "这是一个女孩和一只狗。"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "写一首诗描述这个场景"
        }
      ]
    }
  ]
}'

返回结果

{
    "choices": [
        {
            "message": {
                "content": "海风轻拂笑颜开,  \n沙滩上与犬相陪。  \n夕阳斜照人影短,  \n欢乐时光心自醉。",
                "role": "assistant"
            },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null
        }
    ],
    "object": "chat.completion",
    "usage": {
        "prompt_tokens": 1295,
        "completion_tokens": 32,
        "total_tokens": 1327
    },
    "created": 1726324976,
    "system_fingerprint": null,
    "model": "qwen-vl-max",
    "id": "chatcmpl-3c953977-6107-96c5-9a13-c01e328b24ca"
}
Python
Java
curl
import os
from dashscope import MultiModalConversation

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            },
            {"text": "图中描绘的是什么景象?"},
        ],
    }
]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)
print(f"模型第一轮输出:{response.output.choices[0].message.content[0]['text']}")
messages.append(response['output']['choices'][0]['message'])
user_msg = {"role": "user", "content": [{"text": "做一首诗描述这个场景"}]}
messages.append(user_msg)
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)
print(f"模型第二轮输出:{response.output.choices[0].message.content[0]['text']}")

返回结果

模型第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。
模型第二轮输出:在阳光照耀的海滩上,人与狗共享欢乐时光。
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    private static final String modelName = "qwen-vl-max-latest";
    public static void MultiRoundConversationCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        List<MultiModalMessage> messages = new ArrayList<>();
        messages.add(systemMessage);
        messages.add(userMessage);
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))                
                .model(modelName)
                .messages(messages)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("第一轮输出:"+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));        // add the result to conversation
        messages.add(result.getOutput().getChoices().get(0).getMessage());
        MultiModalMessage msg = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "做一首诗描述这个场景"))).build();
        messages.add(msg);
        param.setMessages((List)messages);
        result = conv.call(param);
        System.out.println("第二轮输出:"+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }

    public static void main(String[] args) {
        try {
            MultiRoundConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。
第二轮输出:在阳光洒满的海滩上,人与狗共享欢乐时光。
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "图中描绘的是什么景象?"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"text": "图中是一名女子和一只拉布拉多犬在沙滩上玩耍。"}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"text": "写一首七言绝句描述这个场景"}
                ]
            }
        ]
    }
}'

返回结果

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "海浪轻拍沙滩边,女孩与狗同嬉戏。阳光洒落笑颜开,快乐时光永铭记。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "output_tokens": 27,
        "input_tokens": 1298,
        "image_tokens": 1247
    },
    "request_id": "bdf5ef59-c92e-92a6-9d69-a738ecee1590"
}

流式输出

大模型接收到输入后,会逐步生成中间结果,最终结果由这些中间结果拼接而成。这种一边生成一边输出中间结果的方式称为流式输出。采用流式输出时,您可以在模型进行输出的同时阅读,减少等待模型回复的时间。

OpenAI兼容
DashScope

您只需在代码中将stream参数设置为true,即可通过OpenAI SDKOpenAI兼容的HTTP方式调用通义千问VL模型,体验流式输出的功能。

Python
Node.js
curl
from openai import OpenAI
import os

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
	{"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
         "content": [{"type": "image_url",
                    "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "图中描绘的是什么景象?"}]}],
    stream=True
)
full_content = ""
print("流式输出内容为:")
for chunk in completion:
    if chunk.choices[0].delta.content is None:
        continue
    full_content += chunk.choices[0].delta.content
    print(chunk.choices[0].delta.content)
print(f"完整内容为:{full_content}")

返回结果

流式输出内容为:

图
中
描绘
的是
一个
女人
......
温暖
和谐
的
氛围
。
完整内容为:图中描绘的是一个女人和一只狗在海滩上互动的场景。女人坐在沙滩上,微笑着与狗握手,显得非常开心。背景是大海和天空,阳光洒在她们身上,营造出一种温暖和谐的氛围。
import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const completion = await openai.chat.completions.create({
    model: "qwen-vl-max-latest",
    messages: [
        {"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
        "content": [{"type": "image_url",
                    "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "图中描绘的是什么景象?"}]}],
    stream: true,
});

let fullContent = ""
console.log("流式输出内容为:")
for await (const chunk of completion) {
    if (chunk.choices[0].delta.content != null) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    }
}
console.log(`完整输出内容为:${fullContent}`)

返回结果

流式输出内容为:

图中描绘的是
一个女人和一只
狗在海滩上
互动的景象。
......
在她们身上,
营造出温暖和谐
的氛围。
完整内容为:图中描绘的是一个女人和一只狗在海滩上互动的景象。女人穿着格子衬衫,坐在沙滩上,微笑着与狗握手。狗戴着项圈,看起来很开心。背景是大海和天空,阳光洒在她们身上,营造出温暖和谐的氛围。
curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max-latest",
    "messages": [
   {
      "role": "system",
      "content": [{"type":"text","text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "图中描绘的是什么景象?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

返回结果

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"图"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"delta":{"content":"中"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

......

data: {"choices":[{"delta":{"content":"分拍摄的照片。整体氛围显得非常"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":"和谐而温馨。"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":1276,"completion_tokens":85,"total_tokens":1361},"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: [DONE]

您可以通过DashScope SDKHTTP方式调用通义千问VL模型,体验流式输出的功能。根据不同的调用方式,您可以设置相应的参数来实现流式输出:

  • Python SDK方式:设置stream参数为True。

  • Java SDK方式:需要通过streamCall接口调用。

  • HTTP方式:需要在Header中指定X-DashScope-SSEenable

流式输出的内容默认是非增量式(即每次返回的内容都包含之前生成的内容),如果您需要使用增量式流式输出,请设置incremental_output(Java 为incrementalOutput)参数为 true
Python
Java
curl
import os
from dashscope import MultiModalConversation

messages = [
    {
    "role": "system",
    "content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            {"text": "图中描绘的是什么景象?"}
        ]
    }
]
responses = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max-latest',
    messages=messages,
    stream=True,
    incremental_output=True
    )
full_content = ""
print("流式输出内容为:")
for response in responses:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
    except:
        pass
print(f"完整内容为:{full_content}")

返回结果

流式输出内容为:
图中描绘的是
一个人和一只狗
在海滩上互动
......
阳光洒在他们
身上,营造出
温暖和谐的氛围
。
完整内容为:图中描绘的是一个人和一只狗在海滩上互动的景象。这个人穿着格子衬衫,坐在沙滩上,与一只戴着项圈的金毛猎犬握手。背景是海浪和天空,阳光洒在他们身上,营造出温暖和谐的氛围。
import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;

public class Main {
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // must create mutable map.
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

图
中
描绘
的是
一个
女人
和
一只
狗
在
海滩
......
营造
出
一种
温暖
和谐
的
氛围
。
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

返回结果

iid:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"这张"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":1,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"图片"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":2,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

......

id:17
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"的欣赏。这是一个温馨的画面,展示了"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":112,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:18
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"人与动物之间深厚的情感纽带。"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":120,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:19
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"input_tokens":1276,"output_tokens":121,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

高分辨率图像理解

您可以通过设置vl_high_resolution_images参数为true,将通义千问VL模型的单图Token上限从1280提升至16384:

参数值

单图Token上限

描述

使用场景

参数值

单图Token上限

描述

使用场景

True

16384

  • 表示模型的单图Token上限为16384,超过该值的图像会被缩放,直到图像的Token小于16384。

  • 模型能直接处理更高像素的图像,能理解更多的图像细节。同时处理速度会降低,Token用量也会增加

内容丰富、需要关注细节的场景

False(默认值)

1280

  • 表示模型的单图Token的上限为1280,超过该值的图像会被缩放,直到图像的Token小于1280。

  • 模型的处理速度会提升,Token用量较少

细节较少、只需用模型理解大致信息或对速度有较高要求的场景

支持设置vl_high_resolution_image的模型

qwen-vl-max、qwen-vl-max-latest、qwen-vl-max-0402、qwen-vl-max-0125、qwen-vl-max-1230、qwen-vl-max-1119、qwen-vl-max-1030、qwen-vl-max-0809、qwen-vl-plus-latest、qwen-vl-plus-0125、qwen-vl-plus-0102、qwen-vl-plus-0809、qwen2-vl-72b-instruct、qwen2-vl-7b-instruct、qwen2-vl-2b-instruct、qwen2.5-vl-72b-instruct、qwen2.5-vl-32b-instruct、qwen2.5-vl-7b-instruct、qwen2.5-vl-3b-instruct

vl_high_resolution_images参数仅支持DashScope Python SDKHTTP方式下使用。
Python
curl
import os
import dashscope

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            {"text": "这张图表现了什么内容?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages,
    vl_high_resolution_images=True
)

print("大模型的回复:\n ",response.output.choices[0].message.content[0]["text"])
print("Token用量情况:","输入总Token:",response.usage["input_tokens"] , ",输入图像Token:" , response.usage["image_tokens"])

返回结果

大模型的回复:
  这张图片展示了一个温馨的圣诞装饰场景。图中可以看到以下元素:

1. **圣诞树**:两棵小型的圣诞树,上面覆盖着白色的雪。
2. **驯鹿摆件**:一只棕色的驯鹿摆件,带有大大的鹿角。
3. **蜡烛和烛台**:几个木制的烛台,里面点燃了小蜡烛,散发出温暖的光芒。
4. **圣诞装饰品**:包括金色的球形装饰、松果、红色浆果串等。
5. **圣诞礼物盒**:一个小巧的金色礼物盒,用金色丝带系着。
6. **圣诞字样**:木质的“MERRY CHRISTMAS”字样,增加了节日气氛。
7. **背景**:木质的背景板,给人一种自然和温暖的感觉。

整体氛围非常温馨和喜庆,充满了浓厚的圣诞节气息。
Token用量情况: 输入总Token: 5368 ,输入图像Token: 5342
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "这张图表现了什么内容?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true
    }
}'

返回结果

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "这张图片展示了一个温馨的圣诞装饰场景。画面中包括以下元素:\n\n1. **圣诞树**:两棵小型的圣诞树,上面覆盖着白色的雪。\n2. **驯鹿摆件**:一只棕色的驯鹿摆件,位于画面中央偏右的位置。\n3. **蜡烛**:几根木制的蜡烛,其中两根已经点燃,发出温暖的光芒。\n4. **圣诞装饰品**:一些金色和红色的装饰球、松果、浆果和绿色的松枝。\n5. **圣诞礼物**:一个小巧的金色礼物盒,旁边还有一个带有圣诞图案的袋子。\n6. **“MERRY CHRISTMAS”字样**:用木质字母拼写的“MERRY CHRISTMAS”,放在画面左侧。\n\n整个场景布置在一个木质背景前,营造出一种温暖、节日的氛围,非常适合圣诞节的庆祝活动。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 5553,
        "output_tokens": 185,
        "input_tokens": 5368,
        "image_tokens": 5342
    },
    "request_id": "38cd5622-e78e-90f5-baa0-c6096ba39b04"
}

多图像输入

通义千问VL模型可在一次请求中同时输入多张图像,模型会根据传入的全部图像进行回答。传入方式为图像URL本地文件,也支持二者组合传入,下面是以图像URL方式传入的示例代码。

输入图像的总Token数必须小于模型的最大输入,您可以参考图像数量限制,计算可传入图像的最大数量。
OpenAI兼容
DashScope
Python
Node.js
curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
       {"role":"system","content":[{"type": "text", "text": "You are a helpful assistant."}]},
       {"role": "user","content": [
           # 第一张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
           {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
           # 第二张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
           {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
           {"type": "text", "text": "这些图描绘了什么内容?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

返回结果

图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海浪和天空,整个画面充满了温馨和愉快的氛围。

图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面给人一种野生自然的感觉。
import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
	    {role: "system",content:[{ type: "text", text: "You are a helpful assistant." }]},
	    {role: "user",content: [
	    // 第一张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
            // 第二张图像链接,,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            {type: "text", text: "这些图描绘了什么内容?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

返回结果

第一张图片中,一个人和一只狗在海滩上互动。人穿着格子衬衫,狗戴着项圈,他们似乎在握手或击掌。

第二张图片中,一只老虎在森林中行走。老虎的毛色是橙色和黑色条纹,背景是绿色的树木和植被。
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max-latest",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "这些图描绘了什么内容?"
        }
      ]
    }
  ]
}'

返回结果

{
  "choices": [
    {
      "message": {
        "content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海景和日落的天空,整个画面显得非常温馨和谐。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面充满了自然的野性和生机。",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
Python
Java
curl
import os
import dashscope

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            # 第一张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            # 第二张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            # 第三张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"},
            {"text": "这些图描绘了什么内容?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

返回结果

这些图片展示了一些动物和自然场景。第一张图片中,一个人和一只狗在海滩上互动。第二张图片是一只老虎在森林中行走。第三张图片是一只卡通风格的兔子在草地上跳跃。
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import java.util.HashMap;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
         // 如果使用本地图像,请导入 import java.util.HashMap;,再为函数添加【String localPath】参数,并释放下面注释
        // String filePath = "file://"+localPath;
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        // 第一张图像链接
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        // 如果使用本地图像,请并释放下面注释
                        // new HashMap<String, Object>(){{put("image", filePath);}},
                        // 第二张图像链接
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        // 第三张图像链接
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"),
                        Collections.singletonMap("text", "这些图描绘了什么内容?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

这些图片展示了一些动物和自然场景。

1. 第一张图片:一个女人和一只狗在海滩上互动。女人穿着格子衬衫,坐在沙滩上,狗戴着项圈,伸出爪子与女人握手。
2. 第二张图片:一只老虎在森林中行走。老虎的毛色是橙色和黑色条纹,背景是树木和树叶。
3. 第三张图片:一只卡通风格的兔子在草地上跳跃。兔子是白色的,耳朵是粉红色的,背景是蓝天和黄色的花朵。
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"},
                    {"text": "这些图展现了什么内容?"}
                ]
            }
        ]
    }
}'

返回结果

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "这张图片显示了一位女士和她的狗在海滩上。她们似乎正在享受彼此的陪伴,狗狗坐在沙滩上伸出爪子与女士握手或互动。背景是美丽的日落景色,海浪轻轻拍打着海岸线。\n\n请注意,我提供的描述基于图像中可见的内容,并不包括任何超出视觉信息之外的信息。如果您需要更多关于这个场景的具体细节,请告诉我!"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

视频理解

部分通义千问VL模型支持对视频内容的理解,文件形式包括图像列表(视频帧)或视频文件。

图像列表
视频文件
qwen-vl-plusqwen-vl-plus-2023-12-01qvq-72b-previewqwen-vl-v1qwen-vl-chat-v1以外的其他模型均支持对图像列表形式的视频内容进行理解。
  • Qwen2.5-VL系列模型:最少传入4张图片,最多512张图片

  • 其他模型:最少传入4张图片,最多80张图片

下面是传入图像列表URL的示例代码,如需传入本地视频请参见本地文件

OpenAI兼容
DashScope
使用OpenAI SDKHTTP方式向通义千问VL模型输入图片列表形式的视频时,需要将用户消息中的"type"参数设为"video"
Python
Node.js
curl
import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[{"role": "user","content": [
        # 传入图像列表时,用户消息中的"type"参数为"video"
        {"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "描述这个视频的具体过程"},
    ]}]
)
print(completion.choices[0].message.content)
// 确保之前在 package.json 中指定了 "type": "module"
import OpenAI from "openai";

const openai = new OpenAI({
    // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [{
            role: "user",
            content: [
                {
                    // 传入图像列表时,用户消息中的"type"参数为"video"
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "描述这个视频的具体过程"
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "描述这个视频的具体过程"}]}]
}'
Python
Java
curl
import os
# dashscope版本需要不低于1.20.10
import dashscope

messages = [{"role": "user",
             "content": [
                  # 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "描述这个视频的具体过程"}]}]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max-latest',
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
// DashScope SDK版本需要不低于2.18.3
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    private static final String MODEL_NAME = "qwen-vl-max-latest";
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
                .build();
        //  若模型属于Qwen2.5-VL系列且传入的是图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的
        Map<String, Object> params = Map.of(
                "video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"),
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "描述这个视频的具体过程")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(systemMessage, userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max-latest",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "描述这个视频的具体过程"
          }
        ]
      }
    ]
  }
}'
qwen-vl-maxqwen-vl-max-latestqwen-vl-max-1119qwen-vl-max-2025-01-25qwen2.5-vl-32b-instructqwen2.5-vl-72b-instruct模型支持直接传入视频文件;如需体验qwen-vl-max-0809qwen-vl-max-1030qwen-vl-plus-latestqwen-vl-plus-0809qwen2.5-vl-3b-instructqwen2.5-vl-7b-instruct模型,请提交工单进行申请。

视频文件限制:

  • 视频文件大小:Qwen2.5-VL系列模型支持传入的视频大小不超过1 GB,其他模型不超过150MB。

  • 视频文件格式: MP4、AVI、MKV、MOV、FLV、WMV 等。

  • 视频时长:Qwen2.5-VL系列模型支持的视频时长为2秒至10分钟,其他模型2秒至40秒。

  • 视频尺寸:无限制,但是视频文件会被调整到约 60万 像素数,更大尺寸的视频文件不会有更好的理解效果。

  • 暂时不支持对视频文件的音频进行理解。

下面是传入视频URL的示例代码,如需传入本地视频请参见本地文件

OpenAI兼容
DashScope
使用OpenAI SDKHTTP方式向通义千问VL模型输入图片列表形式的视频时,需要将用户消息中的"type"参数设为"video_url"
Python
Node.js
curl
import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
        {"role": "system",
         "content": [{"type": "text","text": "You are a helpful assistant."}]},
        {"role": "user","content": [{
            # 直接传入视频文件时,请将type的值设置为video_url
            "type": "video_url",            
            "video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {"type": "text","text": "这段视频的内容是什么?"}]
         }]
)
print(completion.choices[0].message.content)
import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
        {role:"system",content:["You are a helpful assistant."]},
        {role: "user",content: [
            // 直接传入视频文件时,请将type的值设置为video_url
            {type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {type: "text", text: "这段视频的内容是什么?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "messages": [
    {"role": "system", "content": [{"type": "text","text": "You are a helpful assistant."}]},
    {"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
    {"type": "text","text": "这段视频的内容是什么?"}]}]
}'
Python
Java
curl
import dashscope
import os
messages = [
    {"role":"system","content":[{"text": "You are a helpful assistant."}]},
    {"role": "user",
        "content": [
            # fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "这段视频的内容是什么?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量, 请用百炼API Key将下行替换为: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // fps参数控制视频抽帧数量,表示每隔 1/fps 秒抽取一帧
        Map<String, Object> params = Map.of(
                "video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
                "fps",2);
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "这段视频的内容是什么?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system","content": [{"text": "You are a helpful assistant."}]},
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "这段视频的内容是什么?"}]}]}
}'

使用本地文件(输入Base64编码或本地路径

下面是传入本地图像或视频文件的示例代码,使用OpenAI SDKHTTP方式时,需要将本地文件编码为Base64格式后再传入;使用DashScope SDK时,直接传入本地文件的路径即可。

图像
视频

以保存在本地的eagle.png为例。

OpenAI兼容
DashScope

使用OpenAI SDKHTTP方式来处理本地图像的步骤如下:

  1. 编码图像文件:读取本地图像并编码为Base64格式

  2. 传递Base64数据:请使用以下格式将Base64数据传入image_url参数: data:image/{format};base64,{base64_image}

    image/{format}表示图像的格式,请根据实际的图像格式,将image/{format}设置为与使用限制表格中Content Type对应的值。如:本地图像为jpg格式,则设置为image/jpeg
  3. 调用模型:调用通义千问VL模型,并处理返回的结果

Python
Node.js
from openai import OpenAI
import os
import base64


#  base 64 编码格式
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# 将xxxx/eagle.png替换为你本地图像的绝对路径
base64_image = encode_image("xxx/eagle.png")

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
    	{
    	    "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。"f"是字符串格式化的方法。
                    # PNG图像:  f"data:image/png;base64,{base64_image}"
                    # JPEG图像: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP图像: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}, 
                },
                {"type": "text", "text": "图中描绘的是什么景象?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)
import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// 将xxx/eagle.png替换为你本地图像的绝对路径
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
            {"role": "system", 
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{"type": "image_url",
                            // 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
                           // PNG图像:  data:image/png;base64,${base64Image}
                          // JPEG图像: data:image/jpeg;base64,${base64Image}
                         // WEBP图像: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "图中描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

使用DashScope SDK处理本地图像文件时,需要传入文件路径。请您参考下表,结合您的使用方式与操作系统进行文件路径的创建。

系统

SDK

传入的文件路径

示例

LinuxmacOS系统

Python SDK

file://{文件的绝对路径}

file:///home/images/test.png

Java SDK

Windows系统

Python SDK

file://{文件的绝对路径}

file://D:/images/test.png

Java SDK

file:///{文件的绝对路径}

file:///D:images/test.png

Python
Java
import os
from dashscope import MultiModalConversation

# 将xxx/eagle.png替换为你本地图像的绝对路径
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [{"role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
                {'role':'user',
                'content': [{'image': image_path},
                            {'text': '图中描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("image", filePath);}},
                        new HashMap<String, Object>(){{put("text", "图中描绘的是什么景象?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // 将xxx/eagle.png替换为你本地图像的绝对路径
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
图像列表
视频文件

以保存在本地的football1.jpgfootball2.jpgfootball3.jpgfootball4.jpg为例。

OpenAI兼容
DashScope
Python
Node.js
import os
from openai import OpenAI
import base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
    {"role": "system",
     "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "描述这个视频的具体过程"},
    ]}]
)
print(completion.choices[0].message.content)
import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{"type": "video",
                            // 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
                           // PNG图像:  data:image/png;base64,${base64Image}
                          // JPEG图像: data:image/jpeg;base64,${base64Image}
                         // WEBP图像: data:image/webp;base64,${base64Image}
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "这段视频描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();
Python
Java
import os

from dashscope import MultiModalConversation

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{"role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
                {'role':'user',
                # 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的,其他模型设置则不生效
                'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
                            {'text': '这段视频描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)

print(response["output"]["choices"][0]["message"].content[0]["text"])
// DashScope SDK版本需要不低于2.18.3
import java.util.Arrays;
import java.util.Map;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    private static final String MODEL_NAME = "qwen-vl-max-latest";
    public static void videoImageListSample(String localPath1, String localPath2, String localPath3, String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        String filePath1 = "file://" + localPath1;
        String filePath2 = "file://" + localPath2;
        String filePath3 = "file://" + localPath3;
        String filePath4 = "file://" + localPath4;
        MultiModalMessage systemMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
                .build();
        Map<String, Object> params = Map.of(
                "video", Arrays.asList(filePath1,filePath2,filePath3,filePath4),
                // 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的,其他模型设置则不生效
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "描述这个视频的具体过程")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(systemMessage, userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

以保存在本地的test.mp4为例。

本地视频大小限制为100MB。
OpenAI兼容
DashScope
Python
Node.js
from openai import OpenAI
import os
import base64


#  Base64 编码格式
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# 将xxxx/test.mp4替换为你本地视频的绝对路径
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
        {
            "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    # 直接传入视频文件时,请将type的值设置为video_url
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "这段视频描绘的是什么景象?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)
import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// 将xxxx/test.mp4替换为你本地视频的绝对路径
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{
                 // 直接传入视频文件时,请将type的值设置为video_url
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
                 {"type": "text", "text": "这段视频描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();
Python
Java
import os
from dashscope import MultiModalConversation
# 将xxxx/test.mp4替换为你本地视频的绝对路径
local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [{'role': 'system',
                'content': [{'text': 'You are a helpful assistant.'}]},
                {'role':'user',
                # fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                'content': [{'video': video_path,"fps":2},
                            {'text': '这段视频描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", filePath);// fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                                           put("fps", 2);
                                       }}, 
                        new HashMap<String, Object>(){{put("text", "这段视频描绘的是什么景象?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // 将xxxx/test.mp4替换为你本地视频的绝对路径
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

使用限制

支持的图像

模型支持的图像格式如下表,需注意使用OpenAI SDK 传入本地图像时,请根据实际的图像格式,将代码中的image/{format}设置为对应的Content Type值。

图像格式

文件扩展名

Content Type

图像格式

文件扩展名

Content Type

BMP

.bmp

image/bmp

DIB

.dib

image/bmp

ICNS

.icns

image/icns

ICO

.ico

image/x-icon

JPEG

.jfif, .jpe, .jpeg, .jpg

image/jpeg

JPEG2000

.j2c, .j2k, .jp2, .jpc, .jpf, .jpx

image/jp2

PNG

.apng, .png

image/png

SGI

.bw, .rgb, .rgba, .sgi

image/sgi

TIFF

.tif, .tiff

image/tiff

WEBP

.webp

image/webp

图像大小限制

  • 单个图像文件的大小不超过10 MB。

  • 图像的宽度和高度均应大于10像素,宽高比不应超过200:11:200。

  • 对单图的像素总数无限制,因为模型在进行图像理解前会对图像进行缩放等预处理。过大的图像不会有更好的理解效果,推荐的像素值如下:

    • 输入qwen-vl-maxqwen-vl-max-latestqwen-vl-max-1230qwen-vl-max-1119qwen-vl-max-1030qwen-vl-max-0809qwen-vl-plus-latestqwen-vl-plus-0102qwen-vl-plus-0809qwen2-vl-72b-instructqwen2-vl-7b-instructqwen2-vl-2b-instruct模型的单张图像,像素数推荐不超过 1200万,可以支持标准的 4K 图像。

    • 输入qwen-vl-max-0201qwen-vl-plus模型的单张图像,像素数推荐不超过 1048576,相当于一张宽高均为 1024 的图像。

图像输入方式

  • 图像的URL链接:需确保URL可被公网访问。

    说明

    您可以将图像上传到OSS,获取图像的公网URL。

    • 如果要传入的是OSS中读写权限为私有的图像,可使用外网endpoint生成签名URL。该URL允许他人临时访问文件,具体请参见使用预签名URL下载文件

    • 由于OSS内网与百炼服务不互通,请勿使用OSS内网URL。

  • 本地图像文件:使用OpenAI SDK时,应传入图像经Base64编码后的数据。使用Dashscope SDK时,应传入本地图像的路径。

图像数量限制

在多图像输入中,图像数量受模型图文总Token上限(即最大输入)的限制,所有图像的总Token数必须小于模型的最大输入。

如:使用的模型为qwen-vl-max,该模型的最大输入为30720Token,若传入的图像像素均为1280 × 1280:

vl_high_resolution_images

调整后的图像宽高

图像Token

最多可传入的图像数量(张)

vl_high_resolution_images

调整后的图像宽高

图像Token

最多可传入的图像数量(张)

True

1288 x 1288

2118

14

False

980 x 980

1227

25

Prompt指南

看图做题

Prompt技巧:您可以通过“思维链”Prompt方法解决复杂的数学问题,这是一种通过引导模型生成推理过程或帮助模型拆解复杂任务并逐步推理的方式,让模型在生成推理结果前生成更多的推理依据,从而提升模型在复杂问题上的表现。

输入示例

示例代码

输出示例

提示词:请你分步骤解答这道题,输出对这道题的思考判断过程。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i2/O1CN01e99Hxt1evMlWM6jUL_!!6000000003933-0-tps-1294-760.jpg"},
                    {"text": "请你分步骤解答这道题,输出对这道题的思考判断过程。"}
                ]
            }
        ]
    }
}'

image

抽取发票中的信息

通义千问VL模型支持抽取票据证件表单中的信息,并以结构化的形式返回。

Prompt技巧:

  • 使用分隔符强调需要提取的字段

  • 明确输出格式,例如JSON格式

  • 在提示词中明确禁止可能的```json```代码段,如“请你以JSON格式输出,不要输出```json```代码段”

输入示例

示例代码

输出示例

提示词:提取图中的:['发票代码','发票号码','到站','燃油费','票价','乘车日期','开车时间','车次','座号'],请你以JSON格式输出,不要输出```json```代码段”。image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg"},
                    {"text": "提取图中的:['发票代码','发票号码','到站','燃油费','票价','乘车日期','开车时间','车次','座号'],请你以JSON格式输出,不要输出```json```代码段”。"}
                ]
            }
        ]
    }
}'
{
    "发票代码": "221021325353",
    "发票号码": "10283819",
    "到站": "开发区",
    "燃油费": "2.0",
    "票价": "8.00<全>",
    "乘车日期": "2013-06-29",
    "开车时间": "流水",
    "车次": "040",
    "座号": "371"
}

定位图像中的物体(仅Qwen2.5-VL模型支持)

Qwen2.5-VL支持以Box定位或Point定位的两种方式对物体定位,以Box定位方式会返回矩形框的左上角和右下角的坐标,以Point定位的方式会返回矩形框中心点的坐标(两类坐标均相对于图像左上角的绝对值,单位为像素)。

Qwen2.5-VL模型480*480~ 2560*2560分辨率范围内,物体定位效果较为鲁棒,在此范围之外可能会偶发bbox漂移现象。
  1. Prompt技巧

定位方式

支持的输出方式

推荐Prompt

Box定位

JSON或纯文本

检测图中所有{物体}并以{JSON/纯文本}格式输出其bbox的坐标

Point定位

JSONXML

以点的形式定位图中所有{物体},以{JSON/XML}格式输出其point坐标

  1. Prompt改进思路

  • 当检测密集排列的物体时,如Prompt为“检测图中所有人”,模型可能会混淆了“每个人”和“所有人”的语义,从而仅输出将所有人物包含在内的框,可以通过下列提示词向模型强调检测每个对象:

    • Box定位:定位图中每一个{某类物体}并描述其各自的{某种特征},以{JSON/纯文本}格式输出其bbox坐标。

    • Point定位:以点的形式定位图中每一个{某类物体}并描述各自的{某种特征},以{JSON/XML}格式输出其point坐标

  • 定位结果中可能会出现```json```或者```xml```等无关内容,可在Prompt中明确禁止该内容输出,如“请你以JSON格式输出,不要输出```json```代码段”。

输入示例

示例代码

输出示例

Box定位:

提示词:定位每一个蛋糕的位置,并描述其各自的特征,以JSON格式输出所有的bbox的坐标,不要输出```json```代码段。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i3/O1CN01I1CXf21UR0Ld20Yzs_!!6000000002513-2-tps-1024-1024.png"},
                    {"text":  "用一个个框定位图像每一个蛋糕的位置并描述其各自的特征,以JSON格式输出所有的bbox的坐标,不要输出```json```代码段"}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_p":"0.00000000000001",
        "top_k":"1",
        "seed":"3407"
    }
}'
[
  {
    "bbox": [60, 395, 204, 578],
    "description": "巧克力蛋糕,顶部覆盖红色糖霜和彩色糖粒"
  },
  {
    "bbox": [248, 381, 372, 542],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [400, 368, 504, 504],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [530, 355, 654, 526],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [432, 445, 566, 606],
    "description": "粉红色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [630, 475, 774, 646],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [740, 380, 868, 539],
    "description": "巧克力蛋糕,顶部覆盖棕色糖霜"
  },
  {
    "bbox": [796, 512, 960, 693],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [39, 555, 200, 736],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [292, 546, 446, 707],
    "description": "黑色蛋糕,顶部有白色糖霜和两个黑色眼睛"
  },
  {
    "bbox": [516, 564, 666, 715],
    "description": "黄色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [352, 655, 516, 822],
    "description": "白色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [130, 746, 304, 924],
    "description": "白色糖霜的蛋糕,顶部有两个黑色眼睛"
  }
]

Point定位:

提示词:以点的形式定位图中见义勇为的人,并以XML格式输出结果,不要输出```xml```代码段。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01ILRlNK1gvU5xqbaxb_!!6000000004204-49-tps-1138-640.webp"},
                    {"text":  "以点的形式定位图中见义勇为的人,并以XML格式输出结果,不要输出```xml```代码段。"}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_k":"1",
        "seed":"3407"
    }
}'
< points x1 = "284"
y1 = "305"
alt = "见义勇为的人" > 见义勇为的人 < /points>

解析文档输出为Qwen-HTML格式(仅Qwen2.5-VL模型支持)

Qwen2.5-VL模型支持将图像类的文档(如扫描件/图片PDF)解析为 QwenVL HTML格式,该格式不仅能精准识别文本,还能获取图像、表格等元素的位置信息。

Prompt技巧:您需要在提示词中引导模型输出QwenVL HTML,否则将解析为不带位置信息的HTML格式的文本:

  • 推荐系统提示词:"You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."

  • 推荐用户提示词:"QwenVL HTML"

输入示例

示例代码

输出示例

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": "You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://gw.alicdn.com/imgextra/i4/O1CN01VsHGUc1EfAkPpetRE_!!6000000000378-2-tps-1430-2022.png"},
                    {"text": "qwenvl html"}
                ]
            }
        ]
    }
}'
```html
<html><body>
<h2 data-bbox="285 290 671 347"> 1 Introduction</h2> 
 <p data-bbox="285 392 2202 948">The sparks of artificial general intelligence (AGI) are increasingly visible through the fast development of large foundation models, notably large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; 2024a; Gemini Team, 2024; Anthropic, 2023a;b; 2024; Bai et al., 2023; Yang et al., 2024a; Touvron et al., $2023 \mathrm{a} ; \mathrm{b}$; Dubey et al., 2024). The continuous advancement in model and data scaling, combined with the paradigm of large-scale pre-training followed by high-quality supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), has enabled large language models (LLMs) to develop emergent capabilities in language understanding, generation, and reasoning. Building on this foundation, recent breakthroughs in inference time scaling, particularly demonstrated by o1 (OpenAl, 2024b), have enhanced LLMs’ capacity for deep thinking through step-by-step reasoning and reflection. These developments have elevated the potential of language models, suggesting they may achieve significant breakthroughs in scientific exploration as they continue to demonstrate emergent capabilities indicative of more general artificial intelligence.</p> 
 <p data-bbox="285 964 2202 1293">Besides the fast development of model capabilities, the recent two years have witnessed a burst of open (open-weight) large language models in the LLM community, for example, the Llama series (Touvron et al., 2023a;b; Dubey et al., 2024), Mistral series (Jiang et al., 2023a; 2024a), and our Owen series (Bai et al., 2023; Yang et al., 2024a; Qwen Team, 2024a; Hui et al., 2024; Qwen Team, 2024c; Yang et al., 2024b). The open-weight models have democratized the access of large language models to common users and developers, enabling broader research participation, fostering innovation through community collaboration, and accelerating the development of AI applications across diverse domains.</p> 
 <p data-bbox="285 1308 2202 1692">Recently, we release the details of our latest version of the Owen series, Owen2.5. In terms of the openweight part, we release pre-trained and instruction-tuned models of 7 sizes, including $0.5 \mathrm{~B}, 1.5 \mathrm{~B}, 3 \mathrm{~B}, 7 \mathrm{~B}$, $14 \mathrm{~B}, 32 \mathrm{~B}$, and $72 \mathrm{~B}$, and we provide not only the original models in bfloat16 precision but also the quantized models in different precisions. Specifically, the flagship model Owen2.5-72B-Instruct demonstrates competitive performance against the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Additionally, we also release the proprietary models of Mixture-of-Experts (MoE, Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022), namely Owen2.5-Turbo and Owen2.5-Plus ${ }^{1}$, which performs competitively against GPT-4o-mini and GPT-4o respectively.</p> 
 <p data-bbox="285 1705 2202 1806">In this technical report, we introduce Owen2.5, the result of our continuous endeavor to create better LLMs. Below, we show the key features of the latest version of Owen:</p> 
 <ul data-bbox="392 1842 2202 2520"><li data-bbox="392 1842 2202 2032">Better in Size: Compared with Owen2, in addition to $0.5 \mathrm{~B}, 1.5 \mathrm{~B}, 7 \mathrm{~B}$, and $72 \mathrm{~B}$ models, Owen2.5 brings back the $3 \mathrm{~B}, 14 \mathrm{~B}$, and $32 \mathrm{~B}$ models, which are more cost-effective for resource-limited scenarios and are under-represented in the current field of open foundation models. Owen2.5Turbo and Owen2.5-Plus offer a great balance among accuracy, latency, and cost.</li><li data-bbox="392 2041 2202 2323">Better in Data: The pre-training and post-training data have been improved significantly. The pre-training data increased from 7 trillion tokens to 18 trillion tokens, with focus on knowledge, coding, and mathematics. The pre-training is staged to allow transitions among different mixtures. The post-training data amounts to 1 million examples, across the stage of supervised finetuning (SFT, Ouyang et al., 2022), direct preference optimization (DPO, Rafailov et al., 2023), and group relative policy optimization (GRPO, Shao et al., 2024).</li><li data-bbox="392 2328 2202 2520">Better in Use: Several key limitations of Owen2 in use have been eliminated, including larger generation length (from $2 \mathrm{~K}$ tokens to $8 \mathrm{~K}$ tokens), better support for structured input and output, (e.g., tables and JSON), and easier tool use. In addition, Owen2.5-Turbo supports a context length of up to 1 million tokens.</li></ul> 
 <h2 data-bbox="285 2586 954 2641"> 2 Architecture &amp; Tokenizer</h2> 
 <p data-bbox="285 2688 2202 2833">Basically, the Owen2.5 series include dense models for opensource, namely Owen2.5-0.5B / 1.5B / 3B / $7 \mathrm{~B} / 14 \mathrm{~B} / 32 \mathrm{~B} / 72 \mathrm{~B}$, and MoE models for API service, namely Owen2.5-Turbo and Owen2.5-Plus. Below, we provide details about the architecture of models.</p> 
 <p data-bbox="285 2848 2202 3040">For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford et al., 2018) as Owen2 (Yang et al., 2024a). The architecture incorporates several key components: Grouped Query Attention (GQA, Ainslie et al., 2023) for efficient KV cache utilization, SwiGLU activation function (Dauphin et al., 2017) for non-linear activation, Rotary Positional Embeddings (RoPE, Su</p> 
 <hr/> 
 <section class="footnotes" data-bbox="285 3067 2202 3164"><ol class="footnotes-list" data-bbox="285 3067 2202 3164"><li class="footnote-item" data-bbox="285 3067 2202 3164"><p data-bbox="285 3067 2202 3164">${ }^{1}$ Owen2.5-Turbo is identified as qwen-turbo-2024-11-01 and Owen2.5-Plus is identified as qwen-plus-2024-xx-xx (to be released) in the API.</p></li></ol></section> 
</body></html>
```

与时间戳相关的视频理解(仅Qwen2.5-VL模型支持)

Qwen2.5-VL模型具有感知时间信息的能力,能从视频中搜索具体事件,或对不同时间段进行要点总结。

Prompt技巧:

  • 明确任务需求:

    • 指定视频理解的时间范围,如“请你描述下列视频中的一系列事件”或者“请你描述00:05:00 至 00:10:00时间段中的一系列事件”

    • 事件计数:如“统计视频中‘知识点讲解’场景出现的次数及总时长,并记录事件的起始和结束时间戳”

    • 动作或者画面定位:如“视频00:03:25附近5秒内是否有‘选手失误’事件?要求精确到最近0.5秒”

    • 长视频分段处理:如“将下列2小时会议视频按每3分钟生成一个摘要(含时间戳),重点标注‘提问环节’和‘决议通过’事件”

  • 明确输出要求或者格式:

    • JSON结构约束:“在Prompt中要求模型以JSON格式返回时间戳(start_timeend_time)、事件类型(category)、具体事件(event)”

    • 时间格式表示:“请同时用HH:mm:ss和秒数(如:20秒)表示时间戳”

输入示例

示例代码

输出示例

提示词:请你描述下视频中的一系列活动事件,以JSON格式输出开始时间(start_time)、结束时间(end_time)、事件(event),不要输出```json```代码段”。

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"video": "https://cloud.video.taobao.com/vod/C6gCj5AJ3Qrd_UQ9kaMVRY9Ig9G-WToxVYSPRdNXCao.mp4","fps": 2.0},
                    {"text": "请你描述下视频中的一系列活动事件,以JSON格式输出开始时间(start_time)、结束事件(end_time)、事件(event)。"}
                ]
            }
        ]
    }
}'
[
    {
        "start_time": "0.0",
        "end_time": "2.4",
        "event": "一个穿着绿色衬衫和米色裤子的人拿着一个纸箱走进镜头。"
    },
    {
        "start_time": "2.4",
        "end_time": "5.1",
        "event": "这个人将纸箱放在桌子上,然后拿起一个条形码扫描器。"
    },
    {
        "start_time": "5.1",
        "end_time": "8.3",
        "event": "他用条形码扫描器扫描了纸箱上的标签,并将纸箱移到一边。"
    },
    {
        "start_time": "8.3",
        "end_time": "10.9",
        "event": "他放下条形码扫描器,拿起一支笔,在桌子上的笔记本上记录了一些信息。"
    }
]

API参考

关于通义千问VL模型的输入输出参数,请参见通义千问

常见问题

  • 需要手动删除已上传的图像吗?

    在模型完成文本生成后,百炼服务器会自动将图像删除,无需手动删除。

  • 通义千问VL模型是否支持批量提交任务?

    目前qwen-vl-max、qwen-vl-plus模型兼容OpenAI Batch 接口,支持以文件方式批量提交任务。任务会以异步形式执行,并在 24 小时内返回结果。使用批量接口的费用为实时调用的 50%。

  • 通义千问VL模型可以处理PDF、XLSX、XLS、DOC等文本文件吗?

    不可以,通义千问VL模型属于视觉理解模型,只能处理图像格式的文件,不能处理文本文件。您可以使用Qwen-Long模型解析文档内容。

  • 通义千问VL模型是否支持理解视频内容?

    支持,请参见视频理解

  • 通义千问VL模型可以解答图像中的数学问题吗?

    可以。目前有两种方案:

    • 使用视觉推理QVQ模型:利用QVQ模型的推理能力进行解答图像中的数学问题。

    • 使用通义千问VL通用模型:

      • 对于简单的数学问题,直接使用通义千问VL模型进行解答;

      • 对于复杂的数学问题,可以先使用通义千问VL模型的OCR能力解析图像中的问题,再使用通义千问数学模型解答问题。

  • 通义千问VL模型是如何限流与计费的?

    • 限流

      通义千问VL模型的限流条件可以参见限流,阿里云主账号与其RAM子账号共享限流限制。

    • 免费额度

      从开通百炼或模型申请通过之日起计算有效期,有效期180天内,通义千问VL模型提供10万或100Token的免费额度,具体信息请参见模型列表

    • 查询模型的剩余额度

      您可以访问阿里云百炼控制台的模型广场页面,找到通义千问VL模型并点击查看详情,即可查看免费额度、剩余额度及到期时间。如果没有显示免费额度,说明账号下该模型的免费额度已到期。

    • 计费

      总费用 = 输入 Token 数 x 模型输入单价 + 模型输出 Token 数 x 模型输出单价。其中,图像转成Token的方法为每28x28像素对应一个Token,一张图最少4Token。

    • 查看账单

      您可以在阿里云控制台的费用与成本页面查看账单或进行充值。

    • 更多计费问题请参见计费项

错误码

如果模型调用失败并返回报错信息,请参见错误信息进行解决。

  • 本页导读 (1)
  • 应用场景
  • 支持的模型
  • 模型选型建议
  • 如何使用
  • 快速开始
  • 多轮对话(参考历史对话信息)
  • 流式输出
  • 高分辨率图像理解
  • 多图像输入
  • 视频理解
  • 使用本地文件(输入Base64编码或本地路径)
  • 使用限制
  • 支持的图像
  • 图像大小限制
  • 图像输入方式
  • 图像数量限制
  • Prompt指南
  • API参考
  • 常见问题
  • 错误码