视觉理解(Qwen-VL)

更新时间: 2025-08-28 10:03:12

通义千问VL模型可以根据您传入的图片来进行回答。

应用场景

  • 图像问答:描述图像中的内容或者对其进行分类打标,如识别人物、地点、花鸟鱼虫等。

  • 数学题目解答:解答图像中的数学问题,适用于中小学、大学以及成人教育阶段。

  • 视频理解:分析视频内容,如对具体事件进行定位并获取时间戳,或生成关键时间段的摘要。

  • 物体定位:定位图像中的物体,返回外边界矩形框的左上角、右下角坐标或者中心点坐标。

  • 文档解析:将图像类的文档(如扫描件/图片PDF)解析为 QwenVL HTML格式,该格式不仅能精准识别文本,还能获取图像、表格等元素的位置信息。

  • 文字识别与信息抽取:识别图像中的文字、公式,或者抽取票据、证件、表单中的信息,支持格式化输出文本;可识别的语言有中文、英语、日语、韩语、阿拉伯语、越南语、法语、德语、意大利语、西班牙语和俄语。

访问视觉模型可以在线体验图片理解能力。为提高模型效果,建议您根据实际的业务需求选择应用示例的推荐提示词。

模型列表与计费

阿里云百炼平台提供了商业版和开源版两种模型;相对于开源版,商业版模型会持续更新和升级,具有最新的能力。

qwen-vl-maxqwen-vl-plus模型已支持上下文缓存( Context Cache )结构化输出

商业版模型

通义千问VL-Max系列

通义千问VL系列能力最强的模型。

模型名称

版本

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-max

相比qwen-vl-plus再次提升视觉推理和指令遵循能力,在更多复杂任务中提供最佳性能
当前与qwen-vl-max-2025-04-08能力相同

稳定版

131,072

129,024

单图最大16384

8,192

0.003元

Batch调用半价

0.009元

Batch调用半价

各100万Token

有效期:百炼开通后180天内

qwen-vl-max-latest

始终与最新快照版能力相同

最新版

0.0016元

Batch调用半价

0.004元

Batch调用半价

qwen-vl-max-2025-08-13

又称qwen-vl-max-0813
视觉理解指标全面提升,数学、推理、物体识别、多语言处理能力显著增强。

快照版

0.0016元

0.004元

qwen-vl-max-2025-04-08

又称qwen-vl-max-0408
增强数学和推理能力

0.003元

0.009元

qwen-vl-max-2025-04-02

又称qwen-vl-max-0402
显著提高解决复杂数学问题的准确性

qwen-vl-max-2025-01-25

又称qwen-vl-max-0125
升级至Qwen2.5-VL系列,扩展上下文至128k,显著增强图像和视频的理解能力

通义千问VL-Plus系列

通义千问VL-Plus模型在效果、成本上比较均衡。

模型名称

版本

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-plus

当前与qwen-vl-plus-2025-05-07能力相同

稳定版

131,072

129,024

单图最大16384

8,192

0.0015元

Batch调用半价

0.0045元

Batch调用半价

各100万Token

有效期:百炼开通后180天内

qwen-vl-plus-latest

始终与最新快照版能力相同

最新版

0.0008元

Batch调用半价

0.002元

Batch调用半价

qwen-vl-plus-2025-08-15

又称qwen-vl-plus-0815
在物体识别与定位、多语言处理的能力上有显著提升

快照版

0.0008元

0.002元

qwen-vl-plus-2025-07-10

又称qwen-vl-plus-0710
进一步提升监控视频内容的理解能力

32,768

30,720

单图最大16384

0.00015元

0.0015元

qwen-vl-plus-2025-05-07

又称qwen-vl-plus-0507
显著提升数学、推理、监控视频内容的理解能力

131,072

129,024

单图最大16384

0.0015元

0.0045元

qwen-vl-plus-2025-01-25

又称qwen-vl-plus-0125
升级至Qwen2.5-VL系列,扩展上下文至128k,显著增强图像和视频理解能力

更多历史快照模型

通义千问VL-Max系列

模型名称

版本

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-max-2024-12-30

又称qwen-vl-max-1230

快照版

32,768

30,720

单图最大16384

2,048

0.003元

0.009元

各100万Token

有效期:百炼开通后180天内

qwen-vl-max-2024-11-19

又称qwen-vl-max-1119

qwen-vl-max-2024-10-30

又称qwen-vl-max-1030

0.02元

qwen-vl-max-2024-08-09

又称qwen-vl-max-0809

通义千问VL-Plus系列

模型名称

版本

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-plus-2025-01-02

又称qwen-vl-plus-0102

快照版

32,768

30,720

单图最大16384

2,048

0.0015元

0.0045元

各100万Token

有效期:百炼开通后180天内

qwen-vl-plus-2024-08-09

又称qwen-vl-plus-0809

开源版模型

qvq-72b-preview模型是由 Qwen 团队开发的实验性研究模型,专注于提升视觉推理能力,尤其在数学推理领域。qvq-72b-preview模型的局限性请参见QVQ官方博客使用方法 | API参考

如果希望模型先输出思考过程再输出回答内容,请使用商业版模型QVQ

模型名称

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qvq-72b-preview

32,768

16,384

单图最大16384

16,384

0.012元

0.036元

10万Token

有效期:百炼开通后180天内

模型名称

上下文长度

最大输入

最大输出

输入成本

输出成本

免费额度

(注)

(Token数)

(每千Token)

qwen2.5-vl-72b-instruct 

131,072

129,024

单图最大16384

8,192

0.016元

0.048元

各100万Token

有效期:百炼开通后180天内

qwen2.5-vl-32b-instruct

0.008元

0.024元

qwen2.5-vl-7b-instruct

0.002元

0.005元

qwen2.5-vl-3b-instruct

0.0012元

0.0036元

qwen2-vl-72b-instruct

32,768

30,720

单图最大16384

2,048

0.016元

0.048元

qwen2-vl-7b-instruct

32,000

30,000

单图最大16384

2,000

目前仅供免费体验。

免费额度用完后不可调用,建议改用qwen-vl-max、qwen-vl-plus模型。

各10万Token

有效期:百炼开通后180天内

qwen2-vl-2b-instruct

限时免费

qwen-vl-v1

8,000

6,000

单图最大1280

1,500

目前仅供免费体验。

免费额度用完后不可调用,建议改用qwen-vl-max、qwen-vl-plus模型。

qwen-vl-chat-v1

图像与视频转换为Token的规则

图像

每28x28像素对应一个Token,一张图最少需要4个Token。您可以通过以下代码估算图像的Token:

import math
# 使用以下命令安装Pillow库:pip install Pillow
from PIL import Image

def token_calculate(image_path):
    # 打开指定的PNG图片文件
    image = Image.open(image_path)

    # 获取图片的原始尺寸
    height = image.height
    width = image.width
    
    # 将高度调整为28的整数倍
    h_bar = round(height / 28) * 28
    # 将宽度调整为28的整数倍
    w_bar = round(width / 28) * 28
    
    # 图像的Token下限:4个Token
    min_pixels = 28 * 28 * 4
    # 图像的Token上限:1280个Token
    max_pixels = 1280 * 28 * 28
        
    # 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if h_bar * w_bar > max_pixels:
        # 计算缩放因子beta,使得缩放后的图像总像素数不超过max_pixels
        beta = math.sqrt((height * width) / max_pixels)
        # 重新计算调整后的高度,确保为28的整数倍
        h_bar = math.floor(height / beta / 28) * 28
        # 重新计算调整后的宽度,确保为28的整数倍
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        # 计算缩放因子beta,使得缩放后的图像总像素数不低于min_pixels
        beta = math.sqrt(min_pixels / (height * width))
        # 重新计算调整后的高度,确保为28的整数倍
        h_bar = math.ceil(height * beta / 28) * 28
        # 重新计算调整后的宽度,确保为28的整数倍
        w_bar = math.ceil(width * beta / 28) * 28
    return h_bar, w_bar

# 将test.png替换为本地的图像路径
h_bar, w_bar = token_calculate("test.png")
print(f"缩放后的图像尺寸为:高度为{h_bar},宽度为{w_bar}")

# 计算图像的Token数:总像素除以28 * 28
token = int((h_bar * w_bar) / (28 * 28))

# 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各计1个Token)
print(f"图像的Token数为{token + 2}")
// 使用以下命令安装sharp: npm install sharp
import sharp from 'sharp';
import fs from 'fs';

async function tokenCalculate(imagePath) {
    // 打开指定的PNG图片文件
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // 获取图片的原始尺寸
    const height = metadata.height;
    const width = metadata.width;

    // 将高度调整为28的整数倍
    let hBar = Math.round(height / 28) * 28;
    // 将宽度调整为28的整数倍
    let wBar = Math.round(width / 28) * 28;

    // 图像的Token下限:4个Token
    const minPixels = 28 * 28 * 4;
    // 图像的Token上限:1280个Token
    const maxPixels = 1280 * 28 * 28;

    // 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if (hBar * wBar > maxPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不超过max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.floor(height / beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不低于min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.ceil(height * beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.ceil(width * beta / 28) * 28;
    }

    return { hBar, wBar };
}

// 将test.png替换为本地的图像路径
const imagePath = 'test.png';
tokenCalculate(imagePath).then(({ hBar, wBar }) => {
    console.log(`缩放后的图像尺寸为:高度为${hBar},宽度为${wBar}`);

    // 计算图像的Token数:总像素除以28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));

    // 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各占1个Token)
    console.log(`图像的总Token数为${token + 2}`);
}).catch(err => {
    console.error('Error processing image:', err);
});
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class Main {

    // 自定义类存储调整后的尺寸
    public static class ResizedSize {
        public final int height;
        public final int width;

        public ResizedSize(int height, int width) {
            this.height = height;
            this.width = width;
        }
    }

    public static ResizedSize smartResize(String imagePath) throws IOException {
        // 1. 加载图像
        BufferedImage image = ImageIO.read(new File(imagePath));
        if (image == null) {
            throw new IOException("无法加载图像文件: " + imagePath);
        }

        int originalHeight = image.getHeight();
        int originalWidth = image.getWidth();

        final int minPixels = 28 * 28 * 4;
        final int maxPixels = 1280 * 28 * 28;
        // 2. 初始调整为28的倍数
        int hBar = (int) (Math.round(originalHeight / 28.0) * 28);
        int wBar = (int) (Math.round(originalWidth / 28.0) * 28);
        int currentPixels = hBar * wBar;

        // 3. 根据条件调整尺寸
        if (currentPixels > maxPixels) {
            // 当前像素超过最大值,需要缩小
            double beta = Math.sqrt(
                    (originalHeight * (double) originalWidth) / maxPixels
            );
            double scaledHeight = originalHeight / beta;
            double scaledWidth = originalWidth / beta;

            hBar = (int) (Math.floor(scaledHeight / 28) * 28);
            wBar = (int) (Math.floor(scaledWidth / 28) * 28);
        } else if (currentPixels < minPixels) {
            // 当前像素低于最小值,需要放大
            double beta = Math.sqrt(
                    (double) minPixels / (originalHeight * originalWidth)
            );
            double scaledHeight = originalHeight * beta;
            double scaledWidth = originalWidth * beta;

            hBar = (int) (Math.ceil(scaledHeight / 28) * 28);
            wBar = (int) (Math.ceil(scaledWidth / 28) * 28);
        }

        return new ResizedSize(hBar, wBar);
    }

    public static void main(String[] args) {
        try {
            ResizedSize size = smartResize(
                    // xxx/test.png替换为你的图像路径
                    "xxx/test.png"
            );

            System.out.printf("缩放后的图像尺寸:高度 %d,宽度 %d%n", size.height, size.width);

            // 计算 Token(总像素 / 28×28 + 2)
            int token = (size.height * size.width) / (28 * 28) + 2;
            System.out.printf("图像总 Token 数:%d%n", token);

        } catch (IOException e) {
            System.err.println("错误:" + e.getMessage());
            e.printStackTrace();
        }
    }
}

视频

您可以通过以下代码估算视频的Token:

# 使用前安装:pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

FRAME_FACTOR = 2
IMAGE_FACTOR = 28
# 视频帧的长宽比
MAX_RATIO = 200

# 视频帧的 Token 下限
VIDEO_MIN_PIXELS = 128 * 28 * 28
# 视频帧的 Token 上限
VIDEO_MAX_PIXELS = 768 * 28 * 28

# 用户未传入FPS参数,则fps使用默认值
FPS = 2.0
# 最少抽取帧数
FPS_MIN_FRAMES = 4
# 最大抽取帧数,使用qwen2.5-vl模型时,请FPS_MAX_FRAMES将设置为512,其他模型则设置为80
FPS_MAX_FRAMES = 512

# 视频输入的最大像素值,
# 使用qwen2.5-vl模型时,请将VIDEO_TOTAL_PIXELS设置为65536 * 28 * 28,其他模型则设置为24576 * 28 * 28
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 65536 * 28 * 28)))

def round_by_factor(number: int, factor: int) -> int:
    """返回与”number“最接近的整数,该整数可被”factor“整除。"""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """返回大于或等于“number”且可被“factor”整除的最小整数。"""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """返回小于或等于“number”且可被“factor”整除的最大整数。"""
    return math.floor(number / factor) * factor

def smart_nframes(ele,total_frames,video_fps):
    """用于计算抽取的视频帧数。

    Args:
        ele (dict): 包含视频配置的字典格式
            - fps: fps用于控制提取模型输入帧的数量。
        total_frames (int): 视频的原始总帧数。
        video_fps (int | float): 视频的原始帧率

    Raises:
        nframes应该在[FRAME_FACTOR,total_frames]间隔内,否则会报错

    Returns:
        用于模型输入的视频帧数。
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
    fps = ele.get("fps", FPS)
    min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
    max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration-int(duration)>(1/fps):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration)*video_fps)
    nframes = total_frames / video_fps * fps
    if nframes > total_frames:
        logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

    return nframes

def get_video(video_path):
    # 获取视频信息
    cap = cv2.VideoCapture(video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    # 获取视频高度
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    video_fps = cap.get(cv2.CAP_PROP_FPS)
    return frame_height,frame_width,total_frames,video_fps

def smart_resize(ele,path,factor = IMAGE_FACTOR):
    # 获取原视频的宽和高
    height, width, total_frames, video_fps = get_video(path)
    # 视频帧的Token下限
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # 抽取的视频帧数
    nframes = smart_nframes(ele, total_frames, video_fps)
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))

    # 视频的长宽比不应超过200:1或1:200
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


def token_calculate(video_path, fps):
    # 传入视频路径和fps抽帧参数
    messages = [{"content": [{"video": video_path, "fps":fps}]}]
    vision_infos = extract_vision_info(messages)[0]

    resized_height, resized_width=smart_resize(vision_infos,video_path)

    height, width, total_frames,video_fps = get_video(video_path)
    num_frames = smart_nframes(vision_infos,total_frames,video_fps)
    print(f"原视频尺寸:{height}*{width},输入模型的尺寸:{resized_height}*{resized_width},视频总帧数:{total_frames},fps等于{fps}时,抽取的总帧数:{num_frames}",end=",")
    video_token = int(math.ceil(num_frames / 2) * resized_height / 28 * resized_width / 28)
    video_token += 2 # 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各计1个Token)
    return video_token

def extract_vision_info(conversations):
    vision_infos = []
    if isinstance(conversations[0], dict):
        conversations = [conversations]
    for conversation in conversations:
        for message in conversation:
            if isinstance(message["content"], list):
                for ele in message["content"]:
                    if (
                        "image" in ele
                        or "image_url" in ele
                        or "video" in ele
                        or ele.get("type","") in ("image", "image_url", "video")
                    ):
                        vision_infos.append(ele)
    return vision_infos


video_token = token_calculate("xxx/test.mp4", 1)
print("视频tokens:", video_token)
// 使用前请安装:npm install node-ffprobe @ffprobe-installer/ffprobe
import ffprobeInstaller from '@ffprobe-installer/ffprobe';
import ffprobe from 'node-ffprobe';
import probe from "node-ffprobe";
// 设置 ffprobe 路径(全局配置)
ffprobe.FFPROBE_PATH = ffprobeInstaller.path;


// 获取视频信息
async function getVideoInfo(videoPath) {
  try {
    const probeData = await probe(videoPath);
    const videoStream = probeData.streams.find(
      stream => stream.codec_type === 'video'
    );

    if (!videoStream) {
      throw new Error('视频中未找到视频流');
    }

    const width = videoStream.width;
    const height = videoStream.height;
    const totalFrames = videoStream.nb_frames;
    const [numerator, denominator] = videoStream.avg_frame_rate.split('/');
    const frameRate =parseFloat(numerator/denominator);

    return {
      width,
      height,
      totalFrames,
      frameRate
    };
  } catch (error) {
    console.error('获取视频信息失败:', error);
    throw error;
  }
}

// 配置参数
const FRAME_FACTOR = 2; 
const IMAGE_FACTOR = 28;
const MAX_RATIO = 200;
// 视频帧的 Token 下限
const VIDEO_MIN_PIXELS = 128 * 28 * 28; 
// 视频帧的 Token 上限
const VIDEO_MAX_PIXELS = 768 * 28 * 28; 
const FPS = 2.0; // 用户未传入FPS参数,则fps使用默认值
// 最少抽取帧数
const FPS_MIN_FRAMES = 4;
// 最大抽取帧数,使用qwen2.5-vl模型时,请FPS_MAX_FRAMES将设置为512,其他模型则设置为80
const FPS_MAX_FRAMES = 512; 
// # 视频输入的最大像素值,
// 使用qwen2.5-vl模型时,请将VIDEO_TOTAL_PIXELS设置为65536 * 28 * 28,其他模型则设置为24576 * 28 * 28
const VIDEO_TOTAL_PIXELS = parseInt(process.env.VIDEO_MAX_PIXELS) || 65536 * 28 * 28;

// 数学工具函数
function roundByFactor(number, factor) {
    return Math.round(number / factor) * factor;
}

function ceilByFactor(number, factor) {
    return Math.ceil(number / factor) * factor;
}

function floorByFactor(number, factor) {
    return Math.floor(number / factor) * factor;
}

// 计算抽取帧数
function smartNFrames(ele, totalFrames, frameRate) {
    const fps = ele.fps || FPS;
    const minFrames = ceilByFactor(ele.min_frames || FPS_MIN_FRAMES, FRAME_FACTOR);
    const maxFrames = floorByFactor(
        ele.max_frames || Math.min(FPS_MAX_FRAMES, totalFrames),
        FRAME_FACTOR
    );
    const duration = frameRate !== 0 ? parseFloat(totalFrames / frameRate) : 0;

    let totalFramesAdjusted = duration % 1 > (1 / fps)
        ? Math.ceil(duration * frameRate)
        : Math.ceil(Math.floor(parseInt(duration)) * frameRate);

    const nframes = (totalFramesAdjusted / frameRate) * fps;
    const finalNFrames = parseInt(Math.min(
        Math.max(nframes, minFrames),
        Math.min(maxFrames, totalFramesAdjusted)
    ));

    if (finalNFrames < FRAME_FACTOR || finalNFrames > totalFramesAdjusted) {
        throw new Error(
            `nframes should be between ${FRAME_FACTOR} and ${totalFramesAdjusted}, got ${finalNFrames}`
        );
    }
    return finalNFrames;
}

// 智能调整分辨率
async function smartResize(ele, videoPath) {
    const {height, width, totalFrames, frameRate} = await getVideoInfo(videoPath);
    const minPixels = VIDEO_MIN_PIXELS;
    const nframes = smartNFrames(ele, totalFrames, frameRate)
    const maxPixels = Math.max(
        Math.min(VIDEO_MAX_PIXELS, VIDEO_TOTAL_PIXELS / nframes * FRAME_FACTOR),
        Math.floor(minPixels * 1.05)
    );

    // 检查宽高比
    const ratio = Math.max(height, width) / Math.min(height, width);
    if (ratio > MAX_RATIO) {
        throw new Error(`Aspect ratio ${ratio} exceeds ${MAX_RATIO}`);
    }

    let hBar = Math.max(IMAGE_FACTOR, roundByFactor(height, IMAGE_FACTOR));
    let wBar = Math.max(IMAGE_FACTOR, roundByFactor(width, IMAGE_FACTOR));

    if (hBar * wBar > maxPixels) {
        const beta = Math.sqrt((height * width) / maxPixels);
        hBar = floorByFactor(height / beta, IMAGE_FACTOR);
        wBar = floorByFactor(width / beta, IMAGE_FACTOR);
    } else if (hBar * wBar < minPixels) {
        const beta = Math.sqrt(minPixels / (height * width));
        hBar = ceilByFactor(height * beta, IMAGE_FACTOR);
        wBar = ceilByFactor(width * beta, IMAGE_FACTOR);
    }

    return { hBar, wBar };
}

// 计算 Token 数量
async function tokenCalculate(videoPath, fps) {
    const messages = [{ content: [{ video: videoPath, fps }] }];
    const visionInfos = extractVisionInfo(messages);

    const { hBar, wBar } = await smartResize(visionInfos[0], videoPath);
    const { height, width, totalFrames, frameRate  }= await getVideoInfo(videoPath);
    const numFrames = smartNFrames(visionInfos[0], totalFrames, frameRate);

    console.log(
        `原视频尺寸:${height}*${width},输入模型的尺寸:${hBar}*${wBar},视频总帧数:${totalFrames},fps等于${fps}时,抽取的总帧数:${numFrames}`
    );

    const videoToken = Math.ceil(numFrames / 2) * Math.floor(hBar / 28) * Math.floor(wBar / 28) + 2;
    return videoToken;
}

// 解析视觉信息
function extractVisionInfo(conversations) {
    const visionInfos = [];
    if (!Array.isArray(conversations)) {
        conversations = [conversations];
    }
    conversations.forEach(conversation => {
        if (!Array.isArray(conversation)) {
            conversation = [conversation];
        }
        conversation.forEach(message => {
            if (Array.isArray(message.content)) {
                message.content.forEach(ele => {
                    if (ele.image || ele.image_url || ele.video || ['image', 'image_url', 'video'].includes(ele.type)) {
                        visionInfos.push(ele);
                    }
                });
            }
        });
    });
    return visionInfos;
}

// 使用示例
(async () => {
    try {
        const videoPath = "xxx/test.mp4"; // 替换你本地的路径
        const videoToken = await tokenCalculate(videoPath, 1);
        console.log('视频 tokens:', videoToken);
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

模型选型建议

  • 通义千问VL-Max模型的视觉理解能力最强;通义千问VL-Plus模型在效果、成本上比较均衡,如果您暂时不确定使用某种模型,可以优先尝试使用通义千问VL-Plus模型。

  • 若图像中涉及复杂的数学推理问题,建议使用QVQ模型解决。QVQ是视觉推理模型,支持视觉输入及思维链输出,在数学、编程、视觉分析、创作以及通用任务上表现更强。

  • 若处理文字提取任务,可使用通义千问OCR模型解决。通义千问OCR是文字提取专有模型,能够识别多种文字,专注于文档、表格、试题、手写体文字等类型图像的文字提取能力。

如何使用

前提条件

快速开始

下面是理解在线图像(通过URL指定,非本地图像)的示例代码。了解如何传入本地文件图像限制

OpenAI兼容

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max-latest", # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "图中描绘的是什么景象?"},
            ],
        },
    ],
)
print(completion.choices[0].message.content)

返回结果

这是一张在海滩上拍摄的照片。照片中,一个人和一只狗坐在沙滩上,背景是大海和天空。人和狗似乎在互动,狗的前爪搭在人的手上。阳光从画面的右侧照射过来,给整个场景增添了一种温暖的氛围。

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages: [{
        role: "system",
        content: [{
          type: "text",
          text: "You are a helpful assistant."
        }]
      },
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "图中描绘的是什么景象?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}
main()

返回结果

这是一张在海滩上拍摄的照片。照片中,一位穿着格子衬衫的女性坐在沙滩上,与一只戴着项圈的黄色拉布拉多犬互动。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。

curl

curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
  {"role":"system",
  "content":[
    {"type": "text", "text": "You are a helpful assistant."}]},
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
      {"type": "text", "text": "图中描绘的是什么景象?"}
    ]
  }]
}'

返回结果

{
  "choices": [
    {
      "message": {
        "content": "这张图片展示了一位女士和一只狗在海滩上互动。女士坐在沙滩上,微笑着与狗握手。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。狗戴着项圈,显得很温顺。",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope
messages = [
{
    "role": "system",
    "content": [
    {"text": "You are a helpful assistant."}]
},
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "图中描绘的是什么景象?"}]
}]
response = dashscope.MultiModalConversation.call(
    #若没有配置环境变量, 请用百炼API Key将下行替换为: api_key ="sk-xxx"
    api_key = os.getenv('DASHSCOPE_API_KEY'),
    model = 'qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages = messages
)
print(response.output.choices[0].message.content[0]["text"])

返回结果

是一张在海滩上拍摄的照片。照片中有一位女士和一只狗。女士坐在沙滩上,微笑着与狗互动。狗戴着项圈,似乎在与女士握手。背景是大海和天空,阳光洒在她们身上,营造出温馨的氛围。

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
               {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    }
}'

返回结果

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。他们坐在沙滩上,背景是大海和天空。阳光从画面的右侧照射过来,给整个场景增添了一种温暖的氛围。"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

多轮对话(参考历史对话信息)

通义千问VL模型可以参考历史对话信息实现多轮对话,您需要维护一个messages 数组,将每一轮的对话历史以及新的指令添加到 messages 数组中。

OpenAI兼容

Python

from openai import OpenAI
import os

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                },
            },
            {"type": "text", "text": "图中描绘的是什么景象?"},
        ],
    }
]
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages,
    )
print(f"第一轮输出:{completion.choices[0].message.content}")
assistant_message = completion.choices[0].message
messages.append(assistant_message.model_dump())
messages.append({
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": "做一首诗描述这个场景"
        }
        ]
    })
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=messages,
    )
print(f"第二轮输出:{completion.choices[0].message.content}")

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中,一位穿着格子衬衫的女士坐在沙滩上,与一只戴着项圈的金毛犬互动。背景是大海和天空,阳光洒在她们身上,营造出温暖的氛围。
第二轮输出:沙滩上,阳光洒,
女子与犬,笑语哗。
海浪轻拍,风儿吹,
快乐时光,心儿醉。

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

let messages = [
    {
	role: "system",
	content: [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        role: "user",
	content: [
        { type: "image_url", image_url: { "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg" } },
        { type: "text", text: "图中描绘的是什么景象?" },
    ]
}]
async function main() {
    let response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
        messages: messages
    });
    console.log(`第一轮输出:${response.choices[0].message.content}`);
    messages.push(response.choices[0].message);
    messages.push({"role": "user", "content": "做一首诗描述这个场景"});
    response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: messages
    });
    console.log(`第二轮输出:${response.choices[0].message.content}`);
}

main()

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光从画面的右侧照射过来,营造出温暖的氛围。
第二轮输出:沙滩上,人与狗,  
面对面,笑语稠。  
海风轻拂,阳光柔,  
心随波浪,共潮头。  

项圈闪亮,情意浓,  
格子衫下,心相通。  
海天一色,无尽空,  
此刻温馨,永铭中。

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "图中描绘的是什么景象?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "这是一个女孩和一只狗。"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "写一首诗描述这个场景"
        }
      ]
    }
  ]
}'

返回结果

{
    "choices": [
        {
            "message": {
                "content": "海风轻拂笑颜开,  \n沙滩上与犬相陪。  \n夕阳斜照人影短,  \n欢乐时光心自醉。",
                "role": "assistant"
            },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null
        }
    ],
    "object": "chat.completion",
    "usage": {
        "prompt_tokens": 1295,
        "completion_tokens": 32,
        "total_tokens": 1327
    },
    "created": 1726324976,
    "system_fingerprint": null,
    "model": "qwen-vl-max",
    "id": "chatcmpl-3c953977-6107-96c5-9a13-c01e328b24ca"
}

DashScope

Python

import os
from dashscope import MultiModalConversation

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            },
            {"text": "图中描绘的是什么景象?"},
        ],
    }
]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',   # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages)
print(f"模型第一轮输出:{response.output.choices[0].message.content[0]['text']}")
messages.append(response['output']['choices'][0]['message'])
user_msg = {"role": "user", "content": [{"text": "做一首诗描述这个场景"}]}
messages.append(user_msg)
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)
print(f"模型第二轮输出:{response.output.choices[0].message.content[0]['text']}")

返回结果

模型第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。
模型第二轮输出:在阳光照耀的海滩上,人与狗共享欢乐时光。

Java

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    private static final String modelName = "qwen-vl-max-latest";  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    public static void MultiRoundConversationCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        List<MultiModalMessage> messages = new ArrayList<>();
        messages.add(systemMessage);
        messages.add(userMessage);
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))                
                .model(modelName)
                .messages(messages)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("第一轮输出:"+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));        // add the result to conversation
        messages.add(result.getOutput().getChoices().get(0).getMessage());
        MultiModalMessage msg = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "做一首诗描述这个场景"))).build();
        messages.add(msg);
        param.setMessages((List)messages);
        result = conv.call(param);
        System.out.println("第二轮输出:"+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }

    public static void main(String[] args) {
        try {
            MultiRoundConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

第一轮输出:这是一张在海滩上拍摄的照片。照片中有一个穿着格子衬衫的人和一只戴着项圈的狗。人和狗面对面坐着,似乎在互动。背景是大海和天空,阳光洒在他们身上,营造出温暖的氛围。
第二轮输出:在阳光洒满的海滩上,人与狗共享欢乐时光。

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "图中描绘的是什么景象?"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"text": "图中是一名女子和一只拉布拉多犬在沙滩上玩耍。"}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"text": "写一首七言绝句描述这个场景"}
                ]
            }
        ]
    }
}'

返回结果

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "海浪轻拍沙滩边,女孩与狗同嬉戏。阳光洒落笑颜开,快乐时光永铭记。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "output_tokens": 27,
        "input_tokens": 1298,
        "image_tokens": 1247
    },
    "request_id": "bdf5ef59-c92e-92a6-9d69-a738ecee1590"
}

流式输出

大模型接收到输入后,会逐步生成中间结果,最终结果由这些中间结果拼接而成。这种一边生成一边输出中间结果的方式称为流式输出。采用流式输出时,您可以在模型进行输出的同时阅读,减少等待模型回复的时间。

OpenAI兼容

通过 OpenAI 兼容方式开启流式输出十分方便,只需在请求参数中设置stream参数为true即可。

流式输出默认不会返回本次请求所使用的 Token 量。您可以通过设置stream_options参数为{"include_usage": True},使最后一个返回的 chunk 包含本次请求所使用的 Token 量。

Python

from openai import OpenAI
import os

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max-latest",  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[
	{"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
         "content": [{"type": "image_url",
                    "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "图中描绘的是什么景象?"}]}],
    stream=True
)
full_content = ""
print("流式输出内容为:")
for chunk in completion:
    # 如果stream_options.include_usage为True,则最后一个chunk的choices字段为空列表,需要跳过(可以通过chunk.usage获取 Token 使用量)
    if chunk.choices and chunk.choices[0].delta.content != "":
        full_content += chunk.choices[0].delta.content
        print(chunk.choices[0].delta.content)
print(f"完整内容为:{full_content}")

返回结果

流式输出内容为:

图
中
描绘
的是
一个
女人
......
温暖
和谐
的
氛围
。
完整内容为:图中描绘的是一个女人和一只狗在海滩上互动的场景。女人坐在沙滩上,微笑着与狗握手,显得非常开心。背景是大海和天空,阳光洒在她们身上,营造出一种温暖和谐的氛围。

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const completion = await openai.chat.completions.create({
    model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages: [
        {"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
        "content": [{"type": "image_url",
                    "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "图中描绘的是什么景象?"}]}],
    stream: true,
});

let fullContent = ""
console.log("流式输出内容为:")
for await (const chunk of completion) {
    // 如果stream_options.include_usage为true,则最后一个chunk的choices字段为空数组,需要跳过(可以通过chunk.usage获取 Token 使用量)
    if (Array.isArray(chunk.choices) && chunk.choices[0].delta.content != null) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    }
}
console.log(`完整输出内容为:${fullContent}`)

返回结果

流式输出内容为:

图中描绘的是
一个女人和一只
狗在海滩上
互动的景象。
......
在她们身上,
营造出温暖和谐
的氛围。
完整内容为:图中描绘的是一个女人和一只狗在海滩上互动的景象。女人穿着格子衬衫,坐在沙滩上,微笑着与狗握手。狗戴着项圈,看起来很开心。背景是大海和天空,阳光洒在她们身上,营造出温暖和谐的氛围。

curl

curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max-latest",
    "messages": [
   {
      "role": "system",
      "content": [{"type":"text","text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "图中描绘的是什么景象?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

返回结果

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"图"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"delta":{"content":"中"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

......

data: {"choices":[{"delta":{"content":"分拍摄的照片。整体氛围显得非常"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":"和谐而温馨。"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":1276,"completion_tokens":85,"total_tokens":1361},"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: [DONE]

DashScope

您可以通过DashScope SDK或HTTP方式调用通义千问VL模型,体验流式输出的功能。根据不同的调用方式,您可以设置相应的参数来实现流式输出:

  • Python SDK方式:设置stream参数为True。

  • Java SDK方式:需要通过streamCall接口调用。

  • HTTP方式:需要在Header中指定X-DashScope-SSEenable

流式输出的内容默认是非增量式(即每次返回的内容都包含之前生成的内容),如果您需要使用增量式流式输出,请设置incremental_output(Java 为incrementalOutput)参数为 true

Python

import os
from dashscope import MultiModalConversation

messages = [
    {
    "role": "system",
    "content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            {"text": "图中描绘的是什么景象?"}
        ]
    }
]
responses = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages,
    stream=True,
    incremental_output=True
    )
full_content = ""
print("流式输出内容为:")
for response in responses:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"完整内容为:{full_content}")

返回结果

流式输出内容为:
图中描绘的是
一个人和一只狗
在海滩上互动
......
阳光洒在他们
身上,营造出
温暖和谐的氛围
。
完整内容为:图中描绘的是一个人和一只狗在海滩上互动的景象。这个人穿着格子衬衫,坐在沙滩上,与一只戴着项圈的金毛猎犬握手。背景是海浪和天空,阳光洒在他们身上,营造出温暖和谐的氛围。

Java

import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;

public class Main {
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "图中描绘的是什么景象?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                var content = item.getOutput().getChoices().get(0).getMessage().getContent();
                    // 判断content是否存在且不为空
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                    }
            } catch (Exception e) {
                System.out.println(e.getMessage());
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

图
中
描绘
的是
一个
女人
和
一只
狗
在
海滩
......
营造
出
一种
温暖
和谐
的
氛围
。

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

返回结果

iid:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"这张"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":1,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"图片"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":2,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

......

id:17
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"的欣赏。这是一个温馨的画面,展示了"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":112,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:18
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"人与动物之间深厚的情感纽带。"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":120,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:19
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"input_tokens":1276,"output_tokens":121,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

高分辨率图像理解

您可以通过设置vl_high_resolution_images参数为true,将通义千问VL模型的单图Token上限从1280提升至16384:

参数值

单图Token上限

描述

使用场景

True

16384

  • 表示模型的单图Token上限为16384,超过该值的图像会被缩放,直到图像的Token小于16384。

  • 模型能直接处理更高像素的图像,能理解更多的图像细节。同时处理速度会降低,Token用量也会增加

内容丰富、需要关注细节的场景

False(默认值)

1280

  • 表示模型的单图Token的上限为1280,超过该值的图像会被缩放,直到图像的Token小于1280。

  • 模型的处理速度会提升,Token用量较少

细节较少、只需用模型理解大致信息或对速度有较高要求的场景

支持设置vl_high_resolution_image的模型

  • qwen-vl-max系列:qwen-vl-max-0809及之后的模型

  • qwen-vl-plus系列:qwen-vl-plus-0809及之后的模型

  • qwen-vl开源系列:qwen2-vl和qwen2.5-vl系列模型

  • QVQ 系列模型

vl_high_resolution_images参数仅支持DashScope SDK及HTTP方式下使用。

Python

import os
import dashscope

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            {"text": "这张图表现了什么内容?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages,
    vl_high_resolution_images=True
)

print("大模型的回复:\n ",response.output.choices[0].message.content[0]["text"])
print("Token用量情况:","输入总Token:",response.usage["input_tokens"] , ",输入图像Token:" , response.usage["image_tokens"])

返回结果

大模型的回复:
  这张图片展示了一个温馨的圣诞装饰场景。图中可以看到以下元素:

1. **圣诞树**:两棵小型的圣诞树,上面覆盖着白色的雪。
2. **驯鹿摆件**:一只棕色的驯鹿摆件,带有大大的鹿角。
3. **蜡烛和烛台**:几个木制的烛台,里面点燃了小蜡烛,散发出温暖的光芒。
4. **圣诞装饰品**:包括金色的球形装饰、松果、红色浆果串等。
5. **圣诞礼物盒**:一个小巧的金色礼物盒,用金色丝带系着。
6. **圣诞字样**:木质的“MERRY CHRISTMAS”字样,增加了节日气氛。
7. **背景**:木质的背景板,给人一种自然和温暖的感觉。

整体氛围非常温馨和喜庆,充满了浓厚的圣诞节气息。
Token用量情况: 输入总Token: 5368 ,输入图像Token: 5342

Java

// dashscope SDK的版本 >= 2.20.8
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"),
                        Collections.singletonMap("text", "这张图表现了什么内容?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .vlHighResolutionImages(true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("大模型的回复:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
        System.out.println("Token 用量情况:输入总Token:" + result.getUsage().getInputTokens() + ",输入图像的Token:" + result.getUsage().getImageTokens());
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "这张图表现了什么内容?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true
    }
}'

返回结果

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "这张图片展示了一个温馨的圣诞装饰场景。画面中包括以下元素:\n\n1. **圣诞树**:两棵小型的圣诞树,上面覆盖着白色的雪。\n2. **驯鹿摆件**:一只棕色的驯鹿摆件,位于画面中央偏右的位置。\n3. **蜡烛**:几根木制的蜡烛,其中两根已经点燃,发出温暖的光芒。\n4. **圣诞装饰品**:一些金色和红色的装饰球、松果、浆果和绿色的松枝。\n5. **圣诞礼物**:一个小巧的金色礼物盒,旁边还有一个带有圣诞图案的袋子。\n6. **“MERRY CHRISTMAS”字样**:用木质字母拼写的“MERRY CHRISTMAS”,放在画面左侧。\n\n整个场景布置在一个木质背景前,营造出一种温暖、节日的氛围,非常适合圣诞节的庆祝活动。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 5553,
        "output_tokens": 185,
        "input_tokens": 5368,
        "image_tokens": 5342
    },
    "request_id": "38cd5622-e78e-90f5-baa0-c6096ba39b04"
}

多图像输入

通义千问VL 模型支持单次请求传入多张图片进行综合分析,所有图像的总Token数需在模型的最大输入之内,可传入图像的最大数量请参考图像数量限制

以下是理解多张在线图像(通过URL指定,非本地图像)的示例代码。了解如何传入本地文件图像限制

OpenAI兼容

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest", # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[
       {"role":"system","content":[{"type": "text", "text": "You are a helpful assistant."}]},
       {"role": "user","content": [
           # 第一张图像url,如果传入本地文件,请将url的值替换为图像的Base64编码格式
           {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
           # 第二张图像url,如果传入本地文件,请将url的值替换为图像的Base64编码格式
           {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
           {"type": "text", "text": "这些图描绘了什么内容?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

返回结果

图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海浪和天空,整个画面充满了温馨和愉快的氛围。

图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面给人一种野生自然的感觉。

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
        messages: [
	    {role: "system",content:[{ type: "text", text: "You are a helpful assistant." }]},
	    {role: "user",content: [
	    // 第一张图像链接,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
            // 第二张图像链接,,如果传入本地文件,请将url的值替换为图像的Base64编码格式
            {type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            {type: "text", text: "这些图描绘了什么内容?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

返回结果

第一张图片中,一个人和一只狗在海滩上互动。人穿着格子衬衫,狗戴着项圈,他们似乎在握手或击掌。

第二张图片中,一只老虎在森林中行走。老虎的毛色是橙色和黑色条纹,背景是绿色的树木和植被。

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max-latest",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "这些图描绘了什么内容?"
        }
      ]
    }
  ]
}'

返回结果

{
  "choices": [
    {
      "message": {
        "content": "图1中是一位女士和一只拉布拉多犬在海滩上互动的场景。女士穿着格子衬衫,坐在沙滩上,与狗进行握手的动作,背景是海景和日落的天空,整个画面显得非常温馨和谐。\n\n图2中是一只老虎在森林中行走的场景。老虎的毛色是橙色和黑色条纹相间,它正向前迈步,周围是茂密的树木和植被,地面上覆盖着落叶,整个画面充满了自然的野性和生机。",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            # 第一张图像url
            # 如果传入本地文件,请将url替换为:file://ABSOLUTE_PATH/test.jpg,ABSOLUTE_PATH为本地文件的绝对路径
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            # 第二张图像url
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            # 第三张图像url
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"},
            {"text": "这些图描绘了什么内容?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest', # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

返回结果

这些图片展示了一些动物和自然场景。第一张图片中,一个人和一只狗在海滩上互动。第二张图片是一只老虎在森林中行走。第三张图片是一只卡通风格的兔子在草地上跳跃。

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import java.util.HashMap;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
         // 如果使用本地图像,请导入 import java.util.HashMap;,再为函数添加【String localPath】参数,表示本地文件的实际路径
         // 并解除下面注释,当前测试系统为macOS。如果您使用Windows系统,文件路径请用 file:///"+localPath 代替
        // String filePath = "file://"+localPath;
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        // 第一张图像url
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        // 如果使用本地图像,请并解除下面注释
                        // new HashMap<String, Object>(){{put("image", filePath);}},
                        // 第二张图像url
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        // 第三张图像url
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"),
                        Collections.singletonMap("text", "这些图描绘了什么内容?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

返回结果

这些图片展示了一些动物和自然场景。

1. 第一张图片:一个女人和一只狗在海滩上互动。女人穿着格子衬衫,坐在沙滩上,狗戴着项圈,伸出爪子与女人握手。
2. 第二张图片:一只老虎在森林中行走。老虎的毛色是橙色和黑色条纹,背景是树木和树叶。
3. 第三张图片:一只卡通风格的兔子在草地上跳跃。兔子是白色的,耳朵是粉红色的,背景是蓝天和黄色的花朵。

curl

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"},
                    {"text": "这些图展现了什么内容?"}
                ]
            }
        ]
    }
}'

返回结果

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "这张图片显示了一位女士和她的狗在海滩上。她们似乎正在享受彼此的陪伴,狗狗坐在沙滩上伸出爪子与女士握手或互动。背景是美丽的日落景色,海浪轻轻拍打着海岸线。\n\n请注意,我提供的描述基于图像中可见的内容,并不包括任何超出视觉信息之外的信息。如果您需要更多关于这个场景的具体细节,请告诉我!"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

视频理解

部分通义千问VL模型支持对视频内容的理解,文件形式包括图像列表(视频帧)或视频文件。

建议使用性能较优的最新版或近期快照版模型理解视频文件。

视频文件

视频文件限制

  • 视频大小:

    • 传入公网URL:Qwen2.5-VL系列模型支持传入的视频大小不超过1 GB;其他模型不超过150MB。

    • 传入本地文件时:

      • 使用OpenAI SDK方式,经Base64编码后的视频需小于10MB;

      • 使用DashScope SDK方式,视频本身需小于100MB。详情请参见传入本地文件

  • 视频时长:

    • Qwen2.5-VL系列模型:2秒至10分钟;

    • 其他模型:2秒至40秒。

  • 视频格式: MP4、AVI、MKV、MOV、FLV、WMV 等。

  • 视频尺寸:无特定限制,模型处理前会被调整到约60万像素数,更大尺寸的视频文件不会有更好的理解效果。

  • 暂时不支持对视频文件的音频进行理解。

视频抽帧说明

通义千问VL模型通过抽帧来分析视频,抽帧频率决定了模型分析的精细度,不同SDK的控制方式如下:

  • 使用 DashScope SDK:

    可通过设置 fps 参数来控制抽帧频率,表示每隔 秒抽取一帧图像。建议为高速运动场景(如体育赛事、动作电影)设置较大的较大的fps值,为内容静态或较长的视频设置较小的fps值。

  • 使用 OpenAI SDK:

    抽帧频率固定为每隔0.5秒抽取一帧,无法通过参数修改。

以下是理解在线视频(通过URL指定)的示例代码。了解如何传入本地文件

OpenAI兼容

使用OpenAI SDK或HTTP方式向通义千问VL模型直接输入视频文件时,需要将用户消息中的"type"参数设为"video_url"

Python

import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",
    messages=[
        {"role": "system",
         "content": [{"type": "text","text": "You are a helpful assistant."}]},
        {"role": "user","content": [{
            # 直接传入视频文件时,请将type的值设置为video_url
            # 使用OpenAI SDK时,视频文件默认每间隔0.5秒抽取一帧,且不支持修改,如需自定义抽帧频率,请使用DashScope SDK.
            "type": "video_url",            
            "video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {"type": "text","text": "这段视频的内容是什么?"}]
         }]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",
        messages: [
        {role:"system",content:["You are a helpful assistant."]},
        {role: "user",content: [
            // 直接传入视频文件时,请将type的值设置为video_url
            // 使用OpenAI SDK时,视频文件默认每间隔0.5秒抽取一帧,且不支持修改,如需自定义抽帧频率,请使用DashScope SDK.
            {type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {type: "text", text: "这段视频的内容是什么?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "messages": [
    {"role": "system", "content": [{"type": "text","text": "You are a helpful assistant."}]},
    {"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
    {"type": "text","text": "这段视频的内容是什么?"}]}]
}'

DashScope

Python

import dashscope
import os
messages = [
    {"role":"system","content":[{"text": "You are a helpful assistant."}]},
    {"role": "user",
        "content": [
            # fps 可参数控制视频抽帧频率,表示每隔 1/fps 秒抽取一帧,完整用法请参见:https://help.aliyun.com/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "这段视频的内容是什么?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量, 请用百炼API Key将下行替换为: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // fps 可参数控制视频抽帧频率,表示每隔 1/fps 秒抽取一帧,完整用法请参见:https://help.aliyun.com/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
        Map<String, Object> params = Map.of(
                "video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
                "fps",2);
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "这段视频的内容是什么?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system","content": [{"text": "You are a helpful assistant."}]},
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "这段视频的内容是什么?"}]}]}
}'

图像列表

图像列表数量限制

  • Qwen2.5-VL系列模型:最少传入4张图片,最多512张图片

  • 其他模型:最少传入4张图片,最多80张图片

视频抽帧说明

以图像列表(即预先抽取的视频帧)传入时,可通过fps参数告知模型视频帧之间的时间间隔,帮助模型更准确地理解事件的顺序、持续时间和动态变化。

  • 使用 DashScope SDK:

    可在调用 Qwen2.5-VL系列模型 时设置fps参数,表示视频帧是每隔秒从原始视频中抽取的。

  • 使用 OpenAI SDK:

    无法设置 fps参数,模型将默认视频帧是按照每 0.5 秒一帧的频率抽取的。

以下是理解在线视频帧(通过URL指定)的示例代码。了解如何传入本地文件

OpenAI兼容

使用OpenAI SDK或HTTP方式向通义千问VL模型输入图片列表形式的视频时,需要将用户消息中的"type"参数设为"video"

Python

import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest", # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[{"role": "user","content": [
        # 传入图像列表时,用户消息中的"type"参数为"video",
        # 使用OpenAI SDK时,图像列表默认是以每隔0.5秒从视频中抽取出来的,且不支持修改。如需自定义抽帧频率,请使用DashScope SDK.
        {"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "描述这个视频的具体过程"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

// 确保之前在 package.json 中指定了 "type": "module"
import OpenAI from "openai";

const openai = new OpenAI({
    // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
        messages: [{
            role: "user",
            content: [
                {
                    // 传入图像列表时,用户消息中的"type"参数为"video"
                    // 使用OpenAI SDK时,图像列表默认是以每隔0.5秒从视频中抽取出来的,且不支持修改。如需自定义抽帧频率,请使用DashScope SDK.
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "描述这个视频的具体过程"
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "描述这个视频的具体过程"}]}]
}'

DashScope

Python

import os
# dashscope版本需要不低于1.20.10
import dashscope

messages = [{"role": "user",
             "content": [
                  # 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "描述这个视频的具体过程"}]}]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max-latest', # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

// DashScope SDK版本需要不低于2.18.3
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    private static final String MODEL_NAME = "qwen-vl-max-latest"; // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
                .build();
        //  若模型属于Qwen2.5-VL系列且传入的是图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的
        Map<String, Object> params = Map.of(
                "video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"),
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "描述这个视频的具体过程")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(systemMessage, userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max-latest",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "描述这个视频的具体过程"
          }
        ]
      }
    ]
  }
}'

传入本地文件(Base64 编码或文件路径)

通义千问VL 提供两种本地文件上传方式:

  • Base64 编码上传

  • 文件路径直接上传(传输更稳定,推荐方式

上传方式:

Base64 编码上传

将文件转换为 Base64 编码字符串,再传入模型。适用于 OpenAI 和 DashScope SDK 及 HTTP 方式

传入 Base64 编码字符串的步骤(以图像为例)

  1. 文件编码:将本地图像转换为 Base64 编码;

  2. 构建Data URL:格式如下:data:[MIME_type];base64,{base64_image}

    1. MIME_type需替换为实际的媒体类型,确保与支持的图像表格中MIME Type 的值匹配(如image/jpegimage/png);

    2. base64_image为上一步生成的 Base64 字符串;

  3. 调用模型:通过imageimage_url参数传递Data URL并调用模型。

文件路径上传

直接向模型传入本地文件路径。仅 DashScope Python 和 Java SDK 支持,不支持 DashScope HTTP 和OpenAI 兼容方式。

请您参考下表,结合您的编程语言与操作系统指定文件的路径。

指定文件路径(以图像为例)

系统

SDK

传入的文件路径

示例

Linux或macOS系统

Python SDK

file://{文件的绝对路径}

file:///home/images/test.png

Java SDK

Windows系统

Python SDK

file://{文件的绝对路径}

file://D:/images/test.png

Java SDK

file:///{文件的绝对路径}

file:///D:/images/test.png

使用限制:

  • 建议优先选择文件路径上传(稳定性更高),1MB以下的文件也可使用 Base64 编码;

  • 直接传入文件路径时,单张图像或视频帧(图像列表)本身需小于 10MB,单个视频需小于100MB;

  • Base64编码方式传入时,由于Base64编码会增加数据体积,需保证编码后的单个图像或视频需小于 10MB。

如需压缩文件体积请参见如何将图像或视频压缩到满足要求的大小?

图像

文件路径传入

传入文件路径仅支持 DashScope Python 和 Java SDK方式调用,不支持 DashScope HTTP 和OpenAI 兼容方式。

Python

import os
from dashscope import MultiModalConversation

# 将xxx/eagle.png替换为你本地图像的绝对路径
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [{"role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
                {'role':'user',
                'content': [{'image': image_path},
                            {'text': '图中描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("image", filePath);}},
                        new HashMap<String, Object>(){{put("text", "图中描绘的是什么景象?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // 将xxx/eagle.png替换为你本地图像的绝对路径
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 编码传入

OpenAI兼容

Python

from openai import OpenAI
import os
import base64


#  编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# 将xxxx/eagle.png替换为你本地图像的绝对路径
base64_image = encode_image("xxx/eagle.png")

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest", # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[
    	{
    	    "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。"f"是字符串格式化的方法。
                    # PNG图像:  f"data:image/png;base64,{base64_image}"
                    # JPEG图像: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP图像: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}, 
                },
                {"type": "text", "text": "图中描绘的是什么景象?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// 将xxx/eagle.png替换为你本地图像的绝对路径
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest",  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
        messages: [
            {"role": "system", 
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{"type": "image_url",
                            // 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
                           // PNG图像:  data:image/png;base64,${base64Image}
                          // JPEG图像: data:image/jpeg;base64,${base64Image}
                         // WEBP图像: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "图中描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
  {"role":"system",
  "content":[
    {"type": "text", "text": "You are a helpful assistant."}]},
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": f"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
      {"type": "text", "text": "图中描绘的是什么景象?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
from dashscope import MultiModalConversation


#  编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# 将xxxx/eagle.png替换为你本地图像的绝对路径
base64_image = encode_image("xxxx/eagle.png")

messages = [
    {"role": "system", "content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            # 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。"f"是字符串格式化的方法。
            # PNG图像:  f"data:image/png;base64,{base64_image}"
            # JPEG图像: f"data:image/jpeg;base64,{base64_image}"
            # WEBP图像: f"data:image/webp;base64,{base64_image}"
            {"image": f"data:image/png;base64,{base64_image}"},
            {"text": "图中描绘的是什么景象?"},
        ],
    },
]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-max-latest",  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Base64;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void callWithLocalFile(String localPath) throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64编码

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();

        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>() {{ put("image", "data:image/png;base64," + base64Image); }},
                        new HashMap<String, Object>() {{ put("text", "图中描绘的是什么景象?"); }}
                )).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // 将 xxx/eagle.png 替换为你本地图像的绝对路径
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": f"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    }
}'

视频文件

以保存在本地的test.mp4为例。

文件路径传入

传入文件路径仅支持 DashScope Python 和 Java SDK方式调用,不支持 DashScope HTTP 和OpenAI 兼容方式。

Python

import os
from dashscope import MultiModalConversation
# 将xxxx/test.mp4替换为你本地视频的绝对路径
local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [{'role': 'system',
                'content': [{'text': 'You are a helpful assistant.'}]},
                {'role':'user',
                # fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                'content': [{'video': video_path,"fps":2},
                            {'text': '这段视频描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',  
    messages=messages)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", filePath);// fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                                           put("fps", 2);
                                       }}, 
                        new HashMap<String, Object>(){{put("text", "这段视频描绘的是什么景象?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 新加坡和北京地域的API Key不同。获取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")  
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // 将xxxx/test.mp4替换为你本地视频的绝对路径
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 编码传入

OpenAI兼容

Python

from openai import OpenAI
import os
import base64


# 编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# 将xxxx/test.mp4替换为你本地视频的绝对路径
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",  
    messages=[
        {
            "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    # 直接传入视频文件时,请将type的值设置为video_url
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "这段视频描绘的是什么景象?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// 将xxxx/test.mp4替换为你本地视频的绝对路径
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest", 
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{
                 // 直接传入视频文件时,请将type的值设置为video_url
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
                 {"type": "text", "text": "这段视频描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
  {"role":"system",
  "content":[
    {"type": "text", "text": "You are a helpful assistant."}]},
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
      {"type": "text", "text": "图中描绘的是什么景象?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
from dashscope import MultiModalConversation

# 编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# 将xxxx/test.mp4替换为你本地视频的绝对路径
base64_video = encode_video("xxxx/test.mp4")

messages = [{'role': 'system',
                'content': [{'text': 'You are a helpful assistant.'}]},
                {'role':'user',
                # fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                'content': [{'video': f"data:video/mp4;base64,{base64_video}","fps":2},
                            {'text': '这段视频描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',
    messages=messages)

print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    private static String encodeVideoToBase64(String videoPath) throws IOException {
        Path path = Paths.get(videoPath);
        byte[] videoBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(videoBytes);
    }

    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Video = encodeVideoToBase64(localPath); // Base64编码

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();

        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", "data:video/mp4;base64," + base64Video);// fps参数控制视频抽帧数量,表示每隔1/fps 秒抽取一帧
                                           put("fps", 2);
                                       }},
                        new HashMap<String, Object>(){{put("text", "这段视频描绘的是什么景象?");}})).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // 将 xxx/test.mp4 替换为你本地图像的绝对路径
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"video": f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "图中描绘的是什么景象?"}
                ]
            }
        ]
    }
}'

图像列表

以保存在本地的football1.jpgfootball2.jpgfootball3.jpgfootball4.jpg为例。

文件路径传入

传入文件路径仅支持 DashScope Python 和 Java SDK方式调用,不支持 DashScope HTTP 和OpenAI 兼容方式。

Python

import os

from dashscope import MultiModalConversation

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{"role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
                {'role':'user',
                # 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的,其他模型设置则不生效
                'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
                            {'text': '这段视频描绘的是什么景象?'}]}]
response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages)

print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

// DashScope SDK版本需要不低于2.18.3
import java.util.Arrays;
import java.util.Map;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    private static final String MODEL_NAME = "qwen-vl-max-latest";  // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    public static void videoImageListSample(String localPath1, String localPath2, String localPath3, String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        String filePath1 = "file://" + localPath1;
        String filePath2 = "file://" + localPath2;
        String filePath3 = "file://" + localPath3;
        String filePath4 = "file://" + localPath4;
        MultiModalMessage systemMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
                .build();
        Map<String, Object> params = Map.of(
                "video", Arrays.asList(filePath1,filePath2,filePath3,filePath4),
                // 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的,其他模型设置则不生效
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "描述这个视频的具体过程")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 新加坡和北京地域的API Key不同。获取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(systemMessage, userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 编码传入

OpenAI兼容

Python

import os
from openai import OpenAI
import base64

# 编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max-latest",  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=[
    {"role": "system",
     "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "描述这个视频的具体过程"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 新加坡和北京地域的API Key不同。获取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // 以下为新加坡地域base_url,若使用北京地域的模型,需将base_url替换为:https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max-latest", // 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{"type": "video",
                            // 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
                           // PNG图像:  data:image/png;base64,${base64Image}
                          // JPEG图像: data:image/jpeg;base64,${base64Image}
                         // WEBP图像: data:image/webp;base64,${base64Image}
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "这段视频描绘的是什么景象?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": [
                          f"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...",
                          f"data:image/jpeg;base64,nEpp6jpnP57MoWSyOWwrkXMJhHRCWYeFYb...",
                          f"data:image/jpeg;base64,JHWQnJPc40GwQ7zERAtRMK6iIhnWw4080s...",
                          f"data:image/jpeg;base64,adB6QOU5HP7dAYBBOg/Fb7KIptlbyEOu58..."
                          ]},
                {"type": "text",
                "text": "描述这个视频的具体过程"}]}]
}'

DashScope

Python

import base64
import os
from dashscope import MultiModalConversation

#  编码函数: 将本地文件转换为 Base64 编码的字符串
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")


messages = [{"role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
                {'role':'user',
                'content': [
                    {'video':
                         [f"data:image/png;base64,{base64_image1}",
                          f"data:image/png;base64,{base64_image2}",
                          f"data:image/png;base64,{base64_image3}",
                          f"data:image/png;base64,{base64_image4}"
                         ]
                    },
                    {'text': '请描绘这个视频的具体过程?'}]}]
response = MultiModalConversation.call(
    # 新加坡和北京地域的API Key不同。获取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max-latest',  # 此处以qwen-vl-max-latest为例,可按需更换模型名称。模型列表:https://help.aliyun.com/zh/model-studio/models
    messages=messages)

print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void videoImageListSample(String localPath1,String localPath2,String localPath3,String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image1 = encodeImageToBase64(localPath1); // Base64编码
        String base64Image2 = encodeImageToBase64(localPath2);
        String base64Image3 = encodeImageToBase64(localPath3);
        String base64Image4 = encodeImageToBase64(localPath4);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();

        Map<String, Object> params = Map.of(
                "video", Arrays.asList(
                        "data:image/jpeg;base64," + base64Image1,
                        "data:image/jpeg;base64," + base64Image2,
                        "data:image/jpeg;base64," + base64Image3,
                        "data:image/jpeg;base64," + base64Image4),
                // 若模型属于Qwen2.5-VL系列且传入图像列表时,可设置fps参数,表示图像列表是由原视频每隔 1/fps 秒抽取的,其他模型设置则不生效
                    "fps",2
        );
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "描述这个视频的具体过程")))
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max-latest")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // 将 xxx/football1.png 等替换为你本地图像的绝对路径
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg"
            );
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • 将文件转换为 Base64 编码的字符串的方法可参见示例代码

  • 为了便于展示,代码中的"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." ,该Base64 编码字符串是截断的。在实际使用中,请务必传入完整的编码字符串。

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max-latest",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
                      f"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...",
                      f"data:image/jpeg;base64,nEpp6jpnP57MoWSyOWwrkXMJhHRCWYeFYb...",
                      f"data:image/jpeg;base64,JHWQnJPc40GwQ7zERAtRMK6iIhnWw4080s...",
                      f"data:image/jpeg;base64,adB6QOU5HP7dAYBBOg/Fb7KIptlbyEOu58..."
            ],
            "fps":2     
          },
          {
            "text": "描述这个视频的具体过程"
          }
        ]
      }
    ]
  }
}'

使用限制

支持的图像

模型支持的图像格式如下表:

需注意使用OpenAI SDK 传入本地图像时,请根据实际的图像格式,将代码中的image/{format}设置为对应的MIME Type值。

图像格式

常见扩展名

MIME Type

BMP

.bmp

image/bmp

JPEG

.jpe, .jpeg, .jpg

image/jpeg

PNG

.png

image/png

TIFF

.tif, .tiff

image/tiff

WEBP

.webp

image/webp

HEIC

.heic

image/heic

图像大小限制

  • 单个图像文件的大小不超过10 MB。其中传入 Base64编码的图像,需保证编码后的图像小于10MB,详情请参见传入本地文件。如需压缩文件体积请参见如何将图像或视频压缩到满足要求的大小?

  • 对单图的像素无严格限制,图像的宽度和高度均应大于10像素,宽高比不应超过200:1或1:200。

  • 模型在进行图像理解前会对图像进行缩放处理。过大的图像不会有更好的理解效果。

    点击查看推荐的像素值

    • 输入qwen-vl-maxqwen-vl-max-latestqwen-vl-max-1230qwen-vl-max-1119qwen-vl-max-1030qwen-vl-max-0809qwen-vl-plus-latestqwen-vl-plus-0102qwen-vl-plus-0809qwen2-vl-72b-instructqwen2-vl-7b-instructqwen2-vl-2b-instruct模型的单张图像,像素数推荐不超过 1200万,可以支持标准的 4K 图像。

    • 输入qwen-vl-plus模型的单张图像,像素数推荐不超过 1003520。

图像输入方式

  • 图像的URL链接:需确保URL可被公网访问。

    说明
  • 本地图像文件:传入 Base64 编码数据或直接传入本地文件的路径。

图像数量限制

在多图像输入中,图像数量受模型图文总Token上限(即最大输入)的限制,所有图像的总Token数必须小于模型的最大输入。

如:使用的模型为qwen-vl-max,该模型的最大输入为129024个Token,若传入的图像像素均为1280 × 1280,通过图像转换为Token中的代码计算图像Token数如下:

vl_high_resolution_images

调整后的图像宽高

图像Token数

最多可传入的图像数量(张)

True

1288 x 1288

2118

60

False

980 x 980

1227

105

应用示例

看图做题

Prompt技巧:您可以通过“思维链”Prompt方法解决复杂的数学问题,这是一种通过引导模型生成推理过程或帮助模型拆解复杂任务并逐步推理的方式,让模型在生成推理结果前生成更多的推理依据,从而提升模型在复杂问题上的表现。

输入示例

示例代码

输出示例

提示词:请你分步骤解答这道题,输出对这道题的思考判断过程。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i2/O1CN01e99Hxt1evMlWM6jUL_!!6000000003933-0-tps-1294-760.jpg"},
                    {"text": "请你分步骤解答这道题,输出对这道题的思考判断过程。"}
                ]
            }
        ]
    }
}'

image

信息抽取

通义千问VL模型支持抽取票据证件表单中的信息,并以结构化的形式返回。

Prompt技巧:

  • 使用分隔符强调需要提取的字段

  • 明确输出格式,例如JSON格式

  • 在提示词中明确禁止可能的```json```代码段,如“请你以JSON格式输出,不要输出```json```代码段”

如qwen-vl-plus_2025-01-25进行文字提取类型的任务时,为提高准确率,建议设置presence_penalty为1.5,repetition_penalty为1.0。

输入示例

示例代码

输出示例

提示词:提取图中的:['发票代码','发票号码','到站','燃油费','票价','乘车日期','开车时间','车次','座号'],请你以JSON格式输出,不要输出```json```代码段”。image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg"},
                    {"text": "提取图中的:['发票代码','发票号码','到站','燃油费','票价','乘车日期','开车时间','车次','座号'],请你以JSON格式输出,不要输出```json```代码段”。"}
                ]
            }
        ]
    }
}'
{
    "发票代码": "221021325353",
    "发票号码": "10283819",
    "到站": "开发区",
    "燃油费": "2.0",
    "票价": "8.00<全>",
    "乘车日期": "2013-06-29",
    "开车时间": "流水",
    "车次": "040",
    "座号": "371"
}

物体定位

仅Qwen2.5-VL模型支持以Box定位或Point定位的两种方式对物体定位,以Box定位方式会返回矩形框的左上角和右下角的坐标,以Point定位的方式会返回矩形框中心点的坐标(两类坐标均相对于缩放后的图像左上角的绝对值,单位为像素)。

模型在进行图像理解前会对图像进行缩放处理,您可以参考Qwen2.5-VL中的代码将坐标映射到原图中,同时您还可以通过设置vl_high_resolution_images参数为True来尽可能保证图像不被缩放,但可能会带来Token的消耗。
Qwen2.5-VL模型480*480~ 2560*2560分辨率范围内,物体定位效果较为鲁棒,在此范围之外可能会偶发bbox漂移现象。
  1. Prompt技巧

定位方式

支持的输出方式

推荐Prompt

Box定位

JSON或纯文本

检测图中所有{物体}并以{JSON/纯文本}格式输出其bbox的坐标

Point定位

JSON或XML

以点的形式定位图中所有{物体},以{JSON/XML}格式输出其point坐标

  1. Prompt改进思路

  • 当检测密集排列的物体时,如Prompt为“检测图中所有人”,模型可能会混淆了“每个人”和“所有人”的语义,从而仅输出将所有人物包含在内的框,可以通过下列提示词向模型强调检测每个对象:

    • Box定位:定位图中每一个{某类物体}并描述其各自的{某种特征},以{JSON/纯文本}格式输出其bbox坐标。

    • Point定位:以点的形式定位图中每一个{某类物体}并描述各自的{某种特征},以{JSON/XML}格式输出其point坐标

  • 定位结果中可能会出现```json```或者```xml```等无关内容,可在Prompt中明确禁止该内容输出,如“请你以JSON格式输出,不要输出```json```代码段”。

输入示例

示例代码

输出示例

Box定位:

提示词:定位每一个蛋糕的位置,并描述其各自的特征,以JSON格式输出所有的bbox的坐标,不要输出```json```代码段。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i3/O1CN01I1CXf21UR0Ld20Yzs_!!6000000002513-2-tps-1024-1024.png"},
                    {"text":  "用一个个框定位图像每一个蛋糕的位置并描述其各自的特征,以JSON格式输出所有的bbox的坐标,不要输出```json```代码段"}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_k":"1",
        "seed":"3407"
    }
}'
[
  {
    "bbox": [60, 395, 204, 578],
    "description": "巧克力蛋糕,顶部覆盖红色糖霜和彩色糖粒"
  },
  {
    "bbox": [248, 381, 372, 542],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [400, 368, 504, 504],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [530, 355, 654, 526],
    "description": "粉色糖霜的蛋糕,顶部有白色和蓝色的糖粒"
  },
  {
    "bbox": [432, 445, 566, 606],
    "description": "粉红色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [630, 475, 774, 646],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [740, 380, 868, 539],
    "description": "巧克力蛋糕,顶部覆盖棕色糖霜"
  },
  {
    "bbox": [796, 512, 960, 693],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [39, 555, 200, 736],
    "description": "黄色糖霜的蛋糕,顶部有多种颜色的糖粒"
  },
  {
    "bbox": [292, 546, 446, 707],
    "description": "黑色蛋糕,顶部有白色糖霜和两个黑色眼睛"
  },
  {
    "bbox": [516, 564, 666, 715],
    "description": "黄色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [352, 655, 516, 822],
    "description": "白色糖霜的蛋糕,顶部有两个黑色眼睛"
  },
  {
    "bbox": [130, 746, 304, 924],
    "description": "白色糖霜的蛋糕,顶部有两个黑色眼睛"
  }
]

Point定位:

提示词:以点的形式定位图中见义勇为的人,并以XML格式输出结果,不要输出```xml```代码段。

image

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01ILRlNK1gvU5xqbaxb_!!6000000004204-49-tps-1138-640.webp"},
                    {"text":  "以点的形式定位图中见义勇为的人,并以XML格式输出结果,不要输出```xml```代码段。"}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_k":"1",
        "seed":"3407"
    }
}'
< points x1 = "284"
y1 = "305"
alt = "见义勇为的人" > 见义勇为的人 < /points>

文档解析

仅Qwen2.5-VL模型支持将图像类的文档(如扫描件/图片PDF)解析为 QwenVL HTML格式,该格式不仅能精准识别文本,还能获取图像、表格等元素的位置信息。

Prompt技巧:您需要在提示词中引导模型输出QwenVL HTML,否则将解析为不带位置信息的HTML格式的文本:

  • 推荐系统提示词:"You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."

  • 推荐用户提示词:"QwenVL HTML"

输入示例

示例代码

输出示例

image

为防止输出结果过长导致超时的风险,以下为使用流式输出的代码:
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": "You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i3/O1CN01nVbWzy1vx3iInC3z0_!!6000000006238-0-tps-1430-2022.jpg"},
                    {"text": "QwenVL HTML"}
                ]
            }
        ],
    "parameters": {
        "incremental_output":true
    }
    }
}'
完整的输出结果如下:
```html
<html><body>
<h2 data-bbox=\"91 95 223 120\"> 1 Introduction</h2> 
 <p data-bbox=\"91 128 742 296\">The sparks of artificial general intelligence (AGI) are increasingly visible through the fast development of large foundation models, notably large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; 2024; Gemini Team, 2024; Anthropic, 2023a,b; 2024; Bai et al., 2023; Yang et al., 2024a; Touvron et al., 2023a,b; Dubey et al., 2024). The continuous advancement in model and data scaling, combined with the paradigm of large-scale pre-training followed by high-quality supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), has enabled large language models (LLMs) to develop emergent capabilities in language understanding, generation, and reasoning. Building on this foundation, recent breakthroughs in inference time scaling, particularly demonstrated by o1 (OpenAI, 2024b), have enhanced LLMs’ capacity for deep thinking through step-by-step reasoning and reflection. These developments have elevated the potential of language models, suggesting they may achieve significant breakthroughs in scientific exploration as they continue to demonstrate emergent capabilities indicative of more general artificial intelligence.</p> 
 <p data-bbox=\"91 296 742 448\">Besides the fast development of model capabilities, the recent two years have witnessed a burst of open (open-weight) large language models in the LLM community, for example, the Llama series (Touvron et al., 2023a,b; Dubey et al., 2024), Mistral series (Jiang et al., 2023a; 2024a), and our Qwen series (Bai et al., 2023; Yang et al., 2024a; Qwen Team, 2024a; Hui et al., 2024; Qwen Team, 2024c; Yang et al., 2024b). The open-weight models have democratized the access of large language models to common users and developers, enabling broader research participation, fostering innovation through community collaboration, and accelerating the development of AI applications across diverse domains.</p> 
 <p data-bbox=\"91 448 742 586\">Recently, we release the details of our latest version of the Qwen series, Qwen2.5. In terms of the openweight part, we release pre-trained and instruction-tuned models of 7 sizes, including $0.5 \\mathrm{~B}, 1.5 \\mathrm{~B}, 3 \\mathrm{~B}, 7 \\mathrm{~B}$, $14 \\mathrm{~B}, 32 \\mathrm{~B}$, and $72 \\mathrm{~B}$, and we provide not only the original models in bfloat16 precision but also the quantized models in different precisions. Specifically, the flagship model Qwen2.5-72B-Instruct demonstrates competitive performance against the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Additionally, we also release the proprietary models of Mixture-of-Experts (MoE, Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022), namely Qwen2.5-Turbo and Qwen2.5-Plus ${ }^{1}$, which performs competitively against GPT-4o-mini and GPT-4o respectively.</p> 
 <p data-bbox=\"91 586 742 868\">In this technical report, we introduce Qwen2.5, the result of our continuous endeavor to create better LLMs. Below, we show the key features of the latest version of Qwen: </p> 
 <ul data-bbox=\"136 614 742 868\"><li data-bbox=\"136 614 742 712\">Better in Size: Compared with Qwen2, in addition to $0.5 \\mathrm{~B}, 1.5 \\mathrm{~B}, 7 \\mathrm{~B}$, and $72 \\mathrm{~B}$ models, Qwen2.5 brings back the $3 \\mathrm{~B}, 14 \\mathrm{~B}$, and $32 \\mathrm{~B}$ models, which are more cost-effective for resource-limited scenarios and are under-represented in the current field of open foundation models. Qwen2.5Turbo and Qwen2.5-Plus offer a great balance among accuracy, latency, and cost.</li><li data-bbox=\"136 708 742 784\">Better in Data: The pre-training and post-training data have been improved significantly. The pre-training data increased from 7 trillion tokens to 18 trillion tokens, with focus on knowledge, coding, and mathematics. The pre-training is staged to allow transitions among different mixtures. The post-training data amounts to 1 million examples, across the stage of supervised finetuning (SFT, Ouyang et al., 2022), direct preference optimization (DPO, Raffelov et al., 2023), and group relative policy optimization (GRPO, Shao et al., 2024).</li><li data-bbox=\"136 780 742 868\">Better in Use: Several key limitations of Qwen2 in use have been eliminated, including larger generation length (from 2K tokens to 8K tokens), better support for structured input and output, (e.g., tables and JSON), and easier tool use. In addition, Qwen2.5-Turbo supports a context length of up to 1 million tokens.</li></ul> 
 <h2 data-bbox=\"91 892 338 920\"> 2 Architecture &amp; Tokenizer</h2> 
 <p data-bbox=\"91 926 742 978\">Basically, the Qwen2.5 series include dense models for opensource, namely Qwen2.5-0.5B / 1.5B / 3B / $7 \\mathrm{~B} / 14 \\mathrm{~B} / 32 \\mathrm{~B} / 72 \\mathrm{~B}$, and MoE models for API service, namely Qwen2.5-Turbo and Qwen2.5-Plus. Below, we provide details about the architecture of models.</p> 
 <p data-bbox=\"91 982 742 1070\">For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford et al., 2018) as Qwen2 (Yang et al., 2024a). The architecture incorporates several key components: Grouped Query Attention (GQA, Ainslie et al., 2023) for efficient KV cache utilization, SwiGLU activation function (Dauphin et al., 2017) for non-linear activation, Rotary Positional Embeddings (RoPE, Su</p> 
 <hr/> 
 <section class=\"footnotes\" data-bbox=\"91 1028 742 1070\"><ol class=\"footnotes-list\" data-bbox=\"91 1028 742 1070\"><li class=\"footnote-item\" data-bbox=\"91 1028 742 1070\"><p data-bbox=\"91 1028 742 1070\">${ }^{1}$ Qwen2.5-Turbo is identified as qwen-turbo-2024-11-01 and Qwen2.5-Plus is identified as qwen-plus-2024-xx-xx (to be released) in the API.</p></li></ol></section> 
</body>
```html

视频理解

Qwen2.5-VL模型具有感知时间信息的能力,能从视频中搜索具体事件,或对不同时间段进行要点总结。

Prompt技巧:

  • 明确任务需求:

    • 指定视频理解的时间范围,如“请你描述下列视频中的一系列事件”或者“请你描述00:05:00 至 00:10:00时间段中的一系列事件”

    • 事件计数:如“统计视频中‘知识点讲解’场景出现的次数及总时长,并记录事件的起始和结束时间戳”

    • 动作或者画面定位:如“视频00:03:25附近5秒内是否有‘选手失误’事件?要求精确到最近0.5秒”

    • 长视频分段处理:如“将下列2小时会议视频按每3分钟生成一个摘要(含时间戳),重点标注‘提问环节’和‘决议通过’事件”

  • 明确输出要求或者格式:

    • JSON结构约束:“在Prompt中要求模型以JSON格式返回时间戳(start_timeend_time)、事件类型(category)、具体事件(event)”

    • 时间格式表示:“请使用HH:mm:ss或秒数(如:20秒)表示时间戳”

输入示例

示例代码

输出示例

提示词:请你描述下视频中的人物的一系列动作,以JSON格式输出开始时间(start_time)、结束事件(end_time)、事件(event),请使用HH:mm:ss表示 时间戳,不要输出```json```代码段。

您可以通过设置fps参数来控制对视频抽帧的频率,视频文件将以每隔 秒抽取一帧再进行内容的理解,详细用法请参见视频理解
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"video": "https://cloud.video.taobao.com/vod/C6gCj5AJ3Qrd_UQ9kaMVRY9Ig9G-WToxVYSPRdNXCao.mp4","fps": 8.0},
                    {"text": "请你描述下视频中的人物的一系列动作,以JSON格式输出开始时间(start_time)、结束事件(end_time)、事件(event),请使用HH:mm:ss表示 时间戳,不要输出```json```代码段。"}
                ]
            }
        ]
    }
}'
[
    {
        "start_time": "00:00:00.00",
        "end_time": "00:00:04.00",
        "event": "人物从画面左侧走向桌子,手中拿着一个纸箱。"
    },
    {
        "start_time": "00:00:04.00",
        "end_time": "00:00:06.00",
        "event": "人物将纸箱放在桌子上。"
    },
    {
        "start_time": "00:00:06.00",
        "end_time": "00:00:10.00",
        "event": "人物用右手拿起一个扫描枪,对准纸箱上的条形码进行扫描。"
    },
    {
        "start_time": "00:00:10.00",
        "end_time": "00:00:12.00",
        "event": "人物将扫描枪放回原位。"
    },
    {
        "start_time": "00:00:12.00",
        "end_time": "00:00:15.00",
        "event": "人物用双手拿起纸箱,将其移至一旁。"
    },
    {
        "start_time": "00:00:15.00",
        "end_time": "00:00:20.00",
        "event": "人物用右手拿起一支笔,在桌上的笔记本上记录信息。"
    }
]

API参考

关于通义千问VL模型的输入输出参数,请参见通义千问

常见问题

如何将图像或视频压缩到满足要求的大小?

通义千问VL 对输入的文件有大小限制,可通过以下方法压缩。

图像压缩方法

  • 在线工具:使用 CompressJPEGTinyPng 等在线工具进行压缩。

  • 本地软件:使用 Photoshop 等软件,在导出时调整质量。

  • 代码实现:

    # pip install pillow
    
    from PIL import Image
    def compress_image(input_path, output_path, quality=85):
        with Image.open(input_path) as img:
            img.save(output_path, "JPEG", optimize=True, quality=quality)
    
    # 传入本地图像
    compress_image("/xxx/before-large.jpeg","/xxx/after-min.jpeg")

视频压缩方法

  • 在线工具:使用 FreeConvert 等在线工具进行压缩。

  • 本地软件:使用 HandBrake 等软件。

  • 代码实现:使用FFmpeg工具,更多用法请参见FFmpeg官网

    # 基础转换命令(万能模板)
    # -i,作用:输入文件路径,常用值示例:input.mp4
    # -vcodec,作用 视频编码器 ,一般取值有libx264(通用推荐)、libx265(压缩率更高)、
    # -crf,作用:控制视频质量,取值范围:[18-28],数值越小,质量越高,文件体积越大。
    # --preset,作用:控制编码速度与压缩效率的平衡。一般取值有 slow、fast、faster
    # -y,作用:覆盖已存在文件(无需值)
    # output.mp4,作用:输出文件路径
    
    ffmpeg -i input.mp4 -vcodec libx264 -crf 28 -preset slow output.mp4

通义千问VL是否支持批量提交任务?

目前 qwen-vl-max、qwen-vl-max-latest、qwen-vl-plus、qwen-vl-plus-latest 模型兼容OpenAI Batch 接口,支持以文件方式批量提交任务并异步执行。对于处理大规模数据且对时效性有一定容忍度的场景,推荐使用Batch调用,费用仅为实时调用的 50%。

通义千问VL是否支持处理PDF、XLSX、XLS、DOC等文本文件?

不支持,通义千问VL模型属于视觉理解模型,只能处理图像格式的文件,无法直接处理文本文件。如需,可参考以下替代方案:

  • 将文本文件转换为图片格式,但是注意转换后的图像尺寸可能过大,建议您将文件切分为多张图像,使用多图像输入的方式传入模型。

  • Qwen-Long支持处理文本文件,可用于解析文件内容。

如何处理模型超时的情况?

我们推荐您使用流式输出来降低超时风险。使用非流式输出方式调用模型服务,180秒内模型没有结束输出通常会触发超时报错。为了提升用户体验,超时后响应体中会将已生成的内容返回,且不再报超时错误。如果响应头包含x-dashscope-partialresponse:true,表示本次响应触发了超时。支持的模型如下:

支持的模型

  • qwen-max-2024-09-19 及之后的模型

  • qwen-plus-2024-11-25 及之后的模型

  • qwen-flash-2025-07-28 及之后的模型

  • qwen-turbo-2024-11-01 及之后的模型

  • qwen-vl-max-2025-01-25 及之后的模型

  • qwen-vl-plus-2025-01-02 及之后的模型

  • qwen-long-2025-01-25 及之后的模型

  • qwen3 开源模型(qwen3-235b-a22b、qwen3-32b、qwen3-30b-a3b、qwen3-14b、qwen3-8b、qwen3-4b、qwen3-1.7b、qwen3-0.6b)

  • qwen2.5开源模型(qwen2.5-14b-instruct-1m、qwen2.5-7b-instruct-1m、qwen2.5-72b-instruct、qwen2.5-32b-instruct、qwen2.5-14b-instruct、qwen2.5-7b-instruct、qwen2.5-3b-instruct、qwen2.5-1.5b-instruct、qwen2.5-0.5b-instruct)

如果您无法获取到响应头参数(如通过 SDK 调用),可以通过返回的 finish_reason字段来辅助判断,如果 finish_reason 为"null”,表示当前生成内容是不完整的(但不一定是由于触发了超时)。

部分通义千问VL模型已支持前缀续写功能。您可以使用前缀续写功能,将已生成的内容添加到 messages 数组并再次发出请求,使大模型继续生成内容。详情请参见:基于超时返回的内容继续生成

通义千问VL可以解答图像中的数学问题吗?

可以。目前有两种方案:

  • 使用视觉推理QVQ模型:利用QVQ模型的推理能力进行解答图像中的数学问题。

  • 使用通义千问VL通用模型:

    • 对于简单的数学问题,直接使用通义千问VL模型进行解答;

    • 对于复杂的数学问题,可以先使用通义千问VL模型的OCR能力解析图像中的问题,再使用通义千问数学模型解答问题。

通义千问VL是如何计费的?

  • 如何获取免费额度?

    从开通百炼或模型申请通过之日起计算有效期,有效期180天内,通义千问VL模型提供10万或100万Token的免费额度,具体信息请参见模型列表与价格

  • 如何查询模型的剩余额度?

    • 您可以访问阿里云百炼控制台的模型广场页面,找到通义千问VL模型并点击查看详情,即可查看免费额度、剩余额度及到期时间。如果没有显示免费额度,说明账号下该模型的免费额度已到期。

    • 免费额度数据按小时更新,高峰期可能有小时级延迟。

  • 免费额度用完是否有提醒?

    • 目前暂无提醒机制。

    • 您可以开启免费额度用完即停功能,免费额度耗尽将无法继续调用(返回错误 code:AllocationQuota.FreeTierOnly),避免产生额外费用。

  • 如何计费?

    总费用 = 输入 Token 数 x 模型输入单价 + 模型输出 Token 数 x 模型输出单价。其中,图像转成Token的方法为每28x28像素对应一个Token,一张图最少4个Token。

  • 如何查看账单?

    您可以在阿里云控制台的费用与成本页面查看账单或进行充值。

  • 如何关闭计费?

    在阿里云百炼大模型服务平台调用模型按调用量计费,不调用不计费;可通过以下方式管理扣费风险:

    • 删除已创建的API Key,删除API Key后,您将无法通过API调用百炼上的模型,因此也不会再产生模型调用费用。

    • 设置高额消费预警。当设置的预警产品日账单大于预警阈值时,每天短信提醒一次。(统计截止昨日24点的账单费用)

相关链接

错误码

如果模型调用失败并返回报错信息,请参见错误信息进行解决。

上一篇: 多模态 下一篇: 视觉推理
阿里云首页 大模型服务平台百炼 相关技术圈