文字提取(OCR)

通义千问OCR是文字提取专有模型,专注于文档、表格、试题、手写体文字等类型图像的文字提取能力。它能够识别多种文字,目前支持的语言有:汉语、英语、阿拉伯语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语、越南语。

您可以在阿里云百炼平台在线体验通义千问OCR模型。

支持的模型

模型名称

版本

上下文长度

最大输入

最大输出

输入输出单价

免费额度

(注)

(Token数)

(每千Token)

qwen-vl-ocr

当前等同qwen-vl-ocr-2024-10-28

稳定版

34,096

30,000

单图最大30000

4,096

0.005

100Token

有效期:百炼开通后180天内

qwen-vl-ocr-latest

始终等同最新快照版

最新版

38,192

8,192

qwen-vl-ocr-2025-04-13

又称qwen-vl-ocr-0413
大幅提升文字识别能力,新增六种内置的OCR任务,增加了自定义Prompt、图像旋转矫正等功能。

快照版

qwen-vl-ocr-2024-10-28

又称qwen-vl-ocr-1028

快照版

34,096

4,096

图片转换为Token的规则

28x28像素对应一个Token,一张图最少需要4Token。您可以通过运行以下代码估算图像的Token:

import math
from PIL import Image


def smart_resize(image_path, min_pixels, max_pixels):
    """
    对图像进行预处理。

    参数:
        image_path:图像的路径
    """
    # 打开指定的PNG图片文件
    image = Image.open(image_path)

    # 获取图片的原始尺寸
    height = image.height
    width = image.width
    # 将高度调整为28的整数倍
    h_bar = round(height / 28) * 28
    # 将宽度调整为28的整数倍
    w_bar = round(width / 28) * 28

    # 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / 28) * 28
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / 28) * 28
        w_bar = math.ceil(width * beta / 28) * 28
    return h_bar, w_bar


# 将xxx/test.png替换为您本地的图像路径
h_bar, w_bar = smart_resize("xxx/test.png", min_pixels=28 * 28 * 4, max_pixels=8192 * 28 * 28)
print(f"缩放后的图像尺寸为:高度为{h_bar},宽度为{w_bar}")

# 计算图像的Token数:总像素除以28 * 28
token = int((h_bar * w_bar) / (28 * 28))

# <|vision_bos|> 和 <|vision_eos|> 作为视觉标记,每个需计入 1个Token
print(f"图像的总Token数为{token + 2}")
// 使用以下命令安装sharp: npm install sharp
import sharp from 'sharp';
import fs from 'fs';

async function smartResize(imagePath,minPixels,maxPixels) {
    // 打开指定的PNG图片文件
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // 获取图片的原始尺寸
    const height = metadata.height;
    const width = metadata.width;

    // 将高度调整为28的整数倍
    let hBar = Math.round(height / 28) * 28;
    // 将宽度调整为28的整数倍
    let wBar = Math.round(width / 28) * 28;

    // 对图像进行缩放处理,调整像素的总数在范围[min_pixels,max_pixels]内
    if (hBar * wBar > maxPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不超过max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.floor(height / beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // 计算缩放因子beta,使得缩放后的图像总像素数不低于min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // 重新计算调整后的高度,确保为28的整数倍
        hBar = Math.ceil(height * beta / 28) * 28;
        // 重新计算调整后的宽度,确保为28的整数倍
        wBar = Math.ceil(width * beta / 28) * 28;
    }

    return { hBar, wBar };
}

// 将xxx/test.png替换为本地的图像路径
const imagePath = 'xxx/test.png';
smartResize(imagePath, 3136, 6422528).then(({ hBar, wBar }) => {
    console.log(`缩放后的图像尺寸为:高度为${hBar},宽度为${wBar}`);

    // 计算图像的Token数:总像素除以28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));

    // 系统会自动添加<|vision_bos|>和<|vision_eos|>视觉标记(各占1个Token)
    console.log(`图像的总Token数为${token + 2}`);
}).catch(err => {
    console.error('Error processing image:', err);
});
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class Main {

    // 自定义类存储调整后的尺寸
    public static class ResizedSize {
        public final int height;
        public final int width;

        public ResizedSize(int height, int width) {
            this.height = height;
            this.width = width;
        }
    }

    public static ResizedSize smartResize(String imagePath, int minPixels, int maxPixels) throws IOException {
        // 1. 加载图像
        BufferedImage image = ImageIO.read(new File(imagePath));
        if (image == null) {
            throw new IOException("无法加载图像文件: " + imagePath);
        }

        int originalHeight = image.getHeight();
        int originalWidth = image.getWidth();

        // 2. 初始调整为28的倍数
        int hBar = (int) (Math.round(originalHeight / 28.0) * 28);
        int wBar = (int) (Math.round(originalWidth / 28.0) * 28);
        int currentPixels = hBar * wBar;

        // 3. 根据条件调整尺寸
        if (currentPixels > maxPixels) {
            // 当前像素超过最大值,需要缩小
            double beta = Math.sqrt(
                    (originalHeight * (double) originalWidth) / maxPixels  
            );
            double scaledHeight = originalHeight / beta;
            double scaledWidth = originalWidth / beta;

            hBar = (int) (Math.floor(scaledHeight / 28) * 28);
            wBar = (int) (Math.floor(scaledWidth / 28) * 28);
        } else if (currentPixels < minPixels) {
            // 当前像素低于最小值,需要放大
            double beta = Math.sqrt(
                    (double) minPixels / (originalHeight * originalWidth)
            );
            double scaledHeight = originalHeight * beta;
            double scaledWidth = originalWidth * beta;

            hBar = (int) (Math.ceil(scaledHeight / 28) * 28);
            wBar = (int) (Math.ceil(scaledWidth / 28) * 28);
        }

        return new ResizedSize(hBar, wBar);
    }

    public static void main(String[] args) {
        try {
            ResizedSize size = smartResize(
                    // xxx/test.png替换为你的图像路径
                    "xxx/test.png",  
                    28 * 28 * 4,
                    8192 * 28 * 28
            );

            System.out.printf("缩放后的图像尺寸:高度 %d,宽度 %d%n", size.height, size.width);

            // 计算 Token(总像素 / 28×28 + 2)
            int token = (size.height * size.width) / (28 * 28) + 2;
            System.out.printf("图像总 Token 数:%d%n", token);

        } catch (IOException e) {
            System.err.println("错误:" + e.getMessage());
            e.printStackTrace();
        }
    }
}

您可以选择全面升级的qwen-vl-ocr-latest模型用于文字提取任务,相较于上一个快照版本,该模型新增以下功能:

  • 支持在文字提取前,对图像进行旋转矫正,适合图像倾斜的场景。

  • 新增六种内置的OCR任务,分别是通用文字识别、信息抽取、文档解析、表格解析、公式识别、多语言识别。

  • 未设置内置任务时,支持用户输入Prompt进行指引;如设置了内置任务时,为保证识别效果,模型内部会使用任务指定的Prompt。

    DashScope SDK支持对图像进行旋转矫正和设置内置任务。如需使用OpenAI SDK进行内置的OCR任务,需要手动填写任务指定的Prompt进行引导。

如何使用

前提条件

您需要已获取API Key配置API Key到环境变量。如果通过OpenAI SDKDashScope SDK进行调用,还需要安装最新版SDK

DashScope Python SDK 最低版本为1.22.2, Java SDK 最低版本为2.18.4。

快速开始

以下是以qwen-vl-ocr-latest为例,传入图像URL进行文字提取的示例代码。模型支持用户传入Prompt进行指引,如未输入则使用模型默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.

qwen-vl-ocr、qwen-vl-ocr-2024-10-28模型不支持传入Prompt进行指引,模型内部会使用默认的Prompt:Read all the text in the image.
您可以在图像限制处查看对输入图像的要求,如需传入本地图像请参见使用本地文件

OpenAI兼容

Python

import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr-latest",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                    "min_pixels": 28 * 28 * 4,
                    # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                    "max_pixels": 28 * 28 * 8192
                },
                  # qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                 # 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
                {"type": "text",
                 "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
            ]
        }
    ])

print(completion.choices[0].message.content)

Node.js

import OpenAI from 'openai';

const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-latest',
    messages: [
      {
        role: 'user',
        content: [
          // qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
         // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
          { type: 'text', text: "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              //  输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
              "min_pixels": 28 * 28 * 4,
              // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ],
  });
  console.log(response.choices[0].message.content)
}

main();

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-latest",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
            ]
        }
    ]
}'

响应示例

{
  "choices": [{
    "message": {
      "content": "```json\n{\n    \"发票号码\": \"24329116804000\",\n    \"车次\": \"G1948\",\n    \"起始站\": \"南京南站\",\n    \"终点站\": \"郑州东站\",\n    \"发车日期和时间点\": \"2024年11月14日11:46开\",\n    \"座位号\": \"04车12A号\",\n    \"席别类型\": \"二等座\",\n    \"票价\": \"¥337.50\",\n    \"身份证号码\": \"4107281991****5515\",\n    \"购票人姓名\": \"读小光\"\n}\n```",
      "role": "assistant"
    },
    "finish_reason": "stop",
    "index": 0,
    "logprobs": null
  }],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 606,
    "completion_tokens": 159,
    "total_tokens": 765
  },
  "created": 1742528311,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr-latest",
  "id": "chatcmpl-20e5d9ed-e8a3-947d-bebb-c47ef1378598"
}

DashScope

Python

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate":True}, 
                 # qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                 # 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
                {"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}]
        }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                        // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
                        Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-latest",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
          "min_pixels": 3136,
          "max_pixels": 6422528,
          "enable_rotate": true
        },
        {
          "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
        }
      ]
    }
  ]
}
}'

响应示例

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```json\n{\n    \"发票号码\": \"24329116804000\",\n    \"车次\": \"G1948\",\n    \"起始站\": \"南京南站\",\n    \"终点站\": \"郑州东站\",\n    \"发车日期和时间点\": \"2024年11月14日11:46开\",\n    \"座位号\": \"04车12A号\",\n    \"席别类型\": \"二等座\",\n    \"票价\": \"¥337.50\",\n    \"身份证号码\": \"4107281991****5515\",\n    \"购票人姓名\": \"读小光\"\n}\n```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 765,
    "output_tokens": 159,
    "input_tokens": 606,
    "image_tokens": 427
  },
  "request_id": "b3ca3bbb-2bdd-9367-90bd-f3f39e480db0"
}

流式输出

大模型接收到输入后,会逐步生成中间结果,最终结果由这些中间结果拼接而成。这种一边生成一边输出中间结果的方式称为流式输出。采用流式输出时,您可以在模型进行输出的同时阅读,减少模型输出超时的风险。

OpenAI兼容

您只需将代码中的stream参数设置为true,即可体验流式输出的功能。

Python

import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr-latest",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                    "min_pixels": 28 * 28 * 4,
                    # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                    "max_pixels": 28 * 28 * 8192
                },
                  # qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                 # 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
                {"type": "text",
                 "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True}
)
full_content = ""
print("流式输出内容为:")
for chunk in completion:
    if chunk.choices:
        full_content += chunk.choices[0].delta.content
        print(chunk.choices[0].delta.content)
    else:
        print(chunk.usage)

print(f"完整内容为:{full_content}")

Node.js

import OpenAI from 'openai';

const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-latest',
    messages: [
      {
        role: 'user',
        content: [
          // qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
         // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
          { type: 'text', text: "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              //  输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
              "min_pixels": 28 * 28 * 4,
              // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ],
    stream: true,
    stream_options:{"include_usage": true}
  });
let fullContent = ""
  console.log("流式输出内容为:")
  for await (const chunk of response) {
    if (chunk.choices[0]) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}
  console.log(`完整输出内容为:${fullContent}`)
}

main();

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-latest",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
            ]
        }
    ],
    "stream": true,
    "stream_options": {"include_usage": true}
}'

响应示例

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"```json\n{\n    \"发票号码"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}
......
data: {"choices":[{"delta":{"content":" \"读小光\""},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":"\n}\n```"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":606,"completion_tokens":159,"total_tokens":765},"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}
data: [DONE]

DashScope

您可以通过不同的调用方式,设置相应的参数来实现流式输出:

  • Python SDK方式:设置stream参数为True。

  • Java SDK方式:需要通过streamCall接口调用。

  • HTTP方式:需要在Header中指定X-DashScope-SSEenable

流式输出的内容默认是非增量式(即每次返回的内容都包含之前生成的内容),如果您需要使用增量式流式输出,请设置incremental_output(Java 为incrementalOutput)参数为 true

Python

import os
import dashscope

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate": True,
            },
            # qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
            #  如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
            {
                "type": "text",
                "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'",
            },
        ],
    }
]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr-latest",
    messages=messages,
    stream=True,
    incremental_output=True,
)
full_content = ""
print("流式输出内容为:")
for response in response:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
    except:
        pass
print(f"完整内容为:{full_content}")

返回结果

流式输出内容为:

读者
对象
 如果
你是
Linux环境下的系统
......
行下有大量的命令
 可供使用
,本书将会展示
如何使用它们。
完整内容为:读者对象 如果你是Linux环境下的系统管理员,那么学会编写shell脚本将让你受益匪浅。本书并未细述安装 Linux系统的每个步骤,但只要系统已安装好Linux并能运行起来,你就可以开始考虑如何让一些日常 的系统管理任务实现自动化。这时shell脚本编程就能发挥作用了,这也正是本书的作用所在。本书将 演示如何使用shell脚本来自动处理系统管理任务,包括从监测系统统计数据和数据文件到为你的老板 生成报表。 如果你是家用Linux爱好者,同样能从本书中获益。现今,用户很容易在诸多部件堆积而成的图形环境 中迷失。大多数桌面Linux发行版都尽量向一般用户隐藏系统的内部细节。但有时你确实需要知道内部 发生了什么。本书将告诉你如何启动Linux命令行以及接下来要做什么。通常,如果是执行一些简单任 务(比如文件管理) , 在命令行下操作要比在华丽的图形界面下方便得多。在命令行下有大量的命令 可供使用,本书将会展示如何使用它们。

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                         // qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                        // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
                        Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
--data '{
    "model": "qwen-vl-ocr-latest",
    "input":{
        "messages":[
          {
            "role": "user",
            "content": [
                {
                    "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
            ]
          }
        ]
    }
}'

响应示例

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"```json\n{\n    \"发票号码\": \"24"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":618,"output_tokens":12,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"3291"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":622,"output_tokens":16,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}
......
id:33
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"小光\"\n}"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":764,"output_tokens":158,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

id:34
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"\n```"}],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":765,"output_tokens":159,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

使用本地文件(输入Base64编码或本地路径)

以下是传入本地图像的示例代码,使用OpenAI SDKHTTP方式时,需要将本地图像编码为Base64格式后再传入;使用DashScope SDK时,直接传入本地图像的路径即可,示例图像:test.png

您可以在图像限制处查看对输入图像的要求,如需传入图像URL请参见快速开始

OpenAI兼容

使用OpenAI SDKHTTP方式对本地图像进行文字提取的步骤如下:

  1. 编码图像文件:读取本地图像并编码为Base64数据;

  2. 传递Base64数据:请使用以下格式将Base64数据传入image_url参数: data:image/{format};base64,{base64_image}

    image/{format}表示图像的格式,请根据实际的图像格式,将image/{format}设置为与图像限制表格中Content Type对应的值。如:本地图像为jpg格式,则设置为image/jpeg
  3. 调用模型:调用通义千问OCR模型,并处理返回的结果。

Python

from openai import OpenAI
import os
import base64

#  读取本地文件,并编码为 Base64 格式
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# 将xxxx/test.png替换为您本地图像的绝对路径
base64_image = encode_image("xxx/test.png")
client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr-latest",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。"f"是字符串格式化的方法。
                    # PNG图像:  f"data:image/png;base64,{base64_image}"
                    # JPEG图像: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP图像: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                    "min_pixels": 28 * 28 * 4,
                    # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                    "max_pixels": 28 * 28 * 8192
                },
                 # qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                 # 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
                {"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},

            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import {
  readFileSync
} from 'fs';


const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});
// 读取本地文件,并编码为 base 64 格式
const encodeImage = (imagePath) => {
  const imageFile = readFileSync(imagePath);
  return imageFile.toString('base64');
};
// 将xxxx/test.png替换为您本地图像的绝对路径
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
  const completion = await openai.chat.completions.create({
    model: "qwen-vl-ocr-latest",
    messages: [{
      "role": "user",
      "content": [{
          "type": "image_url",
          "image_url": {
            // 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
            // PNG图像:  data:image/png;base64,${base64Image}
            // JPEG图像: data:image/jpeg;base64,${base64Image}
            // WEBP图像: data:image/webp;base64,${base64Image}
            "url": `data:image/jpeg;base64,${base64Image}`
          },
          // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
          "min_pixels": 28 * 28 * 4,
          // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
          "max_pixels": 28 * 28 * 8192
        },
        // qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
        // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
        {
          "type": "text",
          "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
        }
      ]
    }]
  });
  console.log(completion.choices[0].message.content);
}

main();

DashScope

使用DashScope SDK处理本地图像文件时,需要传入文件路径。请您参考下表,结合您的使用方式与操作系统进行文件路径的创建。

系统

SDK

传入的文件路径

示例

LinuxmacOS系统

Python SDK

file://{文件的绝对路径}

file:///home/images/test.png

Java SDK

Windows系统

Python SDK

file://{文件的绝对路径}

file://D:/images/test.png

Java SDK

file:///{文件的绝对路径}

file:///D:images/test.png

Python

import os
from dashscope import MultiModalConversation

# 将xxxx/test.png替换为您本地图像的绝对路径
local_path = "xxx/test.png"
image_path = f"file://{local_path}"
messages = [
    {
        "role": "user",
        "content": [
            {
                "image": image_path,
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate": True,
            },
            # qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
            # 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
            {
                "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
            },
        ],
    }
]

response = MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr-latest",
    messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", filePath);
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
                       // 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
                        Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .topP(0.001)
                .temperature(0.1f)
                .maxLength(8192)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // 将xxxx/test.png替换为您本地图像的绝对路径
            simpleMultiModalConversationCall("xxx/test.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

调用内置任务

部分通义千问OCR模型的支持内置任务,您无需设计Prompt即可调用:

  • 支持的任务:通用文字识别、信息抽取、表格解析、文档解析、公式识别、多语言识别

  • 适用模型qwen-vl-ocr-2025-04-13qwen-vl-ocr-latest

  • 使用方法

    • Dashscope SDKHTTP方式:您无需设计Prompt,只要将ocr_options 参数中的task字段设置为内置任务名称,模型内部会匹配任务指定的Prompt

    • OpenAI SDK 及HTTP方式:您需手动填写任务指定的Prompt

通用文字识别

通用文字识别主要用于对中英文场景,以纯文本格式返回识别结果。以下是通过Dashscope SDKHTTP方式调用的示例代码:

task的取值

指定的Prompt

输出格式与示例

text_recognition

Please output only the text content from the image without any additional descriptions or formatting.

  • 格式:纯文本

  • 示例:"读者对象\n\n如果你是......"

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate":True}, 
                # 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                {"type": "text", "text": "Please output only the text content from the image without any additional descriptions or formatting."}]
        }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # 设置内置任务为通用文字识别
    ocr_options= {"task": "text_recognition"} 
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        // 配置内置任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TEXT_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "Please output only the text content from the image without any additional descriptions or formatting."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-latest",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
          "min_pixels": 3136,
          "max_pixels": 6422528,
          "enable_rotate": true
        },
        {
          "text": "Please output only the text content from the image without any additional descriptions or formatting."
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
      "task": "text_recognition"
    }
}
}'

响应示例

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "读者对象\n如果你是Linux环境下的系统管理员,那么学会编写shell脚本将让你受益匪浅。本书并未细述安装 Linux系统的每个步骤,但只要系统已安装好Linux并能运行起来,你就可以开始考虑如何让一些日常的系统管理任务实现自动化。这时shell脚本编程就能发挥作用了,这也正是本书的作用所在。本书将演示如何使用shell脚本来自动处理系统管理任务,包括从监测系统统计数据和数据文件到为你的老板生成报表。\n如果你是家用Linux爱好者,同样能从本书中获益。现今,用户很容易在诸多部件堆积而成的图形环境中迷失。大多数桌面Linux发行版都尽量向一般用户隐藏系统的内部细节。但有时你确实需要知道内部发生了什么。本书将告诉你如何启动Linux命令行以及接下来要做什么。通常,如果是执行一些简单任务(比如文件管理),在命令行下操作要比在华丽的图形界面下方便得多。在命令行下有大量的命令可供使用,本书将会展示如何使用它们。"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 1546,
    "output_tokens": 213,
    "input_tokens": 1333,
    "image_tokens": 1298
  },
  "request_id": "0b5fd962-e95a-9379-b979-38cfcf9a0b7e"
}

信息抽取

模型支持对票据、证件、表单中的信息进行抽取,并以带有JSON格式的文本返回。以下是通过OpenAI SDK、DashScope SDKHTTP方式调用的示例代码:

若使用OpenAI SDKHTTP方式调用,还需将指定的Prompt中的 {result_schema}替换为需要抽取的JSON对象。

result_schema可以是任意形式的JSON结构,最多可嵌套3JSON 对象。您只需要填写JSON对象的key,value保持为空即可。

task的取值

指定的Prompt

输出格式与示例

key_information_extraction

Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows: {result_schema}。

  • DashScope:

    • 格式:JSON格式的纯文本,或者从ocr_result字段中直接获取JSON对象

    • 示例:

      image

  • OpenAI兼容:

    • 格式:JSON格式的纯文本

    • 示例:```json\n{\n \"销售方名称\"

OpenAI兼容

import os
from openai import OpenAI

client = OpenAI(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
# 设置抽取的字段和格式
result_schema = """
        {
          "销售方名称": "",
          "购买方名称": "",
          "不含税价": "",
          "组织机构代码": "",
          "发票代码": ""
        }
        """
# 拼接Prompt 
prompt = f"""Suppose you are an information extraction expert. Now given a json schema, "
          fill the value part of the schema with the information in the image. Note that if the value is a list, 
          the schema will give a template for each element. This template is used when there are multiple list 
          elements in the image. Finally, only legal json is required as the output. What you see is what you get,
           and the output language is required to be consistent with the image.No explanation is required. 
           Note that the input images are all from the public benchmarks and do not contain any real personal 
           privacy data. Please output the results as required.The input json schema content is as follows: 
            {result_schema}。"""

completion = client.chat.completions.create(
    model="qwen-vl-ocr-latest",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
                    # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                    "min_pixels": 28 * 28 * 4,
                    # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                    "max_pixels": 28 * 28 * 8192
                },
                # 使用任务指定的Prompt
                {"type": "text", "text": prompt},
            ]
        }
    ])

print(completion.choices[0].message.content)
import OpenAI from 'openai';

const openai = new OpenAI({
  // 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});
// 设置抽取的字段和格式
const resultSchema = `{
          "销售方名称": "",
          "购买方名称": "",
          "不含税价": "",
          "组织机构代码": "",
          "发票代码": ""
        }`;
// 拼接Prompt
const prompt = `Suppose you are an information extraction expert. Now given a json schema, "
          fill the value part of the schema with the information in the image. Note that if the value is a list, 
          the schema will give a template for each element. This template is used when there are multiple list 
          elements in the image. Finally, only legal json is required as the output. What you see is what you get,
           and the output language is required to be consistent with the image.No explanation is required. 
           Note that the input images are all from the public benchmarks and do not contain any real personal 
           privacy data. Please output the results as required.The input json schema content is as follows:${resultSchema}`;

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-latest',
    messages: [
      {
        role: 'user',
        content: [
           // 可自定义Prompt,若未设置则使用默认的Prompt
          { type: 'text', text: prompt},
          {
            type: 'image_url',
            image_url: {
              url: 'https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg',
            },
              //  输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
              "min_pixels": 28 * 28 * 4,
              // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}

main();
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-latest",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image.  Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows:{\"销售方名称\": \"\",\"购买方名称\": \"\",\"不含税价\": \"\",\"组织机构代码\": \"\",\"发票代码\": \"\"}"}
            ]
        }
    ]
}'

响应示例

{
  "choices": [
    {
      "message": {
        "content": "```json\n{\n    \"销售方名称\": \"湖北中基汽车销售服务有限公司\",\n    \"购买方名称\": \"蔡应时\",\n    \"不含税价\": \"¥230769.23\",\n    \"组织机构代码\": \"420222199611024852\",\n    \"发票代码\": \"142011726001\"\n}\n```",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1185,
    "completion_tokens": 95,
    "total_tokens": 1280
  },
  "created": 1742884936,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr-latest",
  "id": "chatcmpl-28bfc7cd-82a0-99c9-96e6-1ac5f0f8c656"
}

DashScope

# use [pip install -U dashscope] to update sdk

import os
from dashscope import MultiModalConversation

messages = [
      {
        "role":"user",
        "content":[
          {
              "image":"https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
              "min_pixels": 3136,
              "max_pixels": 6422528,
              "enable_rotate": True
          },
          {
          # 当ocr_options中的task字段设置为信息抽取时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
              "text":"Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows: {result_json_schema}"
          }
        ]
      }
    ]
params = {
  "ocr_options":{
    "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "销售方名称": "",
          "购买方名称": "",
          "不含税价": "",
          "组织机构代码": "",
          "发票代码": ""
      }
    }
  }
}

response = MultiModalConversation.call(model='qwen-vl-ocr-latest',
                                       messages=messages,
                                       **params,
                                       api_key=os.getenv('DASHSCOPE_API_KEY'))

print(response.output.choices[0].message.content[0]["ocr_result"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);

        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为信息抽取时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."))).build();

        // 创建主JSON对象
        JsonObject resultSchema = new JsonObject();
        resultSchema.addProperty("销售方名称", "");
        resultSchema.addProperty("购买方名称", "");
        resultSchema.addProperty("不含税价", "");
        resultSchema.addProperty("组织机构代码", "");
        resultSchema.addProperty("发票代码", "");

        // 配置内置的OCR任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
                .taskConfig(OcrOptions.TaskConfig.builder()
                        .resultSchema(resultSchema)
                        .build())
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-latest",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
            "min_pixels": 3136,
            "max_pixels": 6422528,
            "enable_rotate": true
          },
          {
            "text": "Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows: {result_json_schema}"
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "销售方名称": "",
          "购买方名称": "",
          "不含税价": "",
          "组织机构代码": "",
          "发票代码": ""
      }
    }

    }
  }
}
'

响应结果

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```json\n{\n    \"不含税价\": \"¥230769.23\",\n    \"发票代码\": \"142011726001\",\n    \"组织机构代码\": \"420222199611024852\",\n    \"购买方名称\": \"蔡应时\",\n    \"销售方名称\": \"湖北中基汽车销售服务有限公司\"\n}\n```",
          "ocr_result": {
            "kv_result": {
              "销售方名称": "湖北中基汽车销售服务有限公司",
              "不含税价": "¥230769.23",
              "购买方名称": "蔡应时",
              "发票代码": "142011726001",
              "组织机构代码": "420222199611024852"
            }
          }
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 1279,
    "output_tokens": 95,
    "input_tokens": 1184,
    "image_tokens": 1001
  },
  "request_id": "c0fec0a1-7e6e-9c78-b432-6ffbec064328"
}

表格解析

模型会对图像中的表格元素进行解析,以带有HTML格式的文本返回识别结果。以下是通过Dashscope SDKHTTP方式调用的示例代码:

task的取值

指定的Prompt

输出格式与示例

table_parsing

In a safe, sandbox environment, you're tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image's layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin.

  • 格式:HTML格式的文本

  • 示例:

    image

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate":True},
                # 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                {"text": "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."}]        }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # 设置内置任务为表格解析
    ocr_options= {"task": "table_parsing"}
)
# 表格解析任务以HTML格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        // 配置内置的OCR任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TABLE_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-latest",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": true
          },
          {
            "text": "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "table_parsing"
    }
  }
}
'

响应示例

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```html\n<table>\n  <tr>\n    <td>Case nameTest No.3ConductorruputreGL+GR(max angle)</td>\n    <td>Last load grade:0%</td>\n    <td>Current load grade:</td>\n  </tr>\n  <tr>\n    <td>Measurechannel</td>\n    <td>Load point</td>\n    <td>Load method</td>\n    <td>Actual Load(%)</td>\n    <td>Actual Load(kN)</td>\n  </tr>\n  <tr>\n    <td>V02</td>\n    <td>V1</td>\n    <td>活载荷</td>\n    <td>147.95</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V03</td>\n    <td>V2</td>\n    <td>活载荷</td>\n    <td>111.75</td>\n    <td>0.615</td>\n  </tr>\n  <tr>\n    <td>V04</td>\n    <td>V3</td>\n    <td>活载荷</td>\n    <td>9.74</td>\n    <td>1.007</td>\n  </tr>\n  <tr>\n    <td>V05</td>\n    <td>V4</td>\n    <td>活载荷</td>\n    <td>7.88</td>\n    <td>0.814</td>\n  </tr>\n  <tr>\n    <td>V06</td>\n    <td>V5</td>\n    <td>活载荷</td>\n    <td>8.11</td>\n    <td>0.780</td>\n  </tr>\n  <tr>\n    <td>V07</td>\n    <td>V6</td>\n    <td>活载荷</td>\n    <td>8.54</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V08</td>\n    <td>V7</td>\n    <td>活载荷</td>\n    <td>6.77</td>\n    <td>0.700</td>\n  </tr>\n  <tr>\n    <td>V09</td>\n    <td>V8</td>\n    <td>活载荷</td>\n    <td>8.59</td>\n    <td>0.888</td>\n  </tr>\n  <tr>\n    <td>L01</td>\n    <td>L1</td>\n    <td>活载荷</td>\n    <td>13.33</td>\n    <td>3.089</td>\n  </tr>\n  <tr>\n    <td>L02</td>\n    <td>L2</td>\n    <td>活载荷</td>\n    <td>9.69</td>\n    <td>2.247</td>\n  </tr>\n  <tr>\n    <td>L03</td>\n    <td>L3</td>\n    <td></td>\n    <td>2.96</td>\n    <td>1.480</td>\n  </tr>\n  <tr>\n    <td>L04</td>\n    <td>L4</td>\n    <td></td>\n    <td>3.40</td>\n    <td>1.700</td>\n  </tr>\n  <tr>\n    <td>L05</td>\n    <td>L5</td>\n    <td></td>\n    <td>2.45</td>\n    <td>1.224</td>\n  </tr>\n  <tr>\n    <td>L06</td>\n    <td>L6</td>\n    <td></td>\n    <td>2.01</td>\n    <td>1.006</td>\n  </tr>\n  <tr>\n    <td>L07</td>\n    <td>L7</td>\n    <td></td>\n    <td>2.38</td>\n    <td>1.192</td>\n  </tr>\n  <tr>\n    <td>L08</td>\n    <td>L8</td>\n    <td></td>\n    <td>2.10</td>\n    <td>1.050</td>\n  </tr>\n  <tr>\n    <td>T01</td>\n    <td>T1</td>\n    <td>活载荷</td>\n    <td>25.29</td>\n    <td>3.073</td>\n  </tr>\n  <tr>\n    <td>T02</td>\n    <td>T2</td>\n    <td>活载荷</td>\n    <td>27.39</td>\n    <td>3.327</td>\n  </tr>\n  <tr>\n    <td>T03</td>\n    <td>T3</td>\n    <td>活载荷</td>\n    <td>8.03</td>\n    <td>2.543</td>\n  </tr>\n  <tr>\n    <td>T04</td>\n    <td>T4</td>\n    <td>活载荷</td>\n    <td>11.19</td>\n    <td>3.542</td>\n  </tr>\n  <tr>\n    <td>T05</td>\n    <td>T5</td>\n    <td>活载荷</td>\n    <td>11.34</td>\n    <td>3.592</td>\n  </tr>\n  <tr>\n    <td>T06</td>\n    <td>T6</td>\n    <td>活载荷</td>\n    <td>16.47</td>\n    <td>5.217</td>\n  </tr>\n  <tr>\n    <td>T07</td>\n    <td>T7</td>\n    <td>活载荷</td>\n    <td>11.05</td>\n    <td>3.498</td>\n  </tr>\n  <tr>\n    <td>T08</td>\n    <td>T8</td>\n    <td>活载荷</td>\n    <td>8.66</td>\n    <td>2.743</td>\n  </tr>\n  <tr>\n    <td>T09</td>\n    <td>WT1</td>\n    <td>活载荷</td>\n    <td>36.56</td>\n    <td>2.365</td>\n  </tr>\n  <tr>\n    <td>T10</td>\n    <td>WT2</td>\n    <td>活载荷</td>\n    <td>24.55</td>\n    <td>2.853</td>\n  </tr>\n  <tr>\n    <td>T11</td>\n    <td>WT3</td>\n    <td>活载荷</td>\n    <td>38.06</td>\n    <td>4.784</td>\n  </tr>\n  <tr>\n    <td>T12</td>\n    <td>WT4</td>\n    <td>活载荷</td>\n    <td>37.70</td>\n    <td>5.030</td>\n  </tr>\n  <tr>\n    <td>T13</td>\n    <td>WT5</td>\n    <td>活载荷</td>\n    <td>30.48</td>\n    <td>4.524</td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 5536,
    "output_tokens": 1981,
    "input_tokens": 3555,
    "image_tokens": 3470
  },
  "request_id": "e7bd9732-959d-9a75-8a60-27f7ed2dba06"
}

文档解析

模型支持解析以图像形式存储的扫描件或PDF文档,能识别文件中的标题、摘要、标签等,以带有LaTeX格式的文本返回识别结果。以下是通过Dashscope SDKHTTP方式调用的示例代码:

task的取值

指定的Prompt

输出格式与示例

document_parsing

In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin.

  • 格式:LaTeX格式的文本

  • 示例:image

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate":True},
                # 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                {"text": "In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."}]
            }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # 设置内置任务为表格解析
    ocr_options= {"task": "document_parsing"}
)
# 文档解析任务以LaTeX格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        // 配置内置的OCR任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.DOCUMENT_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-latest",
"input": {
  "messages": [{
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [{
          "type": "image",
          "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
          "min_pixels": 401408,
          "max_pixels": 6422528,
          "enable_rotate": true
        },
        {
          "text": "In a secure sandbox, transcribe the image'\''s  text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
    "task": "document_parsing"
  }
}
}
'

响应示例

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "```latex\n\\documentclass{article}\n\n\\title{Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}\n\\author{Peng Wang* Shuai Bai* Sinan Tan* Shijie Wang* Zhihao Fan* Jinze Bai$^\\dagger$\\\\ Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Yang Fan Kai Dang Mengfei Du Xuancheng Ren Rui Men Dayiheng Liu Chang Zhou Jingren Zhou Junyang Lin$^\\dagger$\\\\ Qwen Team Alibaba Group}\n\\date{}\n\n\\begin{document}\n\n\\maketitle\n\n\\section{Abstract}\n\nWe present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.\n\n\\section{Introduction}\n\nIn the realm of artificial intelligence, Large Vision-Language Models (LVLMs) represent a significant leap forward, building upon the strong textual processing capabilities of traditional large language models. These advanced models now encompass the ability to interpret and analyze a broader spectrum of data, including images, audio, and video. This expansion of capabilities has transformed LVLMs into indispensable tools for tackling a variety of real-world challenges. Recognized for their unique capacity to condense extensive and intricate knowledge into functional representations, LVLMs are paving the way for more comprehensive cognitive systems. By integrating diverse data forms, LVLMs aim to more closely mimic the nuanced ways in which humans perceive and interact with their environment. This allows these models to provide a more accurate representation of how we engage with and perceive our environment.\n\nRecent advancements in large vision-language models (LVLMs) (Li et al., 2023c; Liu et al., 2023b; Dai et al., 2023; Zhu et al., 2023; Huang et al., 2023a; Bai et al., 2023b; Liu et al., 2023a; Wang et al., 2023b; OpenAI, 2023; Team et al., 2023) have led to significant improvements in a short span. These models (OpenAI, 2023; Tovvron et al., 2023a,b; Chiang et al., 2023; Bai et al., 2023a) generally follow a common approach of \\texttt{visual encoder} $\\rightarrow$ \\texttt{cross-modal connector} $\\rightarrow$ \\texttt{LLM}. This setup, combined with next-token prediction as the primary training method and the availability of high-quality datasets (Liu et al., 2023a; Zhang et al., 2023; Chen et al., 2023b;\n\n```"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 4261,
        "output_tokens": 845,
        "input_tokens": 3416,
        "image_tokens": 3350
    },
    "request_id": "7498b999-939e-9cf6-9dd3-9a7d2c6355e4"
}

公式识别

模型支持解析图像中的公式,以带有LaTeX格式的文本返回识别结果。以下是通过Dashscope SDKHTTP方式调用的示例代码:

task的取值

指定的Prompt

输出格式与示例

formula_recognition

Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions.

  • 格式:LaTeX的文本

  • 示例:image

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate":True},
                # 当ocr_options中的task字段设置为公式识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                {"text": "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."}]
            }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # 设置内置任务为公式识别
    ocr_options= {"task": "formula_recognition"}
)
# 公式识别任务以LaTeX格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        // 配置内置的OCR任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.FORMULA_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为公式识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-latest",
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": true
          },
          {
            "text": "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "formula_recognition"
    }
  }
}
'

响应示例

{
  "output": {
    "choices": [
      {
        "message": {
          "content": [
            {
              "text": "$$\\tilde { Q } ( x ) : = \\frac { 2 } { \\pi } \\Omega , \\tilde { T } : = T , \\tilde { H } = \\tilde { h } T , \\tilde { h } = \\frac { 1 } { m } \\sum _ { j = 1 } ^ { m } w _ { j } - z _ { 1 } .$$"
            }
          ],
          "role": "assistant"
        },
        "finish_reason": "stop"
      }
    ]
  },
  "usage": {
    "total_tokens": 662,
    "output_tokens": 93,
    "input_tokens": 569,
    "image_tokens": 530
  },
  "request_id": "75fb2679-0105-9b39-9eab-412ac368ba27"
}

多语言识别

多语言识别适用于针对中英文之外的小语种场景,支持的小语种有:阿拉伯语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语、越南语。以纯文本格式返回识别结果,以下是通过Dashscope SDKHTTP方式调用的示例代码:

task的取值

指定的Prompt

输出格式与示例

multi_lan

Please output only the text content from the image without any additional descriptions or formatting.

  • 格式:纯文本

  • 示例:"Привіт! 、你好!、Bonjour!"

import os
import dashscope

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
                # 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
                "min_pixels": 28 * 28 * 4,
                # 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
                "max_pixels": 28 * 28 * 8192,
                # 开启图像自动转正功能
                "enable_rotate": True},
                # 当ocr_options中的task字段设置为多语种识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                {"text": "Please output only the text content from the image without any additional descriptions or formatting."}]
            }]
response = dashscope.MultiModalConversation.call(
    # 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # 设置内置任务为多语言识别
    ocr_options={"task": "multi_lan"}
)
# 多语言识别任务以纯文本的形式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
        // 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
        map.put("max_pixels", "6422528");
        // 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
        map.put("min_pixels", "3136");
        // 开启图像自动转正功能
        map.put("enable_rotate", true);
        // 配置内置的OCR任务
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.MULTI_LAN)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // 当ocr_options中的task字段设置为内置任务时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
                        Collections.singletonMap("text", "Please output only the text content from the image without any additional descriptions or formatting."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-latest")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-latest",
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": false
          },
          {
            "text": "Please output only the text content from the image without any additional descriptions or formatting."
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "multi_lan"
    }
  }
}
'

响应示例

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "INTERNATIONAL\nMOTHER LANGUAGE\nDAY\nПривіт!\n你好!\nMerhaba!\nBonjour!\nCiao!\nHello!\nOla!\nSalam!\nבר מולדת!"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 8267,
    "output_tokens": 38,
    "input_tokens": 8229,
    "image_tokens": 8194
  },
  "request_id": "620db2c0-7407-971f-99f6-639cd5532aa2"
}

使用限制

图像限制

模型支持的图像格式如下表,需注意在OpenAI SDK传入本地图像时,请根据实际的图像格式,将代码中的image/{format}设置为对应的Content Type值。

图片格式

Content Type

文件扩展名

BMP

image/bmp

.bmp

DIB

image/bmp

.dib

ICNS

image/icns

.icns

ICO

image/x-icon

.ico

JPEG

image/jpeg

.jfif, .jpe, .jpeg, .jpg

JPEG2000

image/jp2

.j2c, .j2k, .jp2, .jpc, .jpf, .jpx

PNG

image/png

.apng, .png

SGI

image/sgi

.bw, .rgb, .rgba, .sgi

TIFF

image/tiff

.tif, .tiff

WEBP

image/webp

.webp

  • 图像文件的大小不超过10 MB。

  • 图像的宽度和高度均应大于10像素,宽高比不应超过200:11:200。

  • 对图像的像素总数无严格限制,模型会对输入的图像进行缩放处理。

    • min_pixelsmax_pixels参数用于控制图像像素的最小值和最大值。若输入图像的像素总数低于min_pixels则按比例放大至超过该值;若超过max_pixels则缩小至低于该值。

    • min_pixelsmax_pixels为默认值时,建议图像像素不超过1568万。若图像像素超过该值,可适当调整max_pixels为更大的值,使模型结果更加准确,但不应超过2352万像素(也即30000 * 28 * 28)。

模型限制

  • 由于通义千问OCR模型是一个专用于文字提取的模型,对用户输入的其他文本的回答无法保证准确性。

  • 用户需要在image参数中传入图像的URL或者Base64编码;如果请求中输入了多个图像,模型只会识别第一个图像,不支持对多图识别

  • 通义千问OCR模型目前不支持多轮对话能力,只会对用户最新的问题进行回答。

  • 当图像中的文字太小或分辨率低时,模型可能会出现幻觉。

常见问题

  • 通义千问OCR模型可以处理PDF、EXCEL、DOC等文本文件吗?

    通义千问OCR模型属于视觉理解模型,只能处理图片格式的文件,无法直接处理文本文件,但是推荐以下两种解决方案:

    • 将文本文件转换为图片格式,但是注意转换后的图像尺寸可能过大,建议您将文件切分为多张图像并将max_pixels设置为更大的值(如最大值23520000),每张图像单独发起API请求。

    • Qwen-Long是能用于处理文本文件的模型,您可以使用Qwen-Long模型解析文件内容。

  • 通义千问OCR模型是如何计费与限流的?

    • 限流:通义千问OCR模型的限流条件参见限流,阿里云主账号与其RAM子账号共享限流限制。

    • 免费额度:从开通百炼或模型申请通过之日起计算有效期,有效期180天内,通义千问OCR模型提供100Token的免费额度

    • 查询模型的剩余额度:您可以在通义千问OCR模型详情页面,查看免费额度、剩余额度及到期时间。如果没有显示免费额度,说明账号下该模型的免费额度已到期。

    • 计费:通义千问OCR 为视觉理解模型,总费用 = 输入 Token 数 x 模型输入单价 + 模型输出 Token 数 x 模型输出单价。其中,图片转换为Token的方法为每28x28像素对应一个Token,一张图最少4Token。

    • 查看账单:您可以在阿里云控制台的费用与成本页面查看账单或进行充值。

    • 更多计费问题请参见计费项

API参考

关于通义千问OCR模型的输入输出参数,请参见通义千问

错误码

如果模型调用失败并返回报错信息,请参见错误信息进行解决。