通义千问OCR是文字提取专有模型,专注于文档、表格、试题、手写体文字等类型图像的文字提取能力。它能够识别多种文字,目前支持的语言有:汉语、英语、阿拉伯语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语、越南语。
您可以在阿里云百炼平台在线体验通义千问OCR模型。
支持的模型
模型名称 | 版本 | 上下文长度 | 最大输入 | 最大输出 | 输入输出单价 | 免费额度 |
(Token数) | (每千Token) | |||||
qwen-vl-ocr 当前等同qwen-vl-ocr-2024-10-28 | 稳定版 | 34,096 | 30,000 单图最大30000 | 4,096 | 0.005元 | 各100万Token 有效期:百炼开通后180天内 |
qwen-vl-ocr-latest 始终等同最新快照版 | 最新版 | 38,192 | 8,192 | |||
qwen-vl-ocr-2025-04-13 又称qwen-vl-ocr-0413 大幅提升文字识别能力,新增六种内置的OCR任务,增加了自定义Prompt、图像旋转矫正等功能。 | 快照版 | |||||
qwen-vl-ocr-2024-10-28 又称qwen-vl-ocr-1028 | 快照版 | 34,096 | 4,096 |
您可以选择全面升级的qwen-vl-ocr-latest
模型用于文字提取任务,相较于上一个快照版本,该模型新增以下功能:
支持在文字提取前,对图像进行旋转矫正,适合图像倾斜的场景。
新增六种内置的OCR任务,分别是通用文字识别、信息抽取、文档解析、表格解析、公式识别、多语言识别。
未设置内置任务时,支持用户输入Prompt进行指引;如设置了内置任务时,为保证识别效果,模型内部会使用任务指定的Prompt。
仅DashScope SDK支持对图像进行旋转矫正和设置内置任务。如需使用OpenAI SDK进行内置的OCR任务,需要手动填写任务指定的Prompt进行引导。
如何使用
前提条件
您需要已获取API Key并配置API Key到环境变量。如果通过OpenAI SDK或DashScope SDK进行调用,还需要安装最新版SDK。
DashScope Python SDK 最低版本为1.22.2, Java SDK 最低版本为2.18.4。
快速开始
以下是以qwen-vl-ocr-latest为例,传入图像URL进行文字提取的示例代码。模型支持用户传入Prompt进行指引,如未输入则使用模型默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
qwen-vl-ocr、qwen-vl-ocr-2024-10-28模型不支持传入Prompt进行指引,模型内部会使用默认的Prompt:Read all the text in the image.
您可以在图像限制处查看对输入图像的要求,如需传入本地图像请参见使用本地文件。
OpenAI兼容
Python
import os
from openai import OpenAI
client = OpenAI(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
},
# qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
{"type": "text",
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
]
}
])
print(completion.choices[0].message.content)
Node.js
import OpenAI from 'openai';
const openai = new OpenAI({
// 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr-latest',
messages: [
{
role: 'user',
content: [
// qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
{ type: 'text', text: "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
}
]
}
],
});
console.log(response.choices[0].message.content)
}
main();
curl
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-latest",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
]
}
]
}'
响应示例
DashScope
Python
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate":True},
# qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
{"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
}
]
}
]
}
}'
流式输出
大模型接收到输入后,会逐步生成中间结果,最终结果由这些中间结果拼接而成。这种一边生成一边输出中间结果的方式称为流式输出。采用流式输出时,您可以在模型进行输出的同时阅读,减少模型输出超时的风险。
OpenAI兼容
您只需将代码中的stream参数设置为true
,即可体验流式输出的功能。
Python
import os
from openai import OpenAI
client = OpenAI(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
},
# qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.不支持用户在text中传入自定义Prompt
{"type": "text",
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
]
}
],
stream=True,
stream_options={"include_usage": True}
)
full_content = ""
print("流式输出内容为:")
for chunk in completion:
if chunk.choices:
full_content += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content)
else:
print(chunk.usage)
print(f"完整内容为:{full_content}")
Node.js
import OpenAI from 'openai';
const openai = new OpenAI({
// 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr-latest',
messages: [
{
role: 'user',
content: [
// qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
{ type: 'text', text: "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
}
]
}
],
stream: true,
stream_options:{"include_usage": true}
});
let fullContent = ""
console.log("流式输出内容为:")
for await (const chunk of response) {
if (chunk.choices[0]) {
fullContent += chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
} else {
console.log(chunk.usage);
}
}
console.log(`完整输出内容为:${fullContent}`)
}
main();
curl
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-latest",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
]
}
],
"stream": true,
"stream_options": {"include_usage": true}
}'
DashScope
您可以通过不同的调用方式,设置相应的参数来实现流式输出:
Python SDK方式:设置
stream
参数为True。Java SDK方式:需要通过
streamCall
接口调用。HTTP方式:需要在Header中指定
X-DashScope-SSE
为enable
。
流式输出的内容默认是非增量式(即每次返回的内容都包含之前生成的内容),如果您需要使用增量式流式输出,请设置incremental_output
(Java 为incrementalOutput
)参数为true
。
Python
import os
import dashscope
messages = [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate": True,
},
# qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
{
"type": "text",
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'",
},
],
}
]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr-latest",
messages=messages,
stream=True,
incremental_output=True,
)
full_content = ""
print("流式输出内容为:")
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
full_content += response["output"]["choices"][0]["message"].content[0]["text"]
except:
pass
print(f"完整内容为:{full_content}")
返回结果
流式输出内容为:
读者
对象
如果
你是
Linux环境下的系统
......
行下有大量的命令
可供使用
,本书将会展示
如何使用它们。
完整内容为:读者对象 如果你是Linux环境下的系统管理员,那么学会编写shell脚本将让你受益匪浅。本书并未细述安装 Linux系统的每个步骤,但只要系统已安装好Linux并能运行起来,你就可以开始考虑如何让一些日常 的系统管理任务实现自动化。这时shell脚本编程就能发挥作用了,这也正是本书的作用所在。本书将 演示如何使用shell脚本来自动处理系统管理任务,包括从监测系统统计数据和数据文件到为你的老板 生成报表。 如果你是家用Linux爱好者,同样能从本书中获益。现今,用户很容易在诸多部件堆积而成的图形环境 中迷失。大多数桌面Linux发行版都尽量向一般用户隐藏系统的内部细节。但有时你确实需要知道内部 发生了什么。本书将告诉你如何启动Linux命令行以及接下来要做什么。通常,如果是执行一些简单任 务(比如文件管理) , 在命令行下操作要比在华丽的图形界面下方便得多。在命令行下有大量的命令 可供使用,本书将会展示如何使用它们。
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
--data '{
"model": "qwen-vl-ocr-latest",
"input":{
"messages":[
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"}
]
}
]
}
}'
使用本地文件(输入Base64编码或本地路径)
以下是传入本地图像的示例代码,使用OpenAI SDK或HTTP方式时,需要将本地图像编码为Base64格式后再传入;使用DashScope SDK时,直接传入本地图像的路径即可,示例图像:test.png
您可以在图像限制处查看对输入图像的要求,如需传入图像URL请参见快速开始。
OpenAI兼容
使用OpenAI SDK或HTTP方式对本地图像进行文字提取的步骤如下:
编码图像文件:读取本地图像并编码为Base64数据;
传递Base64数据:请使用以下格式将Base64数据传入
image_url
参数:data:image/{format};base64,{base64_image}
;image/{format}
表示图像的格式,请根据实际的图像格式,将image/{format}
设置为与图像限制表格中Content Type对应的值。如:本地图像为jpg格式,则设置为image/jpeg
调用模型:调用通义千问OCR模型,并处理返回的结果。
Python
from openai import OpenAI
import os
import base64
# 读取本地文件,并编码为 Base64 格式
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# 将xxxx/test.png替换为您本地图像的绝对路径
base64_image = encode_image("xxx/test.png")
client = OpenAI(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
# 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。"f"是字符串格式化的方法。
# PNG图像: f"data:image/png;base64,{base64_image}"
# JPEG图像: f"data:image/jpeg;base64,{base64_image}"
# WEBP图像: f"data:image/webp;base64,{base64_image}"
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
},
# qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
{"type": "text", "text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"},
],
}
],
)
print(completion.choices[0].message.content)
Node.js
import OpenAI from "openai";
import {
readFileSync
} from 'fs';
const openai = new OpenAI({
// 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
});
// 读取本地文件,并编码为 base 64 格式
const encodeImage = (imagePath) => {
const imageFile = readFileSync(imagePath);
return imageFile.toString('base64');
};
// 将xxxx/test.png替换为您本地图像的绝对路径
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
const completion = await openai.chat.completions.create({
model: "qwen-vl-ocr-latest",
messages: [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
// 需要注意,传入Base64,图像格式(即image/{format})需要与支持的图片列表中的Content Type保持一致。
// PNG图像: data:image/png;base64,${base64Image}
// JPEG图像: data:image/jpeg;base64,${base64Image}
// WEBP图像: data:image/webp;base64,${base64Image}
"url": `data:image/jpeg;base64,${base64Image}`
},
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
},
// qwen-vl-ocr-latest支持在以下text字段中传入Prompt,若未传入,则会使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
{
"type": "text",
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
}
]
}]
});
console.log(completion.choices[0].message.content);
}
main();
DashScope
使用DashScope SDK处理本地图像文件时,需要传入文件路径。请您参考下表,结合您的使用方式与操作系统进行文件路径的创建。
系统 | SDK | 传入的文件路径 | 示例 |
Linux或macOS系统 | Python SDK | file://{文件的绝对路径} | file:///home/images/test.png |
Java SDK | |||
Windows系统 | Python SDK | file://{文件的绝对路径} | file://D:/images/test.png |
Java SDK | file:///{文件的绝对路径} | file:///D:images/test.png |
Python
import os
from dashscope import MultiModalConversation
# 将xxxx/test.png替换为您本地图像的绝对路径
local_path = "xxx/test.png"
image_path = f"file://{local_path}"
messages = [
{
"role": "user",
"content": [
{
"image": image_path,
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate": True,
},
# qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
# 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
{
"text": "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"
},
],
}
]
response = MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr-latest",
messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall(String localPath)
throws ApiException, NoApiKeyException, UploadFileException {
String filePath = "file://"+localPath;
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", filePath);
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// qwen-vl-ocr-latest未设置内置任务时,支持在以下text字段中传入Prompt,若未传入则使用默认的Prompt:Please output only the text content from the image without any additional descriptions or formatting.
// 如调用qwen-vl-ocr-1028,模型会使用固定Prompt:Read all the text in the image.,不支持用户在text中传入自定义Prompt
Collections.singletonMap("text", "请提取车票图像中的发票号码、车次、起始站、终点站、发车日期和时间点、座位号、席别类型、票价、身份证号码、购票人姓名。要求准确无误的提取上述关键信息、不要遗漏和捏造虚假信息,模糊或者强光遮挡的单个文字可以用英文问号?代替。返回数据格式以json方式输出,格式为:{'发票号码':'xxx', '车次':'xxx', '起始站':'xxx', '终点站':'xxx', '发车日期和时间点':'xxx', '座位号':'xxx', '席别类型':'xxx','票价':'xxx', '身份证号码':'xxx', '购票人姓名':'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.topP(0.001)
.temperature(0.1f)
.maxLength(8192)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
// 将xxxx/test.png替换为您本地图像的绝对路径
simpleMultiModalConversationCall("xxx/test.png");
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
调用内置任务
部分通义千问OCR模型的支持内置任务,您无需设计Prompt即可调用:
支持的任务:通用文字识别、信息抽取、表格解析、文档解析、公式识别、多语言识别
适用模型:
qwen-vl-ocr-2025-04-13
、qwen-vl-ocr-latest
使用方法:
Dashscope SDK及HTTP方式:您无需设计Prompt,只要将
ocr_options
参数中的task字段设置为内置任务名称,模型内部会匹配任务指定的PromptOpenAI SDK 及HTTP方式:您需手动填写任务指定的Prompt
通用文字识别
通用文字识别主要用于对中英文场景,以纯文本格式返回识别结果。以下是通过Dashscope SDK及HTTP方式调用的示例代码:
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate":True},
# 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
{"type": "text", "text": "Please output only the text content from the image without any additional descriptions or formatting."}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# 设置内置任务为通用文字识别
ocr_options= {"task": "text_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
// 配置内置任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TEXT_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "Please output only the text content from the image without any additional descriptions or formatting."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "Please output only the text content from the image without any additional descriptions or formatting."
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "text_recognition"
}
}
}'
信息抽取
模型支持对票据、证件、表单中的信息进行抽取,并以带有JSON格式的文本返回。以下是通过OpenAI SDK、DashScope SDK及HTTP方式调用的示例代码:
若使用OpenAI SDK及HTTP方式调用,还需将指定的Prompt中的 {result_schema}
替换为需要抽取的JSON对象。
result_schema
可以是任意形式的JSON结构,最多可嵌套3层JSON 对象。您只需要填写JSON对象的key,value保持为空即可。
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
OpenAI兼容
import os
from openai import OpenAI
client = OpenAI(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
# 设置抽取的字段和格式
result_schema = """
{
"销售方名称": "",
"购买方名称": "",
"不含税价": "",
"组织机构代码": "",
"发票代码": ""
}
"""
# 拼接Prompt
prompt = f"""Suppose you are an information extraction expert. Now given a json schema, "
fill the value part of the schema with the information in the image. Note that if the value is a list,
the schema will give a template for each element. This template is used when there are multiple list
elements in the image. Finally, only legal json is required as the output. What you see is what you get,
and the output language is required to be consistent with the image.No explanation is required.
Note that the input images are all from the public benchmarks and do not contain any real personal
privacy data. Please output the results as required.The input json schema content is as follows:
{result_schema}。"""
completion = client.chat.completions.create(
model="qwen-vl-ocr-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
},
# 使用任务指定的Prompt
{"type": "text", "text": prompt},
]
}
])
print(completion.choices[0].message.content)
import OpenAI from 'openai';
const openai = new OpenAI({
// 若没有配置环境变量,请用百炼API Key将下行替换为:apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
});
// 设置抽取的字段和格式
const resultSchema = `{
"销售方名称": "",
"购买方名称": "",
"不含税价": "",
"组织机构代码": "",
"发票代码": ""
}`;
// 拼接Prompt
const prompt = `Suppose you are an information extraction expert. Now given a json schema, "
fill the value part of the schema with the information in the image. Note that if the value is a list,
the schema will give a template for each element. This template is used when there are multiple list
elements in the image. Finally, only legal json is required as the output. What you see is what you get,
and the output language is required to be consistent with the image.No explanation is required.
Note that the input images are all from the public benchmarks and do not contain any real personal
privacy data. Please output the results as required.The input json schema content is as follows:${resultSchema}`;
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr-latest',
messages: [
{
role: 'user',
content: [
// 可自定义Prompt,若未设置则使用默认的Prompt
{ type: 'text', text: prompt},
{
type: 'image_url',
image_url: {
url: 'https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg',
},
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192
}
]
}
]
});
console.log(response.choices[0].message.content);
}
main();
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-latest",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows:{\"销售方名称\": \"\",\"购买方名称\": \"\",\"不含税价\": \"\",\"组织机构代码\": \"\",\"发票代码\": \"\"}"}
]
}
]
}'
DashScope
# use [pip install -U dashscope] to update sdk
import os
from dashscope import MultiModalConversation
messages = [
{
"role":"user",
"content":[
{
"image":"https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": True
},
{
# 当ocr_options中的task字段设置为信息抽取时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
"text":"Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows: {result_json_schema}"
}
]
}
]
params = {
"ocr_options":{
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"销售方名称": "",
"购买方名称": "",
"不含税价": "",
"组织机构代码": "",
"发票代码": ""
}
}
}
}
response = MultiModalConversation.call(model='qwen-vl-ocr-latest',
messages=messages,
**params,
api_key=os.getenv('DASHSCOPE_API_KEY'))
print(response.output.choices[0].message.content[0]["ocr_result"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为信息抽取时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."))).build();
// 创建主JSON对象
JsonObject resultSchema = new JsonObject();
resultSchema.addProperty("销售方名称", "");
resultSchema.addProperty("购买方名称", "");
resultSchema.addProperty("不含税价", "");
resultSchema.addProperty("组织机构代码", "");
resultSchema.addProperty("发票代码", "");
// 配置内置的OCR任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
.taskConfig(OcrOptions.TaskConfig.builder()
.resultSchema(resultSchema)
.build())
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://prism-test-data.oss-cn-hangzhou.aliyuncs.com/image/car_invoice/car-invoice-img00040.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "Suppose you are an information extraction expert. Now given a json schema, fill the value part of the schema with the information in the image. Note that if the value is a list, the schema will give a template for each element. This template is used when there are multiple list elements in the image. Finally, only legal json is required as the output. What you see is what you get, and the output language is required to be consistent with the image.No explanation is required. Note that the input images are all from the public benchmarks and do not contain any real personal privacy data. Please output the results as required.The input json schema content is as follows: {result_json_schema}"
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"销售方名称": "",
"购买方名称": "",
"不含税价": "",
"组织机构代码": "",
"发票代码": ""
}
}
}
}
}
'
表格解析
模型会对图像中的表格元素进行解析,以带有HTML格式的文本返回识别结果。以下是通过Dashscope SDK及HTTP方式调用的示例代码:
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate":True},
# 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
{"text": "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."}] }]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# 设置内置任务为表格解析
ocr_options= {"task": "table_parsing"}
)
# 表格解析任务以HTML格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
// 配置内置的OCR任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TABLE_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为通用文字识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "table_parsing"
}
}
}
'
文档解析
模型支持解析以图像形式存储的扫描件或PDF文档,能识别文件中的标题、摘要、标签等,以带有LaTeX格式的文本返回识别结果。以下是通过Dashscope SDK及HTTP方式调用的示例代码:
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate":True},
# 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
{"text": "In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# 设置内置任务为表格解析
ocr_options= {"task": "document_parsing"}
)
# 文档解析任务以LaTeX格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
// 配置内置的OCR任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.DOCUMENT_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为表格解析时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [{
"type": "image",
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "In a secure sandbox, transcribe the image'\''s text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin."
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "document_parsing"
}
}
}
'
公式识别
模型支持解析图像中的公式,以带有LaTeX格式的文本返回识别结果。以下是通过Dashscope SDK及HTTP方式调用的示例代码:
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate":True},
# 当ocr_options中的task字段设置为公式识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
{"text": "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# 设置内置任务为公式识别
ocr_options= {"task": "formula_recognition"}
)
# 公式识别任务以LaTeX格式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
// 配置内置的OCR任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.FORMULA_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为公式识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": true
},
{
"text": "Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions."
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "formula_recognition"
}
}
}
'
多语言识别
多语言识别适用于针对中英文之外的小语种场景,支持的小语种有:阿拉伯语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语、越南语。以纯文本格式返回识别结果,以下是通过Dashscope SDK及HTTP方式调用的示例代码:
task的取值 | 指定的Prompt | 输出格式与示例 |
|
|
|
import os
import dashscope
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
# 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
"min_pixels": 28 * 28 * 4,
# 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
"max_pixels": 28 * 28 * 8192,
# 开启图像自动转正功能
"enable_rotate": True},
# 当ocr_options中的task字段设置为多语种识别时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
{"text": "Please output only the text content from the image without any additional descriptions or formatting."}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量,请用百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# 设置内置任务为多语言识别
ocr_options={"task": "multi_lan"}
)
# 多语言识别任务以纯文本的形式返回结果
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
// 输入图像的最大像素阈值,超过该值图像会按原比例缩小,直到总像素低于max_pixels
map.put("max_pixels", "6422528");
// 输入图像的最小像素阈值,小于该值图像会按原比例放大,直到总像素大于min_pixels
map.put("min_pixels", "3136");
// 开启图像自动转正功能
map.put("enable_rotate", true);
// 配置内置的OCR任务
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.MULTI_LAN)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// 当ocr_options中的task字段设置为内置任务时,模型会以下面text字段中的内容作为Prompt,不支持用户自定义
Collections.singletonMap("text", "Please output only the text content from the image without any additional descriptions or formatting."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-latest")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-latest",
"input": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
},
{
"text": "Please output only the text content from the image without any additional descriptions or formatting."
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "multi_lan"
}
}
}
'
使用限制
图像限制
模型支持的图像格式如下表,需注意在OpenAI SDK传入本地图像时,请根据实际的图像格式,将代码中的image/{format}
设置为对应的Content Type值。
图片格式 | Content Type | 文件扩展名 |
BMP | image/bmp | .bmp |
DIB | image/bmp | .dib |
ICNS | image/icns | .icns |
ICO | image/x-icon | .ico |
JPEG | image/jpeg | .jfif, .jpe, .jpeg, .jpg |
JPEG2000 | image/jp2 | .j2c, .j2k, .jp2, .jpc, .jpf, .jpx |
PNG | image/png | .apng, .png |
SGI | image/sgi | .bw, .rgb, .rgba, .sgi |
TIFF | image/tiff | .tif, .tiff |
WEBP | image/webp | .webp |
图像文件的大小不超过10 MB。
图像的宽度和高度均应大于10像素,宽高比不应超过200:1或1:200。
对图像的像素总数无严格限制,模型会对输入的图像进行缩放处理。
min_pixels和max_pixels参数用于控制图像像素的最小值和最大值。若输入图像的像素总数低于min_pixels则按比例放大至超过该值;若超过max_pixels则缩小至低于该值。
min_pixels
和max_pixels
为默认值时,建议图像像素不超过1568万。若图像像素超过该值,可适当调整max_pixels为更大的值,使模型结果更加准确,但不应超过2352万像素(也即30000 * 28 * 28)。
模型限制
由于通义千问OCR模型是一个专用于文字提取的模型,对用户输入的其他文本的回答无法保证准确性。
用户需要在image参数中传入图像的URL或者Base64编码;如果请求中输入了多个图像,模型只会识别第一个图像,不支持对多图识别。
通义千问OCR模型目前不支持多轮对话能力,只会对用户最新的问题进行回答。
当图像中的文字太小或分辨率低时,模型可能会出现幻觉。
常见问题
通义千问OCR模型可以处理PDF、EXCEL、DOC等文本文件吗?
通义千问OCR模型属于视觉理解模型,只能处理图片格式的文件,无法直接处理文本文件,但是推荐以下两种解决方案:
通义千问OCR模型是如何计费与限流的?
限流:通义千问OCR模型的限流条件参见限流,阿里云主账号与其RAM子账号共享限流限制。
免费额度:从开通百炼或模型申请通过之日起计算有效期,有效期180天内,通义千问OCR模型提供100万Token的免费额度。
查询模型的剩余额度:您可以在通义千问OCR模型详情页面,查看免费额度、剩余额度及到期时间。如果没有显示免费额度,说明账号下该模型的免费额度已到期。
计费:通义千问OCR 为视觉理解模型,总费用 = 输入 Token 数 x 模型输入单价 + 模型输出 Token 数 x 模型输出单价。其中,图片转换为Token的方法为每28x28像素对应一个Token,一张图最少4个Token。
查看账单:您可以在阿里云控制台的费用与成本页面查看账单或进行充值。
更多计费问题请参见计费项。
API参考
关于通义千问OCR模型的输入输出参数,请参见通义千问。
错误码
如果模型调用失败并返回报错信息,请参见错误信息进行解决。