文档

文档解析(大模型版)

更新时间:

文档介绍了文档解析(大模型版)API的调用方式,调用前,请先阅读API使用指南

文档解析(大模型版)接口可进行通用文档抽取和理解,从文档中提取出文本markdown内容等。

文档解析(大模型版)接口为异步接口,需要先调用文档解析异步提交服务SubmitDocParserJob接口进行异步任务提交,然后调用文档解析(大模型版)状态查询服务QueryDocParserStatus接口进行处理状态查询,最后根据处理状态,调用GetDocParserResult接口进行结果查询。

重要
  • 免费额度为每月3000页,用完即止。若您的免费额度或资源包消耗完毕,系统将默认采用按量付费的后付费计费方式。

  • 当异步任务处理提交后,用户可以在处理结束后的24小时之内查询处理结果,超过24小时后将无法查询到处理结果。

步骤一:调用文档解析(大模型版)异步提交服务

异步提交服务支持本地文件和URL文件两种方式:

URL上传的异步提交服务接口为:SubmitDocParserJob接口。

本地文件上传的异步提交服务接口为:SubmitDocParserJobAdvance接口。

请求参数

名称

类型

必填

描述

示例值

FileUrl

string

以文档URL方式使用。

单个文档(支持1.5万页以内、150 MB以内的PDF、Word等文档,支持20 MB以内的单张图片)。

https://example.com/example.pdf

FileUrlObject

stream

以本地文件上传方式调用接口时使用。

单个文档(支持1.5万页以内、150 MB以内的PDF、Word等文档,支持20 MB以内的单张图片)。

本地文件生成的FileInputStream

FileName

string

文件名需带文件类型后缀,与fileNameExtension二选一。

example.pdf

FileNameExtension

string

文件类型,与fileName二选一。

pdf

FormulaEnhancement

bool

开启公式识别增强选项,默认为false

true

说明

支持的文档格式:PDF、Word、PPT、PPTX、XLS、XLSX、XLSM和图片,图片支持JPG、JPEG、PNG、BMP、GIF,其余格式支持Markdown、HTML、EPUB、MOBI、RTF、TXT。

返回参数

名称

类型

描述

示例值

RequestId

string

请求唯一ID。

43A29C77-405E-4DC0-BC55-EE694AD0****

Data

object

返回数据。

{"Id": "docmind-20240712-b15f****"}

+Id

string

业务订单号,用于后续查询接口进行查询的唯一标识。

docmind-20220712-b15f****

Code

string

状态码。

200

Message

string

状态详细信息。

message

示例

本接口支持本地文档上传和传入文档URL这两种调用方式。

以Java SDK为例,本地文档上传调用方式的请求示例代码如下,调用submitDocStructureJobAdvance接口,通过fileUrlObject参数实现本地文档上传。

说明

获取并使用AccessKey信息的方式,可参考SDK概述中不同语言的SDK使用指南。

import com.aliyun.docmind_api20220711.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.aliyun.docmind_api20220711.Client;
import com.aliyun.teautil.models.RuntimeOptions;
import java.io.File;
import java.io.FileInputStream;

public static void submit() throws Exception {
    // 使用默认凭证初始化Credentials Client。
    com.aliyun.credentials.Client credentialClient = new com.aliyun.credentials.Client();
    Config config = new Config()
        // 通过credentials获取配置中的AccessKey ID
        .setAccessKeyId(credentialClient.getAccessKeyId())
        // 通过credentials获取配置中的AccessKey Secret
        .setAccessKeySecret(credentialClient.getAccessKeySecret());
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    config.endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
    Client client = new Client(config);
    // 创建RuntimeObject实例并设置运行参数
    RuntimeOptions runtime = new RuntimeOptions();
    SubmitDocParserJobAdvanceRequest advanceRequest = new SubmitDocParserJobAdvanceRequest();
    File file = new File("D:\\example.pdf");
    advanceRequest.fileUrlObject = new FileInputStream(file);
    advanceRequest.fileName = "example.pdf";
    // 发起请求并处理应答或异常。
    SubmitDocParserJobResponse response = client.submitDocParserJobAdvance(advanceRequest, runtime);
}
const Client = require('@alicloud/docmind-api20220711');
const Credential = require('@alicloud/credentials');
const Util = require('@alicloud/tea-util');
const fs = require('fs');

const getResult = async () => {
	// 使用默认凭证初始化Credentials Client
  const cred = new Credential.default();
  const client = new Client.default({
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    endpoint: 'docmind-api.cn-hangzhou.aliyuncs.com',
    // 通过credentials获取配置中的AccessKey ID
    accessKeyId: cred.credential.accessKeyId,
    // 通过credentials获取配置中的AccessKey Secret
    accessKeySecret: cred.credential.accessKeySecret,
    type: 'access_key',
    regionId: 'cn-hangzhou'
  }); 
  const advanceRequest = new Client.SubmitDocParserJobAdvanceRequest();
  const file = fs.createReadStream('./example.pdf');
  advanceRequest.fileUrlObject = file;
  advanceRequest.fileName = 'example.pdf';
  const runtimeObject = new Util.RuntimeOptions({});
  const response = await client.submitDocParserJobAdvance(advanceRequest, runtimeObject);
  return response.body;
};
from alibabacloud_docmind_api20220711.client import Client as docmind_api20220711Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711 import models as docmind_api20220711_models
from alibabacloud_tea_util.client import Client as UtilClient
from alibabacloud_tea_util import models as util_models
from alibabacloud_credentials.client import Client as CredClient

def submit_file():
  	# 使用默认凭证初始化Credentials Client。
    cred=CredClient()
    config = open_api_models.Config(
        # 通过credentials获取配置中的AccessKey ID
        access_key_id=cred.get_access_key_id(),
        # 通过credentials获取配置中的AccessKey Secret
        access_key_secret=cred.get_access_key_secret()
    )
    # 访问的域名
    config.endpoint = f'docmind-api.cn-hangzhou.aliyuncs.com'
    client = docmind_api20220711Client(config)
    request = docmind_api20220711_models.SubmitDocParserJobAdvanceRequest(
        # file_url_object : 本地文件流
        file_url_object=open("./example.pdf", "rb"),
        # file_name :文件名称。名称必须包含文件类型
        file_name='123.pdf',
        # file_name_extension : 文件后缀格式。与文件名二选一
        file_name_extension='pdf'
    )
    runtime = util_models.RuntimeOptions()
    try:
        # 复制代码运行请自行打印 API 的返回值
        response = client.submit_doc_parser_job_advance(request, runtime)
        # API返回值格式层级为 body -> data -> 具体属性。可根据业务需要打印相应的结果。如下示例为打印返回的业务id格式
        # 获取属性值均以小写开头,
        print(response.body.data.id)
    except Exception as error:
        # 如有需要,请打印 error
        UtilClient.assert_as_string(error.message)
import (
	"fmt"
	"os"
  
	openClient "github.com/alibabacloud-go/darabonba-openapi/v2/client"
	"github.com/alibabacloud-go/docmind-api-20220711/client"
	"github.com/alibabacloud-go/tea-utils/v2/service"
  "github.com/aliyun/credentials-go/credentials"
)

func submit(){
 // 使用默认凭证初始化Credentials Client。
	credential, err := credentials.NewCredential(nil)
	// 通过credentials获取配置中的AccessKey ID
	accessKeyId, err := credential.GetAccessKeyId()
	// 通过credentials获取配置中的AccessKey Secret
	accessKeySecret, err := credential.GetAccessKeySecret()
  // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
  var endpoint string = "docmind-api.cn-hangzhou.aliyuncs.com"
	config := openClient.Config{AccessKeyId: accessKeyId, AccessKeySecret: accessKeySecret, Endpoint: &endpoint}
	// 初始化client
  cli, err := client.NewClient(&config)
	if err != nil {
		panic(err)
	}
  // 上传本地文档调用接口
  filename := "D:\\example.pdf"    
  f, err := os.Open(filename)
	if err != nil {
    panic(err)
	}
  // 初始化接口request
  request := client.SubmitDocParserJobAdvanceRequest{
		FileName:      &filename,
		FileUrlObject: f,
	}
  // 创建RuntimeObject实例并设置运行参数
  options := service.RuntimeOptions{}
  response, err := cli.SubmitDocParserJobAdvance(&request, &options)
  if err != nil {
		panic(err)
	}
  // 打印结果
	fmt.Println(response.Body.String())
}
using Newtonsoft.Json;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

using Tea;
using Tea.Utils;

  public static void SubmitFile()
        {
            // 使用默认凭证初始化Credentials Client。
          	var akCredential = new Aliyun.Credentials.Client(null);
            AlibabaCloud.OpenApiClient.Models.Config config = new AlibabaCloud.OpenApiClient.Models.Config
            {
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeyId = akCredential.GetAccessKeyId(),
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeySecret = akCredential.GetAccessKeySecret(),
            };
            // 访问的域名
            config.Endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
            AlibabaCloud.SDK.Docmind_api20220711.Client client = new AlibabaCloud.SDK.Docmind_api20220711.Client(config);
               //需要安装额外的依赖库--> AlibabaCloud.DarabonbaStream        
    	     Stream bodySyream = AlibabaCloud.DarabonbaStream.StreamUtil.ReadFromFilePath("<YOUR-FILE-PATH>");
            AlibabaCloud.SDK.Docmind_api20220711.Models.SubmitDocParserJobAdvanceRequest request = new AlibabaCloud.SDK.Docmind_api20220711.Models.SubmitDocParserJobAdvanceRequest
            {
                FileUrlObject = bodySyream,
                FileNameExtension = "pdf"
            };
            AlibabaCloud.TeaUtil.Models.RuntimeOptions runtime = new AlibabaCloud.TeaUtil.Models.RuntimeOptions();
            try
            {
                // 复制代码运行请自行打印 API 的返回值
                client.SubmitDocParserJobAdvance(request, runtime);
            }
            catch (TeaException error)
            {
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
            catch (Exception _error)
            {
                TeaException error = new TeaException(new Dictionary<string, object>
                {
                    { "message", _error.Message }
                });
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
        }

以Java SDK为例,传入文档URL调用方式的请求示例代码如下,调用SubmitDocParserJob接口,通过fileUrl参数实现传入文档URL。请注意,您传入的文档URL必须为公网可访问下载的公网URL地址,无跨域限制,URL不带特殊转义字符。

说明

获取并使用AccessKey信息的方式,可参考SDK概述中不同语言的SDK使用指南。

import com.aliyun.docmind_api20220711.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.aliyun.docmind_api20220711.Client;

public static void submit() throws Exception {
    // 使用默认凭证初始化Credentials Client。
    com.aliyun.credentials.Client credentialClient = new com.aliyun.credentials.Client();
    Config config = new Config()
        // 通过credentials获取配置中的AccessKey ID
        .setAccessKeyId(credentialClient.getAccessKeyId())
        // 通过credentials获取配置中的AccessKey Secret
        .setAccessKeySecret(credentialClient.getAccessKeySecret());
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    config.endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
    Client client = new Client(config);
    SubmitDocParserJobRequest request = new SubmitDocParserJobRequest();
    request.fileName = "example.pdf";
    request.fileUrl = "https://example.com/example.pdf";
    SubmitDocParserJobResponse response = client.submitDocParserJob(request);
}
const Client = require('@alicloud/docmind-api20220711');
const Credential = require('@alicloud/credentials');

const getResult = async () => {
	// 使用默认凭证初始化Credentials Client
  const cred = new Credential.default();
  const client = new Client.default({
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    endpoint: 'docmind-api.cn-hangzhou.aliyuncs.com',
    // 通过credentials获取配置中的AccessKey ID
    accessKeyId: cred.credential.accessKeyId,
    // 通过credentials获取配置中的AccessKey Secret
    accessKeySecret: cred.credential.accessKeySecret,
    type: 'access_key',
    regionId: 'cn-hangzhou'
  });
  
  const request = new Client.SubmitDocParserJobRequest();
  request.fileName = 'example.pdf';
  request.fileUrl = 'https://example.com/example.pdf';
  const response = await client.submitDocParserJob(request);
  
  return response.body;
}
from alibabacloud_docmind_api20220711.client import Client as docmind_api20220711Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711 import models as docmind_api20220711_models
from alibabacloud_tea_util.client import Client as UtilClient
from alibabacloud_credentials.client import Client as CredClient

def submit_url():
  	# 使用默认凭证初始化Credentials Client。
    cred=CredClient()
    config = open_api_models.Config(
        # 通过credentials获取配置中的AccessKey ID
        access_key_id=cred.get_access_key_id(),
        # 通过credentials获取配置中的AccessKey Secret
        access_key_secret=cred.get_access_key_secret()
    )
    # 访问的域名
    config.endpoint = f'docmind-api.cn-hangzhou.aliyuncs.com'
    client = docmind_api20220711Client(config)
    request = docmind_api20220711_models.SubmitDocParserJobRequest(
        # file_url : 文件url地址
        file_url='https://example.com/example.pdf',
        # file_name :文件名称。名称必须包含文件类型
        file_name='123.pdf',
        # file_name_extension : 文件后缀格式。与文件名二选一
        file_name_extension='pdf'
    )
    try:
        # 复制代码运行请自行打印 API 的返回值
        response = client.submit_doc_parser_job(request)
        # API返回值格式层级为 body -> data -> 具体属性。可根据业务需要打印相应的结果。如下示例为打印返回的业务id格式
        # 获取属性值均以小写开头,
        print(response.body.data.id)     
    except Exception as error:
        # 如有需要,请打印 error
        UtilClient.assert_as_string(error.message)
import (
	"fmt"

	openClient "github.com/alibabacloud-go/darabonba-openapi/v2/client"
  "github.com/alibabacloud-go/docmind-api-20220711/client"
  "github.com/aliyun/credentials-go/credentials"
)

func submit(){
  // 使用默认凭证初始化Credentials Client。
	credential, err := credentials.NewCredential(nil)
	// 通过credentials获取配置中的AccessKey ID
	accessKeyId, err := credential.GetAccessKeyId()
	// 通过credentials获取配置中的AccessKey Secret
	accessKeySecret, err := credential.GetAccessKeySecret()
  // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
  var endpoint string = "docmind-api.cn-hangzhou.aliyuncs.com"
	config := openClient.Config{AccessKeyId: accessKeyId, AccessKeySecret: accessKeySecret, Endpoint: &endpoint}
	// 初始化client
  cli, err := client.NewClient(&config)
	if err != nil {
		panic(err)
	}
  // 文件URL
  fileURL := "https://example.com/example.pdf"
  // 文件名
  fileName := "example.pdf"
  // 初始化接口request
  request := client.SubmitDocParserJobRequest{
		FileUrl:  &fileURL,
		FileName: &fileName,
	}
  response, err := cli.SubmitDocParserJob(&request)
  if err != nil {
		panic(err)
	}
  // 打印结果
	fmt.Println(response.Body.String())
}
using Newtonsoft.Json;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

using Tea;
using Tea.Utils;

public static void SubmitUrl()
        {
            // 使用默认凭证初始化Credentials Client。
          	var akCredential = new Aliyun.Credentials.Client(null);
            AlibabaCloud.OpenApiClient.Models.Config config = new AlibabaCloud.OpenApiClient.Models.Config
            {
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeyId = akCredential.GetAccessKeyId(),
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeySecret = akCredential.GetAccessKeySecret(),
            };
            // 访问的域名
            config.Endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
            AlibabaCloud.SDK.Docmind_api20220711.Client client = new AlibabaCloud.SDK.Docmind_api20220711.Client(config);
            AlibabaCloud.SDK.Docmind_api20220711.Models.SubmitDocParserJobRequest request = new AlibabaCloud.SDK.Docmind_api20220711.Models.SubmitDocParserJobRequest
            {
                FileUrl = "https://example.pdf",
                FileNameExtension = "pdf"
            };
            try
            {
                // 复制代码运行请自行打印 API 的返回值
                client.SubmitDocParserJob(request);
            }
            catch (TeaException error)
            {
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
            catch (Exception _error)
            {
                TeaException error = new TeaException(new Dictionary<string, object>
                {
                    { "message", _error.Message }
                });
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
        }
use AlibabaCloud\SDK\Docmindapi\V20220711\Docmindapi;
use AlibabaCloud\SDK\Docmindapi\V20220711\Models\SubmitDocStructureJobRequest;
use Darabonba\OpenApi\Models\Config;
use AlibabaCloud\Tea\Utils\Utils\RuntimeOptions;
use AlibabaCloud\Tea\Exception\TeaUnableRetryError;
use AlibabaCloud\Credentials\Credential;

// 使用默认凭证初始化Credentials Client。
$bearerToken = new Credential();    
$config = new Config();
// 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
$config->endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
// 通过credentials获取配置中的AccessKey ID
$config->accessKeyId = $bearerToken->getCredential()->getAccessKeyId();
// 通过credentials获取配置中的AccessKey Secret
$config->accessKeySecret = $bearerToken->getCredential()->getAccessKeySecret();
$config->type = "access_key";
$config->regionId = "cn-hangzhou";
$client = new Docmindapi($config);
$request = new SubmitDocParserJobRequest();

$runtime = new RuntimeOptions();
$runtime->maxIdleConns = 3;
$runtime->connectTimeout = 10000;
$runtime->readTimeout = 10000;

$request->fileName = "example.pdf";
$request->fileUrl = "https://example.com/example.pdf";

try {
  $response = $client->submitDocParserJob($request, $runtime);
  var_dump($response->toMap());
} catch (TeaUnableRetryError $e) {
  var_dump($e->getMessage());
  var_dump($e->getErrorInfo());
  var_dump($e->getLastException());
  var_dump($e->getLastRequest());
}

正常返回示例

JSON格式

{
  "RequestId": "43A29C77-405E-4DC0-BC55-EE694AD0****",
  "Data": {
    "Id": "docmind-20240712-b15f****"
  }  
}

步骤二:调用文档解析(大模型版)状态查询服务QueryDocParserStatus接口

调用查询接口的入参ID就是前面异步任务提交接口返回的出参ID,查询结果有status状态和numberOfSuccessfulParsing已处理块。status状态有队列中、处理中、处理成功、处理失败三种情况。

请求参数

名称

类型

必填

描述

示例值

Id

string

需要查询的业务订单号,订单号从提交接口的返回结果中获取。

docmind-20220712-b15f****

返回参数

名称

类型

描述

示例值

RequestId

string

请求唯一ID。

43A29C77-405E-4CC0-BC55-EE694AD0****

Data

string

返回数据。

+Status

string

任务处理完成的状态。Success表示处理成功,Fail表示处理失败。

success

+NumberOfSuccessfulParsing

integer

已处理的模块数。

166

+Tokens

long

英文单词数,或中文字数。

4429

+ParagraphCount

integer

段落数量。

91

+TableCount

integer

表格数量。

2

+ImageCount

integer

图片数量。

0

+PageCountEstimate

integer

当前处理的页码(页码从0开始)。

3

Code

string

状态码。

200

Message

string

详细信息。

message

说明

订单状态Status类型:

  • init:订单处于待处理队列中;

  • processing: 正在解析处理。

  • success:文件处理成功,此时NumberOfSuccessfulParsing将不再变化;

  • fail:文件处理失败;

示例

以Java SDK为例,调用文档解析接口的结果查询类API示例代码如下,调用 接口,通过ID参数传入查询流水号。

说明

获取并使用AccessKey信息的方式,可参考SDK概述中不同语言的SDK使用指南。

import com.aliyun.docmind_api20220711.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.aliyun.docmind_api20220711.Client;

public static void submit() throws Exception {
    // 使用默认凭证初始化Credentials Client。
    com.aliyun.credentials.Client credentialClient = new com.aliyun.credentials.Client();
    Config config = new Config()
        // 通过credentials获取配置中的AccessKey ID
        .setAccessKeyId(credentialClient.getAccessKeyId())
        // 通过credentials获取配置中的AccessKey Secret
        .setAccessKeySecret(credentialClient.getAccessKeySecret());
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    config.endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
    Client client = new Client(config);
    QueryDocParserStatusRequest resultRequest = new QueryDocParserStatusRequest();
    resultRequest.id = "docmind-20220902-824b****";
    QueryDocParserStatusResponse response = client.queryDocParserStatus(resultRequest);
}
const Client = require('@alicloud/docmind-api20220711');
const Credential = require('@alicloud/credentials');

const getResult = async () => {
	// 使用默认凭证初始化Credentials Client
  const cred = new Credential.default();
  const client = new Client.default({
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    endpoint: 'docmind-api.cn-hangzhou.aliyuncs.com',
    // 通过credentials获取配置中的AccessKey ID
    accessKeyId: cred.credential.accessKeyId,
    // 通过credentials获取配置中的AccessKey Secret
    accessKeySecret: cred.credential.accessKeySecret,
    type: 'access_key',
    regionId: 'cn-hangzhou'
  });
  
  const resultRequest = new Client.QueryDocParserStatusRequest();
  resultRequest.id = "docmind-20220902-824b****";
  const response = await client.queryDocParserStatus(resultRequest);
  
  return response.body;
}
from typing import List
from alibabacloud_docmind_api20220711.client import Client as docmind_api20220711Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711 import models as docmind_api20220711_models
from alibabacloud_tea_util.client import Client as UtilClient
from alibabacloud_credentials.client import Client as CredClient

def query():
  	# 使用默认凭证初始化Credentials Client。
    cred=CredClient()
    config = open_api_models.Config(
        # 通过credentials获取配置中的AccessKey ID
        access_key_id=cred.get_access_key_id(),
        # 通过credentials获取配置中的AccessKey Secret
        access_key_secret=cred.get_access_key_secret()
    )
    # 访问的域名
    config.endpoint = f'docmind-api.cn-hangzhou.aliyuncs.com'
    client = docmind_api20220711Client(config)
    request = docmind_api20220711_models.QueryDocParserStatusRequest(
        # id :  任务提交接口返回的id
        id='docmind-20220902-824b****'
    )
    try:
        # 复制代码运行请自行打印 API 的返回值
        response = client.query_doc_parser_status(request)
        # API返回值格式层级为 body -> data -> 具体属性。可根据业务需要打印相应的结果。获取属性值均以小写开头
        # 获取返回结果。建议先把response.body.data转成json,然后再从json里面取具体需要的值。
        print(response.body.data)
    except Exception as error:
        # 如有需要,请打印 error
        UtilClient.assert_as_string(error.message)  
import (
	"fmt"

	openClient "github.com/alibabacloud-go/darabonba-openapi/v2/client"
  "github.com/alibabacloud-go/docmind-api-20220711/client"
  "github.com/aliyun/credentials-go/credentials"
)

func submit(){
    // 使用默认凭证初始化Credentials Client。
		credential, err := credentials.NewCredential(nil)
		// 通过credentials获取配置中的AccessKey ID
		accessKeyId, err := credential.GetAccessKeyId()
		// 通过credentials获取配置中的AccessKey Secret
		accessKeySecret, err := credential.GetAccessKeySecret()
  	// 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
  	var endpoint string = "docmind-api.cn-hangzhou.aliyuncs.com"
		config := openClient.Config{AccessKeyId: accessKeyId, AccessKeySecret: accessKeySecret, Endpoint: &endpoint}
    // 初始化client
    cli, err := client.NewClient(&config)
    if err != nil {
      panic(err)
    }
    id := "docmind-20220925-76b1****"
    // 调用查询接口
    request := client.QueryDocParserStatusRequest{Id: &id}
    response, err := cli.QueryDocParserStatus(&request)
    if err != nil {
      panic(err)
    }
    // 打印查询结果
    fmt.Println(response.Body.String())
}
using Newtonsoft.Json;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

using Tea;
using Tea.Utils;

 public static void GetResult() 
        {
            // 使用默认凭证初始化Credentials Client。
          	var akCredential = new Aliyun.Credentials.Client(null);
            AlibabaCloud.OpenApiClient.Models.Config config = new AlibabaCloud.OpenApiClient.Models.Config
            {
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeyId = akCredential.GetAccessKeyId(),
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeySecret = akCredential.GetAccessKeySecret(),
            };
            // 访问的域名
            config.Endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
            AlibabaCloud.SDK.Docmind_api20220711.Client client = new AlibabaCloud.SDK.Docmind_api20220711.Client(config);
            AlibabaCloud.SDK.Docmind_api20220711.Models.QueryDocParserStatusRequest request = new AlibabaCloud.SDK.Docmind_api20220711.Models.QueryDocParserStatusRequest
            {
                Id = "docmind-20240902-824b****"
            };
            AlibabaCloud.TeaUtil.Models.RuntimeOptions runtime = new AlibabaCloud.TeaUtil.Models.RuntimeOptions();
            try
            {
                // 复制代码运行请自行打印 API 的返回值
                client.QueryDocParserStatus(request);
            }
            catch (TeaException error)
            {
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
            catch (Exception _error)
            {
                TeaException error = new TeaException(new Dictionary<string, object>
                {
                    { "message", _error.Message }
                });
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
        }
use AlibabaCloud\SDK\Docmindapi\V20220711\Docmindapi;
use AlibabaCloud\SDK\Docmindapi\V20220711\Models\GetDocStructureResultRequest;
use Darabonba\OpenApi\Models\Config;
use AlibabaCloud\Tea\Utils\Utils\RuntimeOptions;
use AlibabaCloud\Tea\Exception\TeaUnableRetryError;
use AlibabaCloud\Credentials\Credential;

// 使用默认凭证初始化Credentials Client。
$bearerToken = new Credential();    
$config = new Config();
// 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
$config->endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
// 通过credentials获取配置中的AccessKey ID
$config->accessKeyId = $bearerToken->getCredential()->getAccessKeyId();
// 通过credentials获取配置中的AccessKey Secret
$config->accessKeySecret = $bearerToken->getCredential()->getAccessKeySecret();
$config->type = "access_key";
$config->regionId = "cn-hangzhou";
$client = new Docmindapi($config);
$request = new QueryDocParserStatusRequest();   
$request->id = "docmind-20220902-824b****";

$runtime = new RuntimeOptions();
$runtime->maxIdleConns = 3;
$runtime->connectTimeout = 10000;
$runtime->readTimeout = 10000; 

try {
  $response = $client->queryDocParserStatus($request, $runtime);
  var_dump($response->toMap());
} catch (TeaUnableRetryError $e) {
  var_dump($e->getMessage());
  var_dump($e->getErrorInfo());
  var_dump($e->getLastException());
  var_dump($e->getLastRequest());
}

正常返回示例

JSON格式

{
    "RequestId": "43A29C77-405E-4DC0-BC55-EE694AD0****",
    "Data": {
        "Status": "success",
        "NumberOfSuccessfulParsing": 93,
        "ImageCount": 0,
        "PageCountEstimate": 8,
        "ParagraphCount": 91,
        "TableCount": 2,
        "Tokens": 4429
    }
}

步骤三:调用文档解析(大模型版)结果获取服务GetDocParserResult接口

请求参数

名称

类型

必填

描述

示例值

Id

string

需要查询的业务订单号,订单号从提交接口的返回结果中获取。

docmind-20220712-b15f****

LayoutStepSize

integer

10

期望查询的layout的size大小(步长值)。

LayoutNum

integer

0

查询起始块位置

返回参数

名称

类型

描述

示例值

RequestId

string

请求唯一ID。

43A29C77-405E-4CC0-BC55-EE694AD0****

Data

string

返回数据,文档解析(大模型版)的解析结果。

Code

string

状态码。

200

Message

string

详细信息。

message

示例

以Java SDK为例,调用文档解析接口的结果查询类API示例代码如下。

说明

获取并使用AccessKey信息的方式,可参考SDK概述中不同语言的SDK使用指南。

import com.aliyun.docmind_api20220711.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.aliyun.docmind_api20220711.Client;

public static void submit() throws Exception {
    // 使用默认凭证初始化Credentials Client。
    com.aliyun.credentials.Client credentialClient = new com.aliyun.credentials.Client();
    Config config = new Config()
        // 通过credentials获取配置中的AccessKey ID
        .setAccessKeyId(credentialClient.getAccessKeyId())
        // 通过credentials获取配置中的AccessKey Secret
        .setAccessKeySecret(credentialClient.getAccessKeySecret());
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    config.endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
    Client client = new Client(config);
    GetDocParserResultRequest resultRequest = new GetDocParserResultRequest();
    resultRequest.id = "docmind-20220902-824b****";
    resultRequest.layoutStepSize = 10;
    resultRequest.layoutNum = 0;
    GetDocParserResultResponse response = client.getDocParserResult(resultRequest);
}
const Client = require('@alicloud/docmind-api20220711');
const Credential = require('@alicloud/credentials');

const getResult = async () => {
	// 使用默认凭证初始化Credentials Client
  const cred = new Credential.default();
  const client = new Client.default({
    // 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
    endpoint: 'docmind-api.cn-hangzhou.aliyuncs.com',
    // 通过credentials获取配置中的AccessKey ID
    accessKeyId: cred.credential.accessKeyId,
    // 通过credentials获取配置中的AccessKey Secret
    accessKeySecret: cred.credential.accessKeySecret,
    type: 'access_key',
    regionId: 'cn-hangzhou'
  });
  
  const resultRequest = new Client.GetDocParserResultRequest();
  resultRequest.id = "docmind-20220902-824b****";
  resultRequest.layoutStepSize = 10;
  resultRequest.layoutNum = 0;
  const response = await client.getDocParserResult(resultRequest);  
  return response.body;
}
from typing import List
from alibabacloud_docmind_api20220711.client import Client as docmind_api20220711Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711 import models as docmind_api20220711_models
from alibabacloud_tea_util.client import Client as UtilClient
from alibabacloud_credentials.client import Client as CredClient

def query():
  	# 使用默认凭证初始化Credentials Client。
    cred=CredClient()
    config = open_api_models.Config(
        # 通过credentials获取配置中的AccessKey ID
        access_key_id=cred.get_access_key_id(),
        # 通过credentials获取配置中的AccessKey Secret
        access_key_secret=cred.get_access_key_secret()
    )
    # 访问的域名
    config.endpoint = f'docmind-api.cn-hangzhou.aliyuncs.com'
    client = docmind_api20220711Client(config)
    request = docmind_api20220711_models.GetDocParserResultRequest(
        # id :  任务提交接口返回的id
        id='docmind-20220902-824b****',
        layout_step_size=10,
        layout_num=0
    )
    try:
        # 复制代码运行请自行打印 API 的返回值
        response = client.get_doc_parser_result(request)
        # API返回值格式层级为 body -> data -> 具体属性。可根据业务需要打印相应的结果。获取属性值均以小写开头
        # 获取返回结果。建议先把response.body.data转成json,然后再从json里面取具体需要的值。
        print(response.body.data)
    except Exception as error:
        # 如有需要,请打印 error
        UtilClient.assert_as_string(error.message) 
import (
	"fmt"

	openClient "github.com/alibabacloud-go/darabonba-openapi/v2/client"
  "github.com/alibabacloud-go/docmind-api-20220711/client"
  "github.com/aliyun/credentials-go/credentials"
)

func submit(){
    // 使用默认凭证初始化Credentials Client。
		credential, err := credentials.NewCredential(nil)
		// 通过credentials获取配置中的AccessKey ID
		accessKeyId, err := credential.GetAccessKeyId()
		// 通过credentials获取配置中的AccessKey Secret
		accessKeySecret, err := credential.GetAccessKeySecret()
  	// 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
  	var endpoint string = "docmind-api.cn-hangzhou.aliyuncs.com"
		config := openClient.Config{AccessKeyId: accessKeyId, AccessKeySecret: accessKeySecret, Endpoint: &endpoint}
    // 初始化client
    cli, err := client.NewClient(&config)
    if err != nil {
      panic(err)
    }
    id := "docmind-20220925-76b1****",
    layoutStepSize := 10,
    layoutNum := 0,
    // 调用查询接口
    request := client.GetDocParserResultRequest{Id: &id}
    response, err := cli.GetDocParserResult(&request)
    if err != nil {
      panic(err)
    }
    // 打印查询结果
    fmt.Println(response.Body.String())
}
using Newtonsoft.Json;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

using Tea;
using Tea.Utils;

 public static void GetResult() 
        {
            // 使用默认凭证初始化Credentials Client。
          	var akCredential = new Aliyun.Credentials.Client(null);
            AlibabaCloud.OpenApiClient.Models.Config config = new AlibabaCloud.OpenApiClient.Models.Config
            {
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeyId = akCredential.GetAccessKeyId(),
                // 通过credentials获取配置中的AccessKey Secret
                AccessKeySecret = akCredential.GetAccessKeySecret(),
            };
            // 访问的域名
            config.Endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
            AlibabaCloud.SDK.Docmind_api20220711.Client client = new AlibabaCloud.SDK.Docmind_api20220711.Client(config);
            AlibabaCloud.SDK.Docmind_api20220711.Models.GetDocParserResultRequest request = new AlibabaCloud.SDK.Docmind_api20220711.Models.GetDocParserResultRequest
            {
                Id = "docmind-20240902-824b****",
                LayoutStepSize = 10,
                LayoutNum = 0
            };
            AlibabaCloud.TeaUtil.Models.RuntimeOptions runtime = new AlibabaCloud.TeaUtil.Models.RuntimeOptions();
            try
            {
                // 复制代码运行请自行打印 API 的返回值
                client.GetDocParserResult(request);
            }
            catch (TeaException error)
            {
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
            catch (Exception _error)
            {
                TeaException error = new TeaException(new Dictionary<string, object>
                {
                    { "message", _error.Message }
                });
                // 如有需要,请打印 error
                AlibabaCloud.TeaUtil.Common.AssertAsString(error.Message);
            }
        }
use AlibabaCloud\SDK\Docmindapi\V20220711\Docmindapi;
use AlibabaCloud\SDK\Docmindapi\V20220711\Models\GetDocStructureResultRequest;
use Darabonba\OpenApi\Models\Config;
use AlibabaCloud\Tea\Utils\Utils\RuntimeOptions;
use AlibabaCloud\Tea\Exception\TeaUnableRetryError;
use AlibabaCloud\Credentials\Credential;

// 使用默认凭证初始化Credentials Client。
$bearerToken = new Credential();    
$config = new Config();
// 访问的域名,支持ipv4和ipv6两种方式,ipv6请使用docmind-api-dualstack.cn-hangzhou.aliyuncs.com
$config->endpoint = "docmind-api.cn-hangzhou.aliyuncs.com";
// 通过credentials获取配置中的AccessKey ID
$config->accessKeyId = $bearerToken->getCredential()->getAccessKeyId();
// 通过credentials获取配置中的AccessKey Secret
$config->accessKeySecret = $bearerToken->getCredential()->getAccessKeySecret();
$config->type = "access_key";
$config->regionId = "cn-hangzhou";
$client = new Docmindapi($config);
$request = new GetDocParserResultRequest();   
$request->id = "docmind-20220902-824b****";
$request->layoutStepSize = 10;
$request->layoutNum = 0;

$runtime = new RuntimeOptions();
$runtime->maxIdleConns = 3;
$runtime->connectTimeout = 10000;
$runtime->readTimeout = 10000; 

try {
  $response = $client->getDocParserResult($request, $runtime);
  var_dump($response->toMap());
} catch (TeaUnableRetryError $e) {
  var_dump($e->getMessage());
  var_dump($e->getErrorInfo());
  var_dump($e->getLastException());
  var_dump($e->getLastRequest());
}

  • 处理失败的返回结果如下所示:

{
  "RequestId": "A8EF3A36-1380-1116-A39E-B377BE27****",
  "Code": "UrlNotLegal",
  "Message": "Failed to process the document.  The document url you provided is not legal.",
  "HostId": "docmind-api.cn-hangzhou.aliyuncs.com",
  "Recommend": "https://next.api.aliyun.com/troubleshoot?q=IDP.UrlNotLegal&product=docmind-api"
}

如果请求处理成功失败,会返回失败Code和详细原因Message。访问错误码可以查看错误码详细介绍。

  • 处理成功的返回结果如下所示:

{
    "Data": {
        "layouts": [
            {
                "firstLinesChars": 0,
                "level": 0,
                "blocks": [
                    {
                        "pos": [
                            {
                                "x": 460,
                                "y": 116
                            },
                            {
                                "x": 984,
                                "y": 116
                            },
                            {
                                "x": 984,
                                "y": 172
                            },
                            {
                                "x": 460,
                                "y": 172
                            }
                        ],
                        "text": "PRACTITIONERS’ SECTION"
                    }
                ],
                "markdownContent": "# PRACTITIONERS’ SECTION  \n\n",
                "index": 2,
                "subType": "doc_title",
                "lineHeight": 0,
                "text": "PRACTITIONERS’ SECTION",
                "alignment": "center",
                "type": "title",
                "pageNum": 0,
                "uniqueId": "3d8ded229371f879eac478dee2574a82"
            },
            {
                "firstLinesChars": 0,
                "level": 0,
                "blocks": [
                    {
                        "pos": [
                            {
                                "x": 277,
                                "y": 192
                            },
                            {
                                "x": 1161,
                                "y": 192
                            },
                            {
                                "x": 1161,
                                "y": 236
                            },
                            {
                                "x": 277,
                                "y": 236
                            }
                        ],
                        "text": "RAGGING: A PUBLIC HEALTH PROBLEM IN INDIA"
                    }
                ],
                "markdownContent": "# RAGGING: A PUBLIC HEALTH PROBLEM IN INDIA  \n\n",
                "index": 3,
                "subType": "doc_subtitle",
                "lineHeight": 0,
                "text": "RAGGING: A PUBLIC HEALTH PROBLEM IN INDIA",
                "alignment": "center",
                "type": "title",
                "pageNum": 0,
                "uniqueId": "44256f2d41419956ff30c6590683217b"
            },
            {
                "firstLinesChars": 0,
                "level": 1,
                "blocks": [
                    {
                        "pos": [
                            {
                                "x": 623,
                                "y": 265
                            },
                            {
                                "x": 815,
                                "y": 265
                            },
                            {
                                "x": 815,
                                "y": 297
                            },
                            {
                                "x": 623,
                                "y": 297
                            }
                        ],
                        "text": "RAJESH GARG"
                    }
                ],
                "markdownContent": "RAJESH GARG  \n\n",
                "index": 4,
                "subType": "para",
                "lineHeight": 0,
                "text": "RAJESH GARG",
                "alignment": "center",
                "type": "text",
                "pageNum": 0,
                "uniqueId": "d0d64505df8cb4d73f80e0f63007bf60"
            }
        ]
    },
    "RequestId": "7B8CC68D-D498-5EDA-8352-FAF41591D97A"
}

Data

object

解析结果

+layouts

array

版面信息列表

++text

string

文本内容

++markdownContent

string

markdown 文本内容

++index

int

版面阅读顺序

++uniqueId

string

版面信息唯一id

++alignment

string

间距枚举

++pageNum

int

版面所在页

++level

int

版面层级(最小层级为0,表示根节点)

++type

string

版面类型(详见备注版面类型)

++subType

string

版面子类型(详见备注版面类型)

版面类型

文档解析(大模型版)返回结果中,版面的类型type及子类型subType列表如下:

type(类型)

类型描述

subType(子类型)

子类型描述

title

标题

doc_name

文档名称

doc_title

文档标题

doc_subtitle

文档副标题

para_title

段落标题

table

表格

table_name

表格名

table_name

表格标题

table_note

表注

table_note

表注

multicolumn

多栏文字

formula

公式

formula

公式

contents_title

目录标题

cate_title

目录标题

contents

目录主体

cate

目录主体

text

普通文字

para

段落

figure

图表

picture

图片

logo

logo

figure_name

图名

pic_title

图片标题

figure_note

图注

pic_caption

图注

foot

页脚

page_footer

页脚

foot_image

页脚图片

head

页眉

page_header

页眉

head_image

页眉图片

head_pagenum

页眉页码

page

页码

foot_pagenum

页脚页码

page

页码

corner_note

脚注

footer_note

脚注

end_note

尾注

endnode

尾注

side

侧栏

sidebar

侧栏