调用ReadPageScrape接口动态渲染和解析网页-信息查询服务-阿里云

本文介绍 ReadPageScrape 接口，主要通过无头浏览器动态渲染并触达目标页面，包含接口功能、参数、返回以及接口调用方法。

接口说明

接口功能：

通过浏览器沙箱环境读取 HTML 并解析网页内容。
接口将在目标页面资源完全加载后开始解析（可通过 pageTimeout 参数调整最大等待时长），接口整体耗时将显著受目标站点资源加载情况的影响。
若目标地址的响应头中的内容类型（Content-Type）为 application/pdf，系统将自动触发 PDF 解析，pdf 解析内容为当前浏览器窗口内容；如需下载 pdf 内容，请参考 ReadPageBasic - 标准版。

接口定义

请求参数

字段		可选值	默认值	说明
url: string		必传		解析的目标地址，必须以 http:// 或 https:// 开头
timeout: int		[0, 180000]	60000	接口处理超时时间，单位：ms 最大值：180000
pageTimeout: int		[0, 60000]	15000	等待目标站点完全加载的超时时间注意： pageTimeout 过小会导致页面信息缺失 pageTimeout 应小于 timeout 时长
formats: array		["rawHtml", "html", "markdown", "text", "screenshot"]	["html", "markdown", "text"]	解析结果格式 rawHtml：目标站点的 html html: 根据 readabilityMode 处理后的页面内容 markdown：根据 html 转换成的 markdown 内容 text：html 中的文本内容 screenshot: 页面截图，耗时约3-10s，需妥善设置 timeout 时长
maxAge：int		[0, ∞]	1296000	最大缓存时间，单位（秒）。若缓存时间小于 maxAge，则返回缓存内容若 maxAge 等于 0，则不使用缓存
stealthMode:int		0, 1	0	0: 关闭 stealthMode 1: 开启 stealthMode
readability: map	readabilityMode: string	"none", "normal", "article"	none	normal: 基于自研算法，剔除无关信息（页头/页脚，导航等），并返回重点正文内容。 article: 基于自研算法，获取站点主要正文内容（适用于博客、新闻站点，不适用于目录页、导航页）
	excludeAllImages:bool	false, true	false	是否剔除所有图片
	excludeAllLinks: bool	false, true	false	是否剔除所有链接
	excludedTags: array	[]	[]	指定排除的标签，如： ["form", "header", "footer", "nav"]

返回参数

字段			是否可空	字段说明	样例
requestId: string			否	请求RequestId, 排查问题时可以提供此信息
errorCode: string			是	错误码	错误码
errorMessage: string			是	错误信息
data: map	statusCode: int		否	若目标站点请求成功，返回目标 url 的 HttpCode 若目标站点请求失败，返回 IQS 定制错误码 4030：目标站点安全限制（robots.txt、安全策略等） 4080：请求超时 4290：触发站点限流策略 5010：未知异常
	errorMessage: string		是	目标 url 相关的错误
	rawHtml: string		是	目标 url 的原始 html
	html: string		是	目标 url 的可读 html
	text: string		是	目标 url 的文本内容
	markdown: string		是	目标 url 的 markdown 内容
	screenshot: string		是	目标站点的截图
	links: map	internal: map	是	目标 url 中的链接信息（站内链接） href：链接地址 text：链接展示文本 title：补充提示	{ "href": "https://www.alibabagroup.com/cn/global/home", "text": "阿里巴巴集团", "title": "" }
	links: map	external: map	是	目标 url 中的链接信息（站外链接）
	media: map	images: map	是	目标 url 中的图片信息 type: image src: 图片地址 data: 内嵌数据或附加数据 alt：替代文本 desc: 补充说明 format: 图片格式	{ "type": "image", "src": "https://img.alicdn.com/tfs/TB1AOdINW6qK1RjSZFmXXX0PFXa-258-258.jpg", "data": "", "alt": "阿里云微信", "desc": null, "format": "jpg" }
		audios: array	是	目标 url 中的音频信息 type: audio src: audio 地址 data: 内嵌数据或附加数据 alt：替代文本 desc: 补充说明 format: 格式
		videos: array	是	目标 url 中的视频信息 type: video src: video 地址 data: 内嵌数据或附加数据 alt：替代文本 desc: 补充说明 format: 格式	[ { "type": "video", "src": "blob:https://xxxxx.com/xxx", "data": "", "alt": null, "desc": "", "format": null } ]
	metadata: map	url	否	目标地址
		title	否	站点标题
		hostname	是	站点 hostname
		hostLogo	是	站点 logo
		pdfParse	是	是否解析 pdf

错误码

HttpCode	错误码	错误信息	处理方案
404	InvalidAccessKeyId.NotFound	Specified access key is not found.	检查并确保AccessKey/Secret正确。
403	Retrieval.NotActivate	Please activate AI search service	请下单或联系您的客户经理进行开通。
403	Retrieval.NotAuthorised	Please authorize the AliyunIQSFullAccess privilege to the sub-account.	子账号没有进行授权，参考创建RAM用户并授权
429	Retrieval.Throttling.User	Request was denied due to user flow control.	超出限流规格
429	Retrieval.TestUserQueryPerDayExceeded	The query per day exceed the limit.	测试超出限额（1000次/30天）
403	ReadPage.SecurityRestrict	Security restrictions on the target site (e.g., robots.txt).
400	ReadPage.RequestTimeout	Request target url timeout.	请求超时，适当延长 timeout 时间。
429	ReadPage.RateLimitByDomain	The domain has reached the rate limit.
500	ReadPage.UnknownError	Unknown error.

补充说明

1. stealth_mode

stealth mode 是应对 anti-bot 机制的隐身模式，速度较慢，但在一些站点上可靠性更高。

开启 stealth_mode 前后对比如下：

2. 限制

本接口遵守目标网站robots协议、用户协议及相关法律法规；
本接口仅提供技术支持与页面解析能力，不对第三方网站内容的生成、发布、展示或使用承担任何责任。对于通过本服务抓取、访问或以其他方式获取的第三方内容，我们不作任何形式的承诺或保证。

接口调用

示例

Python SDK

前提条件

您需要确保已安装Python3.8或以上版本。

安装SDK

pip install alibabacloud_iqs20241111==1.6.0

调用代码

import json

from Tea.exceptions import TeaException
from alibabacloud_iqs20241111 import models
from alibabacloud_iqs20241111.client import Client
from alibabacloud_tea_openapi import models as open_api_models


class Sample:
    def __init__(self):
        pass

    @staticmethod
    def create_client() -> Client:
        config = open_api_models.Config(
            # TODO: 使用您的AK/SK进行替换(建议通过环境变量加载)
            access_key_id="$YOUR_ACCESS_KEY",
            access_key_secret="$YOUR_ACCESS_SECRET"
        )
        config.endpoint = f"iqs.cn-zhangjiakou.aliyuncs.com"
        return Client(config)

    @staticmethod
    def main() -> None:
        client = Sample.create_client()
        run_instances_request = models.ReadPageScrapeRequest(
            body=models.ReadPageScrapeBody(
                url="http://www.example.com",
                max_age=0,
            )
        )
        try:
            response = client.read_page_scrape(run_instances_request)
            print(f"api success, request_id:{response.body.request_id}, result: ")
            print(f"{json.dumps(response.body.data.to_map(), indent=2)}")

        except TeaException as e:
            request_id = e.data.get("requestId")
            code = e.data.get("errorCode")
            message = e.data.get("errorMessage")
            print(f"api exception, requestId:{request_id}, code:{code}, message:{message}")


if __name__ == "__main__":
    Sample.main()

Java SDK

前提条件

已安装Java8或以上版本。

Maven依赖

<dependency>
    <groupId>com.aliyun</groupId>
    <artifactId>iqs20241111</artifactId>
    <version>1.6.0</version>
</dependency>

调用代码

package com.aliyun.iqs.readpage.example;

import com.aliyun.iqs20241111.Client;
import com.aliyun.iqs20241111.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

public class Example {
    public static void main(String[] args) throws Exception {
        Client client = initClient();
        invoke(client, "http://www.example.com");
    }

    private static Client initClient() throws Exception {
        // TODO: 使用您的AK/SK进行替换(建议通过环境变量加载)
        String accessKeyId = "$YOUR_ACCESS_KEY";
        String accessKeySecret = "$YOUR_ACCESS_SECRET";

        Config config = new Config()
                .setAccessKeyId(accessKeyId)
                .setAccessKeySecret(accessKeySecret);

        config.setEndpoint("iqs.cn-zhangjiakou.aliyuncs.com");
        return new Client(config);
    }

    private static void invoke(Client client, String url) {
        ReadPageScrapeBody input = new ReadPageScrapeBody();
        input.setUrl(url);

        ReadPageScrapeRequest request = new ReadPageScrapeRequest().setBody(input);

        try {
            ReadPageScrapeResponse response = client.readPageScrape(request);

            printOutput(response.getBody());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void printOutput(ReadPageBasicResponseBody output) {
        // 使用 GsonBuilder 创建带格式化的 Gson 实例
        Gson gson = new GsonBuilder()
                .setPrettyPrinting()
                .disableHtmlEscaping()
                .create();

        // 输出格式化的 JSON
        String prettyJson = gson.toJson(output);
        System.out.println(prettyJson);
    }
}

Go SDK

前提条件

Go 环境版本必须不低于 1.10.x

安装SDK

require (
  github.com/alibabacloud-go/iqs-20241111 v1.6.0
)

调用代码

package main

import (
	"fmt"
	"log"

	openapi "github.com/alibabacloud-go/darabonba-openapi/v2/client"
	iqs20241111 "github.com/alibabacloud-go/iqs-20241111/client"
	util "github.com/alibabacloud-go/tea-utils/v2/service"
	"github.com/alibabacloud-go/tea/tea"
)

const endpointURL = "iqs.cn-zhangjiakou.aliyuncs.com"

func createClient() (*iqs20241111.Client, error) {
	// TODO: 使用您的AK/SK进行替换
	accessKeyID := "YOUR_ACCESS_KEY"
	accessKeySecret := "YOUR_ACCESS_SECRET"

	if accessKeyID == "" || accessKeySecret == "" {
		return nil, fmt.Errorf("ACCESS_KEY or ACCESS_SECRET environment variable is not set")
	}

	config := &openapi.Config{
		AccessKeyId:     tea.String(accessKeyID),
		AccessKeySecret: tea.String(accessKeySecret),
		Endpoint:        tea.String(endpointURL),
	}

	return iqs20241111.NewClient(config)
}

func runReadPage(client *iqs20241111.Client) error {
	body := &iqs20241111.ReadPageScrapeBody{
		Url: tea.String("http://www.example.com"),
	}
	request := &iqs20241111.ReadPageScrapeRequest{
		body,
	}
	runtime := &util.RuntimeOptions{}

	resp, err := client.ReadPageScrapeWithOptions(request, nil, runtime)
	if err != nil {
		return fmt.Errorf("readpage failed: %w", err)
	}

	fmt.Printf("[%s] response: %s\n", *resp.Body.RequestId, resp.Body)
	return nil
}

func main() {
	client, err := createClient()
	if err != nil {
		log.Fatalf("Failed to create client: %v", err)
	}

	if err := runReadPage(client); err != nil {
		log.Fatalf("Error running readpage: %v", err)
	}
}

HTTP 调用

请求参数（RequestBody）

curl --location "https://cloud-iqs.aliyuncs.com/readpage/scrape" \
--header "Content-Type: application/json" \
--header "X-API-Key: <YOUR-IQS-API-KEY>" \
--data '{
    "url": "https://www.example.com",
    "maxAge": 0
}'

{
  "data": {
    "html": "<html>\n<head><title>Example Domain</title></head>\n<body>\n<div>\n<h1>Example Domain</h1>\n<p>This domain is for use in documentation examples without needing permission. Avoid use in operations.</p>\n<p><a href=\"https://iana.org/domains/example\">Learn more</a></p>\n</div>\n</body>\n</html>",
    "links": {
      "internal": "[]",
      "external": "[{\"href\":\"https://iana.org/domains/example\",\"text\":\"Learn more\",\"title\":\"\"}]"
    },
    "markdown": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n[Learn more](https://iana.org/domains/example)\n",
    "media": {
      "images": "[]",
      "audios": "[]",
      "videos": "[]"
    },
    "metadata": {
      "hostname": "www.example.com",
      "pdfParse": false,
      "title": "Example Domain",
      "url": "https://www.example.com"
    },
    "statusCode": 200,
    "text": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\nLearn more\n"
  },
  "requestId": "1d0ac13a-8c73-4134-a835-35d0126f733c"
}