本文介绍动态页面解析接口,包含接口功能、参数、返回以及接口调用方法。
接口说明
接口功能:
通过浏览器沙箱环境读取 HTML 并解析网页内容。
接口将在目标页面资源完全加载后开始解析(可通过 pageTimeout 参数调整最大等待时长),接口整体耗时将显著受目标站点资源加载情况的影响。
计费说明:
体验期间(10/30-11/30日)限时免费。
当目标地址的 HTTP 响应码(httpcode)< 500 时,计为一次有效请求。
若目标地址的响应头中的内容类型(Content-Type)为
application/pdf,系统将自动触发 PDF 解析(pdf 处理大小不超过 5MB),此操作将额外计为一次有效请求。
体验说明:
体验期间接口限制 5qps。
体验额度为 1000次/30天。
接口定义
请求参数
字段 | 可选值 | 默认值 | 说明 | |
url: string | 必传 | 解析的目标地址,必须以 http:// 或 https:// 开头 | ||
timeout: int | [0, 180000] | 60000 | 接口处理超时时间,单位:ms 最大值:180000 | |
pageTimeout: int | [0, 60000] | 15000 | 等待目标站点完全加载的超时时间 注意:
| |
formats: array | ["rawHtml", "html", "markdown", "text", "screenshot"] | ["html", "markdown", "text"] | 解析结果格式
| |
maxAge:int | [0, ∞] | 1296000 | 最大缓存时间,单位(秒)。
| |
readability: map | readabilityMode: string | "none", "normal", "article" | none | normal: 基于自研算法,剔除无关信息(页头/页脚,导航等) article: 基于自研算法,获取站点主要正文内容(适用于博客、新闻站点,不适用于目录页、导航页) |
excludeAllImages:bool | false, true | false | 是否剔除所有图片 | |
excludeAllLinks: bool | false, true | false | 是否剔除所有链接 | |
excludedTags: array | [] | [] | 指定排除的标签,如: ["form", "header", "footer", "nav"] | |
返回参数
字段 | 是否可空 | 字段说明 | 样例 | ||
requestId: string | 否 | 请求RequestId, 排查问题时可以提供此信息 | |||
errorCode: string | 是 | 错误码 | |||
errorMessage: string | 是 | 错误信息 | |||
data: map | statusCode: int | 否 |
| ||
errorMessage: string | 是 | 目标 url 相关的错误 | |||
rawHtml: string | 是 | 目标 url 的原始 html | |||
html: string | 是 | 目标 url 的可读 html | |||
text: string | 是 | 目标 url 的文本内容 | |||
markdown: string | 是 | 目标 url 的 markdown 内容 | |||
screenshot: string | 是 | 目标站点的截图 | |||
links: map | internal: map | 是 | 目标 url 中的链接信息(站内链接)
| { "href": "https://www.alibabagroup.com/cn/global/home", "text": "阿里巴巴集团", "title": "" } | |
external: map | 是 | 目标 url 中的链接信息(站外链接) | |||
media: map | images: map | 是 | 目标 url 中的图片信息
| { "type": "image", "src": "https://img.alicdn.com/tfs/TB1AOdINW6qK1RjSZFmXXX0PFXa-258-258.jpg", "data": "", "alt": "阿里云微信", "desc": null, "format": "jpg" } | |
audios: array | 是 | 目标 url 中的音频信息
| |||
videos: array | 是 | 目标 url 中的视频信息
| [ { "type": "video", "src": "blob:https://xxxxx.com/xxx", "data": "", "alt": null, "desc": "", "format": null } ] | ||
metadata: map | url | 否 | 目标地址 | ||
title | 否 | 站点标题 | |||
hostname | 是 | 站点 hostname | |||
hostLogo | 是 | 站点 logo | |||
pdfParse | 是 | 是否解析 pdf | |||
错误码
HttpCode | 错误码 | 错误信息 | 处理方案 |
404 | InvalidAccessKeyId.NotFound | Specified access key is not found. | 检查并确保AccessKey/Secret正确。 |
403 | Retrieval.NotActivate | Please activate AI search service | 请下单或联系您的客户经理进行开通。 |
403 | Retrieval.NotAuthorised | Please authorize the AliyunIQSFullAccess privilege to the sub-account. | 子账号没有进行授权,参考创建RAM用户并授权 |
429 | Retrieval.Throttling.User | Request was denied due to user flow control. | 超出限流规格 |
429 | Retrieval.TestUserQueryPerDayExceeded | The query per day exceed the limit. | 测试超出限额(1000次/30天) |
403 | ReadPage.SecurityRestrict | Security restrictions on the target site (e.g., robots.txt). | |
400 | ReadPage.RequestTimeout | Request target url timeout. | 请求超时,适当延长 timeout 时间。 |
429 | ReadPage.RateLimitByDomain | The domain has reached the rate limit. | |
500 | ReadPage.UnknownError | Unknown error. |
接口调用
示例
Python SDK
前提条件
您需要确保已安装Python3.8或以上版本。
安装SDK
pip install alibabacloud_iqs20241111==1.6.0调用代码
import json
from Tea.exceptions import TeaException
from alibabacloud_iqs20241111 import models
from alibabacloud_iqs20241111.client import Client
from alibabacloud_tea_openapi import models as open_api_models
class Sample:
def __init__(self):
pass
@staticmethod
def create_client() -> Client:
config = open_api_models.Config(
# TODO: 使用您的AK/SK进行替换(建议通过环境变量加载)
access_key_id="$YOUR_ACCESS_KEY",
access_key_secret="$YOUR_ACCESS_SECRET"
)
config.endpoint = f"iqs.cn-zhangjiakou.aliyuncs.com"
return Client(config)
@staticmethod
def main() -> None:
client = Sample.create_client()
run_instances_request = models.ReadPageScrapeRequest(
body=models.ReadPageScrapeBody(
url="http://www.example.com",
max_age=0,
)
)
try:
response = client.read_page_scrape(run_instances_request)
print(f"api success, request_id:{response.body.request_id}, result: ")
print(f"{json.dumps(response.body.data.to_map(), indent=2)}")
except TeaException as e:
request_id = e.data.get("requestId")
code = e.data.get("errorCode")
message = e.data.get("errorMessage")
print(f"api exception, requestId:{request_id}, code:{code}, message:{message}")
if __name__ == "__main__":
Sample.main()
Java SDK
前提条件
已安装Java8或以上版本。
Maven依赖
<dependency>
<groupId>com.aliyun</groupId>
<artifactId>iqs20241111</artifactId>
<version>1.6.0</version>
</dependency>调用代码
package com.aliyun.iqs.readpage.example;
import com.aliyun.iqs20241111.Client;
import com.aliyun.iqs20241111.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public class Example {
public static void main(String[] args) throws Exception {
Client client = initClient();
invoke(client, "http://www.example.com");
}
private static Client initClient() throws Exception {
// TODO: 使用您的AK/SK进行替换(建议通过环境变量加载)
String accessKeyId = "$YOUR_ACCESS_KEY";
String accessKeySecret = "$YOUR_ACCESS_SECRET";
Config config = new Config()
.setAccessKeyId(accessKeyId)
.setAccessKeySecret(accessKeySecret);
config.setEndpoint("iqs.cn-zhangjiakou.aliyuncs.com");
return new Client(config);
}
private static void invoke(Client client, String url) {
ReadPageScrapeBody input = new ReadPageScrapeBody();
input.setUrl(url);
ReadPageScrapeRequest request = new ReadPageScrapeRequest().setBody(input);
try {
ReadPageScrapeResponse response = client.readPageScrape(request);
printOutput(response.getBody());
} catch (Exception e) {
e.printStackTrace();
}
}
private static void printOutput(ReadPageBasicResponseBody output) {
// 使用 GsonBuilder 创建带格式化的 Gson 实例
Gson gson = new GsonBuilder()
.setPrettyPrinting()
.disableHtmlEscaping()
.create();
// 输出格式化的 JSON
String prettyJson = gson.toJson(output);
System.out.println(prettyJson);
}
}
Go SDK
前提条件
Go 环境版本必须不低于 1.10.x
安装SDK
require (
github.com/alibabacloud-go/iqs-20241111 v1.6.0
)
调用代码
package main
import (
"fmt"
"log"
openapi "github.com/alibabacloud-go/darabonba-openapi/v2/client"
iqs20241111 "github.com/alibabacloud-go/iqs-20241111/client"
util "github.com/alibabacloud-go/tea-utils/v2/service"
"github.com/alibabacloud-go/tea/tea"
)
const endpointURL = "iqs.cn-zhangjiakou.aliyuncs.com"
func createClient() (*iqs20241111.Client, error) {
// TODO: 使用您的AK/SK进行替换
accessKeyID := "YOUR_ACCESS_KEY"
accessKeySecret := "YOUR_ACCESS_SECRET"
if accessKeyID == "" || accessKeySecret == "" {
return nil, fmt.Errorf("ACCESS_KEY or ACCESS_SECRET environment variable is not set")
}
config := &openapi.Config{
AccessKeyId: tea.String(accessKeyID),
AccessKeySecret: tea.String(accessKeySecret),
Endpoint: tea.String(endpointURL),
}
return iqs20241111.NewClient(config)
}
func runReadPage(client *iqs20241111.Client) error {
body := &iqs20241111.ReadPageScrapeBody{
Url: tea.String("http://www.example.com"),
}
request := &iqs20241111.ReadPageScrapeRequest{
body,
}
runtime := &util.RuntimeOptions{}
resp, err := client.ReadPageScrapeWithOptions(request, nil, runtime)
if err != nil {
return fmt.Errorf("readpage failed: %w", err)
}
fmt.Printf("[%s] response: %s\n", *resp.Body.RequestId, resp.Body)
return nil
}
func main() {
client, err := createClient()
if err != nil {
log.Fatalf("Failed to create client: %v", err)
}
if err := runReadPage(client); err != nil {
log.Fatalf("Error running readpage: %v", err)
}
}
HTTP 调用
请求参数(RequestBody)
curl --location "https://cloud-iqs.aliyuncs.com/readpage/scrape" \
--header "Content-Type: application/json" \
--header "X-API-Key: <YOUR-IQS-API-KEY>" \
--data "{
"url": "https://www.example.com",
"maxAge": 0
}"返回
{
"data": {
"html": "<html>\n<head><title>Example Domain</title></head>\n<body>\n<div>\n<h1>Example Domain</h1>\n<p>This domain is for use in documentation examples without needing permission. Avoid use in operations.</p>\n<p><a href=\"https://iana.org/domains/example\">Learn more</a></p>\n</div>\n</body>\n</html>",
"links": {
"internal": "[]",
"external": "[{\"href\":\"https://iana.org/domains/example\",\"text\":\"Learn more\",\"title\":\"\"}]"
},
"markdown": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n[Learn more](https://iana.org/domains/example)\n",
"media": {
"images": "[]",
"audios": "[]",
"videos": "[]"
},
"metadata": {
"hostname": "www.example.com",
"pdfParse": false,
"title": "Example Domain",
"url": "https://www.example.com"
},
"statusCode": 200,
"text": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\nLearn more\n"
},
"requestId": "1d0ac13a-8c73-4134-a835-35d0126f733c"
}