This document describes the Web Scrape Enhanced API, which uses a headless browser to render and access web pages. It covers the API's features, request and response parameters, and explains how to call the API.
API overview
Features:
Reads HTML and parses web content in a browser sandbox environment.
The API begins parsing after all resources on the target page have fully loaded. You can adjust the maximum wait time by using the
pageTimeoutparameter. The target website's resource loading speed significantly affects the overall API response time.If the
Content-Typeheader of the target URL's response isapplication/pdf, the system automatically triggers PDF parsing. The API parses the content visible in the browser viewport. To download the PDF content, see the Web Scrape Standard API.
API reference
Request parameters
Parameter | Value | Default | Description | |
url: string | Required | The target URL to parse. It must start with | ||
pageTimeout: int | [0, 100000] | 10000 | The timeout in milliseconds for the target site to load. The API call completes within
| |
formats: array | ["rawHtml", "html", "markdown", "text", "screenshot"] | ["html", "markdown", "text"] | The format of the parsed results.
| |
maxAge: int | [0, ∞] | 1296000 | The maximum cache duration in seconds.
| |
readability: map | readabilityMode: string | "none", "normal", "article" | none |
|
excludeAllImages:bool | false, true | false | Specifies whether to exclude all images. | |
excludeAllLinks: bool | false, true | false | Specifies whether to exclude all links. | |
excludedTags: array | [] | [] | Specifies an array of HTML tags to exclude. Example: ["form", "header", "footer", "nav"] | |
Response parameters
Parameter | Nullable | Description | Example | ||
requestId: string | No | The request ID. Provide this ID when contacting support. | |||
errorCode: string | Yes | The error code. | |||
errorMessage: string | Yes | The error message. | |||
data: map | statusCode: int | No |
| ||
errorMessage: string | Yes | An error message related to the request to the target URL. | |||
rawHtml: string | Yes | The original HTML of the target URL. | |||
html: string | Yes | The readable HTML content from the target URL, processed according to the | |||
text: string | Yes | The text content of the target URL. | |||
markdown: string | Yes | The Markdown content of the target URL. | |||
screenshot: string | Yes | A screenshot of the target site. | |||
links: map | internal: map | Yes | Link information from the target URL (internal links).
| { "href": "https://www.alibabagroup.com/cn/global/home", "text": "Alibaba Group", "title": "" } | |
external: map | Yes | Link information from the target URL (external links). | |||
media: map | images: map | Yes | Image information from the target URL.
| { "type": "image", "src": "https://img.alicdn.com/tfs/TB1AOdINW6qK1RjSZFmXXX0PFXa-258-258.jpg", "data": "", "alt": "Alibaba Cloud WeChat", "desc": null, "format": "jpg" } | |
audios: array | Yes | Audio information from the target URL.
| |||
videos: array | Yes | Video information from the target URL.
| [ { "type": "video", "src": "blob:https://xxxxx.com/xxx", "data": "", "alt": null, "desc": "", "format": null } ] | ||
metadata: map | url | No | The target URL. | ||
title | No | The site title. | |||
hostname | Yes | The site hostname. | |||
hostLogo | Yes | The site logo. | |||
pdfParse | Yes | Indicates whether a PDF was parsed. | |||
Error codes
HttpCode | Error code | Error message | Solution |
404 | InvalidAccessKeyId.NotFound | Specified access key is not found. | Verify that your AccessKey and Secret are correct. |
403 | Retrieval.NotActivate | Please activate AI search service | Place an order or contact your account manager to activate the service. |
403 | Retrieval.NotAuthorised | Please authorize the AliyunIQSFullAccess privilege to the sub-account. | The sub-account is not authorized. See Create and authorize a RAM user. |
429 | Retrieval.Throttling.User | Request was denied due to user flow control. | You have exceeded your rate limit. |
429 | Retrieval.TestUserQueryPerDayExceeded | The query per day exceed the limit. | Test quota exceeded (1,000 requests per 30 days). |
403 | ReadPage.SecurityRestrict | Security restrictions on the target site (e.g., robots.txt). | |
400 | ReadPage.RequestTimeout | Request target url timeout. | The request timed out. Try increasing the |
429 | ReadPage.RateLimitByDomain | The domain has reached the rate limit. | |
500 | ReadPage.UnknownError | Unknown error. |
Usage notes
1. Stealth mode
Enabled by default, this mode optimizes the browser fingerprint to closely resemble a real user's browser, improving the compatibility and stability of automated access.

2. Limitations
This API complies with the target site's robots.txt protocol, terms of use, and applicable laws and regulations.
This API provides only technical support and page parsing capabilities. We are not responsible for the generation, publication, display, or use of content from third-party websites. We make no commitments or warranties of any kind for third-party content obtained through this service.
API calls
Examples
Python SDK
Prerequisites
Python 3.8 or later is required.
Install the SDK
pip install alibabacloud_iqs20241111==1.6.0Sample code
import json
from Tea.exceptions import TeaException
from alibabacloud_iqs20241111 import models
from alibabacloud_iqs20241111.client import Client
from alibabacloud_tea_openapi import models as open_api_models
class Sample:
def __init__(self):
pass
@staticmethod
def create_client() -> Client:
config = open_api_models.Config(
# TODO: Replace with your AccessKey/Secret. We recommend loading them from environment variables.
access_key_id="$YOUR_ACCESS_KEY",
access_key_secret="$YOUR_ACCESS_SECRET"
)
config.endpoint = f"iqs.cn-zhangjiakou.aliyuncs.com"
return Client(config)
@staticmethod
def main() -> None:
client = Sample.create_client()
run_instances_request = models.ReadPageScrapeRequest(
body=models.ReadPageScrapeBody(
url="http://www.example.com",
max_age=0,
)
)
try:
response = client.read_page_scrape(run_instances_request)
print(f"API call succeeded. Request ID: {response.body.request_id}, Result: ")
print(f"{json.dumps(response.body.data.to_map(), indent=2)}")
except TeaException as e:
request_id = e.data.get("requestId")
code = e.data.get("errorCode")
message = e.data.get("errorMessage")
print(f"API call failed. Request ID: {request_id}, Code: {code}, Message: {message}")
if __name__ == "__main__":
Sample.main()
Java SDK
Prerequisites
Java 8 or later is required.
Maven dependency
<dependency>
<groupId>com.aliyun</groupId>
<artifactId>iqs20241111</artifactId>
<version>1.6.0</version>
</dependency>Sample code
package com.aliyun.iqs.readpage.example;
import com.aliyun.iqs20241111.Client;
import com.aliyun.iqs20241111.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public class Example {
public static void main(String[] args) throws Exception {
Client client = initClient();
invoke(client, "http://www.example.com");
}
private static Client initClient() throws Exception {
// TODO: Replace with your AccessKey/Secret. We recommend loading them from environment variables.
String accessKeyId = "$YOUR_ACCESS_KEY";
String accessKeySecret = "$YOUR_ACCESS_SECRET";
Config config = new Config()
.setAccessKeyId(accessKeyId)
.setAccessKeySecret(accessKeySecret);
config.setEndpoint("iqs.cn-zhangjiakou.aliyuncs.com");
return new Client(config);
}
private static void invoke(Client client, String url) {
ReadPageScrapeBody input = new ReadPageScrapeBody();
input.setUrl(url);
ReadPageScrapeRequest request = new ReadPageScrapeRequest().setBody(input);
try {
ReadPageScrapeResponse response = client.readPageScrape(request);
printOutput(response.getBody());
} catch (Exception e) {
e.printStackTrace();
}
}
private static void printOutput(ReadPageBasicResponseBody output) {
// Use GsonBuilder to create a formatted Gson instance.
Gson gson = new GsonBuilder()
.setPrettyPrinting()
.disableHtmlEscaping()
.create();
// Print the formatted JSON.
String prettyJson = gson.toJson(output);
System.out.println(prettyJson);
}
}
Go SDK
Prerequisites
Go version 1.10.x or later is required.
Install the SDK
require (
github.com/alibabacloud-go/iqs-20241111 v1.6.0
)
Sample code
package main
import (
"fmt"
"log"
openapi "github.com/alibabacloud-go/darabonba-openapi/v2/client"
iqs20241111 "github.com/alibabacloud-go/iqs-20241111/client"
util "github.com/alibabacloud-go/tea-utils/v2/service"
"github.com/alibabacloud-go/tea/tea"
)
const endpointURL = "iqs.cn-zhangjiakou.aliyuncs.com"
func createClient() (*iqs20241111.Client, error) {
// TODO: Replace with your AccessKey/Secret.
accessKeyID := "YOUR_ACCESS_KEY"
accessKeySecret := "YOUR_ACCESS_SECRET"
if accessKeyID == "" || accessKeySecret == "" {
return nil, fmt.Errorf("ACCESS_KEY or ACCESS_SECRET environment variable is not set")
}
config := &openapi.Config{
AccessKeyId: tea.String(accessKeyID),
AccessKeySecret: tea.String(accessKeySecret),
Endpoint: tea.String(endpointURL),
}
return iqs20241111.NewClient(config)
}
func runReadPage(client *iqs20241111.Client) error {
body := &iqs20241111.ReadPageScrapeBody{
Url: tea.String("http://www.example.com"),
}
request := &iqs20241111.ReadPageScrapeRequest{
body,
}
runtime := &util.RuntimeOptions{}
resp, err := client.ReadPageScrapeWithOptions(request, nil, runtime)
if err != nil {
return fmt.Errorf("readpage failed: %w", err)
}
fmt.Printf("[%s] response: %s\n", *resp.Body.RequestId, resp.Body)
return nil
}
func main() {
client, err := createClient()
if err != nil {
log.Fatalf("Failed to create client: %v", err)
}
if err := runReadPage(client); err != nil {
log.Fatalf("Error running readpage: %v", err)
}
}
HTTP call
Request body
curl --location "https://cloud-iqs.aliyuncs.com/readpage/scrape" \
--header "Content-Type: application/json" \
--header "X-API-Key: <YOUR-IQS-API-KEY>" \
--data '{
"url": "https://www.example.com",
"maxAge": 0
}'Response
{
"data": {
"html": "<html>\n<head><title>Example Domain</title></head>\n<body>\n<div>\n<h1>Example Domain</h1>\n<p>This domain is for use in illustrative examples in documents and should not be used in production.</p>\n<p><a href=\"https://iana.org/domains/example\">Learn more</a></p>\n</div>\n</body>\n</html>",
"links": {
"internal": "[]",
"external": "[{\"href\":\"https://iana.org/domains/example\",\"text\":\"Learn more\",\"title\":\"\"}]"
},
"markdown": "# Example Domain\nThis domain is for use in illustrative examples in documents and should not be used in production.\n[Learn more](https://iana.org/domains/example)\n",
"media": {
"images": "[]",
"audios": "[]",
"videos": "[]"
},
"metadata": {
"hostname": "www.example.com",
"pdfParse": false,
"title": "Example Domain",
"url": "https://www.example.com"
},
"statusCode": 200,
"text": "# Example Domain\nThis domain is for use in illustrative examples in documents and should not be used in production.\nLearn more\n"
},
"requestId": "1d0ac13a-8c73-4134-a835-35d0126f733c"
}