ReadPageScrape-Enhanced

更新时间:
复制 MD 格式

This document describes the Web Scrape Enhanced API, which uses a headless browser to render and access web pages. It covers the API's features, request and response parameters, and explains how to call the API.

API overview

Features:

  1. Reads HTML and parses web content in a browser sandbox environment.

  2. The API begins parsing after all resources on the target page have fully loaded. You can adjust the maximum wait time by using the pageTimeout parameter. The target website's resource loading speed significantly affects the overall API response time.

  3. If the Content-Type header of the target URL's response is application/pdf, the system automatically triggers PDF parsing. The API parses the content visible in the browser viewport. To download the PDF content, see the Web Scrape Standard API.

API reference

Request parameters

Parameter

Value

Default

Description

url: string

Required

The target URL to parse. It must start with http:// or https://.

pageTimeout: int

[0, 100000]

10000

The timeout in milliseconds for the target site to load. The API call completes within pageTimeout + 6000 ms.

  • Most pages load within 5000 ms.

  • For pages with many resources, you can set this to 10000 ms.

formats: array

["rawHtml", "html", "markdown", "text", "screenshot"]

["html", "markdown", "text"]

The format of the parsed results.

  • rawHtml: The original HTML of the target site.

  • html: The page content processed according to readabilityMode.

  • markdown: The Markdown content converted from the HTML.

  • text: The text content from the HTML.

  • screenshot: A screenshot of the page. If you select this format, the API response time increases by 2 to 5 seconds, in addition to the pageTimeout value.

maxAge: int

[0, ∞]

1296000

The maximum cache duration in seconds.

  • If the cache age is less than maxAge, the API returns cached content.

  • Setting maxAge to 0 disables caching.

readability: map

readabilityMode: string

"none", "normal", "article"

none

normal: Removes irrelevant information, such as headers, footers, and navigation, to return only the core content.

article: Extracts the main article content from a site. This mode is best for blogs and news sites and may not work well on directory or navigation pages.

excludeAllImages:bool

false, true

false

Specifies whether to exclude all images.

excludeAllLinks: bool

false, true

false

Specifies whether to exclude all links.

excludedTags: array

[]

[]

Specifies an array of HTML tags to exclude. Example:

["form", "header", "footer", "nav"]

Response parameters

Parameter

Nullable

Description

Example

requestId: string

No

The request ID. Provide this ID when contacting support.

errorCode: string

Yes

The error code.

Error codes

errorMessage: string

Yes

The error message.

data: map

statusCode: int

No

  • If the request to the target site succeeds, this field contains the HTTP status code from the target URL's response.

  • If the request to the target site fails, the API returns an IQS custom error code:

    • 4030: Target site security restrictions, such as robots.txt or other security policies.

    • 4080: Request timeout.

    • 4290: The target site's rate limit was reached.

    • 5010: Unknown error.

errorMessage: string

Yes

An error message related to the request to the target URL.

rawHtml: string

Yes

The original HTML of the target URL.

html: string

Yes

The readable HTML content from the target URL, processed according to the readability settings.

text: string

Yes

The text content of the target URL.

markdown: string

Yes

The Markdown content of the target URL.

screenshot: string

Yes

A screenshot of the target site.

links: map

internal: map

Yes

Link information from the target URL (internal links).

  • href: The link address.

  • text: The link's anchor text.

  • title: The link's title attribute, often used for tooltips.

{

"href": "https://www.alibabagroup.com/cn/global/home",

"text": "Alibaba Group",

"title": ""

}

external: map

Yes

Link information from the target URL (external links).

media: map

images: map

Yes

Image information from the target URL.

  • type: image

  • src: The image address.

  • data: Embedded or additional data.

  • alt: The alternative text.

  • desc: A supplementary description.

  • format: The image format.

{

"type": "image",

"src": "https://img.alicdn.com/tfs/TB1AOdINW6qK1RjSZFmXXX0PFXa-258-258.jpg",

"data": "",

"alt": "Alibaba Cloud WeChat",

"desc": null,

"format": "jpg"

}

audios: array

Yes

Audio information from the target URL.

  • type: audio

  • src: The audio address.

  • data: Embedded or additional data.

  • alt: The alternative text.

  • desc: A supplementary description.

  • format: The format.

videos: array

Yes

Video information from the target URL.

  • type: video

  • src: The video address.

  • data: Embedded or additional data.

  • alt: The alternative text.

  • desc: A supplementary description.

  • format: The format.

[

{

"type": "video",

"src": "blob:https://xxxxx.com/xxx",

"data": "",

"alt": null,

"desc": "",

"format": null

}

]

metadata: map

url

No

The target URL.

title

No

The site title.

hostname

Yes

The site hostname.

hostLogo

Yes

The site logo.

pdfParse

Yes

Indicates whether a PDF was parsed.

Error codes

HttpCode

Error code

Error message

Solution

404

InvalidAccessKeyId.NotFound

Specified access key is not found.

Verify that your AccessKey and Secret are correct.

403

Retrieval.NotActivate

Please activate AI search service

Place an order or contact your account manager to activate the service.

403

Retrieval.NotAuthorised

Please authorize the AliyunIQSFullAccess privilege to the sub-account.

The sub-account is not authorized. See Create and authorize a RAM user.

429

Retrieval.Throttling.User

Request was denied due to user flow control.

You have exceeded your rate limit.

429

Retrieval.TestUserQueryPerDayExceeded

The query per day exceed the limit.

Test quota exceeded (1,000 requests per 30 days).

403

ReadPage.SecurityRestrict

Security restrictions on the target site (e.g., robots.txt).

400

ReadPage.RequestTimeout

Request target url timeout.

The request timed out. Try increasing the pageTimeout value.

429

ReadPage.RateLimitByDomain

The domain has reached the rate limit.

500

ReadPage.UnknownError

Unknown error.

Usage notes

1. Stealth mode

Enabled by default, this mode optimizes the browser fingerprint to closely resemble a real user's browser, improving the compatibility and stability of automated access.

image

2. Limitations

  1. This API complies with the target site's robots.txt protocol, terms of use, and applicable laws and regulations.

  2. This API provides only technical support and page parsing capabilities. We are not responsible for the generation, publication, display, or use of content from third-party websites. We make no commitments or warranties of any kind for third-party content obtained through this service.

API calls

Examples

Python SDK

Prerequisites

Python 3.8 or later is required.

Install the SDK
pip install alibabacloud_iqs20241111==1.6.0
Sample code
import json

from Tea.exceptions import TeaException
from alibabacloud_iqs20241111 import models
from alibabacloud_iqs20241111.client import Client
from alibabacloud_tea_openapi import models as open_api_models


class Sample:
    def __init__(self):
        pass

    @staticmethod
    def create_client() -> Client:
        config = open_api_models.Config(
            # TODO: Replace with your AccessKey/Secret. We recommend loading them from environment variables.
            access_key_id="$YOUR_ACCESS_KEY",
            access_key_secret="$YOUR_ACCESS_SECRET"
        )
        config.endpoint = f"iqs.cn-zhangjiakou.aliyuncs.com"
        return Client(config)

    @staticmethod
    def main() -> None:
        client = Sample.create_client()
        run_instances_request = models.ReadPageScrapeRequest(
            body=models.ReadPageScrapeBody(
                url="http://www.example.com",
                max_age=0,
            )
        )
        try:
            response = client.read_page_scrape(run_instances_request)
            print(f"API call succeeded. Request ID: {response.body.request_id}, Result: ")
            print(f"{json.dumps(response.body.data.to_map(), indent=2)}")

        except TeaException as e:
            request_id = e.data.get("requestId")
            code = e.data.get("errorCode")
            message = e.data.get("errorMessage")
            print(f"API call failed. Request ID: {request_id}, Code: {code}, Message: {message}")


if __name__ == "__main__":
    Sample.main()
    

Java SDK

Prerequisites

Java 8 or later is required.

Maven dependency
<dependency>
    <groupId>com.aliyun</groupId>
    <artifactId>iqs20241111</artifactId>
    <version>1.6.0</version>
</dependency>
Sample code
package com.aliyun.iqs.readpage.example;

import com.aliyun.iqs20241111.Client;
import com.aliyun.iqs20241111.models.*;
import com.aliyun.teaopenapi.models.Config;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

public class Example {
    public static void main(String[] args) throws Exception {
        Client client = initClient();
        invoke(client, "http://www.example.com");
    }

    private static Client initClient() throws Exception {
        // TODO: Replace with your AccessKey/Secret. We recommend loading them from environment variables.
        String accessKeyId = "$YOUR_ACCESS_KEY";
        String accessKeySecret = "$YOUR_ACCESS_SECRET";

        Config config = new Config()
                .setAccessKeyId(accessKeyId)
                .setAccessKeySecret(accessKeySecret);

        config.setEndpoint("iqs.cn-zhangjiakou.aliyuncs.com");
        return new Client(config);
    }

    private static void invoke(Client client, String url) {
        ReadPageScrapeBody input = new ReadPageScrapeBody();
        input.setUrl(url);

        ReadPageScrapeRequest request = new ReadPageScrapeRequest().setBody(input);

        try {
            ReadPageScrapeResponse response = client.readPageScrape(request);

            printOutput(response.getBody());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void printOutput(ReadPageBasicResponseBody output) {
        // Use GsonBuilder to create a formatted Gson instance.
        Gson gson = new GsonBuilder()
                .setPrettyPrinting()
                .disableHtmlEscaping()
                .create();

        // Print the formatted JSON.
        String prettyJson = gson.toJson(output);
        System.out.println(prettyJson);
    }
}

Go SDK

Prerequisites

Go version 1.10.x or later is required.

Install the SDK
require (
  github.com/alibabacloud-go/iqs-20241111 v1.6.0
)
Sample code
package main

import (
	"fmt"
	"log"

	openapi "github.com/alibabacloud-go/darabonba-openapi/v2/client"
	iqs20241111 "github.com/alibabacloud-go/iqs-20241111/client"
	util "github.com/alibabacloud-go/tea-utils/v2/service"
	"github.com/alibabacloud-go/tea/tea"
)

const endpointURL = "iqs.cn-zhangjiakou.aliyuncs.com"

func createClient() (*iqs20241111.Client, error) {
	// TODO: Replace with your AccessKey/Secret.
	accessKeyID := "YOUR_ACCESS_KEY"
	accessKeySecret := "YOUR_ACCESS_SECRET"

	if accessKeyID == "" || accessKeySecret == "" {
		return nil, fmt.Errorf("ACCESS_KEY or ACCESS_SECRET environment variable is not set")
	}

	config := &openapi.Config{
		AccessKeyId:     tea.String(accessKeyID),
		AccessKeySecret: tea.String(accessKeySecret),
		Endpoint:        tea.String(endpointURL),
	}

	return iqs20241111.NewClient(config)
}

func runReadPage(client *iqs20241111.Client) error {
	body := &iqs20241111.ReadPageScrapeBody{
		Url: tea.String("http://www.example.com"),
	}
	request := &iqs20241111.ReadPageScrapeRequest{
		body,
	}
	runtime := &util.RuntimeOptions{}

	resp, err := client.ReadPageScrapeWithOptions(request, nil, runtime)
	if err != nil {
		return fmt.Errorf("readpage failed: %w", err)
	}

	fmt.Printf("[%s] response: %s\n", *resp.Body.RequestId, resp.Body)
	return nil
}

func main() {
	client, err := createClient()
	if err != nil {
		log.Fatalf("Failed to create client: %v", err)
	}

	if err := runReadPage(client); err != nil {
		log.Fatalf("Error running readpage: %v", err)
	}
}

HTTP call

  • Request body

curl --location "https://cloud-iqs.aliyuncs.com/readpage/scrape" \
--header "Content-Type: application/json" \
--header "X-API-Key: <YOUR-IQS-API-KEY>" \
--data '{
    "url": "https://www.example.com",
    "maxAge": 0
}'
  • Response

{
  "data": {
    "html": "<html>\n<head><title>Example Domain</title></head>\n<body>\n<div>\n<h1>Example Domain</h1>\n<p>This domain is for use in illustrative examples in documents and should not be used in production.</p>\n<p><a href=\"https://iana.org/domains/example\">Learn more</a></p>\n</div>\n</body>\n</html>",
    "links": {
      "internal": "[]",
      "external": "[{\"href\":\"https://iana.org/domains/example\",\"text\":\"Learn more\",\"title\":\"\"}]"
    },
    "markdown": "# Example Domain\nThis domain is for use in illustrative examples in documents and should not be used in production.\n[Learn more](https://iana.org/domains/example)\n",
    "media": {
      "images": "[]",
      "audios": "[]",
      "videos": "[]"
    },
    "metadata": {
      "hostname": "www.example.com",
      "pdfParse": false,
      "title": "Example Domain",
      "url": "https://www.example.com"
    },
    "statusCode": 200,
    "text": "# Example Domain\nThis domain is for use in illustrative examples in documents and should not be used in production.\nLearn more\n"
  },
  "requestId": "1d0ac13a-8c73-4134-a835-35d0126f733c"
}