How to use AI models to process data in a Retrieval-Augmented Application (version 8.17)-Elasticsearch(ES)-阿里云帮助中心

You can call AI models through the Inference API in a Retrieval-Augmented Application (version 8.17) to enable advanced features, such as structured text extraction, document splitting, and text embedding. This deep integration with AI models improves search accuracy and response efficiency. It helps you build a more precise retrieval system with enhanced data analysis and semantic understanding for complex business scenarios.

Background

AI Search Open Platform is a sub-product of OpenSearch. It provides algorithmic services from the AI search pipeline as modular components, focusing on intelligent search and Retrieval-Augmented Generation (RAG) scenarios. The platform includes built-in services for document analysis, document splitting, text embedding, query analysis, recall, reranking, performance evaluation, and large language model (LLM) services. This allows developers to flexibly select components to build their search applications.
In a Retrieval-Augmented Application (version 8.17), you can call model services through the Inference API to perform operations such as data processing, query rewriting, and text generation.

Billing

AI model services are billed by AI Search Open Platform based on the number of model calls. You are not charged if the service is not used.

Note

After charges are incurred, you can log in to the Costs and Billing system to view consumption details.

Prerequisites

You have created a Retrieval-Augmented Application (version 8.17).

Step 1: Activate and initialize AI models

Before you use AI models, you must activate the AI Search Open Platform service and complete the initialization process. After activation, the system automatically generates the information required to call the models and creates the corresponding models in your application.

Go to the application details page.
1. Log in to the Elasticsearch Serverless console and switch to the target region in the top menu bar.
2. In the left-side navigation pane, click Application Management. Then, click the name of the desired application to go to its details page.
In the left-side navigation pane, click AI Service Center > Model Management to go to the Model Management page.
Activate and initialize the AI model service.
If your account has not yet activated AI Search Open Platform, click Activate Service and Initialize Models on the Model Management page and follow the on-screen instructions to activate the AI model service. After activation, the system automatically initializes, generating the API key, model service namespace, and model service endpoint required to call the models, and registers them with AI Search Open Platform. The system also creates the AI models supported by the Retrieval-Augmented Application (version 8.17), allowing you to call them directly for data processing.
Note
- If your Alibaba Cloud account has activated AI Search Open Platform, follow the on-screen instructions to initialize the models.
- If the UI indicates that some models were not created after initialization completes, find the corresponding model in the model list and click Create Model in the Actions column to create it.
- Activating AI Search Open Platform is free of charge. AI model usage is billed based on the number of calls. For more information, see Billing.
- To ensure that models can be called correctly, do not delete or disable the generated model invocation information in AI Search Open Platform. Deleting this information will render the models in the Model List unusable. For a list of AI models supported by the Retrieval-Augmented Application (version 8.17), see Model List.
The following describes the model invocation information.
- Model Service Namespace: A namespace is used to isolate and manage data. When you first activate AI Search Open Platform, the system automatically creates a default namespace named Default. You can also create namespaces as needed.
  Note
  If you want to use a custom namespace, switch to it on the Model Management page and follow the on-screen instructions to reinitialize the models. This ensures the system generates the necessary API key for the current namespace and creates the supported AI models within it. You can use the models in the namespace only after they are created.
- Model Invocation API Key: When calling an AI model using an API or SDK, use an API key for authentication to ensure secure and reliable calls.
- Model Service Endpoint: The Retrieval-Augmented Application (version 8.17) can use this private network address to access AI Search Open Platform and call the relevant models. To call AI models over the public network, see Obtain a service endpoint.
Note
When you activate AI Search Open Platform for the first time, the system automatically generates an API key and a service endpoint. All applications under this account can use this API key and service endpoint to call models.

Step 2: Call AI models

After the models are initialized, you can call the appropriate model for various use cases, such as document analysis, document splitting, and query analysis. This article demonstrates how to use Kibana to call an AI model for inference operations.

Call an AI model

Go to the Kibana development page.
Note
Before logging in to Kibana, ensure the device you are using has been added to the public or private network allowlist for Kibana. See Procedure for the logon account and password.
1. On the Model Management page, click Access Kibana in the upper-right corner and follow the on-screen instructions to go to the Kibana interface.
2. Click the icon in the upper-left corner and select Management > Dev Tools to go to the Kibana development page.
Call the AI model.
You can directly use the system-created models in Kibana for inference operations. The request format for calling a model is as follows. You can find the model category ID and model ID in the Model List.
```
POST _inference/<model_category_id>/<model_id>
```
The following example calls the ops-query-analyze-001 model for query analysis.
```
POST _inference/query_analyze/ops-query-analyze-001
```
Note
- The Retrieval-Augmented Application (version 8.17) supports various categories of Inference APIs and AI models to suit different needs.
- For more information on creating and calling models, see Call built-in model services of AI Search Open Platform.

Examples

Example 1: Document analysis

Call the system-created ops-document-analyze-001 model in Kibana to parse unstructured document data and convert it into structured data.

You can choose between a synchronous or asynchronous call. The following is the request template.

Note

Synchronous and asynchronous calls are execution modes for document analysis tasks.

Synchronous call: After you submit a document analysis task, Elasticsearch Serverless immediately blocks the current request thread and waits for AI Search Open Platform to process it. Once processing is complete, the result is returned directly to the client.
Asynchronous call: After you submit a document analysis task, Elasticsearch Serverless immediately returns a task ID without waiting for the processing result. The client must then query the task status and result in a separate request.

POST _inference/doc_analyze/ops-document-analyze-001
{
  "input": ["<document_url_or_content>"],
  "task_settings": {
    "document": {
      "input_type": "<url_or_content_or_task_id>",
      "file_name": "<optional_file_name>",
      "file_type": "<optional_file_type>"
    },
    "output": {
      "image_storage": "<base64_or_url>"
    },
    "is_async": "<true_or_false>"
  }
}

The following table describes the core parameters.

Parameter	Description
input	The document to analyze. The input can be a URL, content, or a task ID.
task_settings	The task settings. The sub-parameters within `task_settings` are all optional.
document.input_type	The document type. Valid values are: `url` (default): The web address of the document. The system retrieves the document from this address. `content`: The content of the document. `task_id`: The ID used to query the result of a previous asynchronous task.
document.file_name	The file name. If this parameter is not provided, the system infers the file name from the URL. This parameter is required if the URL is also empty.
document.file_type	The file type. If this parameter is empty, the system infers it from the suffix of the `file_name`. If the type cannot be inferred, you must specify it, such as `.pdf`, `.doc`, or `.docx`.
output.image_storage	The storage method for images. Valid values are: `base64` (default): Converts the image to a Base64-encoded text string and embeds it directly in the response. `url`: Returns a URL for the image. The URL is valid for 3 days.
is_async	Specifies whether the call is asynchronous. Valid values are: `true`: Asynchronous call. `false` (default): Synchronous call.

Synchronous call example

Note

The system returns the processing result directly.

POST _inference/doc_analyze/ops-document-analyze-001
{
  "input": ["http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/rag/file-parser/samples/GB10767.pdf"]
}

The following figure shows an example response.

Asynchronous call example

Make the call. The following is sample code.

Note

The system returns a task_id (the ID of the asynchronous task), which you can use to query the task result.

POST _inference/doc_analyze/ops-document-analyze-001
{
  "input": ["http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/rag/file-parser/samples/GB10767.pdf"],
  "task_settings": {
    "document": {
      "input_type": "url"
    },
    "is_async": true
  }
}

The following figure shows an example response.

Get the asynchronous result. The following is sample code.

# The input here is the task_id returned in the previous step.
POST _inference/doc_analyze/ops-document-analyze-001
{
  "input": ["e0922ee0a6eba95b70e6f355fe8391bb"],
  "task_settings": {
       "input_type": "task_id",
       "is_async": true  
  }
}

The following figure shows an example response.

Example 2: Document splitting

Call the system-created ops-document-split-001 model in Kibana to split an input document.

The following is the request template. The input contains the text content to be split.

Note

According to the JSON standard, if a string field contains the characters \\, \", \/, \b, \f, \n, \r, or \t, they must be escaped. JSON strings generated by common JSON libraries are automatically escaped.

POST _inference/doc_split/ops-document-split-001
{
  "input":"<input>"
}

The following is an example call.

POST _inference/doc_split/ops-document-split-001
{
  "input":"Elasticsearch is an open-source search engine built on top of the Apache Lucene™ full-text search engine library. Lucene is arguably the most advanced, high-performance, and full-featured search engine library available today—whether open source or proprietary. But Lucene is just a library. To make full use of its features, you need to use Java and integrate Lucene directly into your application. Worse, you might need an information retrieval degree to understand how it works. Lucene is very complex. Elasticsearch is also written in Java and uses Lucene internally for indexing and searching, but its goal is to make full-text search simple by hiding the complexity of Lucene and providing a simple and consistent RESTful API instead. However, Elasticsearch is more than just Lucene, and it's more than just a full-text search engine. It can be accurately described as: a distributed, real-time document store where every field can be indexed and searched; a distributed, real-time analytics search engine; capable of scaling to hundreds of nodes and supporting petabytes of structured or unstructured data. Elasticsearch packages all its features into a single service, so you can communicate with it through its simple RESTful API. You can use your favorite programming language as a web client, or even use the command line. With Elasticsearch, getting started is easy. For beginners, it provides sensible defaults and hides complex search theory. It works out of the box. Just a little understanding is needed to quickly become productive. As your knowledge grows, you can leverage more of Elasticsearch's advanced features. Its entire engine is configurable and flexible. You can pick and choose from its many advanced features to tailor Elasticsearch to solve your specific problems. You can download, use, and modify Elasticsearch for free. It is released under the Apache 2 license, one of the most flexible open-source licenses. The source code for Elasticsearch is hosted on Github at github.com/elastic/elasticsearch. If you want to join our amazing community of contributors, see Contributing to Elasticsearch. If you have any questions about Elasticsearch, including specific features, language clients, or plugins, you can join the discussion at discuss.elastic.co."
}

The following is an example response.

{
  "doc_split": {
    "rich_texts": [],
    "nodes": [
      {
        "parent_id": "cfe3181f03344e208a6e63efd4a674ff",
        "id": "cfe3181f03344e208a6e63efd4a674ff",
        "type": "root"
      },
      {
        "parent_id": "cfe3181f03344e208a6e63efd4a674ff",
        "id": "7574e031dd8c4880b42975f601783305",
        "type": "sentence_node"
      },
      {
        "parent_id": "cfe3181f03344e208a6e63efd4a674ff",
        "id": "8ca382cf6d8f4cb38f4a2f6433de47af",
        "type": "sentence_node"
      }
    ],
    "chunks": [
      {
        "meta": {
          "parent_id": "cfe3181f03344e208a6e63efd4a674ff",
          "hierarchy": [],
          "id": "7574e031dd8c4880b42975f601783305",
          "type": "text",
          "token": 298
        },
        "content": "Elasticsearch is an open-source search engine built on top of the Apache Lucene™ full-text search engine library. Lucene is arguably the most advanced, high-performance, and full-featured search engine library available today—whether open source or proprietary. But Lucene is just a library. To make full use of its features, you need to use Java and integrate Lucene directly into your application. Worse, you might need an information retrieval degree to understand how it works. Lucene is very complex. Elasticsearch is also written in Java and uses Lucene internally for indexing and searching, but its goal is to make full-text search simple by hiding the complexity of Lucene and providing a simple and consistent RESTful API instead. However, Elasticsearch is more than just Lucene, and it's more than just a full-text search engine. It can be accurately described as: a distributed, real-time document store where every field can be indexed and searched; a distributed, real-time analytics search engine; capable of scaling to hundreds of nodes and supporting petabytes of structured or unstructured data. Elasticsearch packages all its features into a single service, so you can communicate with it through its simple RESTful API. You can use your favorite programming language as a web client, or even use the command line. With Elasticsearch, getting started is easy. For beginners, it provides sensible defaults and hides complex search theory. It works out of the box."
      },
      {
        "meta": {
          "parent_id": "cfe3181f03344e208a6e63efd4a674ff",
          "hierarchy": [],
          "id": "8ca382cf6d8f4cb38f4a2f6433de47af",
          "type": "text",
          "token": 160
        },
        "content": "Just a little understanding is needed to quickly become productive. As your knowledge grows, you can leverage more of Elasticsearch's advanced features. Its entire engine is configurable and flexible. You can pick and choose from its many advanced features to tailor Elasticsearch to solve your specific problems. You can download, use, and modify Elasticsearch for free. It is released under the Apache 2 license, one of the most flexible open-source licenses. The source code for Elasticsearch is hosted on Github at github.com/elastic/elasticsearch. If you want to join our amazing community of contributors, see Contributing to Elasticsearch. If you have any questions about Elasticsearch, including specific features, language clients, or plugins, you can join the discussion at discuss.elastic.co."
      }
    ],
    "sentences": []
  }
}

Example 3: Text embedding

Call the system-created ops-text-embedding-001 model in Kibana to convert input text into numerical vectors. The output vectors capture the semantic information of the text, which can be used for downstream tasks such as similarity calculation, clustering, and classification.

The following is the request template. The input contains the text content to be embedded.

Note

Each request supports a maximum of 32 texts. Each text can be up to 300 tokens long and cannot be an empty string. Text length limits vary by model. For more information, see Model List.

POST _inference/text_embedding/ops-text-embedding-001
{
  "input":[<input>]
}

The following is an example call.

POST _inference/text_embedding/ops-text-embedding-001
{
  "input":["Science and technology are the primary productive force",
        "Elasticsearch product documentation"]
}

The following is an example response.

{
  "text_embedding": [
    {
      "embedding": [
        -0.029408421,
        0.061318535,
        ... 
        ]
    },
    {
      "embedding": [
        0.01568979,
        0.065073475,
        ...
        ]
    }
  ]
}

Appendix 1: Supported inference APIs

The following table lists the Inference APIs supported by the Retrieval-Augmented Application (version 8.17).

API	Description
DOCUMENT ANALYZE (document analysis)	Extracts logical structures like titles and paragraphs, as well as text, tables, and images from unstructured documents, and outputs the information in a structured format.
IMAGE ANALYZE (image content extraction)	Provides an image content analysis service that uses a large multimodal model to parse image content and recognize text. Provides an Optical Character Recognition (OCR) service that uses OCR capabilities to recognize and extract text from images.
SPLIT_DOC (document splitting)	Provides a general-purpose text splitting strategy. It can split structured data in HTML, Markdown, and TXT formats based on document paragraph formatting, text semantics, or specified rules. It also supports extracting code, images, and tables from rich text.
TEXT_EMBEDDING (text embedding)	Converts text data into dense vector representations for use in scenarios like information retrieval, text classification, and similarity comparison.
SPARSE_EMBEDDING (text sparse vector)	Converts text data into sparse vector representations. Sparse vectors require less storage space and are often used to represent keywords and term frequencies. They can be used with dense vectors for hybrid search to improve retrieval performance.
QUERY ANALYZE (query analysis)	Provides a query analysis service. It uses large language models and NLP capabilities to perform intent recognition, similar query expansion, and NL2SQL processing on your input queries, effectively improving retrieval and Q&A performance in RAG scenarios.
RERANK (reranking service)	Provides a general-purpose document scoring capability. It can rerank documents in descending order based on the relevance between the query and the document content, and output the corresponding scores.
COMPLETION (content generation service)	Provides a dedicated RAG large model service that has been fine-tuned based on Alibaba's self-developed foundation models. It can be combined with document processing and retrieval services and is widely used in RAG scenarios to improve answer accuracy and reduce hallucinations.

Appendix 2: Model list

The following table lists the AI models integrated into the Retrieval-Augmented Application (version 8.17).

Model category	Model category ID	Model ID	Description
Large model service	completion	deepseek-r1	A large language model focused on complex reasoning tasks, with outstanding performance in understanding complex instructions and ensuring result accuracy.
		deepseek-r1-distill-qwen-14b	A model generated by applying knowledge distillation and using training samples from `DeepSeek-R1` for the fine-tuning of `Qwen-14B`.
		deepseek-r1-distill-qwen-7b	A model generated by applying knowledge distillation and using training samples from `DeepSeek-R1` for the fine-tuning of `Qwen-7B`.
		deepseek-v3	A Mixture-of-Experts (MoE) model that excels in handling long text, code, mathematics, encyclopedic knowledge, and Chinese.
		ops-qwen-turbo	This model uses the `Qwen-Turbo` large language model as its base and undergoes supervised fine-tuning to enhance retrieval capabilities and reduce harmfulness.
		qwen-max	Qwen `2.0`, a large language model with hundreds of billions of parameters that supports inputs in various languages, including Chinese and English.
		qwen-plus	An enhanced version of the Qwen large language model that supports inputs in various languages, including Chinese and English.
		qwen-turbo	The Qwen large language model that supports inputs in various languages, including Chinese and English.
Multimodal embedding	multi_modal_embedding	ops-gme-qwen2-vl-2b-instruct	A multimodal embedding service trained on the `Qwen2-VL` multimodal large language models (MLLMs). It supports single-modal and multi-modal combined inputs, efficiently processing text, images, and combined data types.
		ops-m2-encoder	A bilingual (Chinese and English) multimodal service. It is trained on `BM-6B` with a dataset of `6 billion` image-text pairs (with `3 billion` pairs for each language). The model supports cross-modal retrieval (including text-to-image and image-to-text search) and image classification tasks.
		ops-m2-encoder-large	A bilingual (Chinese and English) multimodal service. It has a larger parameter count than the `ops-m2-encoder` model, reaching `1 billion parameters`, which provides stronger expressive power and performance in multimodal tasks.
Reranking	rerank	ops-bge-reranker-larger	Provides a document scoring service based on the BGE model. It can rerank documents in descending order based on the relevance between the query and the document content, and output the corresponding scores. Supports Chinese and English. The maximum input length (query + doc) is 512 tokens.
Reranking	rerank	ops-text-reranker-001	A self-developed reranking model from OpenSearch. It is trained on datasets from multiple industries to provide a high-level reranking service. It reranks documents in descending order of their semantic relevance to the query. Supports Chinese and English. The maximum input length (query + doc) is 1,024 tokens.
Document analysis	doc_analyze	ops-document-analyze-001	Provides a general-purpose document analysis service. It supports extracting logical structures like titles and paragraphs, as well as text, tables, and images from unstructured documents, and outputs the information in a structured format.
Document splitting	doc_split	ops-document-split-001	Provides a general-purpose text splitting strategy. It can split structured data in HTML, Markdown, and TXT formats based on document paragraph formatting, text semantics, or specified rules. It also supports extracting code, images, and tables from rich text.
Image content analysis	img_analyze	ops-image-analyze-ocr-001	Provides an Optical Character Recognition (OCR) service for image content. It can recognize and extract text from images using OCR capabilities for use in image retrieval and Q&A scenarios.
Image content analysis	img_analyze	ops-image-analyze-vlm-001	Provides an image content analysis service. It can parse image content and recognize text based on a large multimodal model. The parsed text can be used for image retrieval and Q&A scenarios.
Query analysis	query_analyze	ops-query-analyze-001	Provides a general-purpose query analysis service. It can understand the intent of your input query and expand it to similar questions based on a large language model.
Text embedding	text_embedding	ops-text-embedding-001	Provides text embedding services for over 40 languages. The maximum input text length is `300` tokens, and the output vector dimension is 1536.
		ops-text-embedding-002	Provides text embedding services for over 100 languages. The maximum input text length is `8192` tokens, and the output vector dimension is 1024.
		ops-text-embedding-en-001	Provides English text embedding services. The maximum input text length is `512` tokens, and the output vector dimension is 768.
		ops-text-embedding-zh-001	Provides Chinese text embedding services. The maximum input text length is `1024` tokens, and the output vector dimension is 768.
Text sparse embedding	sparse_embedding	ops-text-sparse-embedding-001	Provides text embedding services for over 100 languages. The maximum input text length is `8192` tokens.

Use AI models

Background

Billing

Prerequisites

Step 1: Activate and initialize AI models

Step 2: Call AI models

Call an AI model

Examples

Example 1: Document analysis

Example 2: Document splitting

Example 3: Text embedding

Appendix 1: Supported inference APIs

Appendix 2: Model list

Related documents