Create a custom model inference service using the Inference API

更新时间:
复制 MD 格式

Alibaba Cloud Elasticsearch extends the Inference API to provide a flexible and efficient method for deploying and managing custom AI models. This feature is ideal for scenarios such as recommendation systems, image retrieval, and natural language processing. You can quickly combine custom AI models with the high-performance Elasticsearch engine to enhance the efficiency and accuracy of your business applications.

Background information

  • The official Inference API from Elasticsearch (ES) supports calling model inference services from external platforms, such as Hugging Face and OpenAI. For more information, see Create inference API.

  • Alibaba Cloud Elasticsearch (ES) extends the Inference API to support invoking models from various platforms in addition to the officially supported models. You can customize model inference services in ES. Alibaba Cloud ES maps your custom model IDs to the model IDs used by the Inference API and integrates processes, such as embedding and rerank, into the read and write workflows of ES. This lets you quickly integrate AI model capabilities with the high-performance engine of ES to serve real-world business scenarios. Standardized templates are provided for common model platforms. For quick configuration, see the following templates:

Limits

Only Elasticsearch 8.15 and later versions support custom model inference services.

Create an inference model

General template

PUT _inference/<task_type>/<inference_id>
{
  "service": "alibaba-cloud-custom-model",
  "service_settings": {
    "secret_parameters": {
      ...
    },
    "url": "<url>",
    "path": {
      "<path>": {
        "<method>": {
          "query_string": "<query_string>",
          "headers": {
            ...
          },
          "request": {
            "format": "string",
            "content": "<content>"
          },
          "response": {
            "json_parser":{
              ...
            }
          }
        }
      }
    }
  },
  "task_settings": {
    "parameters":{
      ...
    }
  }
}

Parameters

Parameter

Type

Required

Description

<task_type>

/

Yes

The model type. Valid values are:

  • text_embedding

  • sparse_embedding

  • rerank

  • completion

  • custom: For inference models other than the four types listed above, or if you need the complete response, use the custom type. In this case, the result is the complete http response.

inference_id

/

Yes

A custom parameter. The name of the inference model to create.

service

string

Yes

Specifies the service to use. For Alibaba Cloud custom models, the value is alibaba-cloud-custom-model.

service_settings

/

Yes

Service settings.

secret_parameters

object

Yes

Configure sensitive parameters, such as api_key and token, in secret_parameters. Use placeholders to replace them where needed.

<url>

string

Yes

The service's url address.

<path>

string

Yes

The service's path, combined with the url, forms the service invocation endpoint.

<method>

string

Yes

The http request method. POST, PUT, and GET are supported.

<query_string>

string

No

The query string parameters for the http request.

headers

object

No

The header parameters for the http request.

request

object

Yes

The http request for the custom model.

request.format

string

Yes

The body type of the request. Currently, only string is supported.

request.content

string

Yes

The request body structure requires the JSON-formatted http request body to be passed as an escaped string.

response

object

No

When task_type is text_embedding, sparse_embedding, rerank, or completion, you must configure the corresponding response.json_parser.

response.json_parser

object

This parameter is not required when <task_type> is custom. For other types, you must specify the corresponding parameters.

Defines how to parse the http response. It uses JSONPath syntax to parse the response into an object that ES can recognize.

response.json_parser.text_embeddings

string

Required only when <task_type> is text_embedding.

When task_type is text_embedding, configure this parameter to parse the path of text_embedding.

response.json_parser.sparse_result

object

Required only when <task_type> is sparse_embedding.

When task_type is sparse_embedding, configure this parameter.

response.json_parser.sparse_result.path

string

The path to parse sparse_result.

response.json_parser.sparse_result.value.sparse_token

string

The path to parse sparse_token.

response.json_parser.sparse_result.value.sparse_weight

string

The path to parse sparse_weight.

response.json_parser.relevance_score

string

Required when <task_type> is rerank. Do not specify this for other types.

The path to parse the rerank relevance_score.

response.json_parser.reranked_index

string

No

The path to parse the rerank reranked_index.

response.json_parser.document_text

string

No

The path to parse the rerank document_text.

response.json_parser.completion_result

string

Required when <task_type> is completion. Do not specify this for other types.

The path to parse the completion completion_result.

task_settings.parameters

object

No

Custom parameters. Configure them in parameters and use placeholders to replace them where needed.

You can configure optional parameters for the model service in task_settings.parameters using the default values of the inference model. To modify these values, you can set them when you create the service or configure the task_settings.parameters parameter when you call the service.

The response parameter parses the http response, converting it into an object that is recognizable by ES. This process integrates the model inference service with the ES write and query processes. You can use JSONPath expressions to parse the response.

ES supports custom models of the text_embedding, sparse_embedding, rerank, completion, and custom types. The custom type does not have a response parameter. The other four types have different response formats. The following sections provide examples of the response format for each type:

text_embedding type

For the text_embedding type, the input is a string or a List<string>. The required result is a List<List<Float>>, which represents the vector result for each input text.

The following is a sample Response:

{
    "request_id": "B4AB89C8-B135-****-A6F8-2BAB801A2CE4",
    "latency": 38,
    "usage": {
        "token_count": 3072
    },
    "result": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.02868066355586052,
                    0.022033605724573135,
                    ...
                ]
            }
        ]
    }
}

The corresponding parameter settings are:

"response":{
	"json_parser":{
		"text_embeddings":"$.result.embeddings[*].embedding"
	}
}

sparse_embedding type

For the sparse_embedding type, the input is a string or a List<string>. The required result includes a List<List<string>> of tokens and a List<List<Float>> of weights.

Example Response (response result):

{
	"request_id": "75C50B5B-E79E-4930-****-F48DBB392231",
	"latency": 22,
	"usage": {
		"token_count": 11
	},
	"result": {
		"sparse_embeddings": [
			{
				"index": 0,
				"embedding": [
					{
						"tokenId": 6,
						"weight": 0.10137939453125
					},
					{
						"tokenId": 163040,
						"weight": 0.2841796875
					},
					{
						"tokenId": 354,
						"weight": 0.1431884765625
					},
					{
						"tokenId": 5998,
						"weight": 0.161376953125
					},
					{
						"tokenId": 8550,
						"weight": 0.2388916015625
					},
					{
						"tokenId": 2017,
						"weight": 0.1614990234375
					}
				]
			},
			{
				"index": 1,
				"embedding": [
					{
						"tokenId": 9803,
						"weight": 0.1951904296875
					},
					{
						"tokenId": 86250,
						"weight": 0.317138671875
					},
					{
						"tokenId": 5889,
						"weight": 0.17529296875
					},
					{
						"tokenId": 2564,
						"weight": 0.11614990234375
					},
					{
						"tokenId": 59529,
						"weight": 0.1666259765625
					}
				]
			}
		]
	}
}

The corresponding parameter settings are:

"response":{
	"json_parser":{
		"sparse_result":{
			"path":"$.result.sparse_embeddings[*]",
			"value":{
				"sparse_token":"$.embedding[*].token_id",
				"sparse_weight":"$.embedding[*].weight"   
			}
		}
	}
}

rerank type

For the rerank type, the required result is a score of the List<Float> type, which represents the sorting scores of the input text for the query. The result can also include an optional index of the List<int> type, which represents the index of the doc in the input text array. If this parameter is not specified, the default order is used. Another optional field is text of the List<string> type, which represents the original input text that corresponds to the sorting result.

Example Response:

{
  "request_id":"24B004E0-ADEF-****-879B-F28359BFAD1D",
  "latency":19,
  "usage":{
      "doc_count":3
  },
  "result":{
      "scores":[
        {
          "index":0,"score":0.45026873385713345
        },
        {
          "index":1,"score":1.1412238544346029E-4
        },
        {
          "index":2,"score":8.029784284533197E-5
        }
      ]
    }
  }

The corresponding parameter settings are:

"response":{
	"json_parser":{
		"relevance_score":"$.result.scores[*].score",
		"reranked_index":"$.result.scores[*].index"
	}
}

completion type

For the completion type, the required result is a List<string>.

Example Response (response result):

{
  "request_id": "450fcb80-f796-****-8d69-e1e86d29aa9f",
  "latency": 564.903929,
  "result": {
    "text":"Zhengzhou is a modern city with a long and rich cultural history, and it has many fun places to visit. Here are some recommended tourist attractions:..."
  }
  "usage": {
      "output_tokens": 6320,
      "input_tokens": 35,
      "total_tokens": 6355,
  }
  
}

The corresponding parameter settings are:

"response":{
	"json_parser":{
		"completion_result":"$.result.text"
	}
}

Creation examples for each type

TEXT_EMBEDDING (text embedding)

PUT _inference/text_embedding/<model_id>
{
  "service":"alibaba-cloud-custom-model",
  "service_settings":{
    "secret_parameters":{
      <secret_parameter_values>
    },
    "url":"<your_url>",
    "path":{
      "<your_path>":{
        "POST":{
          "query_string": "<query_string_values>",
          "headers":{
            <header_values>
          },
          "request":{
            "format":"string",
            "content":"<model_request_format>"
          },
          "response":{
            "json_parser":{
              "text_embeddings":"<path_to_parse_text_embeddings>"
            }
          }
        }
      }
    }
  },
  "task_settings":{
    "parameters":{
      <parameter_values>
    }
  }
}

SPARSE_EMBEDDING (sparse text embedding)

PUT _inference/sparse_embedding/<model_id>
{
  "service":"alibaba-cloud-custom-model",
  "service_settings":{
    "secret_parameters":{
      <secret_parameter_values>
    },
    "url":"<your_url>",
    "path":{
      "<your_path>":{
        "<method>":{
          "query_string": "<query_string_values>",
          "headers":{
            <header_values>
          },
          "request":{
            "format":"string",
            "content":"<model_request_format>"
          },
          "response":{
            "json_parser":{
              "sparse_result":{
                "path":"<path_to_parse_sparse_embedding>",
                "value":{
                  "sparse_token":"<path_to_parse_sparse_embedding_token>",
                  "sparse_weight":"<path_to_parse_sparse_embedding_weight>"
                }
              }
            }
          }
        }
      }
    }
  },
  "task_settings":{
    "parameters":{
      <parameter_values>
    }
  }
}

RERANK (sorting service)

PUT _inference/rerank/<model_id>
{
  "service":"alibaba-cloud-custom-model",
  "service_settings":{
    "secret_parameters":{
      <secret_parameter_values>
    },
    "url":"<your_url>",
    "path":{
      "<your_path>":{
        "<method>":{
          "query_string": "<query_string_values>",
          "headers":{
            <header_values>
          },
          "request":{
            "format":"string",
            "content":"<model_request_format>"
          },
          "response":{
            "json_parser":{
              "relevance_score":"<path_to_parse_rerank_relevance_score>",
              "reranked_index":"<path_to_parse_rerank_reranked_index>",
              "document_text":"<path_to_parse_rerank_document_text>"
            }
          }
        }
      }
    }
  }
}

COMPLETION (content generation service)

PUT _inference/completion/<model_id>
{
  "service":"alibaba-cloud-custom-model",
  "service_settings":{
    "secret_parameters":{
      <secret_parameter_values>
    },
    "url":"<your_url>",
    "path":{
      "<your_path>":{
        "<method>":{
          "query_string": "<query_string_values>",
          "headers":{
            <header_values>
          },
          "request":{
            "format":"string",
            "content":"<model_request_format>"
          },
          "response":{
            "json_parser":{
              "completion_result":"<path_to_parse_completion>"
            }
          }
        }
      }
    }
  },
  "task_settings":{
    "parameters":{
      <parameter_values>
    }
  }
}

CUSTOM (custom service)

For model types that ES does not currently support, or when you need to retrieve the full response, you can set the model type to custom. In this case, the response that ES returns contains the complete model response.

Models of the text_embedding, sparse_embedding, rerank, and completion types can also be defined as custom.

PUT _inference/custom/<model_id>
{
  "service":"alibaba-cloud-custom-model",
  "service_settings":{
    "secret_parameters":{
      <secret_parameter_values>
    },
    "url":"<your_url>",
    "path":{
      "<your_path>":{
        "<method>":{
          "query_string": "<query_string_values>",
          "headers":{
            <header_values>
          },
          "request":{
            "format":"string",
            "content":"<model_request_format>"
          }
        }
      }
    }
  },
  "task_settings":{
    "parameters":{
      <parameter_values>
    }
  }
}

Get an inference model

Syntax

GET /_inference/_all

GET /_inference/<inference_id>

GET /_inference/<task_type>/_all

GET /_inference/<task_type>/<inference_id>

Parameters

Parameter

Content

<inference_id>

The identifier of the custom inference endpoint.

<task_type>

The type of the inference interface. Supported values:

  • text_embedding

  • sparse_embedding

  • rerank

  • completion

  • custom

Example

Sample request:

GET _inference/_all

Response:

{
  "endpoints": [
    {
      "inference_id": "os_deployment_emb",
      "task_type": "text_embedding",
      "service": "alibaba-cloud-custom-model",
      "service_settings": {
        "url": "http://xxxx.opensearch.aliyuncs.com",
        "path": {
"/v3/openapi/deployments/xxx/predict": {
            "POST": {
              "headers": {
                "Authorization": "Bearer ${api_key}",
                "Token": "${Token}"
              },
              "request": {
                "format": "string",
                "content": """{"input":${input},"input_type":"${input_type}"}"""
              },
              "response": {
                "json_parser": {
                  "text_embeddings": "$.embeddings[*].embedding"
                }
              }
            }
          }
        },
        "rate_limit": {
          "requests_per_minute": 10000
        }
      },
      "task_settings": {
        "parameters": {
          "input_type": "document"
        }
      }
    }
  ]
}

Call an inference model

Syntax

POST /_inference/<inference_id>

POST /_inference/<task_type>/<inference_id>

Parameters

Path parameters

Parameter

Content

<inference_id>

The identifier of the custom inference endpoint.

<task_type>

The type of the inference interface. Supported values:

  • text_embedding

  • sparse_embedding

  • rerank

  • completion

  • custom

Query parameters

Parameter

Content

timeout

Optional. string. Controls the timeout duration for waiting for a request. The default value is 30 s.

Request body parameters

Parameter

Content

input

Required. string or array of strings. The input text for calling the model.

query

Optional. string. Used only for the rerank interface. The input query content.

task_settings

Optional. object. The task_settings configuration for this model call request. This configuration overwrites the task_settings configuration from when the model inference was created.

Examples

text_embedding

Sample call:

POST _inference/text_embedding/os_deployment_emb
{
  "input":"hello world"
}

Response (response):

{
  "text_embedding": [
    {
      "embedding": [
        -0.026062012,
        0.01574707,
        -0.03842163,
        0.012580872,
        ...
      ]
    }
  ]
}

rerank

Sample call:

POST _inference/rerank/os_deployment_custom_rerank
{
  "input": ["luke", "like", "leia", "chewy","r2d2", "star", "wars"],
  "query": "star wars main character"
}

Response:

{
  "rerank": [
    {
      "index": 0,
      "relevance_score": 0.8502201
    },
    {
      "index": 1,
      "relevance_score": 0.062216982
    },
    {
      "index": 2,
      "relevance_score": 0.60352296
    },
    {
      "index": 3,
      "relevance_score": 0.35611072
    },
    {
      "index": 4,
      "relevance_score": 0.40951595
    },
    {
      "index": 5,
      "relevance_score": 0.16277891
    },
    {
      "index": 6,
      "relevance_score": 0.12918286
    }
  ]
}

Delete an inference model

Syntax

DELETE /_inference/<inference_id>

DELETE /_inference/<task_type>/<inference_id>

Parameters

Parameter

Content

<inference_id>

The identifier of the custom inference endpoint.

Example: DELETE /_inference/<inference_id>.

<task_type>

The type of the inference interface. Supported values:

  • text_embedding

  • sparse_embedding

  • rerank

  • completion

  • custom

Example: DELETE /_inference/<task_type>/<inference_id>.

Example

Delete example:

DELETE _inference/custom-rerank

Response (the returned result):

{
  "acknowledged": true,
  "pipelines": []
}