Transform text into vectors by calling Model Studio APIs

更新时间:
复制 MD 格式

This topic describes how to use Alibaba Cloud Model Studio to transform text into vectors and store them in Vector Retrieval Service (DashVector) for retrieval.

Prerequisites

General-purpose text embedding

Introduction

General-purpose text embedding is a unified multilingual text embedding model from Qwen Lab. It is built on a Large Language Model (LLM) and supports major languages worldwide. This model allows developers to quickly transform text into high-quality vector data.

Model name

Vector dimensions

Distance measure

Vector data type

Notes

text-embedding-v1

1536

Cosine

Float32

  • Maximum input characters per line: 2048

  • Maximum lines of text per request: 25

  • Supported languages: Chinese, English, Spanish, French, Portuguese, and Indonesian

text-embedding-v2

1536

Cosine

Float32

  • Maximum input characters per line: 2048

  • Maximum lines of text per request: 25

  • Supported languages: Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian

Note

For more information about general-purpose text embedding, see General-purpose text embedding.

Example

Note

To run the code, replace the following placeholders:

  1. Replace {your-dashvector-api-key} with your DashVector API key.

  2. Replace {your-dashvector-cluster-endpoint} with your DashVector cluster endpoint.

  3. Replace {your-dashscope-api-key} with your DashScope API key.

import dashscope
from dashscope import TextEmbedding
from dashvector import Client
from typing import List, Union


dashscope.api_key = '{your-dashscope-api-key}'


# Call the DashScope general-purpose text embedding model to transform text into vectors
def generate_embeddings(texts: Union[List[str], str], text_type: str = 'document'):
    rsp = TextEmbedding.call(
        model=TextEmbedding.Models.text_embedding_v2,
        input=texts,
        text_type=text_type
    )
    embeddings = [record['embedding'] for record in rsp.output['embeddings']]
    return embeddings if isinstance(texts, list) else embeddings[0]


# Create a DashVector client
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a DashVector collection
rsp = client.create('dashscope-text-embedding', 1536)
assert rsp
collection = client.get('dashscope-text-embedding')
assert collection

# Insert vectors into DashVector
collection.insert(
    ('ID1', generate_embeddings('Alibaba Cloud Vector Retrieval Service (DashVector) is one of the best-performing and most cost-effective vector databases'))
)

# Retrieve vectors
docs = collection.query(
    generate_embeddings('The best vector database', 'query')
)
print(docs)

Related best practices

ONE-PEACE multimodal vector representation

ONE-PEACE is a general-purpose representation model for three modalities: image, text, and audio. It also transforms text into vectors.

For more information, see Generate vectors from multiple modalities — ONE-PEACE multimodal vector representation.