Use open source embedding models from ModelScope to convert text into vectors-DashVector(DashVector)-阿里云帮助中心

ModelScope hosts hundreds of open-source embedding models you can use free of charge. This guide shows how to pick a model, generate embeddings with it, and store the resulting vectors in Vector Retrieval Service (VRS) for semantic search.

Prerequisites

Before you begin, ensure that you have:

A VRS cluster — Create a cluster
A VRS API key — Manage API keys
The latest VRS SDK installed — Install DashVector SDK
The ModelScope SDK installed:
```
pip install -U modelscope
```

Choose a model

Browse all sentence embedding models on ModelScope to find a model that fits your language and domain. The tables below list the most commonly used model families. For each model, note the model ID and dimensions — you will use both values in the code.

CoROM models

CoROM models cover Chinese and English across general, eCommerce, and healthcare domains. See the full CoROM model list.

Model ID	Dimensions	Distance metric	Data type	Domain	Max text length
damo/nlp_corom_sentence-embedding_chinese-base	768	Cosine	Float32	Chinese — general, base	512
damo/nlp_corom_sentence-embedding_english-base	768	Cosine	Float32	English — general, base	512
damo/nlp_corom_sentence-embedding_chinese-base-ecom	768	Cosine	Float32	Chinese — eCommerce, base	512
damo/nlp_corom_sentence-embedding_chinese-base-medical	768	Cosine	Float32	Chinese — healthcare, base	512
damo/nlp_corom_sentence-embedding_chinese-tiny	256	Cosine	Float32	Chinese — general, tiny	512
damo/nlp_corom_sentence-embedding_english-tiny	256	Cosine	Float32	English — general, tiny	512
damo/nlp_corom_sentence-embedding_chinese-tiny-ecom	256	Cosine	Float32	Chinese — eCommerce, tiny	512
damo/nlp_corom_sentence-embedding_chinese-tiny-medical	256	Cosine	Float32	Chinese — healthcare, tiny	512

GTE models

GTE models cover Chinese and English in small, base, and large sizes. See the full GTE model list.

Model ID	Dimensions	Distance metric	Data type	Domain	Max text length
damo/nlp_gte_sentence-embedding_chinese-base	768	Cosine	Float32	Chinese — general, base	512
damo/nlp_gte_sentence-embedding_chinese-large	768	Cosine	Float32	Chinese — general, large	512
damo/nlp_gte_sentence-embedding_chinese-small	512	Cosine	Float32	Chinese — general, small	512
damo/nlp_gte_sentence-embedding_english-base	768	Cosine	Float32	English — general, base	512
damo/nlp_gte_sentence-embedding_english-large	768	Cosine	Float32	English — general, large	512
damo/nlp_gte_sentence-embedding_english-small	384	Cosine	Float32	English — general, small	512

Udever multilingual models

Udever models support multiple languages with a maximum text length of 2,048 tokens. See the full Udever model list.

Model ID	Dimensions	Distance metric	Data type	Parameters	Max text length
damo/udever-bloom-560m	1,024	Cosine	Float32	560m	2,048
damo/udever-bloom-1b1	1,536	Cosine	Float32	1b1	2,048
damo/udever-bloom-3b	2,048	Cosine	Float32	3b	2,048
damo/udever-bloom-7b1	4,096	Cosine	Float32	7b1	2,048

StructBERT FAQ models

StructBERT models are optimized for FAQ question answering with no maximum text length limit. They use a different pipeline task — see Generate embeddings with StructBERT. See the full StructBERT model list.

Model ID	Dimensions	Distance metric	Data type	Domain
damo/nlp_structbert_faq-question-answering_chinese-base	768	Cosine	Float32	Chinese — general, base
damo/nlp_structbert_faq-question-answering_chinese-finance-base	768	Cosine	Float32	Chinese — finance, base
damo/nlp_structbert_faq-question-answering_chinese-gov-base	768	Cosine	Float32	Chinese — eGovernment, base

More models

The following models are also available on ModelScope:

Model name	Model ID	Dimensions	Distance metric	Data type	Max text length
BERT entity embedding — Chinese	damo/nlp_bert_entity-embedding_chinese-base	768	Cosine	Float32	128 (Details)
MiniLM — English, text retrieval	damo/nlp_minilm_ibkd_sentence-embedding_english-msmarco	384	Cosine	Float32	128 (Details)
MiniLM — English, STS	damo/nlp_minilm_ibkd_sentence-embedding_english-sts	384	Cosine	Float32	128 (Details)
text2vec-base-chinese	thomas/text2vec-base-chinese	768	Cosine	Float32	unknown (Details)
text2vec-large-chinese	thomas/text2vec-large-chinese	1,024	Cosine	Float32	unknown (Details)

Generate embeddings and store them in VRS

The code pattern is the same for all models listed above — CoROM, GTE, Udever, and the additional models. Replace <model-id> with your chosen model ID and <dimensions> with the corresponding dimensions value from the tables above.

StructBERT FAQ models use a different pipeline task. See Generate embeddings with StructBERT.

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List
from dashvector import Client

# Replace <model-id> with your chosen model ID, e.g., damo/nlp_corom_sentence-embedding_english-base
pipeline_se = pipeline(Tasks.sentence_embedding, model='<model-id>')


def generate_embeddings(texts: List[str]):
    inputs = {'source_sentence': texts}
    result = pipeline_se(input=inputs)
    return result['text_embedding']


# Create a VRS client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a collection.
# Set dimension to match the model's output dimensions from the table above.
# For example, use 768 for CoROM base models and 256 for CoROM tiny models.
rsp = client.create('text-embedding', dimension=<dimensions>)
assert rsp
collection = client.get('text-embedding')
assert collection

# Insert a vector
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Run a vector search
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)

Replace the following placeholders in the code:

Placeholder	Description	Example
`{your-dashvector-api-key}`	Your VRS API key	—
`{your-dashvector-cluster-endpoint}`	Your VRS cluster endpoint	—
`<model-id>`	Model ID from the tables above	`damo/nlp_corom_sentence-embedding_english-base`
`<dimensions>`	Vector dimensions for the chosen model	`768`

Generate embeddings with StructBERT

StructBERT FAQ models use the faq_question_answering task instead of sentence_embedding. Replace <model-id> with the StructBERT model ID you want to use.

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List

# Replace <model-id> with a StructBERT model ID,
# e.g., damo/nlp_structbert_faq-question-answering_chinese-base
pipeline = pipeline(Tasks.faq_question_answering, model='<model-id>')


def generate_embeddings(texts: List[str], max_len=30):
    return pipeline.get_sentence_embedding(texts)

To store the resulting vectors in VRS and run searches, use the same client, collection.insert(), and collection.query() calls shown in the previous section.