Use open source embedding models from ModelScope to convert text into vectors

更新时间:
复制 MD 格式

ModelScope hosts hundreds of open-source embedding models you can use free of charge. This guide shows how to pick a model, generate embeddings with it, and store the resulting vectors in Vector Retrieval Service (VRS) for semantic search.

Prerequisites

Before you begin, ensure that you have:

Choose a model

Browse all sentence embedding models on ModelScope to find a model that fits your language and domain. The tables below list the most commonly used model families. For each model, note the model ID and dimensions — you will use both values in the code.

CoROM models

CoROM models cover Chinese and English across general, eCommerce, and healthcare domains. See the full CoROM model list.

Model IDDimensionsDistance metricData typeDomainMax text length
damo/nlp_corom_sentence-embedding_chinese-base768CosineFloat32Chinese — general, base512
damo/nlp_corom_sentence-embedding_english-base768CosineFloat32English — general, base512
damo/nlp_corom_sentence-embedding_chinese-base-ecom768CosineFloat32Chinese — eCommerce, base512
damo/nlp_corom_sentence-embedding_chinese-base-medical768CosineFloat32Chinese — healthcare, base512
damo/nlp_corom_sentence-embedding_chinese-tiny256CosineFloat32Chinese — general, tiny512
damo/nlp_corom_sentence-embedding_english-tiny256CosineFloat32English — general, tiny512
damo/nlp_corom_sentence-embedding_chinese-tiny-ecom256CosineFloat32Chinese — eCommerce, tiny512
damo/nlp_corom_sentence-embedding_chinese-tiny-medical256CosineFloat32Chinese — healthcare, tiny512

GTE models

GTE models cover Chinese and English in small, base, and large sizes. See the full GTE model list.

Model IDDimensionsDistance metricData typeDomainMax text length
damo/nlp_gte_sentence-embedding_chinese-base768CosineFloat32Chinese — general, base512
damo/nlp_gte_sentence-embedding_chinese-large768CosineFloat32Chinese — general, large512
damo/nlp_gte_sentence-embedding_chinese-small512CosineFloat32Chinese — general, small512
damo/nlp_gte_sentence-embedding_english-base768CosineFloat32English — general, base512
damo/nlp_gte_sentence-embedding_english-large768CosineFloat32English — general, large512
damo/nlp_gte_sentence-embedding_english-small384CosineFloat32English — general, small512

Udever multilingual models

Udever models support multiple languages with a maximum text length of 2,048 tokens. See the full Udever model list.

Model IDDimensionsDistance metricData typeParametersMax text length
damo/udever-bloom-560m1,024CosineFloat32560m2,048
damo/udever-bloom-1b11,536CosineFloat321b12,048
damo/udever-bloom-3b2,048CosineFloat323b2,048
damo/udever-bloom-7b14,096CosineFloat327b12,048

StructBERT FAQ models

StructBERT models are optimized for FAQ question answering with no maximum text length limit. They use a different pipeline task — see Generate embeddings with StructBERT. See the full StructBERT model list.

Model IDDimensionsDistance metricData typeDomain
damo/nlp_structbert_faq-question-answering_chinese-base768CosineFloat32Chinese — general, base
damo/nlp_structbert_faq-question-answering_chinese-finance-base768CosineFloat32Chinese — finance, base
damo/nlp_structbert_faq-question-answering_chinese-gov-base768CosineFloat32Chinese — eGovernment, base

More models

The following models are also available on ModelScope:

Model nameModel IDDimensionsDistance metricData typeMax text length
BERT entity embedding — Chinesedamo/nlp_bert_entity-embedding_chinese-base768CosineFloat32128 (Details)
MiniLM — English, text retrievaldamo/nlp_minilm_ibkd_sentence-embedding_english-msmarco384CosineFloat32128 (Details)
MiniLM — English, STSdamo/nlp_minilm_ibkd_sentence-embedding_english-sts384CosineFloat32128 (Details)
text2vec-base-chinesethomas/text2vec-base-chinese768CosineFloat32unknown (Details)
text2vec-large-chinesethomas/text2vec-large-chinese1,024CosineFloat32unknown (Details)

Generate embeddings and store them in VRS

The code pattern is the same for all models listed above — CoROM, GTE, Udever, and the additional models. Replace <model-id> with your chosen model ID and <dimensions> with the corresponding dimensions value from the tables above.

StructBERT FAQ models use a different pipeline task. See Generate embeddings with StructBERT.
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List
from dashvector import Client

# Replace <model-id> with your chosen model ID, e.g., damo/nlp_corom_sentence-embedding_english-base
pipeline_se = pipeline(Tasks.sentence_embedding, model='<model-id>')


def generate_embeddings(texts: List[str]):
    inputs = {'source_sentence': texts}
    result = pipeline_se(input=inputs)
    return result['text_embedding']


# Create a VRS client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a collection.
# Set dimension to match the model's output dimensions from the table above.
# For example, use 768 for CoROM base models and 256 for CoROM tiny models.
rsp = client.create('text-embedding', dimension=<dimensions>)
assert rsp
collection = client.get('text-embedding')
assert collection

# Insert a vector
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Run a vector search
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)

Replace the following placeholders in the code:

PlaceholderDescriptionExample
{your-dashvector-api-key}Your VRS API key
{your-dashvector-cluster-endpoint}Your VRS cluster endpoint
<model-id>Model ID from the tables abovedamo/nlp_corom_sentence-embedding_english-base
<dimensions>Vector dimensions for the chosen model768

Generate embeddings with StructBERT

StructBERT FAQ models use the faq_question_answering task instead of sentence_embedding. Replace <model-id> with the StructBERT model ID you want to use.

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List

# Replace <model-id> with a StructBERT model ID,
# e.g., damo/nlp_structbert_faq-question-answering_chinese-base
pipeline = pipeline(Tasks.faq_question_answering, model='<model-id>')


def generate_embeddings(texts: List[str], max_len=30):
    return pipeline.get_sentence_embedding(texts)

To store the resulting vectors in VRS and run searches, use the same client, collection.insert(), and collection.query() calls shown in the previous section.

What's next