Vectorize multi-modal data by using open source multi-modal embedding models of ModelScope-DashVector(DashVector)-阿里云帮助中心

ModelScope is an open-source model-as-a-service (MaaS) platform that provides pre-trained models for AI developers. On ModelScope, you can:

Use and download pre-trained models free of charge.
Perform command line-based model prediction to validate model effects quickly.
Fine-tune models with your own data for customization.
Engage in theoretical and practical training to improve your R&D abilities.
Share your ideas with the entire community.

This guide shows how to use its multi-modal embedding models to generate vectors from images and text, store those vectors in DashVector, and run cross-modal vector searches.

How it works

CLIP (Contrastive Language-Image Pre-training) encodes images and text into vectors within the same latent space, so both types of content are directly comparable. Query with text and retrieve matching images, or query with an image and retrieve matching text — no separate indexes required.

For example, the query "The best vector database" returns semantically similar results regardless of whether they are text or images.

Prerequisites

Before you begin, ensure that you have:

A DashVector cluster. See Create a cluster.
A DashVector API key. See Manage API keys.
The DashVector SDK installed. See Install DashVector SDK.
The ModelScope SDK installed:
```
pip install -U modelscope
```

CLIP model

The Chinese CLIP model is trained on up to 0.2 billion text-graphic pairs in Chinese. It supports graphic search and feature extraction for images and text, and suits search and recommendation scenarios.

Choose a model variant based on the vector dimension you need:

Model ID	Dimensions	Distance metric	Data type	Language	Resolution	Max text length
`damo/multi-modal_clip-vit-base-patch16_zh`	512	Cosine	Float32	Chinese	224 px	512
`damo/multi-modal_clip-vit-large-patch14_zh`	768	Cosine	Float32	Chinese	224 px	512
`damo/multi-modal_clip-vit-huge-patch14_zh`	1,024	Cosine	Float32	Chinese	224 px	512
`damo/multi-modal_clip-vit-large-patch14_336_zh`	768	Cosine	Float32	Chinese	336 px	512

For the full list of CLIP models available on ModelScope, see the ModelScope CLIP model catalog.

Vectorize and search multi-modal data

The steps below use the CLIP model to generate text and image embeddings, store them in a DashVector collection, and run a cross-modal vector search.

Replace these placeholders in the code before running:

Placeholder	Description
`<your-dashvector-api-key>`	Your DashVector API key
`<your-dashvector-cluster-endpoint>`	The endpoint of your DashVector cluster
`<model-id>`	A model ID from the table above (e.g., `damo/multi-modal_clip-vit-large-patch14_zh`)
`<model-dim>`	The vector dimension matching your chosen model (e.g., `768`)

Step 1: Set up the embedding pipeline and DashVector client

from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
from modelscope.preprocessors.image import load_image
from typing import List
from dashvector import Client

# Load the CLIP model pipeline
clip_pipeline = pipeline(task=Tasks.multi_modal_embedding, model='<model-id>')


def generate_text_embeddings(texts: List[str]):
    inputs = {'text': texts}
    result = clip_pipeline.forward(input=inputs)
    return result['text_embedding'].numpy()


def generate_img_embeddings(img: str):
    input_img = load_image(img)
    inputs = {'img': input_img}
    result = clip_pipeline.forward(input=inputs)
    return result['img_embedding'].numpy()[0]


# Initialize the DashVector client
client = Client(
    api_key='<your-dashvector-api-key>',
    endpoint='<your-dashvector-cluster-endpoint>'
)

Step 2: Create a collection

Text and image vectors from the same CLIP model share the same dimension, so you can store both in a single collection and search across modalities.

# Create a collection with the dimension matching your chosen CLIP model
rsp = client.create('CLIP-embedding', dimension=<model-dim>)
assert rsp
collection = client.get('CLIP-embedding')
assert collection

Step 3: Insert text and image vectors

# Generate embeddings and insert them into the collection
collection.insert(
    [
        ('ID1', generate_text_embeddings(['DashVector developed by Alibaba Cloud is one of the most high-performance and cost-effective vector database services.'])[0]),
        ('ID2', generate_img_embeddings('https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg'))
    ]
)

Step 4: Run a cross-modal vector search

Use a text query to search across both text and image vectors in the collection.

# Query with text to find semantically similar results across all modalities
docs = collection.query(
    generate_text_embeddings(['The best vector database'])[0]
)
print(docs)