ModelScope is an open-source model-as-a-service (MaaS) platform that provides pre-trained models for AI developers. On ModelScope, you can:
Use and download pre-trained models free of charge.
Perform command line-based model prediction to validate model effects quickly.
Fine-tune models with your own data for customization.
Engage in theoretical and practical training to improve your R&D abilities.
Share your ideas with the entire community.
This guide shows how to use its multi-modal embedding models to generate vectors from images and text, store those vectors in DashVector, and run cross-modal vector searches.
How it works
CLIP (Contrastive Language-Image Pre-training) encodes images and text into vectors within the same latent space, so both types of content are directly comparable. Query with text and retrieve matching images, or query with an image and retrieve matching text — no separate indexes required.
For example, the query "The best vector database" returns semantically similar results regardless of whether they are text or images.
Prerequisites
Before you begin, ensure that you have:
A DashVector cluster. See Create a cluster.
A DashVector API key. See Manage API keys.
The DashVector SDK installed. See Install DashVector SDK.
The ModelScope SDK installed:
pip install -U modelscope
CLIP model
The Chinese CLIP model is trained on up to 0.2 billion text-graphic pairs in Chinese. It supports graphic search and feature extraction for images and text, and suits search and recommendation scenarios.
Choose a model variant based on the vector dimension you need:
| Model ID | Dimensions | Distance metric | Data type | Language | Resolution | Max text length |
|---|---|---|---|---|---|---|
damo/multi-modal_clip-vit-base-patch16_zh | 512 | Cosine | Float32 | Chinese | 224 px | 512 |
damo/multi-modal_clip-vit-large-patch14_zh | 768 | Cosine | Float32 | Chinese | 224 px | 512 |
damo/multi-modal_clip-vit-huge-patch14_zh | 1,024 | Cosine | Float32 | Chinese | 224 px | 512 |
damo/multi-modal_clip-vit-large-patch14_336_zh | 768 | Cosine | Float32 | Chinese | 336 px | 512 |
For the full list of CLIP models available on ModelScope, see the ModelScope CLIP model catalog.
Vectorize and search multi-modal data
The steps below use the CLIP model to generate text and image embeddings, store them in a DashVector collection, and run a cross-modal vector search.
Replace these placeholders in the code before running:
| Placeholder | Description |
|---|---|
<your-dashvector-api-key> | Your DashVector API key |
<your-dashvector-cluster-endpoint> | The endpoint of your DashVector cluster |
<model-id> | A model ID from the table above (e.g., damo/multi-modal_clip-vit-large-patch14_zh) |
<model-dim> | The vector dimension matching your chosen model (e.g., 768) |
Step 1: Set up the embedding pipeline and DashVector client
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
from modelscope.preprocessors.image import load_image
from typing import List
from dashvector import Client
# Load the CLIP model pipeline
clip_pipeline = pipeline(task=Tasks.multi_modal_embedding, model='<model-id>')
def generate_text_embeddings(texts: List[str]):
inputs = {'text': texts}
result = clip_pipeline.forward(input=inputs)
return result['text_embedding'].numpy()
def generate_img_embeddings(img: str):
input_img = load_image(img)
inputs = {'img': input_img}
result = clip_pipeline.forward(input=inputs)
return result['img_embedding'].numpy()[0]
# Initialize the DashVector client
client = Client(
api_key='<your-dashvector-api-key>',
endpoint='<your-dashvector-cluster-endpoint>'
)Step 2: Create a collection
Text and image vectors from the same CLIP model share the same dimension, so you can store both in a single collection and search across modalities.
# Create a collection with the dimension matching your chosen CLIP model
rsp = client.create('CLIP-embedding', dimension=<model-dim>)
assert rsp
collection = client.get('CLIP-embedding')
assert collectionStep 3: Insert text and image vectors
# Generate embeddings and insert them into the collection
collection.insert(
[
('ID1', generate_text_embeddings(['DashVector developed by Alibaba Cloud is one of the most high-performance and cost-effective vector database services.'])[0]),
('ID2', generate_img_embeddings('https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg'))
]
)Step 4: Run a cross-modal vector search
Use a text query to search across both text and image vectors in the collection.
# Query with text to find semantically similar results across all modalities
docs = collection.query(
generate_text_embeddings(['The best vector database'])[0]
)
print(docs)