Multimodal embedding models convert text, images, and videos into embeddings in a shared semantic space to enable cross-modal retrieval, content classification, and similarity search.
Core capabilities
-
Cross-modal retrieval: Perform semantic searches across different content types, such as text-to-image, image-to-video, or image-to-image.
-
Semantic similarity: Measure the semantic similarity between different content types in a unified embedding space.
-
Content classification and clustering: Group, label, and cluster content based on semantic embeddings.
Key feature: Embeddings for all modalities (text, images, and video) share the same semantic space, enabling direct cross-modal matching and comparison using methods such as cosine similarity. See text and multimodal embedding for details on model selection and usage.
Embedding types
The multimodal embedding model supports two methods for generating embeddings:
-
Multimodal independent embedding: Generates a separate embedding for each input, such as text, an image, a video, or multiple images, within the
contents. For example, an input of one text string and one image returns two independent embeddings. This is ideal for comparing individual items, such as in image-to-image or text-to-image searches. -
Multimodal fused embedding: Fuses all inputs in contents into a single embedding to achieve a unified cross-modal semantic representation. This is suitable for scenarios that require a holistic understanding of multimodal content, such as fusing a product image and its description text into a unified representation for retrieval. For
qwen3-vl-embedding, you enable fusion by settingenable_fusion=true; fortongyi-embedding-vision-plus-2026-03-06andtongyi-embedding-vision-flash-2026-03-06, you achieve fusion by placing text, image, and video in the same content object. The fused embedding supports the following combinations:-
Text and image fusion
-
Text and video fusion
-
Fusing multiple images with text (by passing multiple
imageentries) -
Fusion of images, video, and text
-
qwen2.5-vl-embeddingsupports only fused embeddings, not independent embeddings.tongyi-embedding-vision-plusandtongyi-embedding-vision-flashsupport only independent embeddings.tongyi-embedding-vision-plus-2026-03-06andtongyi-embedding-vision-flash-2026-03-06support both independent embeddings and fused embeddings. You create fused embeddings by placing text, image, and video in the same content object.
For model introductions, selection guidance, and usage instructions, see Text and multimodal embedding.
Model overview
China (Beijing)
|
Model |
Embedding dimensions |
Text length limit |
Image size limit |
Video size limit |
Price (per 1,000 input tokens) |
Free quota (Note) |
|
qwen3-vl-embedding |
2560 (default), 2048, 1536, 1024, 768, 512, 256 |
32,000 tokens |
Up to 5 MB per image |
Up to 50 MB per video file |
Image/Video: CNY 0.0018 Text: CNY 0.0007 |
1 million tokens Valid for 90 days after activating Model Studio |
|
qwen2.5-vl-embedding |
2048, 1024 (default), 768, 512 |
|||||
|
tongyi-embedding-vision-plus-2026-03-06 |
1152 (default), 1024, 512, 256, 128, 64 |
1,024 tokens |
Recommended: up to 5 MB per image; maximum: 10 MB. Supports up to 64 images. |
Up to 50 MB per video file The file must be encoded in H.264 or H.265. |
CNY 0.0005 |
|
|
tongyi-embedding-vision-flash-2026-03-06 |
768 (default), 512, 256, 128, 64 |
CNY 0.00015 |
||||
|
tongyi-embedding-vision-plus |
1152 |
Up to 3 MB per image. Supports up to 8 images. |
Up to 10 MB per video file |
CNY 0.0005 |
||
|
tongyi-embedding-vision-flash |
768 |
CNY 0.00015 |
||||
|
multimodal-embedding-v1 |
1,024 |
512 tokens |
Up to 3 MB per image |
Up to 10 MB per video file |
Image/Video: CNY 0.0009 Text: CNY 0.0007 |
Singapore
|
Model |
Embedding dimensions |
Text length limit |
Image size limit |
Video size limit |
Price (per 1,000 input tokens) |
|
tongyi-embedding-vision-plus |
1152 |
1,024 tokens |
Up to 8 images, 3 MB each |
Up to 10 MB per video file |
CNY 0.0005 |
|
tongyi-embedding-vision-flash |
768 |
1,024 tokens |
CNY 0.00015 |
Input formats and usage limits
|
Fused multimodal models |
||||
|
Model |
Text |
Image |
Video |
Request limit |
|
qwen3-vl-embedding |
Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German. |
JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported) |
MP4, AVI, MOV (URL only) |
Up to 20 content elements per request, with a maximum of 5 images and 1 video. |
|
qwen2.5-vl-embedding |
Supports 11 major languages, including Chinese, English, Japanese, Korean, French, and German. |
Each request is limited to one of each input type: image, text, video, or fused object. |
||
|
Independent multimodal models |
||||
|
Model |
Text |
Image |
Video |
Request limit |
|
tongyi-embedding-vision-plus-2026-03-06 |
Supports over 30 major languages, including Chinese, English, Japanese, and Korean. |
JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported) |
MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only) |
Up to 20 content elements per request, with a maximum of 64 images and 8 videos. |
|
tongyi-embedding-vision-flash-2026-03-06 |
||||
|
tongyi-embedding-vision-plus |
Chinese and English |
JPG, PNG, BMP (URL or Base64 supported) |
MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only) |
No limit on the number of content elements. The total number of input tokens must not exceed the batch processing token limit. |
|
tongyi-embedding-vision-flash |
||||
|
multimodal-embedding-v1 |
Chinese and English |
JPG, PNG, BMP (URL or Base64 supported) |
Up to 20 content elements per request, with a maximum of 20 text segments, 1 image, and 1 video. |
|
All models accept text, image, and video inputs, individually or in combination.tongyi-embedding-vision-plus,tongyi-embedding-vision-flash,tongyi-embedding-vision-plus-2026-03-06, andtongyi-embedding-vision-flash-2026-03-06models also supportmulti_imagesfor image sequences.
Model capabilities
|
Model |
Default dimension |
Vector type |
Supported inputs |
Description |
|
qwen3-vl-embedding |
2560 |
Independent / Fusion |
text, image, video, multiple images |
Fusion mode, enabled with the |
|
qwen2.5-vl-embedding |
1024 |
Fusion only |
text, image, video |
Always returns a single fused vector. Independent vectors and multi-image inputs are not supported. |
|
tongyi-embedding-vision-plus-2026-03-06 |
1152 |
Independent / Fusion |
text, image, video, multi_images |
Based on the Qwen3 foundation model. Supports multiple resolutions, 30+ languages, and fused vectors. |
|
tongyi-embedding-vision-flash-2026-03-06 |
768 |
|||
|
tongyi-embedding-vision-plus |
1152 |
Independent only |
Supports |
|
|
tongyi-embedding-vision-flash |
768 |
|||
|
multimodal-embedding-v1 |
1024 |
text, image, video |
The vector dimension is fixed at 1,024 and cannot be configured. |
Prerequisites
Obtain an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.
HTTP call
POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
Request |
Multimodal independent embeddingThe following example uses the
Multimodal fused embeddingThe
2026-03-06 snapshot exampleThe
The following example shows how to use the 2026-03-06 version to generate a fused embedding. By placing text, image, and video in the same content object, the model combines all inputs into a single embedding of type
|
Request headers |
|
|
Content-Type The content type of the request. Must be |
|
|
Authorization Authenticates the request with a Model Studio API key. Example: Bearer sk-xxxx. |
|
Request body |
|
|
model The model name. Select a model from the Model overview. |
|
|
input The input content. parameters Embedding processing parameters. For HTTP calls, you must wrap these parameters in the parameters object. For SDK calls, you can use these parameters directly. |
Response |
Successful response
Note
The
Note
Error response
|
|
output Task output. |
|
|
request_id Unique request identifier for tracing and troubleshooting. |
|
|
code Error code. Returned only for failed requests. See Error codes. |
|
|
message Detailed error message. Returned only for failed requests. See Error codes. |
|
|
usage Statistics about token usage. |
SDK usage
The SDK'sinputparameter maps toinput.contentsin the HTTP request body, but their structures are different.
Code examples
Image embedding
Image URL
import dashscope
import json
from http import HTTPStatus
# Replace with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Local image
To generate an embedding from a local image, convert the image to a Base64 string:
import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
# Read the file and convert it to Base64.
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png" # Change this to your image's format (e.g., jpg, bmp).
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Video embedding
Currently, the model only supports video input via URL. Local video files are not supported.
import dashscope
import json
from http import HTTPStatus
# Replace with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Text embedding
import dashscope
import json
from http import HTTPStatus
text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Fused embedding
import dashscope
import json
import os
from http import HTTPStatus
# Fuses text, image, and video into a single fused embedding.
# Ideal for use cases like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# Input includes text, image, and video. Set enable_fusion=True to generate a fused embedding.
input_data = [
{"text": text},
{"image": image},
{"video": video}
]
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your Model Studio API key, e.g., api_key="sk-xxx".
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True,
# Optional: Specify the embedding dimension. Valid values: 2560, 2048, 1536, 1024, 768, 512, and 256. Default: 2560.
# parameters={"dimension": 1024}
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Multi-image fused embedding
Use qwen3-vl-embedding to fuse multiple images and text into a single embedding. To fuse multiple images, pass multiple image items. This is ideal for semantic retrieval using multi-angle product images and a text description.
import dashscope
import json
import os
from http import HTTPStatus
# Fuses multiple product images and a description into a single embedding.
# Ideal for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"
# Pass multiple image items and set enable_fusion=True to fuse all inputs into a single embedding.
input_data = [
{"text": text},
{"image": image1},
{"image": image2}
]
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your Model Studio API key, e.g., api_key="sk-xxx".
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
2026-03-06 snapshot version
This example shows how to use thetongyi-embedding-vision-plus-2026-03-06model and itsres_level(resolution) andmax_video_frames(video frames) parameters. Built on the Qwen3 foundation, this model supports 30+ languages and generates both independent and fused embeddings.
import dashscope
import json
import os
from http import HTTPStatus
# Demonstrates using the res_level (resolution) and max_video_frames (video frames) parameters.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
text = "This is a visual multimodal representation model."
input_data = [
{"text": text},
{"image": image},
{"video": video}
]
resp = dashscope.MultiModalEmbedding.call(
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="tongyi-embedding-vision-plus-2026-03-06",
input=input_data,
dimension=1152, # Valid values: 1152, 1024, 512, 256, 128, 64
res_level=1, # Resolution level: 0, 1, 2, or 3. The default is 1.
max_video_frames=64 # Maximum number of sampled video frames. Default: 8. Maximum: 64.
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
To generate a fused embedding with the 2026-03-06 version, place text, image, and video in the same content object. The model fuses all inputs into a single embedding with the type fused.
import dashscope
import json
import os
from http import HTTPStatus
# To create a fused embedding, place text and image in the same content object.
# The model fuses all inputs into a single embedding of type `fused`.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input_data = [
{"text": text, "image": image}
]
resp = dashscope.MultiModalEmbedding.call(
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="tongyi-embedding-vision-plus-2026-03-06",
input=input_data,
dimension=1152
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Output example
{
"status_code": 200,
"request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
"code": "",
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.009490966796875,
-0.024871826171875,
-0.031280517578125,
...
],
"type": "text"
}
]
},
"usage": {
"input_tokens": 10,
"input_tokens_details": {
"image_tokens": 0,
"text_tokens": 10
},
"output_tokens": 1,
"total_tokens": 11
}
}Error codes
If the model call fails and returns an error message, see Error codes for resolution.