Multimodal embedding API

更新时间:
复制 MD 格式

Multimodal embedding models convert text, images, and videos into embeddings in a shared semantic space to enable cross-modal retrieval, content classification, and similarity search.

Core capabilities

  • Cross-modal retrieval: Perform semantic searches across different content types, such as text-to-image, image-to-video, or image-to-image.

  • Semantic similarity: Measure the semantic similarity between different content types in a unified embedding space.

  • Content classification and clustering: Group, label, and cluster content based on semantic embeddings.

Key feature: Embeddings for all modalities (text, images, and video) share the same semantic space, enabling direct cross-modal matching and comparison using methods such as cosine similarity. See text and multimodal embedding for details on model selection and usage.

Embedding types

The multimodal embedding model supports two methods for generating embeddings:

  • Multimodal independent embedding: Generates a separate embedding for each input, such as text, an image, a video, or multiple images, within the contents. For example, an input of one text string and one image returns two independent embeddings. This is ideal for comparing individual items, such as in image-to-image or text-to-image searches.

  • Multimodal fused embedding: Fuses all inputs in contents into a single embedding to achieve a unified cross-modal semantic representation. This is suitable for scenarios that require a holistic understanding of multimodal content, such as fusing a product image and its description text into a unified representation for retrieval. For qwen3-vl-embedding, you enable fusion by setting enable_fusion=true; for tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06, you achieve fusion by placing text, image, and video in the same content object. The fused embedding supports the following combinations:

    • Text and image fusion

    • Text and video fusion

    • Fusing multiple images with text (by passing multiple image entries)

    • Fusion of images, video, and text

qwen2.5-vl-embedding supports only fused embeddings, not independent embeddings. tongyi-embedding-vision-plus and tongyi-embedding-vision-flash support only independent embeddings. tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 support both independent embeddings and fused embeddings. You create fused embeddings by placing text, image, and video in the same content object.

For model introductions, selection guidance, and usage instructions, see Text and multimodal embedding.

Model overview

China (Beijing)

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price (per 1,000 input tokens)

Free quota (Note)

qwen3-vl-embedding

2560 (default), 2048, 1536, 1024, 768, 512, 256

32,000 tokens

Up to 5 MB per image

Up to 50 MB per video file

Image/Video: CNY 0.0018

Text: CNY 0.0007

1 million tokens

Valid for 90 days after activating Model Studio

qwen2.5-vl-embedding

2048, 1024 (default), 768, 512

tongyi-embedding-vision-plus-2026-03-06

1152 (default), 1024, 512, 256, 128, 64

1,024 tokens

Recommended: up to 5 MB per image; maximum: 10 MB. Supports up to 64 images.

Up to 50 MB per video file

The file must be encoded in H.264 or H.265.

CNY 0.0005

tongyi-embedding-vision-flash-2026-03-06

768 (default), 512, 256, 128, 64

CNY 0.00015

tongyi-embedding-vision-plus

1152

Up to 3 MB per image. Supports up to 8 images.

Up to 10 MB per video file

CNY 0.0005

tongyi-embedding-vision-flash

768

CNY 0.00015

multimodal-embedding-v1

1,024

512 tokens

Up to 3 MB per image

Up to 10 MB per video file

Image/Video: CNY 0.0009

Text: CNY 0.0007

Singapore

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price (per 1,000 input tokens)

tongyi-embedding-vision-plus

1152

1,024 tokens

Up to 8 images, 3 MB each

Up to 10 MB per video file

CNY 0.0005

tongyi-embedding-vision-flash

768

1,024 tokens

CNY 0.00015

Input formats and usage limits

Fused multimodal models

Model

Text

Image

Video

Request limit

qwen3-vl-embedding

Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German.

JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported)

MP4, AVI, MOV (URL only)

Up to 20 content elements per request, with a maximum of 5 images and 1 video.

qwen2.5-vl-embedding

Supports 11 major languages, including Chinese, English, Japanese, Korean, French, and German.

Each request is limited to one of each input type: image, text, video, or fused object.

Independent multimodal models

Model

Text

Image

Video

Request limit

tongyi-embedding-vision-plus-2026-03-06

Supports over 30 major languages, including Chinese, English, Japanese, and Korean.

JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported)

MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)

Up to 20 content elements per request, with a maximum of 64 images and 8 videos.

tongyi-embedding-vision-flash-2026-03-06

tongyi-embedding-vision-plus

Chinese and English

JPG, PNG, BMP (URL or Base64 supported)

MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)

No limit on the number of content elements. The total number of input tokens must not exceed the batch processing token limit.

tongyi-embedding-vision-flash

multimodal-embedding-v1

Chinese and English

JPG, PNG, BMP (URL or Base64 supported)

Up to 20 content elements per request, with a maximum of 20 text segments, 1 image, and 1 video.

All models accept text, image, and video inputs, individually or in combination. tongyi-embedding-vision-plus, tongyi-embedding-vision-flash, tongyi-embedding-vision-plus-2026-03-06, and tongyi-embedding-vision-flash-2026-03-06 models also support multi_images for image sequences.

Model capabilities

Model

Default dimension

Vector type

Supported inputs

Description

qwen3-vl-embedding

2560

Independent / Fusion

text, image, video, multiple images

Fusion mode, enabled with the enable_fusion parameter, combines multimodal inputs into a single vector.

qwen2.5-vl-embedding

1024

Fusion only

text, image, video

Always returns a single fused vector. Independent vectors and multi-image inputs are not supported.

tongyi-embedding-vision-plus-2026-03-06

1152

Independent / Fusion

text, image, video, multi_images

Based on the Qwen3 foundation model. Supports multiple resolutions, 30+ languages, and fused vectors.

tongyi-embedding-vision-flash-2026-03-06

768

tongyi-embedding-vision-plus

1152

Independent only

Supports multi_images sequences (up to 8 images).

tongyi-embedding-vision-flash

768

multimodal-embedding-v1

1024

text, image, video

The vector dimension is fixed at 1,024 and cannot be configured.

Prerequisites

Obtain an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

HTTP call

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

Request

Multimodal independent embedding

The following example uses the tongyi-embedding-vision-plus model to generate an independent embedding for each input. You can replace the model name with another supported model. The multi_images type is supported only by tongyi-embedding-vision-plus and tongyi-embedding-vision-flash. The qwen3-vl-embedding model also supports a fused embedding mode, which you can enable by setting enable_fusion=true. For details, see the "Multimodal fused embedding" tab.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' \
    --data '{
        "model": "tongyi-embedding-vision-plus",
        "input": {
            "contents": [ 
                {"text": "Multimodal embedding model"},
                {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
                {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
                {"multi_images": [
                    "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
                    "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png"
                    ]
                  }
            ]
        }
    }'

Multimodal fused embedding

The qwen3-vl-embedding model supports fused embedding generation. Set enable_fusion=true to combine all inputs into a single embedding. This supports various combinations, such as text and image, text and video, multiple images and text, or a mix of image, video, and text. The following example shows a fusion of multiple images, a video, and text.
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' \
    --data '{
        "model": "qwen3-vl-embedding",
        "input": {
            "contents": [
                {"text": "Product description text"},
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"},
                {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
                {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"}
            ]
        },
        "parameters": {
            "enable_fusion": true
        }
    }'

2026-03-06 snapshot example

The tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models are new versions built on the Qwen3 foundation. They support the res_level (multi-resolution) and max_video_frames (video frame count) parameters, and can generate both independent and fused embeddings.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' \
    --data '{
        "model": "tongyi-embedding-vision-plus-2026-03-06",
        "input": {
            "contents": [
                {"text": "This is a visual multimodal representation model"},
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"},
                {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
                {"multi_images": [
                    "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png",
                    "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"
                ]}
            ]
        },
        "parameters": {
            "dimension": 1152,
            "res_level": 1,
            "max_video_frames": 64
        }
    }'

The following example shows how to use the 2026-03-06 version to generate a fused embedding. By placing text, image, and video in the same content object, the model combines all inputs into a single embedding of type fused.

curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' \
    --data '{
        "model": "tongyi-embedding-vision-plus-2026-03-06",
        "input": {
            "contents": [
                {
                    "text": "This is a visual multimodal representation model",
                    "image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png",
                    "video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
                }
            ]
        },
        "parameters": {
            "dimension": 1152
        }
    }'

Request headers

Content-Type string (Required)

The content type of the request. Must be application/json.

Authorization string (Required)

Authenticates the request with a Model Studio API key. Example: Bearer sk-xxxx.

Request body

model string(required)

The model name. Select a model from the Model overview.

input object (required)

The input content.

Properties

contents array(required)

The content items to process. Each item is a dictionary or string that specifies the content type and value in the format {"modality_type": "input_string_or_image/video_url"}. The supported modality types are textimagevideo, and multi_images.

The qwen3-vl-embedding model supports both fused and independent embedding generation. To generate a fused embedding, add the boolean field enable_fusion and set it to true. The qwen2.5-vl-embedding model supports only fused embeddings. The tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models support both independent and fused embeddings. To generate a fused embedding, place the text, image, and video content in the same content object instead of using the enable_fusion parameter.
  • Text: The key is text, and the value is a string. You can also pass the string directly without a dictionary.

  • Image: Use the image key. The value can be a public URL or a Base64-encoded Data URI. The Base64 format is data:image/{format};base64,{data}, where {format} is the image format, such as jpeg or png, and {data} is the Base64-encoded string.

  • Multiple images: This type is supported only by the tongyi-embedding-vision-plus, tongyi-embedding-vision-flash, tongyi-embedding-vision-plus-2026-03-06, and tongyi-embedding-vision-flash-2026-03-06 models. The key is multi_images, and the value is a list of images. Each item in the list is an image that must follow the format described above.

  • Video: The key is video. The value must be a publicly accessible URL.

parameters object (optional)

Embedding processing parameters. For HTTP calls, you must wrap these parameters in the parameters object. For SDK calls, you can use these parameters directly.

Properties

output_type string (optional)

The format for the output embedding representation. Currently, only dense is supported.

dimension integer (optional)

The output embedding dimension. Supported values vary by model:

  • qwen3-vl-embedding: Supports 2560, 2048, 1536, 1024, 768, 512, and 256. The default is 2560.

  • qwen2.5-vl-embedding: Supports 2048, 1024, 768, and 512. The default is 1024.

  • tongyi-embedding-vision-plus: Does not support this parameter. Returns a fixed 1152-dimension embedding.

  • tongyi-embedding-vision-flash: Does not support this parameter. Returns a fixed 768-dimension embedding.

  • tongyi-embedding-vision-plus-2026-03-06: Supports 64, 128, 256, 512, 1024, and 1152. The default is 1152.

  • tongyi-embedding-vision-flash-2026-03-06: Supports 64, 128, 256, 512, and 768. The default is 768.

  • multimodal-embedding-v1: Does not support this parameter. Returns a fixed 1024-dimension embedding.

fps float (optional)

The video frame sampling rate. A smaller value extracts fewer frames. The valid range is [0, 1], and the default is 1.0.

instruct string (optional)

A custom task description to help the model understand the query's intent. English instructions are recommended and can improve performance by 1% to 5%.

enable_fusion bool (optional)

Specifies whether to generate a fused embedding. This parameter is supported only by the qwen3-vl-embedding model. When set to true, all multimodal content in the contents array is fused into a single embedding. The default value is false, which generates an independent embedding for each modality. Fused embeddings support combinations such as text and image, text and video, multiple images and text (by passing multiple image items), and a mix of image, video, and text. This is suitable for retrieval scenarios that require a comprehensive understanding of multimodal content.

The tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models do not use this parameter. Instead, they generate a fused embedding by placing text, image, and video in the same content object.

res_level integer (optional)

Specifies the input resolution level. You can set this to 0, 1, 2, or 3, which correspond to single-image token costs of 127, 402, 578, and 1,026, respectively. The default value is 1 (402 tokens). This parameter is supported only by the tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models. For use cases sensitive to image resolution, such as IPC, autonomous driving, or visual text recognition, a high resolution (res_level=3) can improve performance by 5% to 10%.

max_video_frames integer (optional)

Controls the maximum number of frames sampled from a video. The value cannot exceed 64. The default value is 8. This parameter is supported only by the tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models.

Response

Successful response

{
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.026611328125,
                    -0.016571044921875,
                    -0.02227783203125,
                    ...
                ],
                "type": "text"
            },
            {
                "index": 1,
                "embedding": [
                    0.051544189453125,
                    0.007717132568359375,
                    0.026611328125,
                    ...
                ],
                "type": "image"
            },
            {
                "index": 2,
                "embedding": [
                    -0.0217437744140625,
                    -0.016448974609375,
                    0.040679931640625,
                    ...
                ],
                "type": "video"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 896,
            "text_tokens": 7
        },
        "output_tokens": 3,
        "total_tokens": 906
    },
    "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}
Note

The usage field varies by model. See the following descriptions:

  • tongyi-embedding-vision-* series models: Return input_tokens (sum of text and image tokens), input_tokens_details (including image_tokens and text_tokens), output_tokens, and total_tokens. The response example above is for this type of model.

  • qwen3-vl-embedding: Returns only input_tokens (text tokens only, including system template tokens), image_tokens, and total_tokens (= input_tokens + image_tokens). Does not return input_tokens_details or output_tokens. Example:

{
    "usage": {
        "input_tokens": 43,
        "image_tokens": 1247,
        "total_tokens": 1290
    }
}
Note
  • qwen2.5-vl-embedding: Returns only input_tokens and image_tokens. Does not return total_tokens, input_tokens_details, or output_tokens.

  • multimodal-embedding-v1: Returns input_tokens, image_tokens, image_count, and duration. Does not return total_tokens, input_tokens_details, or output_tokens.

Error response

{
    "code":"InvalidApiKey",
    "message":"Invalid API-key provided.",
    "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1"
}

output object

Task output.

Properties

embeddings array

A list of the resulting embeddings, where each object corresponds to an input element.

Properties

index int

The index of the result in the input list.

embedding array

The dimension of the generated array of embeddings depends on the model and the dimension parameter.

type string

The input type for this result. text, image, video, and multi_images correspond to text, image, video, and multi-image inputs, respectively. Special types include: fused is the fused embedding type returned by the tongyi-embedding-vision-plus-2026-03-06 and tongyi-embedding-vision-flash-2026-03-06 models; fusion is the type returned by the qwen3-vl-embedding or qwen2.5-vl-embedding model in fused embedding mode; vl is the type returned by the qwen3-vl-embedding model in independent embedding mode.

request_id string

Unique request identifier for tracing and troubleshooting.

code string

Error code. Returned only for failed requests. See Error codes.

message string

Detailed error message. Returned only for failed requests. See Error codes.

usage object

Statistics about token usage.

Properties

input_tokens int

The number of tokens in the input content for the current request. For the qwen3-vl-embedding and qwen2.5-vl-embedding models, this value includes only text tokens (including system template tokens) and does not include image or video tokens. For the tongyi-embedding-vision-* series models, this value includes the total number of text, image, and video tokens.

input_tokens_details object

A detailed breakdown of input tokens. This field is returned only by the tongyi-embedding-vision-* series models. It is not returned by the qwen3-vl-embedding, qwen2.5-vl-embedding, or multimodal-embedding-v1 models.

Properties

image_tokens int

The number of tokens for the input images or videos.

text_tokens int

The number of tokens for the input text.

output_tokens int

The number of tokens in the output for the current request. This field is returned only by the tongyi-embedding-vision-* series models.

total_tokens int

The total number of input and output tokens. This field is returned by the qwen3-vl-embedding and tongyi-embedding-vision-* models, but not by the qwen2.5-vl-embedding or multimodal-embedding-v1 models. For the qwen3-vl-embedding model, total_tokens = input_tokens + image_tokens.

image_tokens int

The number of tokens for the input images or videos in the current request. The system samples frames from input videos, with the maximum number of frames controlled by the system configuration, and then calculates the tokens based on the processed result. This field is returned as a top-level field only by the qwen3-vl-embedding, qwen2.5-vl-embedding, and multimodal-embedding-v1 models. For the tongyi-embedding-vision-* series models, the image token count is included in input_tokens_details.image_tokens.

image_count int

The number of images in the input for the current request. This field is returned only by the multimodal-embedding-v1 model.

duration int

The duration of the input video in seconds. This field is returned only by the multimodal-embedding-v1 model.

SDK usage

The SDK's input parameter maps to input.contents in the HTTP request body, but their structures are different.

Code examples

Image embedding

Image URL

import dashscope
import json
from http import HTTPStatus
# Replace with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Local image

To generate an embedding from a local image, convert the image to a Base64 string:

import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # Read the file and convert it to Base64.
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png"  # Change this to your image's format (e.g., jpg, bmp).
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]

# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Video embedding

Currently, the model only supports video input via URL. Local video files are not supported.
import dashscope
import json
from http import HTTPStatus
# Replace with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))
    

Text embedding

import dashscope
import json
from http import HTTPStatus

text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Fused embedding

import dashscope
import json
import os
from http import HTTPStatus

# Fuses text, image, and video into a single fused embedding.
# Ideal for use cases like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"

# Input includes text, image, and video. Set enable_fusion=True to generate a fused embedding.
input_data = [
    {"text": text},
    {"image": image},
    {"video": video}
]

resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your Model Studio API key, e.g., api_key="sk-xxx".
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True,
    # Optional: Specify the embedding dimension. Valid values: 2560, 2048, 1536, 1024, 768, 512, and 256. Default: 2560.
    # parameters={"dimension": 1024}
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Multi-image fused embedding

Use qwen3-vl-embedding to fuse multiple images and text into a single embedding. To fuse multiple images, pass multiple image items. This is ideal for semantic retrieval using multi-angle product images and a text description.

import dashscope
import json
import os
from http import HTTPStatus

# Fuses multiple product images and a description into a single embedding.
# Ideal for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"

# Pass multiple image items and set enable_fusion=True to fuse all inputs into a single embedding.
input_data = [
    {"text": text},
    {"image": image1},
    {"image": image2}
]

resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your Model Studio API key, e.g., api_key="sk-xxx".
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

2026-03-06 snapshot version

This example shows how to use the tongyi-embedding-vision-plus-2026-03-06 model and its res_level (resolution) and max_video_frames (video frames) parameters. Built on the Qwen3 foundation, this model supports 30+ languages and generates both independent and fused embeddings.
import dashscope
import json
import os
from http import HTTPStatus

# Demonstrates using the res_level (resolution) and max_video_frames (video frames) parameters.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
text = "This is a visual multimodal representation model."

input_data = [
    {"text": text},
    {"image": image},
    {"video": video}
]

resp = dashscope.MultiModalEmbedding.call(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="tongyi-embedding-vision-plus-2026-03-06",
    input=input_data,
    dimension=1152,      # Valid values: 1152, 1024, 512, 256, 128, 64
    res_level=1,         # Resolution level: 0, 1, 2, or 3. The default is 1.
    max_video_frames=64  # Maximum number of sampled video frames. Default: 8. Maximum: 64.
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

To generate a fused embedding with the 2026-03-06 version, place text, image, and video in the same content object. The model fuses all inputs into a single embedding with the type fused.

import dashscope
import json
import os
from http import HTTPStatus

# To create a fused embedding, place text and image in the same content object.
# The model fuses all inputs into a single embedding of type `fused`.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"

input_data = [
    {"text": text, "image": image}
]

resp = dashscope.MultiModalEmbedding.call(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="tongyi-embedding-vision-plus-2026-03-06",
    input=input_data,
    dimension=1152
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Output example

{
    "status_code": 200,
    "request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.009490966796875,
                    -0.024871826171875,
                    -0.031280517578125,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 0,
            "text_tokens": 10
        },
        "output_tokens": 1,
        "total_tokens": 11
    }
}

Error codes

If the model call fails and returns an error message, see Error codes for resolution.