Audio understanding (Qwen3-Omni-Captioner)

更新时间:
复制 MD 格式

Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It automatically generates accurate and comprehensive descriptions for complex audio—speech, ambient sounds, music, and sound effects—without prompts. The model identifies speaker emotions, musical elements (style, instruments), and sensitive information. Ideal for audio content analysis, security audits, intent recognition, and video editing.

Supported models

Token conversion rule for audio: Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.

Getting started

Prerequisites

Qwen3-Omni-Captioner supports API calls only. Online testing in the Model Studio console is not available.

These code samples analyze online audio via a URL, not local files. Learn how to pass local files and audio file limits.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variable not configured, replace with your API key: apiKey: "sk-xxx"
       // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
       // Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     }
            }]
        }]
});

console.log(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long-captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, “Oh, with this, how am I supposed to work quietly?” His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker’s language, and his tone strongly suggests a scenario of home office disruption-perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

DashScope

Python

import dashscope
import os

# If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
# dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages
    )
print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    // If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
    //  static {Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";}
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    }
}'

Response

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: “Oh, how can I possibly work quietly like this?” His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound-a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker’s complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment-perhaps a student, office worker, or someone in a quiet home environment-caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

How it works

  • Single-turn interaction: Each request is an independent analysis task. Multi-turn conversation is not supported.

  • Fixed task: The model generates audio descriptions in English only. You cannot use instructions (e.g., system messages) to change behavior, output format, or content focus.

  • Audio input only: The model accepts audio only. Text prompts are not needed. The message parameter format is fixed.

    Example message format

    OpenAI compatible

    messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                        }
                    }
                ]
            }
        ]

    DashScope

    messages = [
        {
            "role": "user",
            "content": [
                {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
            ]
        }
    ]

Streaming output

Streaming output generates intermediate results step-by-step and returns them simultaneously, allowing you to read responses as they're generated. This reduces wait time.

OpenAI compatible

Set stream to true to enable streaming output.

Python

import os
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(
    # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True},

)
for chunk in completion:
    # If stream_options.include_usage is True, the choices field of the last chunk is an empty list and should be skipped. You can get the token usage from chunk.usage.
    if chunk.choices and chunk.choices[0].delta.content != "":
        print(chunk.choices[0].delta.content,end="")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variable not configured, replace with your API key: apiKey: "sk-xxx"
       // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
       // Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     },
            }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    }
}'

DashScope

Call via DashScope SDK or HTTP. Set parameters based on your method:

  • Python SDK: Set the stream parameter to True.

  • Java SDK: Use the streamCall method.

  • HTTP: In the header, set X-DashScope-SSE to enable.

By default, streaming output is non-incremental. This means each returned chunk contains all previously generated content. For incremental output, set incremental_output (incrementalOutput in Java) to true.

Python

import dashscope
import os

# If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
# dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    stream=True,
    incremental_output=True
    )
    
full_content = ""
print("Streaming output:")
for response in response:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"Full content: {full_content}")

Java

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    // If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
    //  static {Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";}
    
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // qwen3-omni-30b-a3b-captioner supports only one audio file as input.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav");}}
                )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult.Output.Choice.Message.Content> content = item.getOutput().getChoices().get(0).getMessage().getContent();
                // Check if content exists and is not empty.
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    },
    "parameters": {
      "incremental_output": true
    }
}'

Pass local file (Base64 encoding or file path)

Two methods are available to upload local files:

  • Use Base64 encoding

  • Direct file path (Recommended for greater transmission stability)

Upload methods:

Pass by file path

Pass the file path directly to the model. Supported by DashScope Python and Java SDKs only, not HTTP. See the table below for path format by language and OS.

Specify the file path

System

SDK

Input file path

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.mp3

Java SDK

Windows operating system

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.mp3

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/images/test.mp3

Pass by Base64 encoding

Convert the file to a Base64 string and pass it to the model.

Steps to pass a Base64-encoded string

  1. Encode the file: Convert the local audio file to a Base64 string.

    Example: Converting an audio file to a Base64 string

    import base64
    
    # Encoding function: Converts a local file to a Base64-encoded string
    def encode_audio(audio_path):
        with open(audio_path, "rb") as audio_file:
            return base64.b64encode(audio_file.read()).decode("utf-8")
    
    # Replace xxxx/test.mp3 with the absolute path of your local audio file
    base64_audio = encode_audio("xxxx/test.mp3")
  2. Construct a Data URL: data:;base64,{base64_audio}, where base64_audio is the Base64 string from step 1.

  3. Pass the Data URL via audio (DashScope) or input_audio (OpenAI compatible) parameter.

Limits:

  • Recommended: pass the file path directly for greater transmission stability. For files under 1 MB, Base64 encoding also works.

  • When passing by file path, audio files must be under 10 MB.

  • When using Base64, the encoded string must be under 10 MB. Note: Base64 increases file size.

Pass by file path

File path passing is supported by DashScope Python and Java SDKs only, not HTTP.

Python

import dashscope
import os

# If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
# dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "user",
        # Pass the file path prefixed with file:// in the audio parameter.
        "content": [{"audio": audio_file_path}],
    }
]

response = dashscope.MultiModalConversation.call(
            # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
            # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
            api_key=os.getenv('DASHSCOPE_API_KEY'),
            model="qwen3-omni-30b-a3b-captioner",
            messages=messages)
print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    // If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
    // static {Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";}
    
    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {

        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
        // The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
        // The current test system is macOS. If you use Windows, use "file:///ABSOLUTE_PATH/welcome.mp3" instead.

        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", localFilePath);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                        "data": f"data:;base64,{base64_audio}"
                    },
                }
            ],
        },
    ]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
//  Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
const base64Audio = encodeAudio("xxx/ABSOLUTE_PATH/welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`}
            }]
        }]
});

console.log(completion.choices[0].message.content);

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# Beijing region base_url. For Singapore, use: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."
          }
        }
      ]
    }
  ]
}'

DashScope

Python

import os
import base64
import dashscope
from dashscope import MultiModalConversation

# If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
# dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"

# Encoding function: Converts a local file to a Base64-encoded string
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

messages = [
    {
        "role": "user",
        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
        "content": [{"audio":f"data:;base64,{base64_audio}"}],
    }
]

response = MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    
    )
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

  // If you use a model in the Singapore region, uncomment the following line and replace {WorkspaceId} with your actual workspace ID.
 // static {Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";}

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // Replace ABSOLUTE_PATH/welcome.mp3 with the actual path of your local file.
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                // The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
                // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# The following is the URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/text-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."}
                ]
            }
        ]
    }
}'

API reference

For Qwen3-Omni-Captioner parameters, see Text generation.

Error codes

If the model call fails and returns an error message, see Error codes for resolution.

FAQ

How to compress an audio file to the required size?

  • Online tools: You can use tools like Compresss to compress audio.

  • Code implementation: Use FFmpeg. For usage details, see the official FFmpeg website.

    # Basic conversion command (universal template)
    # -i: Specifies the input file path. Example: input.mp3
    
    # -b:a: Sets the audio bitrate.
      # Common values: 64 kbps (low quality, for voice and low-bandwidth streaming), 128k (medium quality, for general audio and podcasts), 192 kbps (high quality, for music and broadcasting).
      # A higher bitrate results in better audio quality and a larger file size.
      
    # -ar: Sets the audio sample rate, which is the number of samples per second.
     # Common values: 8000 Hz, 22050 Hz, 44100 Hz (standard sample rate).
     # A higher sample rate results in a larger file size.
     
    # -ac: Sets the number of audio channels. Common values: 1 (mono), 2 (stereo). Mono files are smaller.
    
    # -y: Overwrites the output file if it exists (no value needed). # output.mp3: Specifies the output file path.
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Limitations

Audio file limits:

  • Duration: Up to 40 minutes.

  • Number of files: Only one audio file is supported per request.

  • File formats: AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.

  • File input methods: Public URL, Base64 encoding, or local file path.

  • File size:

    • Public URL: No more than 1 GB.

    • File path: The audio file must be smaller than 10 MB.

    • Base64 encoding: The encoded Base64 string must be smaller than 10 MB. For more information, see Pass local file.

    To compress a file, see How to compress an audio file to the required size?