Audio understanding (Qwen-Audio)

更新时间:
复制 MD 格式

Qwen-Audio is a large-scale audio language model developed by Alibaba Cloud. It can understand various types of audio, including human speech, natural sounds, music, and singing. The model's core features include audio transcription, content summarization, sentiment analysis, audio event detection, and voice chat.

Important
  • Applicable regions: This document applies only to the deployment mode in the Chinese mainland. The endpoints and data storage are located in the Beijing region. Compute resources for inference are limited to the Chinese mainland. To use the model, you must use an API key from the Beijing region.

  • For free trial only: The Qwen-Audio model is currently available for free trial only. After your free quota is used up, you can no longer call the model. Payment is not supported. For production-level applications, use Qwen-Omni as an alternative model.

Examples

Application Scenario

Input Example

Output Result

Speech recognition and analysis

  • Recognizes human speech: In addition to speech-to-text, it can analyze the speaker's gender, age group, accent, emotion, and intent.

  • Recognizes natural sounds: such as car horns, bells, thunder, and breaking glass.

  • Recognizes music: Analyzes music to identify instruments, rhythm, key, and style.

What is said in this audio? Is the speaker male or female? What is their approximate age? What language or dialect are they using? What is their emotion?

The original content of this audio is: 'This is a letter from the Panzhihua Iron and Steel Works in Sichuan.' The speaker is a male, approximately 30 years old, speaking the Southwestern Mandarin-Chongqing dialect. The emotion is calm.

Audio Q&A

Asks and answers questions based on audio content, such as locating the timestamp of specific information in the audio.

Where does "Ali" appear in this audio?

"Ali" appears from 1.53 seconds to 1.87 seconds.

Voice chat

The model can respond directly to audio content without any text instructions.

You can try using earplugs or finding a relatively quiet work environment to help you concentrate.

Supported models

Model

Version

Context window

Max input

Max output

Input cost

Output cost

Free quota

Note

(tokens)

(per 1M tokens)

qwen-audio-turbo

Stable

8,000

6,000

1,500

This model is currently available for free trial only.

You cannot invoke the model after the free quota is used. We recommend using Qwen-Omni as an alternative.

100,000 tokens

Valid for 90 days after activating Model Studio

qwen-audio-turbo-latest

Latest

8,192

6,144

2,048

Rules for converting audio to tokens

Each second of audio corresponds to 25 tokens. Audio with a duration of less than 1 second is also counted as 25 tokens.

Getting started

Prerequisites

The Qwen-Audio model can only be called through an API. It is not available for online trial in the Alibaba Cloud Model Studio console.

The following code examples show how to understand online audio, which is specified by a URL instead of a local file. For more information, see how to pass local files and the limits on audio files.

Python

import dashscope
import os

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
            {"text": "What is this audio about?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen-audio-turbo", 
    messages=messages,
    result_format="message"
    )
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Sample response

The output is:
This audio says:'Welcome to Alibaba Cloud'

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"),
                        Collections.singletonMap("text", "What is this audio about?")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-audio-turbo")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

The output is:
This audio says:'Welcome to Alibaba Cloud'

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-audio-turbo",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
                    {"text": "What is this audio about?"}
                ]
            }
        ]
    }
}'

Sample response

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "This audio says:'Welcome to Alibaba Cloud'"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "audio_tokens": 85,
        "output_tokens": 10,
        "input_tokens": 33
    },
    "request_id": "1341c517-71bf-94f5-862a-18fbddb332e9"
}

Core features

Voice conversation

The model can respond directly to audio content without any text instructions. For example, if the audio contains a question such as "What is a suitable activity for this environment?", the model replies with suitable activities.

Currently, the qwen-audio-turbo-latest and `qwen2-audio-instruct` models support voice chat.

Python

import dashscope
import os

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen2-audio-instruct', 
    messages=messages)
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Sample response

The output is:
Of course. There are many types of literary books. Based on your interests and preferences, I can give you some suggestions. What type of literary works do you like? For example, modern literature, classical literature, science fiction, historical novels, poetry, essays, and so on.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("audio","https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-audio-turbo")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

The output is:
Of course. There are many types of literary books. Based on your interests and preferences, I can give you some suggestions. What type of literary works do you like? For example, modern literature, classical literature, science fiction, historical novels, detective novels, novels, poetry, essays, drama, and so on.

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-audio-turbo",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav"}                ]
            }
        ]
    }
}'

Sample response

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "Of course, but I need to understand your interests first. What type of literary works do you prefer? For example, novels, essays, poetry, etc."
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "audio_tokens": 237,
        "output_tokens": 29,
        "input_tokens": 28
    },
    "request_id": "ae407255-2fed-9e5a-90e6-6dab3178e913"
}

Passing local files (Base64 encoding or file path)

Qwen-Audio supports two methods for uploading local files: Base64 encoding and file path. You can choose an upload method based on the file size and SDK type. For specific recommendations, see How to choose a file upload method.

Pass by file path

You can pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, and not by HTTP calls. Refer to the following table to specify the file path based on your programming language and operating system. The file must meet the requirements described in Supported audio files.

Specify the file path

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.mp3

Java SDK

Windows

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.mp3

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/images/test.mp3

Base64-Encoded Input

You can encode the file into a Base64 string and then pass it to the model.

Steps to pass a Base64-encoded string

  1. Encode the file: Convert the local audio file to a Base64-encoded string.

  2. Construct a Data URL in the following format: data:;base64,{base64_audio}. base64_audio is the Base64-encoded string from the previous step.

  3. Call the model by passing the audio parameter with the Data URL.

Pass by file path

Passing a file by its path is supported only for calls made using the DashScope Python and Java SDKs. This method is not supported for HTTP calls.

Python

from dashscope import MultiModalConversation

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure its validity, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "system", 
        "content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        # Pass the file path prefixed with file:// in the audio parameter.
        "content": [{"audio": audio_file_path}, {"text": "What is the audio about?"}],
    }
]

response = MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen-audio-turbo", 
    messages=messages)
    
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Sample response

The output is:
This audio says: 'Welcome to Alibaba Cloud'.

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {
            
        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
        // The full path of the local file must be prefixed with file:// to ensure its validity, for example: file:///home/images/test.mp3
        // The current test system is macOS. If you are using Windows, use "file:///ABSOLUTE_PATH/welcome.mp3" instead.
        
        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("audio", localFilePath);}},
                        new HashMap<String, Object>(){{put("text", "What is the audio about?");}}))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-audio-turbo")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

The output is:
The audio says: 'Welcome to Alibaba Cloud'

Pass by Base64 encoding

Python

import os
import base64
from dashscope import MultiModalConversation

# Encoding function: Converts a local file to a Base64-encoded string.
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file. 
audio_file_path = "ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

messages = [
    {
        "role": "system",
        "content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        # When passing a local file using Base64 encoding, it must be prefixed with data: to ensure the validity of the file URL.
        # You must include "base64," before the Base64-encoded data (base64_audio), otherwise an error will occur.
        "content": [{"audio":f"data:;base64,{base64_audio}"},
                    {"text": "What is the audio about? "}],
    }
]

response = MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen2-audio-instruct",
    messages=messages
    )

print(response.output.choices[0].message.content[0])

Sample response

The output is:
This audio says: 'Welcome to Alibaba Cloud'.

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;


public class Main {

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }
    
    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // Replace ABSOLUTE_PATH/welcome.mp3 with the actual path of your local file.
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                 // When passing a local file using Base64 encoding, it must be prefixed with data: to ensure the validity of the file URL.
                 // You must include "base64," before the Base64-encoded data (base64Audio), otherwise an error will occur.
                .content(Arrays.asList(new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}},
                        new HashMap<String, Object>(){{put("text", "What is the audio about?");}}))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY")) 
                .model("qwen-audio-turbo")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

The output is:
The audio says: 'Welcome to Alibaba Cloud'

curl

  • For a method to convert a file to a Base64-encoded string, see the Python sample code.

  • For demonstration purposes, the Base64-encoded string "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." in the code is truncated. In your application, you must pass the complete encoded string.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-audio-turbo",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."},
                    {"text": "What is this audio about?"}
                ]
            }
        ]
    }
}'

More usage

Limits

Supported audio files

  • File size:

    • When passed as a public URL or local path: The audio file size cannot exceed 10 MB.

    • When passed as a Base64-encoded string, the size of the encoded string cannot exceed 10 MB.

  • Audio duration: The audio duration is limited to 30 seconds. If it exceeds 30 seconds, the model processes only the first 30 seconds.

  • File format: Common encoded audio formats are supported, such as AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.

  • Supported languages: Chinese, English, Cantonese, French, Italian, Spanish, German, and Japanese.

Audio file input methods

  • Public URL: Provide a publicly accessible URL that uses the HTTP or HTTPS protocol. To obtain a public URL, you can upload a local file to OSS or upload a file to get a temporary URL.

  • Pass as Base64 encoding: Convert the file to a Base64-encoded string and pass it in the request.

  • Pass as a local file path: Pass the path of the local file directly.

Going live

The Qwen-Audio model is available for free trial only, has calling quota limits, and does not have a paid option. If you plan to use the audio understanding feature in a production environment, we recommend that you migrate to the Qwen-Omni model.

  • Migration advantages

    Comparison item

    Qwen-Audio

    Qwen-Omni

    Commercial support

    Free trial only, no paid option

    Paid option available, suitable for production environments

    Features

    Audio understanding capabilities

    Omni-modal capabilities including audio

    Region restrictions

    Only the Beijing region is supported

    Supports international (Singapore) and Chinese mainland (Beijing) regions

  • Migration method

    The usage of the Qwen-Omni model is different from that of Qwen-Audio. For more information, see Asynchronous (Qwen-Omni).

API reference

For more information about the input and output parameters of the Qwen-Audio model, see Qwen.

Error codes

If the model call fails and returns an error message, see Error messages for resolution.

FAQ

How to choose a file upload method?

You can choose the most suitable upload method based on the SDK type, file size, and network stability.

Audio file specifications

DashScope SDK (Python, Java)

DashScope HTTP

Greater than 7 MB and less than 10 MB

Pass local path

Only public URLs are supported. Use Alibaba Cloud Object Storage Service.

Less than 7 MB

Pass local path

Base64 encoding

How to compress an audio file to the required size?

  • Online tools: Use online tools such as Compresss to compress audio files.

  • Code implementation: Use the FFmpeg tool. For more information, see the FFmpeg official website.

    # Basic conversion command example
    # -i: Specifies the input file path. Example: input.mp3
    
    # -b:a: Sets the audio bitrate. 
      # Common values include 64 kbps (low quality, suitable for voice and low-bandwidth streaming media), 128k (medium quality, suitable for daily audio and podcasts), and 192 kbps (high quality, suitable for music and broadcasting).
      # The higher the bitrate, the better the audio quality and the larger the file size.
      
    # -ar: Sets the audio sample rate, which indicates the number of samples per second.
     # Common values are 8000 Hz, 22050 Hz, and 44100 Hz (standard sample rate).
     # The higher the sample rate, the larger the file size.
     
    # -ac: Sets the number of audio channels. Common values are 1 (mono) and 2 (stereo). Mono files are smaller.
    
    # -y: Overwrites an existing file (no value needed).
    # output.mp3: Specifies the output file path.
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y