Qwen-Audio is a large-scale audio language model developed by Alibaba Cloud. It can understand various types of audio, including human speech, natural sounds, music, and singing. The model's core features include audio transcription, content summarization, sentiment analysis, audio event detection, and voice chat.
Applicable regions: This document applies only to the deployment mode in the Chinese mainland. The endpoints and data storage are located in the Beijing region. Compute resources for inference are limited to the Chinese mainland. To use the model, you must use an API key from the Beijing region.
For free trial only: The Qwen-Audio model is currently available for free trial only. After your free quota is used up, you can no longer call the model. Payment is not supported. For production-level applications, use Qwen-Omni as an alternative model.
Examples
Application Scenario | Input Example | Output Result |
Speech recognition and analysis
|
|
|
Audio Q&A Asks and answers questions based on audio content, such as locating the timestamp of specific information in the audio. |
|
|
Voice chat The model can respond directly to audio content without any text instructions. |
|
Supported models
Model | Version | Context window | Max input | Max output | Input cost | Output cost | Free quota |
(tokens) | (per 1M tokens) | ||||||
qwen-audio-turbo | Stable | 8,000 | 6,000 | 1,500 | This model is currently available for free trial only. You cannot invoke the model after the free quota is used. We recommend using Qwen-Omni as an alternative. | 100,000 tokens Valid for 90 days after activating Model Studio | |
qwen-audio-turbo-latest | Latest | 8,192 | 6,144 | 2,048 | |||
Getting started
Prerequisites
Obtain an API key and configure it as an environment variable.
If you use the DashScope SDK to call the model, install the latest version of the SDK.
The Qwen-Audio model can only be called through an API. It is not available for online trial in the Alibaba Cloud Model Studio console.
The following code examples show how to understand online audio, which is specified by a URL instead of a local file. For more information, see how to pass local files and the limits on audio files.
Python
import dashscope
import os
messages = [
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
{"text": "What is this audio about?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model="qwen-audio-turbo",
messages=messages,
result_format="message"
)
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])
Sample response
The output is:
This audio says:'Welcome to Alibaba Cloud'Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"),
Collections.singletonMap("text", "What is this audio about?")))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-audio-turbo")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Sample response
The output is:
This audio says:'Welcome to Alibaba Cloud'curl
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-audio-turbo",
"input":{
"messages":[
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
{"text": "What is this audio about?"}
]
}
]
}
}'Sample response
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This audio says:'Welcome to Alibaba Cloud'"
}
]
}
}
]
},
"usage": {
"audio_tokens": 85,
"output_tokens": 10,
"input_tokens": 33
},
"request_id": "1341c517-71bf-94f5-862a-18fbddb332e9"
}Core features
Voice conversation
The model can respond directly to audio content without any text instructions. For example, if the audio contains a question such as "What is a suitable activity for this environment?", the model replies with suitable activities.
Currently, the qwen-audio-turbo-latest and `qwen2-audio-instruct` models support voice chat.Python
import dashscope
import os
messages = [
{
"role": "user",
"content": [
{"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav"}
]
}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen2-audio-instruct',
messages=messages)
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])Sample response
The output is:
Of course. There are many types of literary books. Based on your interests and preferences, I can give you some suggestions. What type of literary works do you like? For example, modern literature, classical literature, science fiction, historical novels, poetry, essays, and so on.Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("audio","https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav")))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-audio-turbo")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Sample response
The output is:
Of course. There are many types of literary books. Based on your interests and preferences, I can give you some suggestions. What type of literary works do you like? For example, modern literature, classical literature, science fiction, historical novels, detective novels, novels, poetry, essays, drama, and so on.curl
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-audio-turbo",
"input":{
"messages":[
{
"role": "system",
"content": [
{"text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav"} ]
}
]
}
}'Sample response
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "Of course, but I need to understand your interests first. What type of literary works do you prefer? For example, novels, essays, poetry, etc."
}
]
}
}
]
},
"usage": {
"audio_tokens": 237,
"output_tokens": 29,
"input_tokens": 28
},
"request_id": "ae407255-2fed-9e5a-90e6-6dab3178e913"
}Passing local files (Base64 encoding or file path)
Qwen-Audio supports two methods for uploading local files: Base64 encoding and file path. You can choose an upload method based on the file size and SDK type. For specific recommendations, see How to choose a file upload method.
Pass by file path
You can pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, and not by HTTP calls. Refer to the following table to specify the file path based on your programming language and operating system. The file must meet the requirements described in Supported audio files.
Base64-Encoded Input
You can encode the file into a Base64 string and then pass it to the model.
Pass by file path
Passing a file by its path is supported only for calls made using the DashScope Python and Java SDKs. This method is not supported for HTTP calls.
Python
from dashscope import MultiModalConversation
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure its validity, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
# Pass the file path prefixed with file:// in the audio parameter.
"content": [{"audio": audio_file_path}, {"text": "What is the audio about?"}],
}
]
response = MultiModalConversation.call(
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model="qwen-audio-turbo",
messages=messages)
print("The output is:")
print(response["output"]["choices"][0]["message"].content[0]["text"])
Sample response
The output is:
This audio says: 'Welcome to Alibaba Cloud'.Java
import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void callWithLocalFile()
throws ApiException, NoApiKeyException, UploadFileException {
// Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
// The full path of the local file must be prefixed with file:// to ensure its validity, for example: file:///home/images/test.mp3
// The current test system is macOS. If you are using Windows, use "file:///ABSOLUTE_PATH/welcome.mp3" instead.
String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(new HashMap<String, Object>(){{put("audio", localFilePath);}},
new HashMap<String, Object>(){{put("text", "What is the audio about?");}}))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-audio-turbo")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
callWithLocalFile();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Sample response
The output is:
The audio says: 'Welcome to Alibaba Cloud'Pass by Base64 encoding
Python
import os
import base64
from dashscope import MultiModalConversation
# Encoding function: Converts a local file to a Base64-encoded string.
def encode_audio(audio_file_path):
with open(audio_file_path, "rb") as audio_file:
return base64.b64encode(audio_file.read()).decode("utf-8")
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
messages = [
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
# When passing a local file using Base64 encoding, it must be prefixed with data: to ensure the validity of the file URL.
# You must include "base64," before the Base64-encoded data (base64_audio), otherwise an error will occur.
"content": [{"audio":f"data:;base64,{base64_audio}"},
{"text": "What is the audio about? "}],
}
]
response = MultiModalConversation.call(
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model="qwen2-audio-instruct",
messages=messages
)
print(response.output.choices[0].message.content[0])Sample response
The output is:
This audio says: 'Welcome to Alibaba Cloud'.Java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
private static String encodeAudioToBase64(String audioPath) throws IOException {
Path path = Paths.get(audioPath);
byte[] audioBytes = Files.readAllBytes(path);
return Base64.getEncoder().encodeToString(audioBytes);
}
public static void callWithLocalFile()
throws ApiException, NoApiKeyException, UploadFileException,IOException{
// Replace ABSOLUTE_PATH/welcome.mp3 with the actual path of your local file.
String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
String base64Audio = encodeAudioToBase64(localFilePath);
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
// When passing a local file using Base64 encoding, it must be prefixed with data: to ensure the validity of the file URL.
// You must include "base64," before the Base64-encoded data (base64Audio), otherwise an error will occur.
.content(Arrays.asList(new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}},
new HashMap<String, Object>(){{put("text", "What is the audio about?");}}))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-audio-turbo")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println("The output is:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
callWithLocalFile();
} catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Sample response
The output is:
The audio says: 'Welcome to Alibaba Cloud'curl
For a method to convert a file to a Base64-encoded string, see the Python sample code.
For demonstration purposes, the Base64-encoded string
"data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."in the code is truncated. In your application, you must pass the complete encoded string.
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-audio-turbo",
"input":{
"messages":[
{
"role": "system",
"content": [
{"text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."},
{"text": "What is this audio about?"}
]
}
]
}
}'More usage
Limits
Supported audio files
File size:
When passed as a public URL or local path: The audio file size cannot exceed 10 MB.
When passed as a Base64-encoded string, the size of the encoded string cannot exceed 10 MB.
Audio duration: The audio duration is limited to 30 seconds. If it exceeds 30 seconds, the model processes only the first 30 seconds.
File format: Common encoded audio formats are supported, such as AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.
Supported languages: Chinese, English, Cantonese, French, Italian, Spanish, German, and Japanese.
Audio file input methods
Public URL: Provide a publicly accessible URL that uses the HTTP or HTTPS protocol. To obtain a public URL, you can upload a local file to OSS or upload a file to get a temporary URL.
Pass as Base64 encoding: Convert the file to a Base64-encoded string and pass it in the request.
Pass as a local file path: Pass the path of the local file directly.
Going live
The Qwen-Audio model is available for free trial only, has calling quota limits, and does not have a paid option. If you plan to use the audio understanding feature in a production environment, we recommend that you migrate to the Qwen-Omni model.
Migration advantages
Comparison item
Qwen-Audio
Qwen-Omni
Commercial support
Free trial only, no paid option
Paid option available, suitable for production environments
Features
Audio understanding capabilities
Omni-modal capabilities including audio
Region restrictions
Only the Beijing region is supported
Supports international (Singapore) and Chinese mainland (Beijing) regions
Migration method
The usage of the Qwen-Omni model is different from that of Qwen-Audio. For more information, see Asynchronous (Qwen-Omni).
API reference
For more information about the input and output parameters of the Qwen-Audio model, see Qwen.
Error codes
If the model call fails and returns an error message, see Error messages for resolution.