Transcribe pre-recorded audio into text. Non-real-time speech recognition models support multilingual recognition, sung-content recognition, noise reduction, and speaker diarization, making them well suited for meeting transcription, call analysis, subtitle generation, and similar use cases.
Overview
Transcribe pre-recorded audio and video files in bulk by submitting asynchronous tasks.
-
Supports Context enhancement, which lets you provide contextual hints to improve recognition accuracy (fun-asr-flash-2026-06-15 only).
-
Configurable features include speaker diarization, sensitive-word filtering, sentence- and word-level timestamps, and hotword enhancement.
-
Asynchronously transcribes a single audio file of up to 12 hours and 2 GB.
-
Accepts any sample rate and works with common audio and video formats, including AAC, WAV, and MP3.
For real-time scenarios such as live captioning, online meetings, or voice assistants, use Real-time speech recognition instead. For guidance on choosing the right model, see Speech-to-text.
Prerequisites
-
You have Obtain an API key and stored the API key as an environment variable.
-
To call the API through the DashScope SDK, install the latest SDK.
Quick start
Fun-ASR
Asynchronous
Audio and video files are typically large, so the file-transcription API is asynchronous: submit the task, poll the query endpoint for its status, and retrieve the recognition result after the task completes.
cURL
When you call the API with cURL, first submit the task to obtain a task_id, then use that ID to query the result.
Submit a task
China (Beijing) region URL. The URL varies by region.
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
"model": "fun-asr",
"input": {
"file_urls": [
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
]
},
"parameters": {
"channel_id": [0],
"language_hints": ["zh", "en"]
}
}'
Get the task result
This query endpoint defaults to 20 QPS and can be scaled up to 100 QPS. For higher throughput, or to avoid polling-induced throttling, configure asynchronous task callbacks (see Replace polling with callbacks for high-concurrency workloads).
China (Beijing) region URL. The URL varies by region.
curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"
Download the recognition result
After the task succeeds, the output.results[].transcription_url field returned by the query endpoint points to a publicly downloadable JSON file that contains the full recognition result. The URL is valid for 24 hours by default, so download and persist the file promptly.
# Replace {transcription_url} with the transcription_url value returned by the query endpoint
curl -sS '{transcription_url}' -o transcription.json
cat transcription.json | jq .Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='fun-asr',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter used to specify the language code of the audio to be recognized. For the value range, see the API reference documentation.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)
Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
// Create the transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("fun-asr")
// language_hints is an optional parameter used to specify the language code of the audio to be recognized. For the value range, see the API reference documentation.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete, then get the result
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}
Synchronous (fun-asr-flash-2026-06-15)
fun-asr-flash-2026-06-15 supports synchronous calls for audio files up to 5 minutes. Results can be returned in streaming or non-streaming mode.
China (Beijing) region URL. The URL varies by region.
curl --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header "Content-Type: application/json" \
--header "X-DashScope-SSE: disable" \
--data '{
"model": "fun-asr-flash-2026-06-15",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
}
}
]
}
]
},
"parameters": {
"format": "wav",
"sample_rate": "16000"
}
}'
Qwen3-ASR-Flash-Filetrans
Qwen3-ASR-Flash-Filetrans is purpose-built for asynchronous transcription of audio files. It supports recordings of up to 12 hours, accepts only publicly accessible audio file URLs (local file upload is not supported), and returns the full recognition result in a single response after the task completes.
cURL
When you call the API with cURL, submit the task first to obtain a task_id, then use that ID to query the result.
Submit a task
China (Beijing) region URL. The URL varies by region.
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id":[
0
],
"enable_itn": false,
"enable_words": true
}
}'
Get the task result
This query endpoint defaults to 20 QPS and can be scaled up to 100 QPS. For higher throughput, or to avoid polling-induced throttling, configure asynchronous task callbacks (see Replace polling with callbacks for high-concurrency workloads).
China (Beijing) region URL. The URL varies by region.
curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"
Download the recognition result
After the task succeeds, the output.result.transcription_url field returned by the query endpoint points to a publicly downloadable JSON file that contains the full recognition result. The URL is valid for 24 hours by default, so download and persist the file promptly.
# Replace {transcription_url} with the transcription_url value returned by the query endpoint
curl -sS '{transcription_url}' -o transcription.json
cat transcription.json | jq .Complete example
Java
import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import okhttp3.*;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class Main {
// China (Beijing) region URL. The URL varies by region.
private static final String API_URL_SUBMIT = "https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription";
// China (Beijing) region URL. The URL varies by region.
private static final String API_URL_QUERY = "https://dashscope.aliyuncs.com/api/v1/tasks/";
private static final Gson gson = new Gson();
public static void main(String[] args) {
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: String apiKey = "sk-xxx"
String apiKey = System.getenv("DASHSCOPE_API_KEY");
OkHttpClient client = new OkHttpClient();
// 1. Submit the task
/*String payloadJson = """
{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
"enable_itn": false,
"language": "zh"
}
}
""";*/
String payloadJson = """
{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
"enable_itn": false,
"enable_words": true
}
}
""";
RequestBody body = RequestBody.create(payloadJson, MediaType.get("application/json; charset=utf-8"));
Request submitRequest = new Request.Builder()
.url(API_URL_SUBMIT)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("Content-Type", "application/json")
.addHeader("X-DashScope-Async", "enable")
.post(body)
.build();
String taskId = null;
try (Response response = client.newCall(submitRequest).execute()) {
if (response.isSuccessful() && response.body() != null) {
String respBody = response.body().string();
ApiResponse apiResp = gson.fromJson(respBody, ApiResponse.class);
if (apiResp.output != null) {
taskId = apiResp.output.taskId;
System.out.println("Task submitted, task_id: " + taskId);
} else {
System.out.println("Submit response content: " + respBody);
return;
}
} else {
System.out.println("Task submission failed! HTTP code: " + response.code());
if (response.body() != null) {
System.out.println(response.body().string());
}
return;
}
} catch (IOException e) {
e.printStackTrace();
return;
}
// 2. Poll the task status
boolean finished = false;
while (!finished) {
try {
TimeUnit.SECONDS.sleep(2); // Wait 2 seconds before querying again
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
String queryUrl = API_URL_QUERY + taskId;
Request queryRequest = new Request.Builder()
.url(queryUrl)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("X-DashScope-Async", "enable")
.addHeader("Content-Type", "application/json")
.get()
.build();
try (Response response = client.newCall(queryRequest).execute()) {
if (response.body() != null) {
String queryResponse = response.body().string();
ApiResponse apiResp = gson.fromJson(queryResponse, ApiResponse.class);
if (apiResp.output != null && apiResp.output.taskStatus != null) {
String status = apiResp.output.taskStatus;
System.out.println("Current task status: " + status);
if ("SUCCEEDED".equalsIgnoreCase(status)
|| "FAILED".equalsIgnoreCase(status)
|| "UNKNOWN".equalsIgnoreCase(status)) {
finished = true;
System.out.println("Task completed. Final result: ");
System.out.println(queryResponse);
}
} else {
System.out.println("Query response content: " + queryResponse);
}
}
} catch (IOException e) {
e.printStackTrace();
return;
}
}
}
static class ApiResponse {
@SerializedName("request_id")
String requestId;
Output output;
}
static class Output {
@SerializedName("task_id")
String taskId;
@SerializedName("task_status")
String taskStatus;
}
}
Python
import os
import time
import requests
import json
# China (Beijing) region URL. The URL varies by region.
API_URL_SUBMIT = "https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription"
# China (Beijing) region URL. The URL varies by region.
API_URL_QUERY_BASE = "https://dashscope.aliyuncs.com/api/v1/tasks/"
def main():
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"X-DashScope-Async": "enable"
}
# 1. Submit the task
payload = {
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
# "language": "zh",
"enable_itn": False,
"enable_words": True
}
}
print("Submitting ASR transcription task...")
try:
submit_resp = requests.post(API_URL_SUBMIT, headers=headers, data=json.dumps(payload))
except requests.RequestException as e:
print(f"Request to submit task failed: {e}")
return
if submit_resp.status_code != 200:
print(f"Task submission failed! HTTP code: {submit_resp.status_code}")
print(submit_resp.text)
return
resp_data = submit_resp.json()
output = resp_data.get("output")
if not output or "task_id" not in output:
print("Unexpected submit response content:", resp_data)
return
task_id = output["task_id"]
print(f"Task submitted, task_id: {task_id}")
# 2. Poll the task status
finished = False
while not finished:
time.sleep(2) # Wait 2 seconds before querying again
query_url = API_URL_QUERY_BASE + task_id
try:
query_resp = requests.get(query_url, headers=headers)
except requests.RequestException as e:
print(f"Request to query task failed: {e}")
return
if query_resp.status_code != 200:
print(f"Task query failed! HTTP code: {query_resp.status_code}")
print(query_resp.text)
return
query_data = query_resp.json()
output = query_data.get("output")
if output and "task_status" in output:
status = output["task_status"]
print(f"Current task status: {status}")
if status.upper() in ("SUCCEEDED", "FAILED", "UNKNOWN"):
finished = True
print("Task completed. Final result:")
print(json.dumps(query_data, indent=2, ensure_ascii=False))
else:
print("Query response content:", query_data)
if __name__ == "__main__":
main()
Java SDK
import com.alibaba.dashscope.audio.qwen_asr.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
public class Main {
public static void main(String[] args) {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
QwenTranscriptionParam param =
QwenTranscriptionParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash-filetrans")
.fileUrl("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav")
//.parameter("language", "zh")
//.parameter("channel_id", new ArrayList<String>(){{add("0");add("1");}})
.parameter("enable_itn", false)
.parameter("enable_words", true)
.build();
try {
QwenTranscription transcription = new QwenTranscription();
// Submit the task
QwenTranscriptionResult result = transcription.asyncCall(param);
System.out.println("create task result: " + result);
// Query the task status
result = transcription.fetch(QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task status: " + result);
// Wait for the task to complete
result =
transcription.wait(
QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task result: " + result);
// Get the speech recognition result
QwenTranscriptionTaskResult taskResult = result.getResult();
if (taskResult != null) {
// Get the URL of the recognition result
String transcriptionUrl = taskResult.getTranscriptionUrl();
// Fetch the content at the URL
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
// Pretty-print the JSON result
Gson gson = new GsonBuilder().setPrettyPrinting().create();
System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
}
} catch (Exception e) {
System.out.println("error: " + e);
}
}
}Python SDK
import json
import os
import sys
from http import HTTPStatus
import dashscope
from dashscope.audio.qwen_asr import QwenTranscription
from dashscope.api_entities.dashscope_response import TranscriptionResponse
# run the transcription script
if __name__ == '__main__':
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
task_response = QwenTranscription.async_call(
model='qwen3-asr-flash-filetrans',
file_url='https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav',
#language="",
enable_itn=False,
enable_words=True
)
print(f'task_response: {task_response}')
print(task_response.output.task_id)
query_response = QwenTranscription.fetch(task=task_response.output.task_id)
print(f'query_response: {query_response}')
task_result = QwenTranscription.wait(task=task_response.output.task_id)
print(f'task_result: {task_result}')Qwen3-ASR-Flash
Qwen3-ASR-Flash supports recordings of up to 5 minutes. It accepts either a publicly accessible audio file URL or a local file upload, and can stream the recognition result back to you.
Input: audio file URL
Python SDK
import os
import dashscope
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
China (Beijing) region URL. The URL varies by region.
curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"asr_options": {
"enable_itn": false
}
}
}'
Input: Base64-encoded audio file
Pass Base64-encoded audio as a data URL in the form data:<mediatype>;base64,<data>.
-
<mediatype>: the MIME type.The value depends on the audio format. For example:
-
WAV:
audio/wav -
MP3:
audio/mpeg
-
-
<data>: the audio data encoded as a Base64 string.Base64 encoding increases the payload size, so keep the source file small enough that the encoded result stays within the 10 MB input limit.
-
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
Python SDK
The example uses this audio file: welcome.mp3.
import base64
import dashscope
import os
import pathlib
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
# Replace with the actual path to your audio file
file_path = "welcome.mp3"
# Replace with the actual MIME type of your audio file
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
messages = [
{"role": "user", "content": [{"audio": data_uri}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
The example uses this audio file: welcome.mp3.
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
// Replace with the actual path to your audio file
private static final String AUDIO_FILE = "welcome.mp3";
// Replace with the actual MIME type of your audio file
private static final String AUDIO_MIME_TYPE = "audio/mpeg";
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException, IOException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", toDataUrl())))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
// Generate a data URI
public static String toDataUrl() throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get(AUDIO_FILE));
String encoded = Base64.getEncoder().encodeToString(bytes);
return "data:" + AUDIO_MIME_TYPE + ";base64," + encoded;
}
}
Input: absolute path to a local audio file
When you process a local audio file with the DashScope SDK, pass the file path as input. Build the path according to your SDK and operating system, as shown in the following table.
|
Operating system |
SDK |
File path format |
Example |
|
Linux or macOS |
Python SDK |
file://{absolute_path_to_file} |
file:///home/audio/test.wav |
|
Java SDK |
|||
|
Windows |
Python SDK |
file://{absolute_path_to_file} |
file://D:/audio/test.wav |
|
Java SDK |
file:///{absolute_path_to_file} |
file:///D:/audio/test.wav |
Local-file calls are capped at 100 QPS and the limit cannot be increased, so they are not suitable for production, high-concurrency, or load-testing workloads. For higher concurrency, upload the file to OSS and call the API with its URL.
Python SDK
The example uses this audio file: welcome.mp3.
import os
import dashscope
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local audio file
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
The example uses this audio file: welcome.mp3.
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
// Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local file
String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", localFilePath)))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Streaming output
The model generates intermediate results step by step, and the final result is assembled from them. A non-streaming call waits for the full result and returns it in one response, while a streaming call returns results as they are generated, which significantly reduces time to first token. Choose the streaming parameter that matches your call method:
-
DashScope Python SDK: set
streamto true. -
DashScope Java SDK: call the
streamCallmethod. -
DashScope HTTP: set the
X-DashScope-SSEheader toenable.
Python SDK
import os
import dashscope
# China (Beijing) region URL. The URL varies by region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
},
stream=True
)
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
except:
pass
Java SDK
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
resultFlowable.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
// China (Beijing) region URL. The URL varies by region.
Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
China (Beijing) region URL. The URL varies by region.
curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"incremental_output": true,
"asr_options": {
"enable_itn": false
}
}
}'
Paraformer
The example code for Paraformer is similar to the asynchronous call of Fun-ASR. Replace the model name with a Paraformer model name.
Advanced features
Use the OpenAI-compatible API
The OpenAI-compatible mode is not available in the US region.
Only the Qwen3-ASR-Flash model series supports OpenAI-compatible calls. This mode accepts only publicly accessible audio file URLs; absolute paths to local audio files are not accepted.
The OpenAI Python SDK must be 1.52.0 or later, and the Node.js SDK must be 4.68.0 or later. To install or upgrade:
# Python
pip install -U "openai>=1.52.0"
# Node.js
npm install openai@^4.68.0
asr_options is not a standard OpenAI parameter. When you call the API through the OpenAI SDK, pass it through extra_body.
Input: audio file URL
Python SDK
from openai import OpenAI
import os
try:
client = OpenAI(
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# China (Beijing) region URL. The URL varies by region.
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Whether to enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# When stream is False, stream_options cannot be set
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# When stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped (you can get token usage via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
Node.js SDK
// Preparations before running:
// Common to Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 recommended)
// 2. Run the following command to install the required dependency: npm install openai
import OpenAI from "openai";
const client = new OpenAI({
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// China (Beijing) region URL. The URL varies by region.
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
async function main() {
try {
const streamEnabled = false; // Whether to enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
]
}
],
stream: streamEnabled,
// When stream is False, stream_options cannot be set
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
cURL
China (Beijing) region URL. The URL varies by region.
curl -X POST 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-asr-flash",
"messages": [
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
"stream":false,
"asr_options": {
"enable_itn": false
}
}'
Input: Base64-encoded audio file
You can also pass Base64-encoded audio as a data URL, in the format data:<mediatype>;base64,<data>.
-
<mediatype>: The MIME type.The value varies by audio format. For example:
-
WAV:
audio/wav -
MP3:
audio/mpeg
-
-
<data>: The Base64-encoded string of the audio data.Base64 encoding inflates the payload size. Keep the source file small enough that the encoded data still fits within the 10 MB input limit.
-
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
Python SDK
The example uses this audio file: welcome.mp3.
import base64
from openai import OpenAI
import os
import pathlib
try:
# Replace with the actual path to your audio file
file_path = "welcome.mp3"
# Replace with the actual MIME type of your audio file
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
client = OpenAI(
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# China (Beijing) region URL. The URL varies by region.
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Whether to enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": data_uri
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# When stream is False, stream_options cannot be set
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# When stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped (you can get token usage via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
Node.js SDK
The example uses this audio file: welcome.mp3.
// Preparations before running:
// Common to Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 recommended)
// 2. Run the following command to install the required dependency: npm install openai
import OpenAI from "openai";
import { readFileSync } from 'fs';
const client = new OpenAI({
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// China (Beijing) region URL. The URL varies by region.
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
const encodeAudioFile = (audioFilePath) => {
const audioFile = readFileSync(audioFilePath);
return audioFile.toString('base64');
};
// Replace with the actual path to your audio file
const dataUri = `data:audio/mpeg;base64,${encodeAudioFile("welcome.mp3")}`;
async function main() {
try {
const streamEnabled = false; // Whether to enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: dataUri
}
}
]
}
],
stream: streamEnabled,
// When stream is False, stream_options cannot be set
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
Process long audio files
Non-real-time speech recognition transcribes long audio files asynchronously, making it well suited for producing meeting minutes, interview transcripts, and reviewing call recordings.
Limitations:
-
Qwen3-ASR-Flash-Filetrans, Fun-ASR, and Paraformer: Each audio file is capped at 2 GB in size and 12 hours in duration.
-
Qwen3-ASR-Flash: Each audio file is capped at 10 MB in size and 5 minutes in duration. For longer audio, use Qwen3-ASR-Flash-Filetrans or Fun-ASR.
-
When speaker diarization is enabled: Keep the audio duration under 2 hours to avoid recognition failures or timeouts. For details, see Speaker diarization.
How it works: Long-audio transcription runs as an asynchronous task in three steps:
-
Submit a transcription task to receive a
task_id. -
Poll the task status, or call the SDK's wait method to block until the task completes.
-
After the task completes, download the result JSON from the returned URL.
For code samples, see the quick start of Qwen3-ASR-Flash-Filetrans.
Streaming output
Qwen3-ASR-Flash supports streaming output: intermediate results are returned while the audio is being processed, which is well suited for use cases that require real-time progress feedback.
Fun-ASR, Paraformer, and Qwen3-ASR-Flash-Filetrans are asynchronous transcription models and do not support streaming output. Retrieve their final results through task polling (see Process long audio files).
To enable streaming output:
-
DashScope Python SDK: set
streamtoTrue. -
DashScope Java SDK: call the API through the
streamCallmethod. -
DashScope HTTP: set the
X-DashScope-SSEheader toenable. -
OpenAI-compatible SDK: set
streamtoTrue.
For a streaming code sample, see Streaming output in the Qwen3-ASR-Flash quick start.
Improve accuracy with hotwords
Fun-ASR and Paraformer improve recognition accuracy for domain-specific proper nouns (names, locations, product names) through hotwords. Create a hotword list in the Model Studio console, then pass its ID to the API through the vocabulary_id parameter.
For instructions on creating and using hotword lists, see Custom hotwords.
SDK naming conventions for these parameters vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference for each SDK.
Speaker diarization
Speaker diarization identifies the different speakers in an audio file and tags each sentence in the transcript with a speaker label. It is well suited for multi-party meetings and interview recordings.
Supported models: Fun-ASR and Paraformer support speaker diarization (off by default). The Qwen-ASR series does not yet support it.
To enable: Set diarization_enabled to true in the API request. Each sentence in the result then includes a speaker_id field that identifies the speaker.
Response structure (excerpt):
{
"transcripts": [
{
"sentences": [
{ "begin_time": 100, "end_time": 3820, "text": "Hello, let's discuss the project progress today.", "speaker_id": 0 },
{ "begin_time": 3820, "end_time": 6500, "text": "Sure, I'll give the update first.", "speaker_id": 1 }
]
}
]
}
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference for each SDK.
When speaker diarization is enabled, keep the audio duration under 2 hours to avoid recognition failures or timeouts. (For the audio duration limit when diarization is not enabled, see Process long audio files.) Diarization is supported only for mono audio.
For complete field definitions, see the API reference.
Sensitive word filter
The sensitive word filter replaces or removes sensitive words from recognition results. It is well suited for customer service quality assurance (QA), content compliance, and subtitle moderation.
Supported models: Fun-ASR and Paraformer support the sensitive word filter. The Qwen-ASR series (Qwen3-ASR-Flash and Qwen3-ASR-Flash-Filetrans) does not yet support it.
Default behavior: When the special_word_filter parameter is not specified, the system applies the built-in Alibaba Cloud Model Studio sensitive word list. Matched words are replaced with the same number of * characters.
Custom configuration: special_word_filter is a JSON object with three fields:
-
filter_with_signed.word_list: An array of strings whose matches are replaced with the same number of*characters. For example, with["test"], "please help me test it" becomes "please help me **** it". -
filter_with_empty.word_list: An array of strings whose matches are removed from the result. For example, with["start"], "is the game about to start now" becomes "is the game about to now". -
system_reserved_filter: A boolean. Defaults totrue. Controls whether the built-in sensitive word list is applied in addition to the custom lists.
Example configuration:
{
"special_word_filter": {
"filter_with_signed": {
"word_list": ["test"]
},
"filter_with_empty": {
"word_list": ["start", "happen"]
},
"system_reserved_filter": true
}
}
SDK naming conventions for these parameters vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
Context enhancement
Supported model: Only fun-asr-flash-2026-06-15 supports context enhancement.
Use case: Designed for scenarios that combine ASR with a large language model. Passing previous conversation context (LLM replies and earlier recognition results) into the ASR model significantly improves transcription accuracy for proper nouns such as names, locations, and product terms — more flexible than traditional hotwords.
Usage: Pass the conversation history through input.messages. Use the assistant role for the LLM's previous replies and the user role with input_text type for earlier recognition results. Context pairs must appear before the current audio message. For details, see the API reference.
Supported text types include (but are not limited to):
-
Hotword lists in various delimiter formats (for example: hotword1, hotword2, hotword3, hotword4)
-
Free-form paragraphs or passages of any length
-
Mixed content: any combination of word lists and paragraphs
-
Irrelevant or meaningless text, including gibberish. The model tolerates irrelevant content well, and recognition quality rarely degrades because of it.
Example:
Consider an audio clip whose correct transcription is: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bulge Bracket, BB..."
|
Without context enhancement Without context enhancement, some investment bank names are misrecognized. For example, "Bird Rock" should be "Bulge Bracket". Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bird Rock, BB..." |
With context enhancement With context enhancement, the investment bank names are recognized correctly. Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bulge Bracket, BB..." |
To produce the corrected result, include any of the following in the context:
-
A word list:
-
List 1:
Bulge Bracket, Boutique, Middle Market, domestic securities firms -
List 2:
Bulge Bracket Boutique Middle Market domestic securities firms -
List 3:
['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
-
-
Natural language:
Investment bank classification: a quick guide. Recently a few friends in Australia asked me what investment banks really are. Here is a quick primer. For students studying abroad, investment banks fall into four broad categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms. Bulge Bracket banks: the nine top investment banks we often refer to, including Goldman Sachs, Morgan Stanley, and so on. They are large in both business scope and scale. Boutique banks: relatively small in size but highly focused in their service areas. Firms such as Lazard and Evercore have deep expertise in specific fields. Middle Market banks: serve mid-sized companies with M&A, IPO, and similar services. Though smaller than the bulge brackets, they hold strong positions in specific markets. Domestic securities firms: as the Chinese market has risen, domestic firms play an increasingly important role internationally. There are also further breakdowns by position and business line you can find in related charts. Hopefully this helps you understand investment banks and prepare for your career. -
Natural language with distracting content: some text is unrelated to the audio, such as the names in the example below.
Investment bank classification: a quick guide. Recently a few friends in Australia asked me what investment banks really are. Here is a quick primer. For students studying abroad, investment banks fall into four broad categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms. Bulge Bracket banks: the nine top investment banks we often refer to, including Goldman Sachs, Morgan Stanley, and so on. They are large in both business scope and scale. Boutique banks: relatively small in size but highly focused in their service areas. Firms such as Lazard and Evercore have deep expertise in specific fields. Middle Market banks: serve mid-sized companies with M&A, IPO, and similar services. Though smaller than the bulge brackets, they hold strong positions in specific markets. Domestic securities firms: as the Chinese market has risen, domestic firms play an increasingly important role internationally. There are also further breakdowns by position and business line you can find in related charts. Hopefully this helps you understand investment banks and prepare for your career. Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei, Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu, Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lyu Wenxuan, Jiang Shihan, Ding Muchen, Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Tao Ranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan
Emotion recognition
Qwen3-ASR-Flash-Filetrans and Qwen3-ASR-Flash have emotion recognition always on, with no additional configuration required. The result includes an emotion tag for the speaker, drawn from seven fine-grained categories: surprised, neutral, happy, sad, disgusted, angry, and fearful.
Field paths (vary by interface):
-
OpenAI-compatible interface (Qwen3-ASR-Flash real-time transcription): nested at
choices[].delta.annotations[].emotion(streaming) orchoices[].message.annotations[].emotion(non-streaming). -
DashScope synchronous interface (Qwen3-ASR-Flash): nested at
output.choices[].message.annotations[].emotion. -
DashScope asynchronous task interface (Qwen3-ASR-Flash-Filetrans, audio file transcription): nested at
transcripts[].sentences[].emotion, alongside the timestamp and speaker fields on each sentence object.
Response structure (excerpt from the DashScope asynchronous task interface):
{
"transcripts": [{
"sentences": [{
"begin_time": 0,
"end_time": 1440,
"text": "Welcome to Alibaba Cloud.",
"emotion": "neutral",
"language": "en"
}]
}]
}
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
The Fun-ASR and Paraformer non-real-time models do not yet support emotion recognition. To use emotion recognition with real-time recognition, see the corresponding section in Real-time speech recognition.
Get timestamps
Non-real-time speech recognition can return timestamps in the transcript, which supports subtitle generation, keyword highlighting, and audio or video editing. All three asynchronous transcription models—Fun-ASR, Paraformer, and Qwen3-ASR-Flash-Filetrans—support timestamps, but the default behavior and the control method differ by model:
-
Qwen3-ASR-Flash-Filetrans: Only the DashScope asynchronous interface supports timestamps; the feature is permanently on. The
enable_wordsrequest parameter controls the granularity:false(default) returns sentence-level timestamps;truereturns word-level timestamps. Word-level timestamps are supported only for Chinese, English, Japanese, Korean, German, French, Spanish, Italian, Portuguese, and Russian. Accuracy is not guaranteed for other languages. -
Fun-ASR: Timestamps are permanently on and cannot be disabled.
-
Paraformer: Timestamps are off by default. To enable them, set the
timestamp_alignment_enabledrequest parameter totrue.
When Qwen3-ASR-Flash is called through the OpenAI-compatible interface, the output is a chat.completion and does not include timestamp fields. For timestamps, use Qwen3-ASR-Flash-Filetrans (the asynchronous task interface).
Timestamps are returned in milliseconds at two levels:
-
Sentence level:
sentences[].begin_timeandsentences[].end_timemark the start and end of each sentence in the audio. -
Word level: The
sentences[].words[]array. Each element containsbegin_time,end_time, andtext(the word or character text).
Response structure (excerpt from the DashScope asynchronous task interface):
{
"transcripts": [{
"sentences": [{
"begin_time": 100,
"end_time": 3820,
"text": "Hello, let's discuss the project progress today.",
"words": [
{ "begin_time": 100, "end_time": 596, "text": "Hello" },
{ "begin_time": 596, "end_time": 844, "text": "let's" }
]
}]
}]
}
The in-audio timestamps are integer milliseconds (for example, 100). They are not the same as the task-level end_time (the task completion time, a string such as "2024-09-12 15:11:40.903"). Do not confuse them.
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
Replace polling with callbacks for high-concurrency workloads
After you submit an asynchronous transcription task (Fun-ASR, Qwen3-ASR-Flash-Filetrans, or Paraformer) through POST /api/v1/services/audio/asr/transcription, the usual pattern is to poll GET /api/v1/tasks/{task_id} for the result. That query endpoint defaults to 20 QPS and scales up to 100 QPS, so high-concurrency batch workloads can easily trigger throttling.
Configure callback notifications through EventBridge instead: when a task finishes, Model Studio automatically pushes a dashscope:System:AsyncTaskFinish event to the target you configured (an HTTP/HTTPS endpoint or a RocketMQ topic). Your consumer reads the result directly from the event and no longer needs to call the query endpoint, eliminating the risk of polling-induced throttling. For setup details, see Configure EventBridge callback notifications.
Applicable models
-
Supported: Fun-ASR, Qwen3-ASR-Flash-Filetrans, and Paraformer (all asynchronous transcription tasks).
-
Not supported: Qwen3-ASR-Flash, which uses synchronous and streaming calls rather than asynchronous tasks.
Callback message body
For all three models, the callback message body has data.contain_result set to true, and data.output_result carries the transcription_url directly. Your consumer can fetch the recognition result as soon as it receives the callback, without calling GET /api/v1/tasks/{task_id} again. The result field paths and structure differ across the three models—see the table below.
Pick the correct path based on the model you call; never hard-code a single path in your consumer. On failure, data.output_result.output no longer contains results or result. Instead, it contains code and message. Check data.task_status first, then read the result.
|
Model |
Submit parameter |
Result field path (in the callback body) |
|
|
Fun-ASR |
|
|
|
|
Paraformer |
|
Same as Fun-ASR: |
|
|
Qwen3-ASR-Flash-Filetrans |
|
|
|
Considerations
Security (HTTP/HTTPS delivery): In production, validate the X-Eventbridge-Signature* headers on every callback request before processing the request. Without validation, any external IP can spoof AsyncTaskFinish events and inject fake recognition results. Allow your endpoint at least 5 seconds to respond to each callback request. RocketMQ delivery has no per-message signature; security is enforced by RocketMQ authentication.
Delivery latency: Expect roughly 1–90 seconds between task completion (end_time) and message arrival at your target (HTTP/HTTPS endpoint or RocketMQ topic). The exact latency depends on the current EventBridge load.
Idempotency: The same event may be delivered more than once because of retries. Implement idempotent processing on the consumer side; we recommend using data.id or data.task_id from the CloudEvents envelope as the deduplication key.
Apply in production
The best practices below improve recognition quality and system stability when you use non-real-time speech recognition in production.
Production best practices
-
File hosting: Upload audio files to Alibaba Cloud OSS and call the API by URL. Avoid uploading local files (the local-file API is capped at 100 QPS and the limit cannot be increased).
-
Asynchronous polling: Long-audio transcription uses an asynchronous flow. Set a reasonable polling interval (for example, 2–5 seconds) to avoid burning through your quota with frequent queries. If you need higher throughput than the 100 QPS query ceiling, switch to event callback notifications. For details, see Replace polling with callbacks for high-concurrency workloads.
-
Error handling: Implement a robust retry mechanism. Retry network timeouts and transient server errors (5xx) with exponential backoff.
-
Noise reduction: For noisy audio, preprocess the file with tools such as FFmpeg before submitting it for recognition.
-
Model selection: Choose a model based on audio duration. Use Qwen3-ASR-Flash for short audio up to 5 minutes. Use Fun-ASR or Qwen3-ASR-Flash-Filetrans for longer audio.
Supported scope
Available models vary by region:
China (Beijing)
To call the models below, use an API key in the Beijing region:
-
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
-
Fun-ASR-Flash: fun-asr-flash-2026-06-15
-
Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans (stable, currently equivalent to qwen3-asr-flash-filetrans-2025-11-17), qwen3-asr-flash-filetrans-2025-11-17 (snapshot)
-
Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2026-02-10 (latest snapshot), qwen3-asr-flash-2025-09-08 (snapshot)
-
Paraformer: paraformer-v2, paraformer-8k-v2, paraformer-v1, paraformer-8k-v1, paraformer-mtl-v1
Singapore
To call the models below, use an API key in the Singapore region:
-
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
-
Fun-ASR-Flash: fun-asr-flash-2026-06-15
-
Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans (stable, currently equivalent to qwen3-asr-flash-filetrans-2025-11-17), qwen3-asr-flash-filetrans-2025-11-17 (snapshot)
-
Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2026-02-10 (latest snapshot), qwen3-asr-flash-2025-09-08 (snapshot)
US (Virginia)
To call the models below, use an API key in the US region:
Qwen3-ASR-Flash: qwen3-asr-flash-us (stable, currently equivalent to qwen3-asr-flash-2025-09-08-us), qwen3-asr-flash-2025-09-08-us (snapshot)
API reference
App listing and ICP filing
FAQ
Q: How do I provide a publicly accessible audio URL to the API?
Use Alibaba Cloud Object Storage Service (OSS). OSS provides highly available and durable storage and can generate publicly accessible URLs.
Verify that the URL is reachable from the public internet: Open the URL in a browser or run curl against it to confirm the audio file downloads or plays successfully (HTTP status code 200).
Q: How do I check whether the audio format meets the requirements?
Use the open-source tool ffprobe to quickly inspect audio details:
# Inspect the container format (format_name), codec (codec_name), sample rate (sample_rate), and channel count (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3
Q: How do I process audio to meet the model requirements?
Use the open-source tool FFmpeg to clip audio or convert formats:
-
Audio clipping: extract a segment from a long audio file
# -i: Input file # -ss 00:01:30: Set the clip start time (start at 1 minute 30 seconds) # -t 00:02:00: Set the clip duration (clip 2 minutes) # -c copy: Copy the audio stream directly without re-encoding; faster # output_clip.wav: Output file ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav -
Format conversion
For example, convert any audio to a 16 kHz, 16-bit, mono WAV file:
# -i: Input file # -ac 1: Set the channel count to 1 (mono) # -ar 16000: Set the sample rate to 16000 Hz (16 kHz) # -sample_fmt s16: Set the sample format to 16-bit signed integer PCM # output.wav: Output file ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav
Q: How do I improve recognition accuracy?
The following factors affect recognition accuracy. Check each one and tune accordingly.
Main factors:
-
Sound quality: Recording equipment, sample rate, and ambient noise directly affect audio clarity. High-quality input is the foundation of accurate recognition.
-
Speaker characteristics: Variations in pitch, speaking rate, accent, and dialect—especially uncommon dialects and strong accents—make recognition harder.
-
Language and vocabulary: Mixed languages, technical terms, and slang make recognition harder. Configure hotwords to improve accuracy for domain-specific terminology.
How to optimize:
-
Improve audio quality: Use a high-quality microphone, record at the recommended sample rate, and minimize ambient noise and echo.
-
Adapt to the speaker: For audio with strong accents or distinct dialects, choose a model that supports the relevant dialect.
-
Configure hotwords: Set hotwords for technical terms and proper nouns.