Parameters and interface details for the Fun-ASR audio file recognition Python SDK.
Model Studio has released a workspace-specific domain for the Singapore region: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from https://dashscope-intl.aliyuncs.com to the new domain.
{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.
User guide: Non-real-time speech recognition. For supported audio formats, file size limits, duration limits, and other input requirements, see Audio specifications.
Prerequisites
-
You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.
NoteFor temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.
Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
User guide: Non-real-time speech recognition. For supported audio formats, file size limits, duration limits, and other input requirements, see Audio specifications.
Getting started
The Transcription core class provides interfaces to submit tasks, wait for completion, and query results. Two calling methods are available:
-
Asynchronous submission and synchronous waiting: Submit a task and block the current thread until the task completes.
-
Asynchronous submission and asynchronous query: Submit a task and query the result when needed.
Asynchronous submission and synchronous waiting
-
Call the
async_callmethod of the core class (Transcription) and set the request parameters.Note-
Tasks enter the
PENDINGstate after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed. -
Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
-
-
Call the
waitmethod of the core class (Transcription) to synchronously wait for the task to complete.Task statuses include
PENDING,RUNNING,SUCCEEDED, andFAILED. When the task is in thePENDINGorRUNNINGstate, thewaitinterface is blocked. When the task is in theSUCCEEDEDorFAILEDstate, thewaitinterface is no longer blocked and returns the task result.waitreturns TranscriptionResponse.
Asynchronous submission and asynchronous query
-
Call the
async_callmethod of the core class (Transcription) and set the request parameters.Note-
Tasks enter the
PENDINGstate after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed. -
Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
-
-
Repeatedly call the
fetchmethod of the core class (Transcription) until you get the final task result.When the task status is
SUCCEEDEDorFAILED, stop polling and process the result.fetchreturns a TranscriptionResponse.
Request parameters
Set request parameters using the async_call method of the Transcription core class.
|
Parameter |
Type |
Default |
Required |
Description |
|
model |
str |
- |
Yes |
The model used to transcribe the audio or video file. Valid values:
|
|
file_urls |
list[str] |
- |
Yes |
A list of URLs for audio and video file transcription. The HTTP and HTTPS protocols are supported. A single request supports only 1 URL. If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix. |
|
vocabulary_id |
str |
- |
No |
The ID of a custom vocabulary. Hotwords in this vocabulary are used for speech recognition. Disabled by default. See Customize hotwords. |
|
channel_id |
list[int] |
[0] |
No |
Indexes of sound channels to recognize in a multi-channel audio file. The index starts from 0. For example, [0] recognizes the first channel, and [0, 1] recognizes the first and second channels. If omitted, the first channel is processed by default. Important
Each specified sound channel is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges. |
|
special_word_filter |
str |
- |
No |
Specifies the sensitive words to handle during speech recognition. You can configure different handling actions for individual sensitive words. For details, see Sensitive word filter. |
|
diarization_enabled |
bool |
False |
No |
Automatic speaker diarization is disabled by default. This feature applies to single-channel audio only (not supported for multi-channel audio). When enabled, recognition results include the Note
If speaker diarization is enabled, keep the audio duration under 2 hours. Audio exceeding 2 hours may cause recognition failures or timeouts. For an example of |
|
speaker_count |
int |
- |
No |
A reference value for the number of speakers. The value must be an integer from 2 to 100, inclusive. Takes effect only when speaker diarization is enabled ( By default, the number of speakers is automatically determined. This parameter serves as a hint to the algorithm and does not guarantee the exact number of speakers in the output. |
|
language_hints |
list[str] |
- |
No |
The language code for recognition. If the source language is unknown, leave it unset and the model detects the language automatically. The system reads only the first value in the array. Any extra values are ignored. |
Response results
TranscriptionResponse
TranscriptionResponse encapsulates the basic information of the task, such as task_id and task_status, and the execution result. The execution result is the content of the output property. For more information, see TranscriptionOutput.
Parameters to note:
|
Parameter |
Description |
|
status_code |
The HTTP status code of the request. |
|
code |
|
|
message |
|
|
task_id |
The task ID. |
|
task_status |
The task status. The possible states are When a task includes multiple subtasks, the overall task status is marked as |
|
results |
The recognition results of the subtasks. |
|
subtask_status |
The subtask status. The possible states are |
|
file_url |
The URL of the audio file to be recognized. |
|
transcription_url |
The URL for the audio recognition result. The recognition result is stored as a JSON file. Download the file or read its content by sending an HTTP request to the |
TranscriptionOutput
TranscriptionOutput corresponds to the output property of TranscriptionResponse and represents the result of the current task execution.
Parameters to note:
|
Parameter |
Description |
|
code |
The error code. Use this with the |
|
message |
The error message. Use this with the |
|
task_id |
The task ID. |
|
task_status |
The task status. The possible states are When a task includes multiple subtasks, the overall task status is marked as |
|
results |
The recognition results of the subtasks. |
|
subtask_status |
The subtask status. The possible states are |
|
file_url |
The URL of the audio file to be recognized. |
|
transcription_url |
The URL for the audio recognition result. The transcription result is stored in a JSON file. Use the |
Recognition result description
The recognition result is saved as a JSON file.
The key parameters are as follows:
|
Parameter |
Type |
Description |
|
audio_format |
string |
The format of the audio in the source file. |
|
channels |
array[integer] |
The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on. |
|
original_sampling_rate |
integer |
The sample rate of the audio in the source file (Hz). |
|
original_duration_in_milliseconds |
integer |
The original duration of the audio in the source file (ms). |
|
channel_id |
integer |
The index of the transcribed audio track, starting from 0. |
|
content_duration_in_milliseconds |
integer |
The duration of the content in the audio track that is identified as speech (ms). Important
Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies. |
|
transcript |
string |
The paragraph-level speech transcription result. |
|
sentences |
array |
The sentence-level speech transcription result. |
|
words |
array |
The word-level speech transcription result. |
|
begin_time |
integer |
Start timestamp (ms). |
|
end_time |
integer |
End timestamp (ms). |
|
text |
string |
The speech transcription result. |
|
speaker_id |
integer |
The index of the current speaker, starting from 0. This is used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled. |
|
punctuation |
string |
The predicted punctuation mark after the word, if any. |
Key interfaces
Core class (Transcription)
Import Transcription using from dashscope.audio.asr import Transcription.
|
Member method |
Method signature |
Description |
|
async_call |
|
Asynchronously submits a speech recognition task. |
|
wait |
|
Blocks the current thread until the asynchronous task is complete. The task status is This method returns TranscriptionResponse. |
|
fetch |
|
Asynchronously queries the result of the current task. This method returns TranscriptionResponse. |
Other interfaces: Batch query task status/Cancel task
For more information, see Manage asynchronous tasks. You can batch query audio file recognition tasks submitted within 24 hours and cancel tasks that are in the PENDING state.
Error codes
If an error occurs, see Error codes to troubleshoot the issue.
If a task contains multiple subtasks, the overall task status is marked as SUCCEEDED if at least one subtask succeeds. You must check the subtask_status field to determine the result of each subtask.
Example of an error response:
{
"task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
"task_status": "SUCCEEDED",
"submit_time": "2024-12-16 16:30:59.170",
"scheduled_time": "2024-12-16 16:30:59.204",
"end_time": "2024-12-16 16:31:02.375",
"results": [
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
"code": "InvalidFile.DownloadFailed",
"message": "The audio file cannot be downloaded.",
"subtask_status": "FAILED"
}
],
"task_metrics": {
"TOTAL": 1,
"SUCCEEDED": 0,
"FAILED": 1
}
}
FAQ
Features
Q: Is audio in Base64 encoding supported?
This service recognizes audio from publicly accessible URLs only. It does not support audio in Base64 encoding, binary streams, or local files.
Q: How do I provide an audio file as a publicly accessible URL?
You can typically follow these steps. This is a general guide, and the specific steps may vary for different storage products. We recommend that you upload the audio to Object Storage Service (OSS).
When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.
When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:
The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.
Q: How long does it take to get the recognition result?
Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) varies with the queue length and file duration. The longer the audio file, the longer the processing time.
Troubleshooting
If a code error occurs, refer to Error codes to troubleshoot.
Q: Why can't I get a result after continuous polling?
This may be because of rate limiting.
Q: Why is the audio not recognized (no recognition result)?
Check whether the audio format and sample rate are correct and meet the parameter constraints.
You can use the ffprobe tool to retrieve information about the audio container, codec, sample rate, and channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx