Python SDK-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

User guide: Non-real-time speech recognition. For supported audio formats, file size limits, and duration limits, see Audio specifications.

Prerequisites

You have activated the service and Obtain an API key. Please Configure API key as an environment variable instead of hardcoding it in your code to prevent security risks caused by code leakage.

Note
When you need to provide temporary access to third-party applications or users, or when you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend using temporary authentication tokens.

Compared with long-term API Keys, temporary authentication tokens have a short validity period (60 seconds) and higher security, making them suitable for temporary call scenarios and effectively reducing the risk of API Key leakage.

Usage: In your code, replace the API Key originally used for authentication with the obtained temporary authentication token.
Install the latest DashScope SDK.

Getting started

The Transcription core class provides interfaces to submit tasks, wait for completion, and query results. Two calling methods are available:

Asynchronous submission and synchronous waiting: Submit a task and block the current thread until the task completes.
Asynchronous submission and asynchronous query: Submit a task and query the result when needed.

Asynchronous submission and synchronous waiting

Call the async_call method of the core class (Transcription) and set the request parameters.
Note
- Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.
- Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
Call the wait method of the core class (Transcription) to synchronously wait for the task to complete.

Task statuses include PENDING, RUNNING, SUCCEEDED, and FAILED. When the task is in the PENDING or RUNNING state, the wait interface is blocked. When the task is in the SUCCEEDED or FAILED state, the wait interface is no longer blocked and returns the task result.

wait returns TranscriptionResponse.

Click to view the complete example

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

# The following URL is for the China (Beijing) region. The URLs vary by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1'

# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav']
)

transcribe_response = Transcription.wait(task=task_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Asynchronous submission and asynchronous query

Call the async_call method of the core class (Transcription) and set the request parameters.
Note
- Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.
- Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
Repeatedly call the fetch method of the core class (Transcription) until you get the final task result.

When the task status is SUCCEEDED or FAILED, stop polling and process the result.

fetch returns a TranscriptionResponse.

Click to view the complete example

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

# The following URL is for the China (Beijing) region. The URLs vary by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1'

# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

transcribe_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav']
)

while True:
    if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
        break
    transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Request parameters

Set request parameters using the async_call method of the Transcription core class.

Parameter	Type	Default	Required	Description
model	str	-	Yes	The model name for audio and video file transcription. Valid values: fun-asr fun-asr-2025-11-07 fun-asr-2025-08-25 fun-asr-mtl fun-asr-mtl-2025-08-25
file_urls	list[str]	-	Yes	A list of URLs for audio and video file transcription. The HTTP and HTTPS protocols are supported. A single request supports only 1 URL. For supported audio formats, file size limits, and duration limits, see Audio specifications. If audio files are stored in Alibaba Cloud OSS, the SDK does not support temporary URLs with the oss:// prefix.
vocabulary_id	str	-	No	The ID of a custom vocabulary. Hotwords in this vocabulary are used for speech recognition. Disabled by default. See Customize hotwords.
channel_id	list[int]	[0]	No	Indexes of sound channels to recognize in a multi-channel audio file. The index starts from 0. For example, [0] recognizes the first channel, and [0, 1] recognizes the first and second channels. If omitted, the first channel is processed by default. Important Each specified sound channel is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges.
special_word_filter	str	-	No	Specifies sensitive words to handle during transcription, with different processing methods for different words. For details, see Sensitive word filter.
diarization_enabled	bool	False	No	Automatic speaker diarization is disabled by default. This feature applies to single-channel audio only (not supported for multi-channel audio). When enabled, recognition results include the `speaker_id` field to distinguish speakers. Note If speaker diarization is enabled, keep the audio duration under 2 hours. Audio exceeding 2 hours may cause recognition failures or timeouts. For an example of `speaker_id`, see Recognition result description.
speaker_count	int	-	No	A reference value for the number of speakers. The value must be an integer from 2 to 100, inclusive. Takes effect only when speaker diarization is enabled (`diarization_enabled` is set to true). By default, the number of speakers is automatically determined. This parameter serves as a hint to the algorithm and does not guarantee the exact number of speakers in the output.
language_hints	list[str]	-	No	The language of the audio to transcribe. If the language is unknown, omit this parameter. The model detects the language automatically. Only the first value in the array is read. Additional values are ignored. View supported language codes fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: zh: Chinese en: English ja: Japanese ko: Korean vi: Vietnamese th: Thai id: Indonesian ms: Malay tl: Filipino hi: Hindi ar: Arabic fr: French de: German es: Spanish pt: Portuguese ru: Russian it: Italian nl: Dutch sv: Swedish da: Danish fi: Finnish no: Norwegian el: Greek pl: Polish cs: Czech hu: Hungarian ro: Romanian bg: Bulgarian hr: Croatian sk: Slovak fun-asr-2025-08-25: zh: Chinese en: English

Response results

`TranscriptionResponse`

TranscriptionResponse encapsulates the basic information of the task, such as task_id and task_status, and the execution result. The execution result is the content of the output property. For more information, see TranscriptionOutput.

Click to view an example of the TranscriptionResponse structure

`PENDING` state

{
    "status_code":200,
    "request_id":"251aceab-a6aa-9fc4-b7f7-0cc6d3e2a9f3",
    "code":null,
    "message":"",
    "output":{
        "task_id":"7d0a58a3-1dbe-4de9-8cff-5f48213128b0",
        "task_status":"PENDING",
        "submit_time":"2025-02-13 16:55:08.573",
        "scheduled_time":"2025-02-13 16:55:08.592",
        "task_metrics":{
            "TOTAL":1,
            "SUCCEEDED":0,
            "FAILED":0
        }
    },
    "usage":null
}

`RUNNING` state

{
    "status_code":200,
    "request_id":"d9d530f1-853c-9848-a5f1-f5de59086ff7",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"RUNNING",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "task_metrics":{
            "TOTAL":1,
            "SUCCEEDED":0,
            "FAILED":0
        }
    },
    "usage":null
}

`SUCCEEDED` state

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"SUCCEEDED",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "end_time":"2025-02-13 17:31:21.867",
        "results":[
            {
                "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A31/20ee4e4f-0404-4806-b617-c7d4c62eed19-1.json?Expires=1739525481&OSSAccessKeyId=yourOSSAccessKeyId&Signature=3q%2B1uQmRwltd7FPn5HQM2mBKw74%3D",
                "subtask_status":"SUCCEEDED"
            }
        ],
        "task_metrics":{
            "TOTAL":1,
            "SUCCEEDED":1,
            "FAILED":0
        }
    },
    "usage":{
        "duration":9
    }
}

`FAILED` state

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
        "task_status": "FAILED",
        "submit_time": "2024-12-16 16:30:59.170",
        "scheduled_time": "2024-12-16 16:30:59.204",
        "end_time": "2024-12-16 16:31:02.375",
        "results": [
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
                "code": "InvalidFile.DownloadFailed",
                "message": "The audio file cannot be downloaded.",
                "subtask_status": "FAILED"
            }
        ],
        "task_metrics": {
            "TOTAL": 1,
            "SUCCEEDED": 0,
            "FAILED": 1
        }
    },
    "usage":{
        "duration":9
    }
}

Parameters to note:

Parameter	Description
status_code	The HTTP status code of the request.
code	You can ignore the outermost `code`. `output.results` contains a `code` field for the error code. Check the `message` field and refer to Error codes to troubleshoot.
message	You can ignore the outermost `message`. The `message` field in `output.results` contains the error message. Use it with the `code` field and refer to Error codes to troubleshoot.
task_id	The task ID.
task_status	The task status. The possible states are `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`. When a task includes multiple subtasks, the overall task status is marked as `SUCCEEDED` as long as at least one subtask succeeds. Check the `subtask_status` field for the result of each specific subtask.
results	The recognition results of the subtasks.
subtask_status	The subtask status. The possible states are `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`.
file_url	The URL of the audio file to be recognized.
transcription_url	The URL for the audio recognition result. The recognition result is stored as a JSON file. Download the file or read its content by sending an HTTP request to the `transcription_url`. For the JSON file content, see Recognition result description.

`TranscriptionOutput`

TranscriptionOutput corresponds to the output property of TranscriptionResponse and represents the result of the current task execution.

Click to view an example of the TranscriptionOutput structure

PENDING state

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"PENDING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":1,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

RUNNING state

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"RUNNING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":1,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

`SUCCEEDED` state

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"SUCCEEDED",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "end_time":"2025-02-13 17:59:28.828",
    "results":[
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A59/70e737cc-bf8c-418b-b0c8-83fab192a0fa-1.json?Expires=1739527168&OSSAccessKeyId=yourOSSAccessKeyId&Signature=AtGjIKI%2BdgbzjJIu%2BHsr1R5nSAY%3D",
            "subtask_status":"SUCCEEDED"
        }
    ],
    "task_metrics":{
        "TOTAL":1,
        "SUCCEEDED":1,
        "FAILED":0
    }
}

`FAILED` state

code is the error code and message is the error message. These two fields are returned only when an error occurs. You can use these two fields to troubleshoot the issue. For more information, see Error codes.

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "FAILED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 1,
        "SUCCEEDED": 0,
        "FAILED": 1
    }
}

Parameters to note:

Parameter	Description
code	The error code. Use this with the `message` field and refer to Error codes to troubleshoot.
message	The error message. Use this with the `code` field and refer to Error codes to troubleshoot.
task_id	The task ID.
task_status	The task status. The possible states are `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`. When a task includes multiple subtasks, the overall task status is marked as `SUCCEEDED` as long as at least one subtask succeeds. Check the `subtask_status` field for the result of each specific subtask.
results	The recognition results of the subtasks.
subtask_status	The subtask status. The possible states are `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`.
file_url	The URL of the audio file to be recognized.
transcription_url	The URL for the audio recognition result. The transcription result is stored in a JSON file. Use the `transcription_url` to download the file or read its content with an HTTP request. For the JSON file content, see Transcription result description.

Recognition result description

The recognition result is saved as a JSON file.

Click to view a recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, // This field is displayed only when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The key parameters are as follows:

Parameter	Type	Description
audio_format	string	The format of the audio in the source file.
channels	array[integer]	The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.
original_sampling_rate	integer	The sample rate of the audio in the source file (Hz).
original_duration_in_milliseconds	integer	The original duration of the audio in the source file (ms).
channel_id	integer	The index of the transcribed audio track, starting from 0.
content_duration_in_milliseconds	integer	The duration of the content in the audio track that is identified as speech (ms). Important Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies.
transcript	string	The paragraph-level speech transcription result.
sentences	array	The sentence-level speech transcription result.
words	array	The word-level speech transcription result.
begin_time	integer	Start timestamp (ms).
end_time	integer	End timestamp (ms).
text	string	The speech transcription result.
speaker_id	integer	The index of the current speaker, starting from 0. This is used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled.
punctuation	string	The predicted punctuation mark after the word, if any.

Key interfaces

Core class (`Transcription`)

Import Transcription using from dashscope.audio.asr import Transcription.

Member method	Method signature	Description
async_call	`@classmethod def async_call(cls, model: str, file_urls: List[str], phrase_id: str = None, api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Asynchronously submits a speech recognition task.
wait	`@classmethod def wait(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Blocks the current thread until the asynchronous task is complete. The task status is `SUCCEEDED` or `FAILED`. This method returns TranscriptionResponse.
fetch	`@classmethod def fetch(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Asynchronously queries the result of the current task. This method returns TranscriptionResponse.

Other interfaces: Batch query task status/Cancel task

For more information, see Manage asynchronous tasks. You can batch query audio file recognition tasks submitted within 24 hours and cancel tasks that are in the PENDING state.

Error codes

If an error occurs, see Error codes to troubleshoot the issue.

If a task contains multiple subtasks, the overall task status is marked as SUCCEEDED if at least one subtask succeeds. You must check the subtask_status field to determine the result of each subtask.

Example of an error response:

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 1,
        "SUCCEEDED": 0,
        "FAILED": 1
    }
}

FAQ

Features

Q: Is audio in Base64 encoding supported?

This service recognizes audio from publicly accessible URLs only. It does not support audio in Base64 encoding, binary streams, or local files.

Q: How do I provide an audio file as a publicly accessible URL?

You can typically follow these steps. This is a general guide, and the specific steps may vary for different storage products. We recommend that you upload the audio to Object Storage Service (OSS).

1. Choose a storage and hosting method

Examples include the following:

Object Storage Service (Recommended):
- Use a cloud provider's object storage service, such as OSS. Upload the audio file to a bucket and set its access permissions to public.
- Advantages: High availability, CDN acceleration support, and easy management.
Web server:
- Place the audio file on a web server that supports HTTP/HTTPS access, such as Nginx or Apache.
- Advantages: Suitable for small projects or local testing.
Content Delivery Network (CDN):
- Host the audio file on a CDN and access it through the URL provided by the CDN.
- Advantages: Accelerates file transfer, suitable for high-concurrency scenarios.

2. Upload the audio file

Upload the audio file based on your chosen storage/hosting method. For example:

Object Storage Service:
- Log on to the cloud provider's console and create a bucket.
- Upload the audio file and set its permissions to "public-read" or generate a temporary access link.
Web server:
- Place the audio file in a specified directory on the server, such as /var/www/html/audio/.
- Ensure the file is accessible via HTTP/HTTPS.

3. Generate a publicly accessible URL

For example:

Object Storage Service:
- After uploading the file, the system automatically generates a public access URL, typically in the format https://<bucket-name>.<region>.aliyuncs.com/<file-name>.
- For a more user-friendly domain name, you can attach a custom domain name and enable HTTPS.
Web server:
- The access URL for the file is usually the server address plus the file path, such as https://your-domain.com/audio/file.mp3.
CDN:
- After configuring CDN acceleration, use the URL provided by the CDN, such as https://cdn.your-domain.com/audio/file.mp3.

4. Verify the URL's availability

In a public network environment, ensure that the generated URL is accessible. For example:

Open the URL in a browser to check if the audio file can be played.
Use a tool, such as curl or Postman, to verify that the URL returns a correct HTTP response (status code 200).

When using the SDK, if audio files are stored in Alibaba Cloud OSS, temporary URLs with the oss:// prefix are not supported.

When using the RESTful API, if audio files are stored in Alibaba Cloud OSS, temporary URLs with the oss:// prefix are supported:

The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.

Q: How long does it take to get the recognition result?

Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) varies with the queue length and file duration. The longer the audio file, the longer the processing time.

Troubleshooting

If a code error occurs, refer to Error codes to troubleshoot.

Q: Why can't I get a result after continuous polling?

This may be because of rate limiting.

Q: Why is the audio not recognized (no recognition result)?

Check whether the audio format and sample rate are correct and meet the parameter constraints.

You can use the ffprobe tool to retrieve information about the audio container, codec, sample rate, and channels:

ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx

Prerequisites

Getting started

Asynchronous submission and synchronous waiting

Asynchronous submission and asynchronous query

Request parameters

Response results

TranscriptionResponse

PENDING state

RUNNING state

SUCCEEDED state

FAILED state

TranscriptionOutput

PENDING state

RUNNING state

SUCCEEDED state

FAILED state

Recognition result description

Key interfaces

Core class (Transcription)

Other interfaces: Batch query task status/Cancel task

Error codes

FAQ

Features

Q: Is audio in Base64 encoding supported?

Q: How do I provide an audio file as a publicly accessible URL?

Q: How long does it take to get the recognition result?

Troubleshooting

Q: Why can't I get a result after continuous polling?

Q: Why is the audio not recognized (no recognition result)?

`TranscriptionResponse`

`PENDING` state

`RUNNING` state

`SUCCEEDED` state

`FAILED` state

`TranscriptionOutput`

`SUCCEEDED` state

`FAILED` state

Core class (`Transcription`)