Use the Fun-ASR HTTP API to submit audio files for transcription and retrieve results through DashScope.
User guide: Non-real-time speech recognition. For supported audio formats, file size limits, and duration limits, see Audio specifications.
DashScope asynchronous calls (Fun-ASR)
How it works
Unlike synchronous calls that return results immediately, the asynchronous mode handles long audio files and time-consuming tasks. This mode uses a submit-then-poll workflow to avoid request timeouts during lengthy processing:
Step 1: Submit a task
The client sends an asynchronous processing request.
After validating the request, the server returns a unique
task_idwithout immediately executing the task, indicating that the task was created successfully.
Step 2: Get the results
The client uses the returned
task_idto poll the result query endpoint repeatedly.Once the task finishes, the result query endpoint returns the final transcription results.
Service endpoint
China (Beijing)
Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription
Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}
Singapore
Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription
Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}
Replace WorkspaceId with your actual workspace ID.
Singapore
Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription
Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}
Replace WorkspaceId with your actual workspace ID.
China (Beijing)
Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription
Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}
Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:
-
China (Beijing): from
https://dashscope.aliyuncs.comtohttps://{WorkspaceId}.cn-beijing.maas.aliyuncs.com -
Singapore: from
https://dashscope-intl.aliyuncs.comtohttps://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com
{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.
When submitting tasks through the new domain (https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com), the request body must include the parameters object. Even if you don't need to set any parameters, pass an empty object {}. Otherwise, the task submits successfully but transcription fails.
Request headers
Parameter | Type | Required | Description |
Authorization | string | Yes | Authentication token in the format |
Content-Type | string | Yes | Media type of the request body. Required only for the submit task endpoint. Set to |
X-DashScope-Async | string | Yes | Async task identifier. Required only for the submit task endpoint. Set to |
Submit task
Submits an audio transcription task. The endpoint returns asynchronously; use the Query task to poll for results.
Request body | The following URL is for the China (Beijing) region. URLs differ by region. |
model The model name for audio and video file transcription. Valid values:
| |
input The input parameter object. | |
parameters The request parameter object. Important When using the new domain ( |
Response body | |
request_id The unique identifier of this request. | |
output The data returned when submitting a task. |
Query task
Returns the status and results of an audio transcription task. Poll this endpoint until the task reaches a terminal state.
Request body | The following URL is for the China (Beijing) region. URLs differ by region. |
task_id Important URL path parameter. No request body. The ID of the task to query, returned as | |
Response body | Success exampleError example |
request_id The unique identifier of this request. | |
output The data returned when querying a task. |
Other APIs: batch query task status and cancel tasks
For details, see Manage asynchronous tasks: supports batch querying of audio file transcription tasks submitted within the last 24 hours, and canceling tasks in PENDING (queued) state.
Transcription result description
The recognition result is saved as a JSON file.
The key parameters are as follows:
|
Parameter |
Type |
Description |
|
audio_format |
string |
The format of the audio in the source file. |
|
channels |
array[integer] |
The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on. |
|
original_sampling_rate |
integer |
The sample rate of the audio in the source file (Hz). |
|
original_duration_in_milliseconds |
integer |
The original duration of the audio in the source file (ms). |
|
channel_id |
integer |
The index of the transcribed audio track, starting from 0. |
|
content_duration_in_milliseconds |
integer |
The duration of the content in the audio track that is identified as speech (ms). Important
Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies. |
|
transcript |
string |
The paragraph-level speech transcription result. |
|
sentences |
array |
The sentence-level speech transcription result. |
|
words |
array |
The word-level speech transcription result. |
|
begin_time |
integer |
Start timestamp (ms). |
|
end_time |
integer |
End timestamp (ms). |
|
text |
string |
The speech transcription result. |
|
speaker_id |
integer |
The index of the current speaker, starting from 0. This is used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled. |
|
punctuation |
string |
The predicted punctuation mark after the word, if any. |
DashScope synchronous calls (Fun-ASR-Flash)
SDK calls aren't supported for this feature.
Endpoint
China (Beijing)
POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Singapore
POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Replace WorkspaceId with your actual Workspace ID.
Singapore
POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Replace WorkspaceId with your actual Workspace ID.
China (Beijing)
POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Request headers
Parameter | Type | Required | Description |
Authorization | string | Yes | Authentication token in the format |
Content-Type | string | Yes | Media type of the request body. Set to |
X-DashScope-SSE | string | Yes | Controls whether results are returned as an SSE stream. Set to |
Request body | The following URL is for the China (Beijing) region. URLs differ by region. Non-streamingStreamingWith context - Non-streamingWith context - StreamingBase64You can pass Base64-encoded data (Data URI) in the format:
|
model Model name. Set to | |
input Input information. | |
parameters Model parameters. |
Response body | Non-streamingStreamingWhen Sample response: |
request_id Unique identifier for this request. | |
output Output result. | |
usage Usage information. Returned only when |
SSE streaming result processing
In streaming mode, note the following:
For each SSE event received, parse the JSON in the
datafield.Check
output.sentence.sentence_endto determine whether the current sentence is complete: when the value istrue, recognition for the sentence is finished and the word-level timestamps are finalized; when the value isfalse, recognition is still in progress and the text and timestamps may change in subsequent events.usageinformation is returned only in sentence-end events. Use it to track audio processing duration.
DashScope synchronous calls (Fun-ASR-Realtime)
This feature is available only in the Beijing region.
SDK calls aren't supported.
Endpoint
POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Request headers
Parameter | Type | Required | Description |
Authorization | string | Yes | Authentication token in the format |
Content-Type | string | Yes | Media type of the request body. Set to |
X-DashScope-SSE | string | Yes | Controls whether results are returned as an SSE stream. Set to |
Request body | Non-streamingStreaming |
model The model name. Valid values:
| |
input The input information. Either this parameter or the audio file URL approach (specified through | |
parameters The model parameters. | |
resources The resource list. This is a reserved field. Pass an empty array |
Response body | Non-streamingStreamingWhen Intermediate results (sentence start, progressive word additions): Final result (sentence complete, includes Important
Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:
|
request_id The unique identifier for the request. | |
output The output result. | |
usage The usage information. Returned only when |
SSE streaming result processing
In streaming mode, note the following:
For each SSE event received, parse the JSON in the
datafield.Check
output.sentence.sentence_endto determine whether the current sentence is complete: when the value istrue, recognition for the sentence is finished and the word-level timestamps are finalized; when the value isfalse, recognition is still in progress and the text and timestamps may change in subsequent events.usageinformation is returned only in sentence-end events. Use it to track audio processing duration.