Fun-ASR RESTful API audio file transcription DashScope-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Use the Fun-ASR HTTP API to submit audio files for transcription and retrieve results through DashScope.

User guide: Non-real-time speech recognition. For supported audio formats, file size limits, and duration limits, see Audio specifications.

DashScope asynchronous calls (Fun-ASR)

How it works

Unlike synchronous calls that return results immediately, the asynchronous mode handles long audio files and time-consuming tasks. This mode uses a submit-then-poll workflow to avoid request timeouts during lengthy processing:

Step 1: Submit a task
- The client sends an asynchronous processing request.
- After validating the request, the server returns a unique task_id without immediately executing the task, indicating that the task was created successfully.
Step 2: Get the results
- The client uses the returned task_id to poll the result query endpoint repeatedly.
- Once the task finishes, the result query endpoint returns the final transcription results.

Service endpoint

China (Beijing)

Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace {WorkspaceId} with your actual workspace ID.

Singapore

Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace {WorkspaceId} with your actual workspace ID.

Singapore

Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace {WorkspaceId} with your actual workspace ID.

China (Beijing)

Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace {WorkspaceId} with your actual workspace ID.

Important

Alibaba Cloud Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

China (Beijing): from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com
Singapore: from dashscope-intl.aliyuncs.com to {WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

Replace {WorkspaceId} with your actual Workspace ID. The existing domains remain fully functional.

Important

When submitting tasks through the new domain (https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com), the request body must include the parameters object. Even if you don't need to set any parameters, pass an empty object {}. Otherwise, the task submits successfully but transcription fails.

Replace {WorkspaceId} with your actual workspace ID.

Request headers

Parameter	Type	Required	Description
Authorization	string	Yes	Authentication token in the format `Bearer <your_api_key>`. Replace `<your_api_key>` with your actual API key. Required for both the submit task and query task endpoints.
Content-Type	string	Yes	Media type of the request body. Required only for the submit task endpoint. Set to `application/json`.
X-DashScope-Async	string	Yes	Async task identifier. Required only for the submit task endpoint. Set to `enable`. Omitting this header causes task submission to fail.

Submit task

Submits an audio transcription task. The endpoint returns asynchronously; use the Query task to poll for results.

Request body	The following URL is for the China (Beijing) region. URLs differ by region. `curl --location 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-Async: enable" \ --data '{ "model": "fun-asr", "input": { "file_urls": [ "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" ] }, "parameters": { "channel_id": [0] } }'`
model `string` (Required) The model name for audio and video file transcription. Valid values: fun-asr fun-asr-2025-11-07 fun-asr-2025-08-25 fun-asr-mtl fun-asr-mtl-2025-08-25
input `object` (Required) The input parameter object. Properties file_urls `array[string]` (Required) A list of audio or video file URLs to transcribe. Supports HTTP and HTTPS. Each request accepts only one URL. For supported audio formats, file size limits, and duration limits, see Audio specifications. If your audio files are stored in Alibaba Cloud OSS, the RESTful API also supports temporary URLs prefixed with `oss://`. Important Temporary URLs expire after 48 hours and can't be used afterward. Don't use them in production environments. The file upload credential API is throttled at 100 QPS and can't be scaled up. Don't use it in production, high-concurrency, or load testing scenarios. In production, store files in stable storage such as Alibaba Cloud OSS to ensure long-term availability and avoid throttling. If the OSS temporary public URL can't be accessed, set `X-DashScope-OssResourceResolve` to `enable` in the Request headers (not recommended). The Java SDK and Python SDK don't support custom request headers.
parameters `object` (Optional) The request parameter object. Important When using the new domain (`https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com`), `parameters` is required. Even if you don't need to set any parameters, pass an empty object `{}`. If you omit this field, the task submits successfully but the query results endpoint returns a transcription failure. Properties vocabulary_id `string` (Optional) The custom vocabulary ID. When set, the custom vocabulary associated with this ID takes effect during transcription. Disabled by default. For usage instructions, see Custom hotwords. channel_id `array[integer]` (Optional) Specifies the audio track indexes to transcribe in a multi-track audio file. Indexes start from 0. For example, [0] transcribes the first track, and [0, 1] transcribes both the first and second tracks. If omitted, only the first track is processed. Important Each specified track is billed independently. For example, requesting [0, 1] for a single file incurs two separate charges. Default: [0]. special_word_filter `string` (Optional) Specifies sensitive words to handle during transcription, with different processing methods for different words. For details, see Sensitive word filter. diarization_enabled `boolean` (Optional) Whether to enable speaker diarization. Disabled by default. Applies only to mono audio. Multi-channel audio doesn't support speaker diarization. When enabled, the transcription results include a `speaker_id` field to distinguish different speakers. Note If speaker diarization is enabled, keep the audio duration under 2 hours. Longer audio may cause transcription failures or timeouts. For a `speaker_id` example, see Transcription result description. Default: false. speaker_count `integer` (Optional) Important Takes effect only when speaker diarization is enabled (`diarization_enabled` set to `true`). A hint for the expected number of speakers. Valid values: integers from 2 to 100, inclusive. By default, the system detects the speaker count automatically. Providing this value guides the algorithm toward the specified count but doesn't guarantee that exact number in the output. language_hints `array[string]` (Optional) The language of the audio to transcribe. If the language is unknown, omit this parameter. The model detects the language automatically. Only the first value in the array is read. Additional values are ignored. View supported language codes fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: zh: Chinese en: English ja: Japanese ko: Korean vi: Vietnamese th: Thai id: Indonesian ms: Malay tl: Filipino hi: Hindi ar: Arabic fr: French de: German es: Spanish pt: Portuguese ru: Russian it: Italian nl: Dutch sv: Swedish da: Danish fi: Finnish no: Norwegian el: Greek pl: Polish cs: Czech hu: Hungarian ro: Romanian bg: Bulgarian hr: Croatian sk: Slovak fun-asr-2025-08-25: zh: Chinese en: English

Response body	`{ "output": { "task_status": "PENDING", "task_id": "c2e5d63b-96e1-4607-bb91-**********" }, "request_id": "77ae55ae-be17-97b8-9942--**********" }`
request_id `string` The unique identifier of this request.
output `object` The data returned when submitting a task. Properties task_id `string` The task ID. Pass this ID as the string to the Query task. task_status `string` The task status. Returns `PENDING` when the task is submitted successfully.

Query task

Returns the status and results of an audio transcription task. Poll this endpoint until the task reaches a terminal state.

Request body	The following URL is for the China (Beijing) region. URLs differ by region. `curl --location 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY"`
task_id `string` (Required) Important URL path parameter. No request body. The ID of the task to query, returned as `task_id` by the Submit task.

Response body	Success example { "request_id": "f9e1afad-94d3-997e-a83b-**********", "output": { "task_id": "f86ec806-4d73-485f-a24f-********", "task_status": "SUCCEEDED", "submit_time": "2024-09-12 15:11:40.041", "scheduled_time": "2024-09-12 15:11:40.071", "end_time": "2024-09-12 15:11:40.903", "results": [ { "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav", "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/pre/filetrans-16k/20240912/15%3A11/409a4b92-445b-4dd8-8c1d-f110954d82d8-1.json?Expires=1726211500&OSSAccessKeyId=yourOSSAccessKeyId&Signature=v5Owy5qoAfT7mzGmQgH0g8C**%3D", "subtask_status": "SUCCEEDED" } ], "task_metrics": { "TOTAL": 1, "SUCCEEDED": 1, "FAILED": 0 } }, "usage": { "duration": 9 } } Error example { "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx", "task_status": "SUCCEEDED", "submit_time": "2024-12-16 16:30:59.170", "scheduled_time": "2024-12-16 16:30:59.204", "end_time": "2024-12-16 16:31:02.375", "results": [ { "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav", "code": "FILE_DOWNLOAD_FAILED", "message": "The audio file cannot be downloaded.", "subtask_status": "FAILED" } ], "task_metrics": { "TOTAL": 1, "SUCCEEDED": 0, "FAILED": 1 } }
request_id `string` The unique identifier of this request.
output `object` The data returned when querying a task. Properties task_id `string` The ID of the queried task. task_status `string` The status of the queried task. Note When a task contains multiple subtasks, the overall task status is marked as `SUCCEEDED` as long as any subtask succeeds. Check the `subtask_status` field to determine individual subtask results. submit_time `string` The time when the task was submitted. scheduled_time `string` The time when the task was scheduled for execution. end_time `string` The time when the task finished. results `array[object]` The list of subtask results, one for each audio file to transcribe. Properties subtask_status `string` The subtask status. file_url `string` The URL of the file processed in this transcription subtask. transcription_url `string` The URL to download the transcription results. This URL is valid for 24 hours. Once expired, previously returned URLs can no longer be used to query the task or download results. The transcription results are stored as a JSON file. You can download this file through the URL or read its content directly through an HTTP request. For the meaning of each field in the JSON data, see Transcription result description. code `string` Important Returned only when the subtask fails. The error code of the failed subtask. message `string` Important Returned only when the subtask fails. The error message of the failed subtask. task_metrics `object` Overall task execution statistics. Properties TOTAL `integer` The total number of subtasks. SUCCEEDED `integer` The number of succeeded subtasks. FAILED `integer` The number of failed subtasks.

Other APIs: batch query task status and cancel tasks

For details, see Manage asynchronous tasks: supports batch querying of audio file transcription tasks submitted within the last 24 hours, and canceling tasks in PENDING (queued) state.

Transcription result description

The recognition result is saved as a JSON file.

Click to view a recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, // This field is displayed only when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The key parameters are as follows:

Parameter	Type	Description
audio_format	string	The format of the audio in the source file.
channels	array[integer]	The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.
original_sampling_rate	integer	The sample rate of the audio in the source file (Hz).
original_duration_in_milliseconds	integer	The original duration of the audio in the source file (ms).
channel_id	integer	The index of the transcribed audio track, starting from 0.
content_duration_in_milliseconds	integer	The duration of the content in the audio track that is identified as speech (ms). Important Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies.
transcript	string	The paragraph-level speech transcription result.
sentences	array	The sentence-level speech transcription result.
words	array	The word-level speech transcription result.
begin_time	integer	Start timestamp (ms).
end_time	integer	End timestamp (ms).
text	string	The speech transcription result.
speaker_id	integer	The index of the current speaker, starting from 0. This is used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled.
punctuation	string	The predicted punctuation mark after the word, if any.

DashScope synchronous calls (Fun-ASR-Flash)

Important

SDK calls aren't supported for this feature.

Endpoint

China (Beijing)

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace {WorkspaceId} with your actual workspace ID.

Singapore

POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace {WorkspaceId} with your actual Workspace ID.

Singapore

POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace {WorkspaceId} with your actual Workspace ID.

China (Beijing)

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace {WorkspaceId} with your actual workspace ID.

Important

China (Beijing): from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com
Singapore: from dashscope-intl.aliyuncs.com to {WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

Replace {WorkspaceId} with your actual Workspace ID. The existing domains remain fully functional.

Request headers

Parameter	Type	Required	Description
Authorization	string	Yes	Authentication token in the format `Bearer <your_api_key>`. Replace `<your_api_key>` with your actual API key.
Content-Type	string	Yes	Media type of the request body. Set to `application/json`.
X-DashScope-SSE	string	Yes	Controls whether results are returned as an SSE stream. Set to `enable` to receive intermediate and final results incrementally. Set to `disable` or omit this header to receive only the final result.

Request body	The following URL is for the China (Beijing) region. URLs differ by region. Non-streaming curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: disable" \ --data '{ "model": "fun-asr-flash-2026-06-15", "input": { "messages": [ { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" } } ] } ] }, "parameters": { "format": "wav", "sample_rate": "16000" } }' Streaming curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: enable" \ --data '{ "model": "fun-asr-flash-2026-06-15", "input": { "messages": [ { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" } } ] } ] }, "parameters": { "format": "wav", "sample_rate": "16000" } }' With context - Non-streaming curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: disable" \ --data '{ "model": "fun-asr-flash-2026-06-15", "input": { "messages": [ { "role": "user", "content": [ { "type": "input_text", "text": "Hello" } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "Hello, I am Qwen. How can I help you?" } ] }, { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" } } ] } ] }, "parameters": { "format": "wav", "sample_rate": "16000" } }' With context - Streaming curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: enable" \ --data '{ "model": "fun-asr-flash-2026-06-15", "input": { "messages": [ { "role": "user", "content": [ { "type": "input_text", "text": "Hello" } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "Hello, I am Qwen. How can I help you?" } ] }, { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" } } ] } ] }, "parameters": { "format": "wav", "sample_rate": "16000" } }' Base64 You can pass Base64-encoded data (Data URI) in the format: `data:<mediatype>;base64,<data>`. `<mediatype>`: MIME type Varies by audio format. Examples: WAV: `audio/wav` MP3: `audio/mpeg` `<data>`: Base64-encoded string of the audio file Base64 encoding increases the payload size. Keep the original file small enough so the encoded data stays within the 10 MB input audio size limit. Example: `data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9` View sample code Python `import base64, pathlib # input.mp3 is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements. file_path = pathlib.Path("input.mp3") base64_str = base64.b64encode(file_path.read_bytes()).decode() data_uri = f"data:audio/mpeg;base64,{base64_str}"` Java import java.nio.file.; import java.util.Base64; public class Main { /* * filePath is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements. */ public static String toDataUrl(String filePath) throws Exception { byte[] bytes = Files.readAllBytes(Paths.get(filePath)); String encoded = Base64.getEncoder().encodeToString(bytes); return "data:audio/mpeg;base64," + encoded; } public static void main(String[] args) throws Exception { System.out.println(toDataUrl("input.mp3")); } } import base64, pathlib import os import requests # input.wav is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements. file_path = pathlib.Path("input.wav") base64_str = base64.b64encode(file_path.read_bytes()).decode() data_uri = f"data:audio/wav;base64,{base64_str}" url = "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" headers = { "Authorization": f"Bearer {os.environ['DASHSCOPE_API_KEY']}", "Content-Type": "application/json", "X-DashScope-SSE": "disable", } payload = { "model": "fun-asr-flash-2026-06-15", "input": { "messages": [ { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": data_uri, }, } ], } ] }, "parameters": { "format": "wav", "sample_rate": "16000", }, } response = requests.post(url, headers=headers, json=payload) print(response.status_code) print(response.json())
model `string` (Required) Model name. Set to `fun-asr-flash-2026-06-15`.
input `object` (Required) Input information. Properties messages `array(object)` (Required) Message list. Contains the audio to transcribe, and optionally conversation context to improve recognition accuracy. Important The context feature improves recognition accuracy for specialized vocabulary. For usage details, see Quick start. Constraints: Context messages (`input_text` and `text` types) are limited to 5 each. When exceeded, only the 5 most recent are retained. The total text length per turn (`user` and `assistant` `text` field lengths combined) must not exceed 400 characters (counted per character, where each character counts as 1). Excess content is silently truncated from the end. Important When providing context, the order of messages in the `messages` array matters: context messages must follow conversational turn order, with each `user` message (`input_text` type) placed before the corresponding `assistant` message (`text` type). The `user` message containing `input_audio` must always be the last item in the `messages` array. Properties role `string` (Required) Message role. Valid values: `user` (Required): A user message. When type is `input_audio`, this represents the audio to transcribe. When type is `input_text`, this represents a previous turn's transcription result or a domain-specific word list (optional, for context). `assistant` (Optional, for context): An LLM response from a previous turn. content `array(object)` (Required) Message content list. Properties type `string` (Required) Content type. Each request must include at least one `input_audio` message. Valid values: `input_audio` (Required): Audio input for the current transcription (role must be user). Include the `input_audio` object. `input_text` (Optional, for context): A previous turn's transcription result or a domain-specific word list (role must be user). Include the `text` field. `text` (Optional, for context): An LLM response from a previous turn (role must be assistant). Include the `text` field. input_audio `object` (Conditionally required) Required when `type` is `input_audio`. Properties data `string` (Required) Audio data for transcription. For supported audio formats, file size limits, and duration limits, see Audio specifications. Two formats are supported: Audio file URL: A publicly accessible URL pointing to the audio file. Base64 Data URI: A Data URI containing Base64-encoded audio data, formed by concatenating the `data:{MIME_TYPE};base64,` prefix with the Base64-encoded audio data. Supported MIME types include `audio/wav` and `audio/mp3`. Example (URL): `https://example.com/audio/sample.wav` Example (Base64): `data:audio/wav;base64,{BASE64_ENCODED_DATA}` text `string` (Conditionally required) When `type` is `input_text`, pass the transcription result from a previous turn or a domain-specific word list. When `type` is `text`, pass the LLM response from a previous turn. Text length is counted per character, where each character counts as 1. The combined `text` field lengths of all messages in a turn must not exceed 400 characters. Excess content is truncated from the end.
parameters `object` (Required) Model parameters. Properties format `string` (Required) Audio format. Set this to the actual format of your audio file, such as `wav`, `mp3`, or `opus`. For details, see Audio specifications. sample_rate `string` (Optional) Audio sample rate in Hz. For example, `16000` indicates a 16 kHz sample rate. For details, see Audio specifications.

Response body	Non-streaming { "output": { "sentence": { "begin_time": 760, "channel_id": 0, "end_time": 3800, "sentence_end": true, "sentence_id": 1, "text": "Hello world, this is Alibaba Speech Lab.", "words": [ {"begin_time": 760, "end_time": 1040, "fixed": true, "punctuation": "", "text": "Hello"}, {"begin_time": 1040, "end_time": 1240, "fixed": true, "punctuation": "，", "text": " world"}, {"begin_time": 1360, "end_time": 1880, "fixed": true, "punctuation": "", "text": "this is"}, {"begin_time": 1880, "end_time": 2520, "fixed": true, "punctuation": "", "text": "Alibaba"}, {"begin_time": 2520, "end_time": 2840, "fixed": true, "punctuation": "", "text": "Speech"}, {"begin_time": 2840, "end_time": 3800, "fixed": true, "punctuation": "。", "text": "Lab"} ] }, "text": "Hello world, this is Alibaba Speech Lab." }, "usage": { "duration": 4 }, "request_id": "40e0734d-096f-9ae3-86c1-a8c013287561" } Streaming When `X-DashScope-SSE: enable` is set, the server returns results using the Server-Sent Events (SSE) protocol. Each SSE event has the following format: `id:{sequence_number} event:result :HTTP_STATUS/200 data:{JSON data}` Sample response: id:1 event:result :HTTP_STATUS/200 data:{"output":{"sentence":{"sentence_id":1,"sentence_end":true,"end_time":3800,"words":[{"end_time":1040,"punctuation":"","begin_time":760,"fixed":true,"text":"Hello"},{"end_time":1240,"punctuation":"，","begin_time":1040,"fixed":true,"text":" World"},{"end_time":1880,"punctuation":"","begin_time":1360,"fixed":true,"text":"this is"},{"end_time":2520,"punctuation":"","begin_time":1880,"fixed":true,"text":"Alibaba"},{"end_time":2840,"punctuation":"","begin_time":2520,"fixed":true,"text":"Speech"},{"end_time":3800,"punctuation":"。","begin_time":2840,"fixed":true,"text":"Lab"}],"begin_time":760,"text":"Hello world, this is Alibaba Speech Lab","channel_id":0},"text":"Hello World, this is Alibaba Speech Lab."},"usage":{"duration":4},"request_id":"fc1582e4-935c-9fc2-a482-a98bf43daa69"}
request_id `string` Unique identifier for this request.
output `object` Output result. Properties text `string` Accumulated full transcription text up to this point. sentence `object` Details of the current sentence. Properties sentence_id `integer` Sentence number, starting from 1. sentence_end `boolean` Whether this is the final result for the sentence. When `true`, recognition for this sentence is complete. begin_time `integer` Start time of the sentence in milliseconds. end_time `integer` End time of the sentence in milliseconds. Returned only when `sentence_end` is `true`. text `string` Transcription text for the current sentence. channel_id `integer` Audio channel number, starting from 0. words `array` Word-level timestamp list. Properties text `string` Word text. begin_time `integer` Start time of the word in milliseconds. end_time `integer` End time of the word in milliseconds. punctuation `string` Punctuation mark after the word. Empty string if none. fixed `boolean` Whether the word is finalized. `false` means the timestamp may be adjusted in subsequent events.
usage `object` Usage information. Returned only when `sentence_end` is `true`. Properties duration `integer` Duration of processed audio in seconds.

SSE streaming result processing

In streaming mode, note the following:

For each SSE event received, parse the JSON in the data field.
Check output.sentence.sentence_end to determine whether the current sentence is complete: when the value is true, recognition for the sentence is finished and the word-level timestamps are finalized; when the value is false, recognition is still in progress and the text and timestamps may change in subsequent events.
usage information is returned only in sentence-end events. Use it to track audio processing duration.

DashScope synchronous calls (Fun-ASR-Realtime)

Important

This feature is available only in the Beijing region.
SDK calls aren't supported.

Endpoint

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace {WorkspaceId} with your actual workspace ID.

Important

Alibaba Cloud Model Studio has released a workspace-specific domain for the China (Beijing) region. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com.

Replace {WorkspaceId} with your actual Workspace ID. The existing domain remains fully functional.

Request headers

Parameter	Type	Required	Description
Authorization	string	Yes	Authentication token in the format `Bearer <your_api_key>`. Replace `<your_api_key>` with your actual API key.
Content-Type	string	Yes	Media type of the request body. Set to `application/json`.
X-DashScope-SSE	string	Yes	Controls whether results are returned as an SSE stream. Set to `enable` to turn on SSE streaming, which returns intermediate recognition results and final results in multiple responses. Set to `disable` or omit this header to return only the final result.

Request body	Non-streaming `curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: disable" \ --data '{ "model": "fun-asr-realtime", "input": { "messages": [] }, "parameters": { "audio_address": "https://example.com/audio/sample.mp3", "format": "mp3" }, "resources": [] }'` Streaming `curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header "Content-Type: application/json" \ --header "X-DashScope-SSE: enable" \ --data '{ "model": "fun-asr-realtime", "input": { "messages": [] }, "parameters": { "audio_address": "https://example.com/audio/sample.mp3", "format": "mp3" }, "resources": [] }'`
model `string` (Required) The model name. Valid values: `fun-asr-realtime` (stable version) `fun-asr-realtime-2026-02-28` (Fun-Realtime-ASR-preview)
input `object` (Conditionally Required) The input information. Either this parameter or the audio file URL approach (specified through `parameters.audio_address`) is required. Use this parameter when uploading audio as Base64. Properties messages `array(object)` (Required) The message list. Properties content `array(object)` (Required) The content of the user message. Only one message group is allowed. Properties audio `string` (Required) The audio to transcribe, provided as a Base64-encoded Data URI. For supported audio formats, file size limits, and duration limits, see Audio specifications. Required when uploading audio as Base64. The value is a `data:{MIME_TYPE};base64,` prefix followed by the Base64-encoded audio data. Supported MIME types include `audio/wav` and `audio/mp3`. Example: `data:audio/wav;base64,{BASE64_ENCODED_DATA}` role `string` (Required) The role of the user message. Set to `user`. Required when uploading audio as Base64.
parameters `object` (Required) The model parameters. Properties audio_address `string` (Conditionally Required) The URL of the audio file. Either this parameter or the Base64 approach (specified through `input.messages`) is required. Required when using the URL approach. The URL must be publicly accessible. For supported audio formats, file size limits, and duration limits, see Audio specifications. format `string` (Required) The audio format. Set this to the actual format of the audio file. Supported formats include `wav`, `mp3`, and `opus`. For more information, see Audio specifications. vad_enabled `boolean` (optional) Specifies whether to enable Voice Activity Detection (VAD). Defaults to `true`. Set to `false` to disable VAD. However, if the audio duration exceeds 1 minute, the system automatically enables VAD regardless of this setting.
resources `array` (optional) The resource list. This is a reserved field. Pass an empty array `[]`.

Response body	Non-streaming { "output": { "sentence": { "begin_time": 160, "channel_id": 0, "end_time": 1680, "sentence_end": true, "sentence_id": 1, "text": "Welcome to Alibaba Cloud.", "words": [ {"begin_time": 160, "end_time": 520, "fixed": true, "punctuation": "", "text": "Welcome"}, {"begin_time": 520, "end_time": 880, "fixed": true, "punctuation": "", "text": "to"}, {"begin_time": 880, "end_time": 1280, "fixed": true, "punctuation": "", "text": "Alibaba"}, {"begin_time": 1280, "end_time": 1680, "fixed": true, "punctuation": "。", "text": "Cloud"} ] }, "text": "Welcome to Alibaba Cloud." }, "usage": { "duration": 2 }, "request_id": "eff4c092-2289-9b43-a4cd-80e591fa90f5" } Streaming When `X-DashScope-SSE: enable` is set, the server returns intermediate and final recognition results as Server-Sent Events (SSE). Each SSE event follows this format: `id:{sequence_number} event:result :HTTP_STATUS/200 data:{JSON data}` Intermediate results (sentence start, progressive word additions): id:1 event:result :HTTP_STATUS/200 data:{"output":{"sentence":{"sentence_id":1,"sentence_end":false,"sentence_begin":true,"words":[],"begin_time":0,"text":"","channel_id":0},"text":""},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"} id:2 event:result :HTTP_STATUS/200 data:{"output":{"sentence":{"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":false,"text":"Welcome"}],"begin_time":160,"text":"Welcome","channel_id":0,"sentence_id":1,"sentence_end":false},"text":"Welcome"},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"} id:3 event:result :HTTP_STATUS/200 data:{"output":{"sentence":{"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":false,"text":"Welcome"},{"end_time":880,"punctuation":"","begin_time":520,"fixed":false,"text":"to"}],"begin_time":160,"text":"Welcome to","channel_id":0,"sentence_id":1,"sentence_end":false},"text":"Welcome to"},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"} Final result (sentence complete, includes `usage`): id:4 event:result :HTTP_STATUS/200 data:{"output":{"sentence":{"sentence_id":1,"sentence_end":true,"end_time":1680,"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":true,"text":"Welcome"},{"end_time":880,"punctuation":"","begin_time":520,"fixed":true,"text":"to"},{"end_time":1280,"punctuation":"","begin_time":880,"fixed":true,"text":"Alibaba"},{"end_time":1680,"punctuation":"。","begin_time":1280,"fixed":true,"text":"Cloud"}],"begin_time":160,"text":"Welcome to Alibaba Cloud.","channel_id":0},"text":"Welcome to Alibaba Cloud."},"usage":{"duration":2},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"}
request_id `string` The unique identifier for the request.
output `object` The output result. Properties text `string` The full recognized text accumulated so far. sentence `object` Details about the current sentence. Properties sentence_id `integer` The sentence number, starting from 1. sentence_end `boolean` Indicates whether this is the final result for the sentence. When `true`, recognition for the sentence is complete. begin_time `integer` The start time of the sentence, in milliseconds. end_time `integer` The end time of the sentence, in milliseconds. Returned only when `sentence_end` is `true`. text `string` The recognized text for the current sentence. channel_id `integer` The channel number, starting from 0. words `array` The word-level timestamp list. Properties text `string` The word text. begin_time `integer` The start time of the word, in milliseconds. end_time `integer` The end time of the word, in milliseconds. punctuation `string` The punctuation mark after the word. An empty string if there's no punctuation. fixed `boolean` Indicates whether the word is finalized. `false` means the timestamp for this word may change in subsequent events.
usage `object` The usage information. Returned only when `sentence_end` is `true`. Properties duration `integer` The processed audio duration, in seconds.

SSE streaming result processing

In streaming mode, note the following:

For each SSE event received, parse the JSON in the data field.
Check output.sentence.sentence_end to determine whether the current sentence is complete: when the value is true, recognition for the sentence is finished and the word-level timestamps are finalized; when the value is false, recognition is still in progress and the text and timestamps may change in subsequent events.
usage information is returned only in sentence-end events. Use it to track audio processing duration.

DashScope asynchronous calls (Fun-ASR)

How it works

Service endpoint

China (Beijing)

Singapore

Singapore

China (Beijing)

Request headers

Submit task

Request body

Response body

Query task

Request body

Response body

Success example

Error example

Other APIs: batch query task status and cancel tasks

Transcription result description

DashScope synchronous calls (Fun-ASR-Flash)

Endpoint

China (Beijing)

Singapore

Singapore

China (Beijing)

Request headers

Request body

Non-streaming

Streaming

With context - Non-streaming

With context - Streaming

Base64

Response body

Non-streaming

Streaming

SSE streaming result processing

DashScope synchronous calls (Fun-ASR-Realtime)

Endpoint

Request headers

Request body

Non-streaming

Streaming

Response body

Non-streaming

Streaming

SSE streaming result processing