Fun-ASR HTTP API for audio file transcription

更新时间:
复制 MD 格式

Use the Fun-ASR HTTP API to submit audio files for transcription and retrieve results through DashScope.

User guide: Non-real-time speech recognition. For supported audio formats, file size limits, and duration limits, see Audio specifications.

DashScope asynchronous calls (Fun-ASR)

How it works

Unlike synchronous calls that return results immediately, the asynchronous mode handles long audio files and time-consuming tasks. This mode uses a submit-then-poll workflow to avoid request timeouts during lengthy processing:

  1. Step 1: Submit a task

    • The client sends an asynchronous processing request.

    • After validating the request, the server returns a unique task_id without immediately executing the task, indicating that the task was created successfully.

  2. Step 2: Get the results

    • The client uses the returned task_id to poll the result query endpoint repeatedly.

    • Once the task finishes, the result query endpoint returns the final transcription results.

Service endpoint

China (Beijing)

Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}

Singapore

Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace WorkspaceId with your actual workspace ID.

Singapore

Submit task endpoint: POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/{task_id}

Replace WorkspaceId with your actual workspace ID.

China (Beijing)

Submit task endpoint: POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription

Query task endpoint: GET https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}

Important

Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

  • China (Beijing): from https://dashscope.aliyuncs.com to https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com

  • Singapore: from https://dashscope-intl.aliyuncs.com to https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

Important

When submitting tasks through the new domain (https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com), the request body must include the parameters object. Even if you don't need to set any parameters, pass an empty object {}. Otherwise, the task submits successfully but transcription fails.

Request headers

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token in the format Bearer <your_api_key>. Replace <your_api_key> with your actual API key. Required for both the submit task and query task endpoints.

Content-Type

string

Yes

Media type of the request body. Required only for the submit task endpoint. Set to application/json.

X-DashScope-Async

string

Yes

Async task identifier. Required only for the submit task endpoint. Set to enable. Omitting this header causes task submission to fail.

Submit task

Submits an audio transcription task. The endpoint returns asynchronously; use the Query task to poll for results.

Request body

The following URL is for the China (Beijing) region. URLs differ by region.

curl --location 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/asr/transcription' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-Async: enable" \
     --data '{
    "model": "fun-asr",
    "input": {
        "file_urls": [
            "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
        ]
    },
    "parameters": {
        "channel_id": [0]
    }
}'

model string (Required)

The model name for audio and video file transcription.

Valid values:

  • fun-asr

  • fun-asr-2025-11-07

  • fun-asr-2025-08-25

  • fun-asr-mtl

  • fun-asr-mtl-2025-08-25

input object (Required)

The input parameter object.

Properties

file_urls array[string] (Required)

A list of audio or video file URLs to transcribe. Supports HTTP and HTTPS. Each request accepts only one URL. For supported audio formats, file size limits, and duration limits, see Audio specifications.

If your audio files are stored in Alibaba Cloud OSS, the RESTful API also supports temporary URLs prefixed with oss://.

Important
  • Temporary URLs expire after 48 hours and can't be used afterward. Don't use them in production environments.

  • The file upload credential API is throttled at 100 QPS and can't be scaled up. Don't use it in production, high-concurrency, or load testing scenarios.

  • In production, store files in stable storage such as Alibaba Cloud OSS to ensure long-term availability and avoid throttling.

  • If the OSS temporary public URL can't be accessed, set X-DashScope-OssResourceResolve to enable in the Request headers (not recommended).

    The Java SDK and Python SDK don't support custom request headers.

parameters object (Optional)

The request parameter object.

Important

When using the new domain (https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com), parameters is required. Even if you don't need to set any parameters, pass an empty object {}. If you omit this field, the task submits successfully but the query results endpoint returns a transcription failure.

Properties

vocabulary_id string (Optional)

The custom vocabulary ID. When set, the custom vocabulary associated with this ID takes effect during transcription. Disabled by default. For usage instructions, see Custom hotwords.

channel_id array[integer] (Optional)

Specifies the audio track indexes to transcribe in a multi-track audio file. Indexes start from 0. For example, [0] transcribes the first track, and [0, 1] transcribes both the first and second tracks. If omitted, only the first track is processed.

Important

Each specified track is billed independently. For example, requesting [0, 1] for a single file incurs two separate charges.

Default: [0].

special_word_filter string (Optional)

Specifies sensitive words to handle during transcription, with different processing methods for different words. For details, see Sensitive word filter.

diarization_enabled boolean (Optional)

Whether to enable speaker diarization. Disabled by default.

Applies only to mono audio. Multi-channel audio doesn't support speaker diarization.

When enabled, the transcription results include a speaker_id field to distinguish different speakers.

Note

If speaker diarization is enabled, keep the audio duration under 2 hours. Longer audio may cause transcription failures or timeouts.

For a speaker_id example, see Transcription result description.

Default: false.

speaker_count integer (Optional)

Important

Takes effect only when speaker diarization is enabled (diarization_enabled set to true).

A hint for the expected number of speakers. Valid values: integers from 2 to 100, inclusive.

By default, the system detects the speaker count automatically. Providing this value guides the algorithm toward the specified count but doesn't guarantee that exact number in the output.

language_hints array[string] (Optional)

The language of the audio to transcribe. If the language is unknown, omit this parameter. The model detects the language automatically.

Only the first value in the array is read. Additional values are ignored.

View supported language codes

  • fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25:

    • zh: Chinese

    • en: English

    • ja: Japanese

    • ko: Korean

    • vi: Vietnamese

    • th: Thai

    • id: Indonesian

    • ms: Malay

    • tl: Filipino

    • hi: Hindi

    • ar: Arabic

    • fr: French

    • de: German

    • es: Spanish

    • pt: Portuguese

    • ru: Russian

    • it: Italian

    • nl: Dutch

    • sv: Swedish

    • da: Danish

    • fi: Finnish

    • no: Norwegian

    • el: Greek

    • pl: Polish

    • cs: Czech

    • hu: Hungarian

    • ro: Romanian

    • bg: Bulgarian

    • hr: Croatian

    • sk: Slovak

  • fun-asr-2025-08-25:

    • zh: Chinese

    • en: English

Response body

{
  "output": {
    "task_status": "PENDING",
    "task_id": "c2e5d63b-96e1-4607-bb91-************"
  },
  "request_id": "77ae55ae-be17-97b8-9942--************"
}

request_id string

The unique identifier of this request.

output object

The data returned when submitting a task.

Properties

task_id string

The task ID. Pass this ID as the string to the Query task.

task_status string

The task status. Returns PENDING when the task is submitted successfully.

Query task

Returns the status and results of an audio transcription task. Poll this endpoint until the task reaches a terminal state.

Request body

The following URL is for the China (Beijing) region. URLs differ by region.

curl --location 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/tasks/{task_id}' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY"

task_id string (Required)

Important

URL path parameter. No request body.

The ID of the task to query, returned as task_id by the Submit task.

Response body

Success example

{
  "request_id": "f9e1afad-94d3-997e-a83b-************",
  "output": {
    "task_id": "f86ec806-4d73-485f-a24f-************",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-09-12 15:11:40.041",
    "scheduled_time": "2024-09-12 15:11:40.071",
    "end_time": "2024-09-12 15:11:40.903",
    "results": [
      {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
        "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/pre/filetrans-16k/20240912/15%3A11/409a4b92-445b-4dd8-8c1d-f110954d82d8-1.json?Expires=1726211500&OSSAccessKeyId=yourOSSAccessKeyId&Signature=v5Owy5qoAfT7mzGmQgH0g8C****%3D",
        "subtask_status": "SUCCEEDED"
      }
    ],
    "task_metrics": {
      "TOTAL": 1,
      "SUCCEEDED": 1,
      "FAILED": 0
    }
  },
  "usage": {
    "duration": 9
  }
}

Error example

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 1,
        "SUCCEEDED": 0,
        "FAILED": 1
    }
}

request_id string

The unique identifier of this request.

output object

The data returned when querying a task.

Properties

task_id string

The ID of the queried task.

task_status string

The status of the queried task.

Note

When a task contains multiple subtasks, the overall task status is marked as SUCCEEDED as long as any subtask succeeds. Check the subtask_status field to determine individual subtask results.

submit_time string

The time when the task was submitted.

scheduled_time string

The time when the task was scheduled for execution.

end_time string

The time when the task finished.

results array[object]

The list of subtask results, one for each audio file to transcribe.

Properties

subtask_status string

The subtask status.

file_url string

The URL of the file processed in this transcription subtask.

transcription_url string

The URL to download the transcription results. This URL is valid for 24 hours. Once expired, previously returned URLs can no longer be used to query the task or download results.

The transcription results are stored as a JSON file. You can download this file through the URL or read its content directly through an HTTP request. For the meaning of each field in the JSON data, see Transcription result description.

code string

Important

Returned only when the subtask fails.

The error code of the failed subtask.

message string

Important

Returned only when the subtask fails.

The error message of the failed subtask.

task_metrics object

Overall task execution statistics.

Properties

TOTAL integer

The total number of subtasks.

SUCCEEDED integer

The number of succeeded subtasks.

FAILED integer

The number of failed subtasks.

Other APIs: batch query task status and cancel tasks

For details, see Manage asynchronous tasks: supports batch querying of audio file transcription tasks submitted within the last 24 hours, and canceling tasks in PENDING (queued) state.

Transcription result description

The recognition result is saved as a JSON file.

Click to view a recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, // This field is displayed only when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The key parameters are as follows:

Parameter

Type

Description

audio_format

string

The format of the audio in the source file.

channels

array[integer]

The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.

original_sampling_rate

integer

The sample rate of the audio in the source file (Hz).

original_duration_in_milliseconds

integer

The original duration of the audio in the source file (ms).

channel_id

integer

The index of the transcribed audio track, starting from 0.

content_duration_in_milliseconds

integer

The duration of the content in the audio track that is identified as speech (ms).

Important

Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies.

transcript

string

The paragraph-level speech transcription result.

sentences

array

The sentence-level speech transcription result.

words

array

The word-level speech transcription result.

begin_time

integer

Start timestamp (ms).

end_time

integer

End timestamp (ms).

text

string

The speech transcription result.

speaker_id

integer

The index of the current speaker, starting from 0. This is used to distinguish different speakers.

This field is displayed in the recognition result only when speaker diarization is enabled.

punctuation

string

The predicted punctuation mark after the word, if any.

DashScope synchronous calls (Fun-ASR-Flash)

Important

SDK calls aren't supported for this feature.

Endpoint

China (Beijing)

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Singapore

POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace WorkspaceId with your actual Workspace ID.

Singapore

POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Replace WorkspaceId with your actual Workspace ID.

China (Beijing)

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Request headers

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token in the format Bearer <your_api_key>. Replace <your_api_key> with your actual API key.

Content-Type

string

Yes

Media type of the request body. Set to application/json.

X-DashScope-SSE

string

Yes

Controls whether results are returned as an SSE stream. Set to enable to receive intermediate and final results incrementally. Set to disable or omit this header to receive only the final result.

Request body

The following URL is for the China (Beijing) region. URLs differ by region.

Non-streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: disable" \
     --data '{
    "model": "fun-asr-flash-2026-06-15",
    "input": {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
                        }
                    }
                ]
            }
        ]
    },
    "parameters": {
        "format": "wav",
        "sample_rate": "16000"
    }
}'

Streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: enable" \
     --data '{
    "model": "fun-asr-flash-2026-06-15",
    "input": {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
                        }
                    }
                ]
            }
        ]
    },
    "parameters": {
        "format": "wav",
        "sample_rate": "16000"
    }
}'

With context - Non-streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: disable" \
     --data '{
    "model": "fun-asr-flash-2026-06-15",
    "input": {
        "messages": [
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "你好啊,我是通义千问,有什么可以帮助你的?"
                    }
                ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": "你好啊"
                    }
                ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
                        }
                    }
                ]
            }
        ]
    },
    "parameters": {
        "format": "wav",
        "sample_rate": "16000"
    }
}'

With context - Streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: enable" \
     --data '{
    "model": "fun-asr-flash-2026-06-15",
    "input": {
        "messages": [
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "你好啊,我是通义千问,有什么可以帮助你的?"
                    }
                ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": "你好啊"
                    }
                ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
                        }
                    }
                ]
            }
        ]
    },
    "parameters": {
        "format": "wav",
        "sample_rate": "16000"
    }
}'

Base64

You can pass Base64-encoded data (Data URI) in the format: data:<mediatype>;base64,<data>.

  • <mediatype>: MIME type

    Varies by audio format. Examples:

    • WAV: audio/wav

    • MP3: audio/mpeg

  • <data>: Base64-encoded string of the audio file

    Base64 encoding increases the payload size. Keep the original file small enough so the encoded data stays within the 10 MB input audio size limit.

  • Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

    View sample code

    import base64, pathlib
    
    # input.mp3 is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements.
    file_path = pathlib.Path("input.mp3")
    base64_str = base64.b64encode(file_path.read_bytes()).decode()
    data_uri = f"data:audio/mpeg;base64,{base64_str}"
    import java.nio.file.*;
          import java.util.Base64;
    
          public class Main {
              /**
               * filePath is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements.
               */
              public static String toDataUrl(String filePath) throws Exception {
                  byte[] bytes = Files.readAllBytes(Paths.get(filePath));
                  String encoded = Base64.getEncoder().encodeToString(bytes);
                  return "data:audio/mpeg;base64," + encoded;
              }
    
              public static void main(String[] args) throws Exception {
                  System.out.println(toDataUrl("input.mp3"));
              }
          }
import base64, pathlib
import os
import requests

# input.wav is the local audio file to transcribe. Replace with your own file path and make sure it meets the audio requirements.
file_path = pathlib.Path("input.wav")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/wav;base64,{base64_str}"

url = "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation"

headers = {
    "Authorization": f"Bearer {os.environ['DASHSCOPE_API_KEY']}",
    "Content-Type": "application/json",
    "X-DashScope-SSE": "disable",
}

payload = {
    "model": "fun-asr-flash-2026-06-15",
    "input": {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": data_uri,
                        },
                    }
                ],
            }
        ]
    },
    "parameters": {
        "format": "wav",
        "sample_rate": "16000",
    },
}

response = requests.post(url, headers=headers, json=payload)
print(response.status_code)
print(response.json())

model string (Required)

Model name. Set to fun-asr-flash-2026-06-15.

input object (Required)

Input information.

Properties

messages array(object) (Required)

Message list. Contains the audio to transcribe, and optionally conversation context to improve recognition accuracy.

Important

Use this feature when combining ASR with a large language model (LLM): pass previous turns' (user and assistant messages) as context to improve recognition accuracy for the current turn. The user and assistant messages must appear in pairs, and roles must be set correctly. Incorrect roles degrade recognition quality.

Properties

role string (Required)

Message role. Valid values:

  • user (Required): A user message. When type is input_audio, this represents the audio to transcribe. When type is input_text, this represents a previous turn's transcription result (optional, for context).

  • assistant (Optional, for context): An LLM response from a previous turn. Must be paired with a user (input_text) message.

content array(object) (Required)

Message content list.

Properties

type string (Required)

Content type. Each request must include at least one input_audio message. Valid values:

  • input_audio (Required): Audio input for the current transcription (role must be user). Include the input_audio object.

  • input_text (Optional, for context): A previous turn's transcription result (role must be user). Include the text field.

  • text (Optional, for context): An LLM response from a previous turn (role must be assistant). Include the text field.

input_audio object (Conditionally required)

Required when type is input_audio.

Properties

data string (Required)

Audio data for transcription. For supported audio formats, file size limits, and duration limits, see Audio specifications. Two formats are supported:

  • Audio file URL: A publicly accessible URL pointing to the audio file.

  • Base64 Data URI: A Data URI containing Base64-encoded audio data, formed by concatenating the data:{MIME_TYPE};base64, prefix with the Base64-encoded audio data. Supported MIME types include audio/wav and audio/mp3.

Example (URL): https://example.com/audio/sample.wav

Example (Base64): data:audio/wav;base64,{BASE64_ENCODED_DATA}

text string (Conditionally required)

When type is input_text, pass the transcription result from a previous turn. When type is text, pass the LLM response from a previous turn.

parameters object (Required)

Model parameters.

Properties

format string (Required)

Audio format. Set this to the actual format of your audio file, such as wav, mp3, or opus. For details, see Audio specifications.

sample_rate string (Optional)

Audio sample rate in Hz. For example, 16000 indicates a 16 kHz sample rate. For details, see Audio specifications.

Response body

Non-streaming

{
    "output": {
        "sentence": {
            "begin_time": 760,
            "channel_id": 0,
            "end_time": 3800,
            "sentence_end": true,
            "sentence_id": 1,
            "text": "Hello world, this is Alibaba Speech Lab.",
            "words": [
                {"begin_time": 760, "end_time": 1040, "fixed": true, "punctuation": "", "text": "Hello"},
                {"begin_time": 1040, "end_time": 1240, "fixed": true, "punctuation": ",", "text": " world"},
                {"begin_time": 1360, "end_time": 1880, "fixed": true, "punctuation": "", "text": "this is"},
                {"begin_time": 1880, "end_time": 2520, "fixed": true, "punctuation": "", "text": "Alibaba"},
                {"begin_time": 2520, "end_time": 2840, "fixed": true, "punctuation": "", "text": "Speech"},
                {"begin_time": 2840, "end_time": 3800, "fixed": true, "punctuation": "。", "text": "Lab"}
            ]
        },
        "text": "Hello world, this is Alibaba Speech Lab."
    },
    "usage": {
        "duration": 4
    },
    "request_id": "40e0734d-096f-9ae3-86c1-a8c013287561"
}

Streaming

When X-DashScope-SSE: enable is set, the server returns results using the Server-Sent Events (SSE) protocol. Each SSE event has the following format:

id:{sequence_number}
      event:result
      :HTTP_STATUS/200
      data:{JSON data}

Sample response:

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"sentence":{"sentence_id":1,"sentence_end":true,"end_time":3800,"words":[{"end_time":1040,"punctuation":"","begin_time":760,"fixed":true,"text":"Hello"},{"end_time":1240,"punctuation":",","begin_time":1040,"fixed":true,"text":" World"},{"end_time":1880,"punctuation":"","begin_time":1360,"fixed":true,"text":"this is"},{"end_time":2520,"punctuation":"","begin_time":1880,"fixed":true,"text":"Alibaba"},{"end_time":2840,"punctuation":"","begin_time":2520,"fixed":true,"text":"Speech"},{"end_time":3800,"punctuation":"。","begin_time":2840,"fixed":true,"text":"Lab"}],"begin_time":760,"text":"Hello world, this is Alibaba Speech Lab","channel_id":0},"text":"Hello World, this is Alibaba Speech Lab."},"usage":{"duration":4},"request_id":"fc1582e4-935c-9fc2-a482-a98bf43daa69"}

request_id string

Unique identifier for this request.

output object

Output result.

Properties

text string

Accumulated full transcription text up to this point.

sentence object

Details of the current sentence.

Properties

sentence_id integer

Sentence number, starting from 1.

sentence_end boolean

Whether this is the final result for the sentence. When true, recognition for this sentence is complete.

begin_time integer

Start time of the sentence in milliseconds.

end_time integer

End time of the sentence in milliseconds. Returned only when sentence_end is true.

text string

Transcription text for the current sentence.

channel_id integer

Audio channel number, starting from 0.

words array

Word-level timestamp list.

Properties

text string

Word text.

begin_time integer

Start time of the word in milliseconds.

end_time integer

End time of the word in milliseconds.

punctuation string

Punctuation mark after the word. Empty string if none.

fixed boolean

Whether the word is finalized. false means the timestamp may be adjusted in subsequent events.

usage object

Usage information. Returned only when sentence_end is true.

Properties

duration integer

Duration of processed audio in seconds.

SSE streaming result processing

In streaming mode, note the following:

  1. For each SSE event received, parse the JSON in the data field.

  2. Check output.sentence.sentence_end to determine whether the current sentence is complete: when the value is true, recognition for the sentence is finished and the word-level timestamps are finalized; when the value is false, recognition is still in progress and the text and timestamps may change in subsequent events.

  3. usage information is returned only in sentence-end events. Use it to track audio processing duration.

DashScope synchronous calls (Fun-ASR-Realtime)

Important
  • This feature is available only in the Beijing region.

  • SDK calls aren't supported.

Endpoint

POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Request headers

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token in the format Bearer <your_api_key>. Replace <your_api_key> with your actual API key.

Content-Type

string

Yes

Media type of the request body. Set to application/json.

X-DashScope-SSE

string

Yes

Controls whether results are returned as an SSE stream. Set to enable to turn on SSE streaming, which returns intermediate recognition results and final results in multiple responses. Set to disable or omit this header to return only the final result.

Request body

Non-streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: disable" \
     --data '{
    "model": "fun-asr-realtime",
    "input": {
        "messages": []
    },
    "parameters": {
        "audio_address": "https://example.com/audio/sample.mp3",
        "format": "mp3"
    },
    "resources": []
}'

Streaming

curl --location --request POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
     --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
     --header "Content-Type: application/json" \
     --header "X-DashScope-SSE: enable" \
     --data '{
    "model": "fun-asr-realtime",
    "input": {
        "messages": []
    },
    "parameters": {
        "audio_address": "https://example.com/audio/sample.mp3",
        "format": "mp3"
    },
    "resources": []
}'

model string (Required)

The model name.

Valid values:

  • fun-asr-realtime (stable version)

  • fun-asr-realtime-2026-02-28 (Fun-Realtime-ASR-preview)

input object (Conditionally Required)

The input information. Either this parameter or the audio file URL approach (specified through parameters.audio_address) is required. Use this parameter when uploading audio as Base64.

Properties

messages array(object) (Required)

The message list.

Properties

content array(object) (Required)

The content of the user message. Only one message group is allowed.

Properties

audio string (Required)

The audio to transcribe, provided as a Base64-encoded Data URI. For supported audio formats, file size limits, and duration limits, see Audio specifications.

Required when uploading audio as Base64.

The value is a data:{MIME_TYPE};base64, prefix followed by the Base64-encoded audio data. Supported MIME types include audio/wav and audio/mp3.

Example: data:audio/wav;base64,{BASE64_ENCODED_DATA}

role string (Required)

The role of the user message. Set to user. Required when uploading audio as Base64.

parameters object (Required)

The model parameters.

Properties

audio_address string (Conditionally Required)

The URL of the audio file. Either this parameter or the Base64 approach (specified through input.messages) is required. Required when using the URL approach. The URL must be publicly accessible. For supported audio formats, file size limits, and duration limits, see Audio specifications.

format string (Required)

The audio format. Set this to the actual format of the audio file. Supported formats include wav, mp3, and opus. For more information, see Audio specifications.

vad_enabled boolean (optional)

Specifies whether to enable Voice Activity Detection (VAD).

Defaults to true.

Set to false to disable VAD. However, if the audio duration exceeds 1 minute, the system automatically enables VAD regardless of this setting.

resources array (optional)

The resource list. This is a reserved field. Pass an empty array [].

Response body

Non-streaming

{
    "output": {
        "sentence": {
            "begin_time": 160,
            "channel_id": 0,
            "end_time": 1680,
            "sentence_end": true,
            "sentence_id": 1,
            "text": "Welcome to Alibaba Cloud.",
            "words": [
                {"begin_time": 160, "end_time": 520, "fixed": true, "punctuation": "", "text": "Welcome"},
                {"begin_time": 520, "end_time": 880, "fixed": true, "punctuation": "", "text": "to"},
                {"begin_time": 880, "end_time": 1280, "fixed": true, "punctuation": "", "text": "Alibaba"},
                {"begin_time": 1280, "end_time": 1680, "fixed": true, "punctuation": "。", "text": "Cloud"}
            ]
        },
        "text": "Welcome to Alibaba Cloud."
    },
    "usage": {
        "duration": 2
    },
    "request_id": "eff4c092-2289-9b43-a4cd-80e591fa90f5"
}

Streaming

When X-DashScope-SSE: enable is set, the server returns intermediate and final recognition results as Server-Sent Events (SSE). Each SSE event follows this format:

id:{sequence_number}
      event:result
      :HTTP_STATUS/200
      data:{JSON data}

Intermediate results (sentence start, progressive word additions):

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"sentence":{"sentence_id":1,"sentence_end":false,"sentence_begin":true,"words":[],"begin_time":0,"text":"","channel_id":0},"text":""},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"sentence":{"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":false,"text":"Welcome"}],"begin_time":160,"text":"Welcome","channel_id":0,"sentence_id":1,"sentence_end":false},"text":"Welcome"},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"}

id:3
event:result
:HTTP_STATUS/200
data:{"output":{"sentence":{"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":false,"text":"Welcome"},{"end_time":880,"punctuation":"","begin_time":520,"fixed":false,"text":"to"}],"begin_time":160,"text":"Welcome to","channel_id":0,"sentence_id":1,"sentence_end":false},"text":"Welcome to"},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"}

Final result (sentence complete, includes usage):

id:4
      event:result
      :HTTP_STATUS/200
      data:{"output":{"sentence":{"sentence_id":1,"sentence_end":true,"end_time":1680,"words":[{"end_time":520,"punctuation":"","begin_time":160,"fixed":true,"text":"Welcome"},{"end_time":880,"punctuation":"","begin_time":520,"fixed":true,"text":"to"},{"end_time":1280,"punctuation":"","begin_time":880,"fixed":true,"text":"Alibaba"},{"end_time":1680,"punctuation":"。","begin_time":1280,"fixed":true,"text":"Cloud"}],"begin_time":160,"text":"Welcome to Alibaba Cloud.","channel_id":0},"text":"Welcome to Alibaba Cloud."},"usage":{"duration":2},"request_id":"372d19b3-993f-9288-adf0-a99f7606bd30"}
Important

Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

  • China (Beijing): from https://dashscope.aliyuncs.com to https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com

  • Singapore: from https://dashscope-intl.aliyuncs.com to https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

request_id string

The unique identifier for the request.

output object

The output result.

Properties

text string

The full recognized text accumulated so far.

sentence object

Details about the current sentence.

Properties

sentence_id integer

The sentence number, starting from 1.

sentence_end boolean

Indicates whether this is the final result for the sentence. When true, recognition for the sentence is complete.

begin_time integer

The start time of the sentence, in milliseconds.

end_time integer

The end time of the sentence, in milliseconds. Returned only when sentence_end is true.

text string

The recognized text for the current sentence.

channel_id integer

The channel number, starting from 0.

words array

The word-level timestamp list.

Properties

text string

The word text.

begin_time integer

The start time of the word, in milliseconds.

end_time integer

The end time of the word, in milliseconds.

punctuation string

The punctuation mark after the word. An empty string if there's no punctuation.

fixed boolean

Indicates whether the word is finalized. false means the timestamp for this word may change in subsequent events.

usage object

The usage information. Returned only when sentence_end is true.

Properties

duration integer

The processed audio duration, in seconds.

SSE streaming result processing

In streaming mode, note the following:

  1. For each SSE event received, parse the JSON in the data field.

  2. Check output.sentence.sentence_end to determine whether the current sentence is complete: when the value is true, recognition for the sentence is finished and the word-level timestamps are finalized; when the value is false, recognition is still in progress and the text and timestamps may change in subsequent events.

  3. usage information is returned only in sentence-end events. Use it to track audio processing duration.