Fun-ASR real-time speech recognition client events-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Client events are WebSocket commands that the client sends to the Fun-ASR real-time speech recognition service: run-task starts a recognition task, continue-task updates the conversation context, and finish-task ends it. Each section below describes the message structure and field semantics of one event.

User guide: For model details and selection guidance, see Speech-to-text.

Event sequence: For the event interaction diagram, see WebSocket API.

run-task

Description: Starts a speech recognition task. Sets the model, audio format, sample rate, and other parameters.

When to send: Immediately after the WebSocket connection is established.

Response: The client can send audio only after the service returns a task-started event.

header object (Required)

Properties

action string (Required)

The command type. Fixed at run-task.

task_id string (Required)

A UUID-format task ID generated by the client. Subsequent events are correlated by this ID.

streaming string (Required)

Fixed at duplex.

Basic request

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "fun-asr-realtime",
        "parameters": {
            "format": "pcm",
            "sample_rate": 16000
        },
        "input": {}
    }
}

With context

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "fun-asr-realtime",
        "parameters": {
            "format": "pcm",
            "sample_rate": 16000
        },
        "input": {
            "context": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": "你好啊"
                        }
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {
                            "type": "text",
                            "text": "你好啊，我是通义千问，有什么可以帮助你的？"
                        }
                    ]
                }
            ]
        }
    }
}

payload object (Required)

Properties

task_group string (Required)

The task group. Fixed at audio.

task string (Required)

The task type. Fixed at asr.

function string (Required)

The function type. Fixed at recognition.

model string (Required)

The supported model name.

input object (Required)

The input object. Pass {} when no context is provided.

Important

Only the fun-asr-realtime and fun-asr-realtime-2025-11-07 models support the context parameter.

Properties

context array(object) (Optional)

Conversation context used to improve recognition accuracy for domain-specific vocabulary. For usage details, see Quick start.

Important

Constraints: A maximum of 5 context messages of each type (input_text and text). When exceeded, only the most recent 5 are retained. The total text length per turn (sum of user and assistant text fields) must not exceed 400 characters. Excess content is truncated from the end.

Important

When providing context, the messages in the context array must be ordered by conversation turn. Within each turn, the user message (input_text type) must precede the corresponding assistant message (text type).

Properties

role string (Required)

The message role. Valid values:

user: The speech recognition results from previous turns or domain-specific vocabulary.
assistant: The LLM responses from previous turns.

content array(object) (Required)

The message content list.

Properties

type string (Required)

The content type. Valid values:

input_text: The speech recognition results from previous turns or domain-specific vocabulary (when role is user). Requires the text field.
text: The LLM responses from previous turns (when role is assistant). Requires the text field.

text string (Required)

The text content. When type is input_text, provide the speech recognition results from previous turns or domain-specific vocabulary. When type is text, provide the LLM responses from previous turns.

parameters object (Required)

Speech recognition parameters.

Properties

format string (Required)

The audio format.

Valid values:

pcm
wav
mp3
opus
speex
aac
amr

sample_rate integer (Required)

The sample rate, in Hz.

Valid values:

8 kHz models support only 8000 Hz; all other models support any sample rate.

vocabulary_id string (Optional)

The custom vocabulary (hotword) list ID.

language_hints array[string] (Optional)

The spoken language of the audio. No default; the model auto-detects the language when this parameter is omitted.

Only the first value in the array is used; additional values are ignored.

Valid values:

fun-asr-realtime, fun-asr-realtime-2025-11-07:
- zh: Chinese
- en: English
- ja: Japanese
- ko: Korean
- vi: Vietnamese
- th: Thai
- id: Indonesian
- ms: Malay
- tl: Filipino
- hi: Hindi
- ar: Arabic
- fr: French
- de: German
- es: Spanish
- pt: Portuguese
- ru: Russian
- it: Italian
- nl: Dutch
- sv: Swedish
- da: Danish
- fi: Finnish
- no: Norwegian
- el: Greek
- pl: Polish
- cs: Czech
- hu: Hungarian
- ro: Romanian
- bg: Bulgarian
- hr: Croatian
- sk: Slovak
fun-asr-realtime-2026-02-28:
- zh: Chinese
- en: English
- ja: Japanese
fun-asr-realtime-2025-09-15:
- zh: Chinese
- en: English
fun-asr-flash-8k-realtime, fun-asr-flash-8k-realtime-2026-01-28:
- zh: Chinese

semantic_punctuation_enabled boolean (Optional)

Whether to enable semantic punctuation.

Default: false.

true: Enables semantic punctuation and disables VAD-based segmentation.
false (default): Enables VAD-based segmentation and disables semantic punctuation.

Semantic punctuation provides higher accuracy and is suitable for meeting transcription. VAD (Voice Activity Detection)-based segmentation has lower latency and is suitable for conversational scenarios.

max_sentence_silence integer (Optional)

Important

Effective only when semantic_punctuation_enabled is false.

The VAD silence threshold, in milliseconds. When the silence after a speech segment exceeds this threshold, the system marks the sentence as ended.

Default: 1300.

Valid range: [200, 6000].

multi_threshold_mode_enabled boolean (Optional)

Important

Effective only when semantic_punctuation_enabled is false.

Whether to enable multi-threshold mode. When enabled, prevents VAD from producing overly long segments.

Default: false.

heartbeat boolean (Optional)

Whether to enable heartbeat packets.

Default: false.

true: Keeps the connection alive even when the client continuously sends silent audio.
false (default): The connection times out and disconnects after 60 seconds, even when the client continuously sends silent audio.

speech_noise_threshold float (Optional)

Important

Only Fun-ASR supports this parameter.

The speech-noise discrimination threshold. Adjusts VAD sensitivity.

Valid range: [-1.0, 1.0].

Behavior:

Values closer to -1 lower the noise threshold. Noise is more likely to be classified as speech, which can cause additional noise to be transcribed.
Values closer to +1 raise the noise threshold. Speech is more likely to be classified as noise, which can cause some speech to be filtered out.

This is an advanced parameter, and changes can significantly affect recognition quality. Recommendations:

Test thoroughly to verify the effect before applying changes.
Adjust in small increments based on the actual audio environment (suggested step: 0.1).

continue-task

Description: Updates the conversation context during an active recognition task to improve recognition accuracy.

When to send: During an active task, when you need to update the conversation context.

Important

Only the fun-asr-realtime and fun-asr-realtime-2025-11-07 models support this event.

header object (Required)

Properties

action string (Required)

The command type. Fixed at continue-task.

task_id string (Required)

A UUID-format task ID generated by the client. Must match the task_id used in the corresponding run-task event.

streaming string (Required)

Fixed at duplex.

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "context": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": "你好啊"
                        }
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {
                            "type": "text",
                            "text": "你好啊，我是通义千问，有什么可以帮助你的？"
                        }
                    ]
                }
            ]
        }
    }
}

payload object (Required)

Properties

input object (Required)

The input object.

Properties

context array(object) (Optional)

Conversation context used to improve recognition accuracy for domain-specific vocabulary. For usage details, see Quick start.

Important

Properties

role string (Required)

The message role. Valid values:

user: The speech recognition results from previous turns or domain-specific vocabulary.
assistant: The LLM responses from previous turns.

content array(object) (Required)

The message content list.

Properties

type string (Required)

The content type. Valid values:

input_text: The speech recognition results from previous turns or domain-specific vocabulary (when role is user). Requires the text field.
text: The LLM responses from previous turns (when role is assistant). Requires the text field.

text string (Required)

finish-task

Description: Notifies the service that all audio has been sent and requests that the service finish the task.

When to send: After all audio data has been sent.

Response: The service returns a task-finished event.

header object (Required)

Properties

action string (Required)

The command type. Fixed at finish-task.

task_id string (Required)

A UUID-format task ID generated by the client. Must match the task_id used in the corresponding run-task event.

streaming string (Required)

Fixed at duplex.

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

payload object (Required)

Properties

input object (Required)

Fixed at {}.