Interaction flow for real-time speech recognition (Qwen-ASR-Realtime)-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Qwen-ASR-Realtime receives audio streams and transcribes speech in real time over WebSocket. The service supports two interaction modes: VAD mode and Manual mode.

User guide: For model overviews and selection guidance, see Speech-to-text. For sample code, see Real-time speech recognition.

Service endpoint

Use the following WebSocket URL. The model query parameter specifies the model. Replace <model_name> with the model name:

China (Beijing)

wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Replace {WorkspaceId} with your actual workspace ID.

Singapore

wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Replace {WorkspaceId} with your actual Workspace ID.

Important

Alibaba Cloud Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

China (Beijing): from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com
Singapore: from dashscope-intl.aliyuncs.com to {WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

Replace {WorkspaceId} with your actual Workspace ID. The existing domains remain fully functional.

Important

Use the wss:// scheme. Set authorization in the request headers (see Request headers). Specify the model through the model query parameter.

Request headers

Include the following fields in the request headers:

Parameter	Type	Required	Description
Authorization	string	Yes	Authentication token in the format `Bearer <your_api_key>`. Replace `<your_api_key>` with your API key.
user-agent	string	No	Client identifier that helps the server track request sources.
X-DashScope-WorkSpace	string	No	The Alibaba Cloud Model Studio workspace ID.
X-DashScope-DataInspection	string	No	Whether to enable data inspection. Omit this header unless data inspection is required; if it is, set the value to `enable`.

Important

Authorization is verified during the WebSocket handshake. If the API key is invalid or missing, the handshake fails with an HTTP 401 or 403 error.

Interaction flows

For details about client and server events, see Client events for Qwen-ASR-Realtime and Server events.

Qwen-ASR-Realtime supports two interaction modes:

VAD mode (default): The server uses voice activity detection (VAD) to automatically detect the start and end of each utterance. Use this mode for real-time conversations, meeting transcription, and similar scenarios.
Manual mode: The client controls utterance boundaries. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.

VAD mode (default)

The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end. Use this mode for real-time conversations, meeting transcription, and similar scenarios.

To enable: Configure the session.turn_detection parameter in the client session.update event.

The client sends input_audio_buffer.append events to append audio to the buffer.
The server returns a input_audio_buffer.speech_started event when speech is detected.

Note: If the client sends session.finish to end the session before this event arrives, the server immediately returns a session.finished event. The client must then close the connection.
The client continues to send input_audio_buffer.append events to submit audio.
After sending all audio, the client sends a session.finish event to end the session.
The server returns a input_audio_buffer.speech_stopped event when it detects the end of speech.
The server returns a input_audio_buffer.committed event.
The server returns a conversation.item.created event.
The server returns a conversation.item.input_audio_transcription.text event that contains partial transcription results.
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.

Manual mode

The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends a input_audio_buffer.commit event to notify the server. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.

To enable: Set session.turn_detection to null in the client session.update event.

The client sends input_audio_buffer.append events to append audio to the buffer.
The client sends a input_audio_buffer.commit event to commit the input audio buffer. The commit creates a new user message item in the conversation.
The client sends a session.finish event to end the session.
The server returns a input_audio_buffer.committed event.
The server returns a conversation.item.input_audio_transcription.text event that contains partial transcription results.
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.