Qwen-ASR-Realtime receives audio streams and transcribes speech in real time over WebSocket. The service supports two interaction modes: VAD mode and Manual mode.
User guide: For model overviews and selection guidance, see Speech-to-text. For sample code, see Real-time speech recognition.
Service endpoint
Use the following WebSocket URL. The model query parameter specifies the model. Replace <model_name> with the model name:
China (Beijing)
wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=<model_name>
Singapore
wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime?model=<model_name>
Replace WorkspaceId with your actual Workspace ID.
Model Studio has released a workspace-specific domain for the Singapore region: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from wss://dashscope-intl.aliyuncs.com to the new domain.
{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.
Use the wss:// scheme. Set authorization in the request headers (see Request headers). Specify the model through the model query parameter.
Request headers
Include the following fields in the request headers:
|
Parameter |
Type |
Required |
Description |
|
Authorization |
string |
Yes |
Authentication token in the format |
|
user-agent |
string |
No |
Client identifier that helps the server track request sources. |
|
X-DashScope-WorkSpace |
string |
No |
The Alibaba Cloud Model Studio workspace ID. |
|
X-DashScope-DataInspection |
string |
No |
Whether to enable data inspection. Omit this header unless data inspection is required; if it is, set the value to |
Authorization is verified during the WebSocket handshake. If the API key is invalid or missing, the handshake fails with an HTTP 401 or 403 error.
Interaction flows
For details about client and server events, see Client events for Qwen-ASR-Realtime and Server events.
Qwen-ASR-Realtime supports two interaction modes:
-
VAD mode (default): The server uses voice activity detection (VAD) to automatically detect the start and end of each utterance. Use this mode for real-time conversations, meeting transcription, and similar scenarios.
-
Manual mode: The client controls utterance boundaries. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.
VAD mode (default)
The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end. Use this mode for real-time conversations, meeting transcription, and similar scenarios.
To enable: Configure the session.turn_detection parameter in the client session.update event.
-
The client sends input_audio_buffer.append events to append audio to the buffer.
-
The server returns a input_audio_buffer.speech_started event when speech is detected.
Note: If the client sends session.finish to end the session before this event arrives, the server immediately returns a session.finished event. The client must then close the connection.
-
The client continues to send input_audio_buffer.append events to submit audio.
-
After sending all audio, the client sends a session.finish event to end the session.
-
The server returns a input_audio_buffer.speech_stopped event when it detects the end of speech.
-
The server returns a input_audio_buffer.committed event.
-
The server returns a conversation.item.created event.
-
The server returns a conversation.item.input_audio_transcription.text event that contains partial transcription results.
-
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
-
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.
Manual mode
The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends a input_audio_buffer.commit event to notify the server. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.
To enable: Set session.turn_detection to null in the client session.update event.
-
The client sends input_audio_buffer.append events to append audio to the buffer.
-
The client sends a input_audio_buffer.commit event to commit the input audio buffer. The commit creates a new user message item in the conversation.
-
The client sends a session.finish event to end the session.
-
The server returns a input_audio_buffer.committed event.
-
The server returns a conversation.item.input_audio_transcription.text event that contains partial transcription results.
-
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
-
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.