Interaction flow for real-time speech recognition (Qwen-ASR-Realtime)

更新时间:
复制 MD 格式

Qwen-ASR-Realtime receives audio streams and transcribes speech in real time over WebSocket. The service supports two interaction modes: VAD mode and Manual mode.

User guide: For model overviews and selection guidance, see Speech-to-text. For sample code, see Real-time speech recognition.

Service endpoint

Use the following WebSocket URL. The model query parameter specifies the model. Replace <model_name> with the model name:

China (Beijing)

wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Singapore

wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Replace WorkspaceId with your actual Workspace ID.

Important

Model Studio has released a workspace-specific domain for the Singapore region: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from wss://dashscope-intl.aliyuncs.com to the new domain.

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

Important

Use the wss:// scheme. Set authorization in the request headers (see Request headers). Specify the model through the model query parameter.

Request headers

Include the following fields in the request headers:

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token in the format Bearer <your_api_key>. Replace <your_api_key> with your API key.

user-agent

string

No

Client identifier that helps the server track request sources.

X-DashScope-WorkSpace

string

No

The Alibaba Cloud Model Studio workspace ID.

X-DashScope-DataInspection

string

No

Whether to enable data inspection. Omit this header unless data inspection is required; if it is, set the value to enable.

Important

Authorization is verified during the WebSocket handshake. If the API key is invalid or missing, the handshake fails with an HTTP 401 or 403 error.

Interaction flows

For details about client and server events, see Client events for Qwen-ASR-Realtime and Server events.

Qwen-ASR-Realtime supports two interaction modes:

  • VAD mode (default): The server uses voice activity detection (VAD) to automatically detect the start and end of each utterance. Use this mode for real-time conversations, meeting transcription, and similar scenarios.

  • Manual mode: The client controls utterance boundaries. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.

VAD mode (default)

The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end. Use this mode for real-time conversations, meeting transcription, and similar scenarios.

To enable: Configure the session.turn_detection parameter in the client session.update event.

image

Manual mode

The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends a input_audio_buffer.commit event to notify the server. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.

To enable: Set session.turn_detection to null in the client session.update event.

image