Use the DashScope Java SDK to call Qwen-ASR-Realtime.
User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.
Prerequisites
-
DashScope SDK 2.22.5 or later (Install the SDK)
-
Understand the interaction flow between client and server
Model Studio has released a workspace-specific domain for the Singapore region: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from wss://dashscope-intl.aliyuncs.com to the new domain.
{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.
Interaction modes
Qwen-ASR-Realtime supports two modes for deciding when to process audio:
|
Mode |
|
How it works |
|
VAD mode (default) |
|
The server detects speech boundaries using voice activity detection (VAD) and decides when to commit the audio buffer for recognition. |
|
Manual mode |
|
The client controls when to commit audio by calling |
For details on each mode, see VAD mode and Manual mode.
Request parameters
Connection parameters (OmniRealtimeParam)
Set these parameters with the chained methods of the OmniRealtimeParam class.
|
Parameter |
Type |
Required |
Description |
|
|
|
Yes |
The model to use. Example: |
|
|
|
Yes |
The service endpoint. China (Beijing): |
|
|
|
No |
The API key. |
Session configuration (OmniRealtimeConfig)
Set these parameters with the chained methods of the OmniRealtimeConfig class.
|
Parameter |
Type |
Required |
Description |
|
|
|
Yes |
Output modality. Fixed to |
|
|
|
No |
Enables server-side VAD. When disabled, call |
|
|
|
No |
VAD type. Fixed to |
|
|
|
No |
VAD sensitivity threshold. Recommended value: Default: Lower values increase sensitivity (may trigger on background noise). Higher values reduce sensitivity and help avoid false triggers in noisy environments. |
|
|
|
No |
Silence duration in milliseconds that marks the end of an utterance. Recommended value: Default: Shorter durations (e.g., 300 ms) speed up responses but may split natural pauses. Longer durations (e.g., 1200 ms) handle pauses better but increase latency. |
|
|
|
No |
Speech recognition settings. See Transcription parameters. |
Transcription parameters (OmniRealtimeTranscriptionParam)
Set these parameters with the setter methods of the OmniRealtimeTranscriptionParam class.
|
Parameter |
Type |
Required |
Description |
|
|
|
No |
Language of the audio source. For supported languages, see Supported languages. |
|
|
|
No |
Audio sampling rate in Hz. Valid values: Default: Setting |
|
|
|
No |
Audio encoding format. Valid values: |
|
|
|
No |
Context text for contextual biasing. Provide background text, entity vocabularies, or reference material to improve recognition accuracy. Maximum: 10,000 tokens. |
Key interfaces
OmniRealtimeConversation
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation
This class manages the WebSocket lifecycle: connecting to the server, sending audio, and ending the session.
Create a conversation
OmniRealtimeConversation conversation =
new OmniRealtimeConversation(param, callback);
Creates a new conversation instance with the specified connection parameters and callback handler.
Connect to the server
conversation.connect();
Opens a WebSocket connection. The server responds with session.created and session.updated events.
Throws: NoApiKeyException, InterruptedException.
Configure the session
conversation.updateSession(config);
Updates the session configuration after the connection is established. The server responds with a session.updated event. If not called, the server uses default settings.
Send audio data
conversation.appendAudio(audioBase64);
Appends a Base64-encoded audio segment to the server-side audio buffer.
-
VAD mode (
enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer. -
Manual mode (
enableTurnDetection=false): Audio accumulates in the buffer until you callcommit()to trigger recognition. Each event can contain up to 15 MiB of audio data.
Commit the audio buffer
conversation.commit();
Submits the buffered audio for recognition. The server responds with an input_audio_buffer.committed event.
This method is only available in manual mode (enableTurnDetection=false). An error occurs if the audio buffer is empty.
End the session
conversation.endSession(); // synchronous
// or
conversation.endSessionAsync(); // asynchronous
Notifies the server to finish processing any remaining audio and end the session. The server responds with a session.finished event.
When to call:
-
VAD mode: After you finish sending audio.
-
Manual mode: After you call
commit().
Close the connection
conversation.close();
Stops the task and closes the WebSocket connection immediately.
Get session and response IDs
String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();
-
getSessionId()returns the session ID for the current task. -
getResponseId()returns the response ID of the most recent server response.
OmniRealtimeCallback
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback
Inherit this class and implement the callback methods to handle server events.
|
Method |
Parameters |
Triggered when |
|
|
None |
The WebSocket connection is established. |
|
|
|
A server event is received. Parse the |
|
|
|
The WebSocket connection is closed. |