Java SDK

User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.

Prerequisites

DashScope SDK 2.22.5 or later (Install the SDK)
Obtain an API key
Understand the interaction flow between client and server

Important

Alibaba Cloud Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

China (Beijing): from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com
Singapore: from dashscope-intl.aliyuncs.com to {WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

Replace {WorkspaceId} with your actual Workspace ID. The existing domains remain fully functional.

Interaction modes

Qwen-ASR-Realtime supports two modes for deciding when to process audio:

Mode	`enableTurnDetection`	How it works
VAD mode (default)	`true`	The server detects speech boundaries using voice activity detection (VAD) and decides when to commit the audio buffer for recognition.
Manual mode	`false`	The client controls when to commit audio by calling `commit()`. This gives you full control over segmentation.

For details on each mode, see VAD mode and Manual mode.

Request parameters

Connection parameters (OmniRealtimeParam)

Set these parameters with the chained methods of the OmniRealtimeParam class.

Click to view sample code

OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-asr-flash-realtime")
                // The following URL is for the China (Beijing) region. The URLs vary by region.
                .url("wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit https://help.aliyun.com/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with .apikey("sk-xxx") and use your Model Studio API key.
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                .build();

Parameter	Type	Required	Description
`model`	`String`	Yes	The model to use. Example: `qwen3-asr-flash-realtime`.
`url`	`String`	Yes	The service endpoint. China (Beijing): `wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime`. Singapore: `wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime`. Replace `{WorkspaceId}` with your actual Workspace ID. Replace `{WorkspaceId}` with your actual workspace ID.
`apikey`	`String`	No	The API key.

Session configuration (OmniRealtimeConfig)

Set these parameters with the chained methods of the OmniRealtimeConfig class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

OmniRealtimeConfig config = OmniRealtimeConfig.builder()
        .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
        .enableTurnDetection(true)
        .turnDetectionType("server_vad")
        .turnDetectionThreshold(0.0f)
        .turnDetectionSilenceDurationMs(400)
        .transcriptionConfig(transcriptionParam)
        .build();

Parameter	Type	Required	Description
`modalities`	`List<OmniRealtimeModality>`	Yes	Output modality. Fixed to `[OmniRealtimeModality.TEXT]`.
`enableTurnDetection`	`boolean`	No	Enables server-side VAD. When disabled, call `commit()` to trigger recognition manually. Default: `true`.
`turnDetectionType`	`String`	No	VAD type. Fixed to `server_vad`.
`turnDetectionThreshold`	`float`	No	VAD sensitivity threshold. Recommended value: `0.0`. Default: `0.2`. Valid range: `[-1, 1]`. Lower values increase sensitivity (may trigger on background noise). Higher values reduce sensitivity and help avoid false triggers in noisy environments.
`turnDetectionSilenceDurationMs`	`int`	No	Silence duration in milliseconds that marks the end of an utterance. Recommended value: `400`. Default: `800`. Valid range: `[200, 6000]`. Shorter durations (e.g., 300 ms) speed up responses but may split natural pauses. Longer durations (e.g., 1200 ms) handle pauses better but increase latency.
`transcriptionConfig`	`OmniRealtimeTranscriptionParam`	No	Speech recognition settings. See Transcription parameters.

Transcription parameters (OmniRealtimeTranscriptionParam)

Set these parameters with the setter methods of the OmniRealtimeTranscriptionParam class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

Parameter	Type	Required	Description
`language`	`String`	No	Language of the audio source. For supported languages, see Supported languages.
`inputSampleRate`	`int`	No	Audio sampling rate in Hz. Valid values: `16000`, `8000`. Default: `16000`. Setting `8000` causes server-side upsampling to 16,000 Hz, which may introduce minor latency. Use only for natively 8,000 Hz audio (e.g., telephony).
`inputAudioFormat`	`String`	No	Audio encoding format. Valid values: `pcm`, `opus`. Default: `pcm`.
`corpusText`	`String`	No	Context text for contextual biasing. Provide background text, entity vocabularies, or reference material to improve recognition accuracy. Maximum: 10,000 tokens.

Key interfaces

OmniRealtimeConversation

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation

This class manages the WebSocket lifecycle: connecting to the server, sending audio, and ending the session.

Create a conversation

OmniRealtimeConversation conversation =
        new OmniRealtimeConversation(param, callback);

Creates a new conversation instance with the specified connection parameters and callback handler.

Connect to the server

conversation.connect();

Opens a WebSocket connection. The server responds with session.created and session.updated events.

Throws: NoApiKeyException, InterruptedException.

Configure the session

conversation.updateSession(config);

Updates the session configuration after the connection is established. The server responds with a session.updated event. If not called, the server uses default settings.

Send audio data

conversation.appendAudio(audioBase64);

Appends a Base64-encoded audio segment to the server-side audio buffer.

VAD mode (enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.
Manual mode (enableTurnDetection=false): Audio accumulates in the buffer until you call commit() to trigger recognition. Each event can contain up to 15 MiB of audio data.

Commit the audio buffer

conversation.commit();

Submits the buffered audio for recognition. The server responds with an input_audio_buffer.committed event.

This method is only available in manual mode (enableTurnDetection=false). An error occurs if the audio buffer is empty.

End the session

conversation.endSession();  // synchronous
// or
conversation.endSessionAsync();  // asynchronous

Notifies the server to finish processing any remaining audio and end the session. The server responds with a session.finished event.

When to call:

VAD mode: After you finish sending audio.
Manual mode: After you call commit().

Close the connection

conversation.close();

Stops the task and closes the WebSocket connection immediately.

Get session and response IDs

String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();

getSessionId() returns the session ID for the current task.
getResponseId() returns the response ID of the most recent server response.

OmniRealtimeCallback

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback

Inherit this class and implement the callback methods to handle server events.

Method	Parameters	Triggered when
`onOpen()`	None	The WebSocket connection is established.
`onEvent(JsonObject message)`	`message`: A server event as JSON. Common event types: `session.created`, `session.updated`, `input_audio_buffer.committed`, `conversation.item.input_audio_transcription.completed`, `session.finished`.	A server event is received. Parse the `type` field to determine the event type.
`onClose(int code, String reason)`	`code`: Status code. `reason`: Reason for closing.	The WebSocket connection is closed.