Qwen-ASR-Realtime Java SDK - API reference

更新时间:
复制 MD 格式

Use the DashScope Java SDK to call Qwen-ASR-Realtime.

User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.

Prerequisites

Important

Model Studio has released a workspace-specific domain for the Singapore region: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from wss://dashscope-intl.aliyuncs.com to the new domain.

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

Interaction modes

Qwen-ASR-Realtime supports two modes for deciding when to process audio:

Mode

enableTurnDetection

How it works

VAD mode (default)

true

The server detects speech boundaries using voice activity detection (VAD) and decides when to commit the audio buffer for recognition.

Manual mode

false

The client controls when to commit audio by calling commit(). This gives you full control over segmentation.

For details on each mode, see VAD mode and Manual mode.

Request parameters

Connection parameters (OmniRealtimeParam)

Set these parameters with the chained methods of the OmniRealtimeParam class.

Click to view sample code

OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-asr-flash-realtime")
                // The following URL is for the China (Beijing) region. The URLs vary by region.
                .url("wss://dashscope.aliyuncs.com/api-ws/v1/realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit https://help.aliyun.com/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with .apikey("sk-xxx") and use your Model Studio API key.
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                .build();

Parameter

Type

Required

Description

model

String

Yes

The model to use. Example: qwen3-asr-flash-realtime.

url

String

Yes

The service endpoint. China (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime. Singapore: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime. Replace WorkspaceId with your actual Workspace ID.

apikey

String

No

The API key.

Session configuration (OmniRealtimeConfig)

Set these parameters with the chained methods of the OmniRealtimeConfig class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

OmniRealtimeConfig config = OmniRealtimeConfig.builder()
        .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
        .enableTurnDetection(true)
        .turnDetectionType("server_vad")
        .turnDetectionThreshold(0.0f)
        .turnDetectionSilenceDurationMs(400)
        .transcriptionConfig(transcriptionParam)
        .build();

Parameter

Type

Required

Description

modalities

List<OmniRealtimeModality>

Yes

Output modality. Fixed to [OmniRealtimeModality.TEXT].

enableTurnDetection

boolean

No

Enables server-side VAD. When disabled, call commit() to trigger recognition manually. Default: true.

turnDetectionType

String

No

VAD type. Fixed to server_vad.

turnDetectionThreshold

float

No

VAD sensitivity threshold. Recommended value: 0.0.

Default: 0.2. Valid range: [-1, 1].

Lower values increase sensitivity (may trigger on background noise). Higher values reduce sensitivity and help avoid false triggers in noisy environments.

turnDetectionSilenceDurationMs

int

No

Silence duration in milliseconds that marks the end of an utterance. Recommended value: 400.

Default: 800. Valid range: [200, 6000].

Shorter durations (e.g., 300 ms) speed up responses but may split natural pauses. Longer durations (e.g., 1200 ms) handle pauses better but increase latency.

transcriptionConfig

OmniRealtimeTranscriptionParam

No

Speech recognition settings. See Transcription parameters.

Transcription parameters (OmniRealtimeTranscriptionParam)

Set these parameters with the setter methods of the OmniRealtimeTranscriptionParam class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

Parameter

Type

Required

Description

language

String

No

Language of the audio source. For supported languages, see Supported languages.

inputSampleRate

int

No

Audio sampling rate in Hz. Valid values: 16000, 8000.

Default: 16000.

Setting 8000 causes server-side upsampling to 16,000 Hz, which may introduce minor latency. Use only for natively 8,000 Hz audio (e.g., telephony).

inputAudioFormat

String

No

Audio encoding format. Valid values: pcm, opus. Default: pcm.

corpusText

String

No

Context text for contextual biasing. Provide background text, entity vocabularies, or reference material to improve recognition accuracy. Maximum: 10,000 tokens.

Key interfaces

OmniRealtimeConversation

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation

This class manages the WebSocket lifecycle: connecting to the server, sending audio, and ending the session.

Create a conversation

OmniRealtimeConversation conversation =
        new OmniRealtimeConversation(param, callback);

Creates a new conversation instance with the specified connection parameters and callback handler.

Connect to the server

conversation.connect();

Opens a WebSocket connection. The server responds with session.created and session.updated events.

Throws: NoApiKeyException, InterruptedException.

Configure the session

conversation.updateSession(config);

Updates the session configuration after the connection is established. The server responds with a session.updated event. If not called, the server uses default settings.

Send audio data

conversation.appendAudio(audioBase64);

Appends a Base64-encoded audio segment to the server-side audio buffer.

  • VAD mode (enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.

  • Manual mode (enableTurnDetection=false): Audio accumulates in the buffer until you call commit() to trigger recognition. Each event can contain up to 15 MiB of audio data.

Commit the audio buffer

conversation.commit();

Submits the buffered audio for recognition. The server responds with an input_audio_buffer.committed event.

This method is only available in manual mode (enableTurnDetection=false). An error occurs if the audio buffer is empty.

End the session

conversation.endSession();  // synchronous
// or
conversation.endSessionAsync();  // asynchronous

Notifies the server to finish processing any remaining audio and end the session. The server responds with a session.finished event.

When to call:

  • VAD mode: After you finish sending audio.

  • Manual mode: After you call commit().

Close the connection

conversation.close();

Stops the task and closes the WebSocket connection immediately.

Get session and response IDs

String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();
  • getSessionId() returns the session ID for the current task.

  • getResponseId() returns the response ID of the most recent server response.

OmniRealtimeCallback

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback

Inherit this class and implement the callback methods to handle server events.

Method

Parameters

Triggered when

onOpen()

None

The WebSocket connection is established.

onEvent(JsonObject message)

message: A server event as JSON. Common event types: session.created, session.updated, input_audio_buffer.committed, conversation.item.input_audio_transcription.completed, session.finished.

A server event is received. Parse the type field to determine the event type.

onClose(int code, String reason)

code: Status code. reason: Reason for closing.

The WebSocket connection is closed.