Client events-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

This topic describes the client events for the Qwen-TTS Realtime API.

Reference: Real-time speech synthesis - Qwen.

session.update

Use this event to update the session configuration. Send this event as the first step of interaction after the WebSocket connection is established. If you do not send this event, the system uses the default configurations. After the server successfully processes this event, it returns a session.updated event as confirmation.

event_id string (Required)

A unique event ID that the client generates. The ID must be unique within a single WebSocket connection session. Use a Universally Unique Identifier (UUID).

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "voice": "Cherry",
        "mode": "server_commit",
        "language_type": "Chinese",
        "response_format": "pcm",
        "sample_rate": 24000,
        "instructions": "",
        "optimize_instructions": false
    }
}

type string (Required)

The event type. The value is fixed to session.update.

session object (Optional)

The session configuration.

Properties

voice string (Required)

The voice for speech synthesis. For more information, see Supported voices.

System voices and custom voices are supported:

System voices: Available only for the Qwen3-TTS-Instruct-Flash-Realtime, Qwen3-TTS-Flash-Realtime, and Qwen-TTS-Realtime model series. For voice samples, see Supported voices.
Custom voices
- Voices customized by the Voice Cloning (Qwen) feature: Available only for the Qwen3-TTS-VC-Realtime model series.
- Voices customized by the Voice Design (Qwen) feature: Available only for the Qwen3-TTS-VD-Realtime model series.

mode string (Optional)

The interaction mode. Valid values:

server_commit (default): The server automatically determines when to synthesize speech, balancing latency and quality. This mode is recommended for most scenarios.
commit: The client manually triggers synthesis. This mode has the lowest latency, but you must manage sentence integrity yourself.

language_type string (Optional)

Specifies the language of the synthesized audio. The default value is Auto.

Auto: Use this value when the language of the text is uncertain or contains multiple languages. The model automatically matches the pronunciation for different language segments in the text, but cannot guarantee perfect accuracy.
Specific language: Use this for single-language text. Specifying a language significantly improves synthesis quality and typically yields better results than Auto. Valid values include the following:
- Chinese
- English
- German
- Italian
- Portuguese
- Spanish
- Japanese
- Korean
- French
- Russian

response_format string (Optional)

The format of the audio output from the model.

Supported formats:

pcm (default)
wav
mp3
opus

Qwen-TTS-Realtime (see Supported models) supports only pcm.

sample_rate integer (Optional)

The sample rate of the audio output from the model, in Hz.

Supported sample rates:

8000
16000
24000 (default)
48000

Qwen-TTS-Realtime (see Supported models) supports only 24000.

speech_rate float (Optional)

The speech rate of the audio. A value of 1.0 is the normal speed. A value less than 1.0 is slower, and a value greater than 1.0 is faster.

Default value: 1.0.

Valid range: [0.5, 2.0].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

volume integer (Optional)

The volume of the audio.

Default value: 50.

Valid range: [0, 100].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

pitch_rate float (Optional)

The pitch of the synthesized audio.

Default value: 1.0.

Valid range: [0.5, 2.0].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

bit_rate integer (Optional)

Specifies the bitrate of the audio in kbps. A higher bitrate results in better audio quality and a larger file size. This parameter is available only when the audio format (response_format) is set to opus.

Default value: 128.

Valid range: [6, 510].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

instructions string (Optional)

Sets the instructions. For more information, see Real-time speech synthesis - Qwen.

Default value: None. The parameter is not active if not set.

Length limit: The length cannot exceed 1600 tokens.

Supported languages: Chinese and English only.

Scope: This feature is available only for the Qwen3-TTS-Instruct-Flash-Realtime model series.

optimize_instructions boolean (Optional)

Specifies whether to optimize the instructions to improve the naturalness and expressiveness of the speech synthesis.

Default value: false.

Behavior: When set to true, the system enhances and rewrites the content of instructions to generate internal instructions that are better suited for speech synthesis.

Scenarios: Recommended for scenarios that require high-quality, fine-grained voice expression.

Dependency: This parameter depends on the instructions parameter being set. If instructions is empty, this parameter has no effect.

Scope: This feature is available only for the Qwen3-TTS-Instruct-Flash-Realtime model series.

input_text_buffer.append

Appends text for synthesis to the text buffer. In `server_commit` mode, the text is appended to the server-side text buffer. In `commit` mode, the text is appended to the client-side text buffer.

event_id string (Required)

A unique event ID that the client generates. The ID must be unique within a single WebSocket connection session. Use a Universally Unique Identifier (UUID).

{
  "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
  "type": "input_text_buffer.append",
  "text": "Hello, I am Qwen."
}

type string (Required)

The event type. The value is fixed to input_text_buffer.append.

text string (Required)

The text to be synthesized.

input_text_buffer.commit

Commits the user input text buffer to create a new user message item in the conversation. If the input text buffer is empty, this event causes a fault. In `server_commit` mode, submitting this event immediately synthesizes all previous text, and the server stops caching the text. In `commit` mode, the client must commit the text buffer to create a user message item. Committing the input text buffer does not create a response from the model. The server responds with an input_text_buffer.committed event.

event_id string (Required)

A unique event ID that the client generates. The ID must be unique within a single WebSocket connection session. Use a Universally Unique Identifier (UUID).

{
  "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
  "type": "input_text_buffer.commit"
}

type string (Required)

The event type. The value is fixed to input_text_buffer.commit.

input_text_buffer.clear

Clears the text in the buffer. The server responds with an input_text_buffer.cleared event.

event_id string (Required)

A unique event ID that the client generates. The ID must be unique within a single WebSocket connection session. Use a Universally Unique Identifier (UUID).

{
  "event_id": "event_2728",
  "type": "input_text_buffer.clear"
}

type string (Required)

The event type. The value is fixed to input_text_buffer.clear.

session.finish

The client sends a session.finish event to notify the server that there is no more text input. The server returns the remaining audio and then closes the connection.

event_id string (Required)

A unique event ID that the client generates. The ID must be unique within a single WebSocket connection session. Use a Universally Unique Identifier (UUID).

{
  "event_id": "event_2239",
  "type": "session.finish"
}

type string (Required)

The event type. The value is fixed to session.finish.