Client events

更新时间:
复制 MD 格式

Client events are JSON messages sent over a WebSocket connection to control the Qwen-TTS Realtime API session -- configure voice settings, stream text for synthesis, and signal completion.

For the full API overview, see Real-time speech synthesis - Qwen.

Event summary

Client event Server response Description
session.update session.updated Set voice, audio format, interaction mode, and other session parameters
input_text_buffer.append -- Append text to the synthesis buffer
input_text_buffer.commit input_text_buffer.committed Commit buffered text to trigger synthesis
input_text_buffer.clear input_text_buffer.cleared Discard all buffered text
session.finish -- End the session; the server flushes remaining audio and closes the connection

session.update

Configures the session. Send as the first message after the WebSocket connection is established. If omitted, all parameters use defaults. The server confirms with a session.updated event.

Request body

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "voice": "Cherry",
        "mode": "server_commit",
        "language_type": "Chinese",
        "response_format": "pcm",
        "sample_rate": 24000,
        "instructions": "",
        "optimize_instructions": false
    }
}

Parameters

Parameter Type Required Description
event_id string Yes Unique event identifier generated by the client (UUID recommended). Must be unique within the WebSocket session.
type string Yes Set to session.update.
session object No Session configuration. See the following subsections.

session properties

voice

Type: string | Required: Yes

The voice for speech synthesis. For voice samples, see Supported voices.

  • System voices: Available only for the Qwen3-TTS-Instruct-Flash-Realtime, Qwen3-TTS-Flash-Realtime, and Qwen-TTS-Realtime model series.

  • Custom voices:

    • Voices created through Voice cloning (Qwen): Available for the Qwen3-TTS-VC-Realtime series only.

    • Voices created through Voice design (Qwen): Available for the Qwen3-TTS-VD-Realtime series only.

mode

Type: string | Required: No | Default: server_commit

The interaction mode that controls when buffered text is synthesized.

Value Behavior
server_commit The server decides when to synthesize, balancing latency and quality. Recommended for most uses.
commit Trigger synthesis manually by sending input_text_buffer.commit. Lowest latency, but you must manage sentence integrity.

language_type

Type: string | Required: No | Default: Auto

The language of the synthesized audio.

  • Auto -- For unknown or mixed-language text. The model automatically matches pronunciation per segment, but accuracy is not guaranteed.

  • For single-language text, specifying language significantly improves quality. Supported values:

Value Value Value
Chinese English German
Italian Portuguese Spanish
Japanese Korean French
Russian

response_format

Type: string | Required: No | Default: pcm

The audio output format.

Value Notes
pcm Default. Only format supported by Qwen-TTS-Realtime series. See Supported models.
wav
mp3
opus Supports configurable bitrate via the bit_rate parameter.

sample_rate

Type: integer | Required: No | Default: 24000

The sample rate of the audio output, in Hz.

Supported values: 8000, 16000, 24000, 48000.

Note: The Qwen-TTS-Realtime series supports only 24000. See Supported models.

speech_rate

Type: float | Required: No | Default: 1.0 | Range: 0.5--2.0

The playback speed. Values below 1.0 slow down the audio; values above 1.0 speed it up.

Note: Not supported by the Qwen-TTS-Realtime series. See Supported models.

volume

Type: integer | Required: No | Default: 50 | Range: 0--100

The audio volume.

Note: Not supported by the Qwen-TTS-Realtime series. See Supported models.

pitch_rate

Type: float | Required: No | Default: 1.0 | Range: 0.5--2.0

The pitch of the synthesized audio.

Note: Not supported by the Qwen-TTS-Realtime series. See Supported models.

bit_rate

Type: integer | Required: No | Default: 128 | Range: 6--510

Audio bitrate in kbps. Higher values produce better quality but larger files. Only applies when response_format is opus.

Note: Not supported by the Qwen-TTS-Realtime series. See Supported models.

instructions

Type: string | Required: No | Default: None | Max length: 1600 tokens

Controls style and expressiveness of synthesized speech. For details, see Real-time speech synthesis - Qwen.

Supported languages: Chinese and English only.

Note: Available for the Qwen3-TTS-Instruct-Flash-Realtime series only.

optimize_instructions

Type: boolean | Required: No | Default: false

When true, rewrites instructions to improve naturalness and expressiveness. Enable for use cases requiring fine-grained vocal control.

Has no effect if instructions is empty.

Note: Available for the Qwen3-TTS-Instruct-Flash-Realtime series only.

input_text_buffer.append

Append text to the synthesis buffer.

  • In server_commit mode, text is appended to the server-side buffer.

  • In commit mode, text is appended to the client-side buffer.

Request body

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_text_buffer.append",
    "text": "Hello, I am Qwen."
}

Parameters

Parameter Type Required Description
event_id string Yes Unique event identifier generated by the client (UUID recommended). Must be unique within the WebSocket session.
type string Yes Set to input_text_buffer.append.
text string Yes The text to synthesize.

input_text_buffer.commit

Commits buffered text and creates a user message item. The server responds with an input_text_buffer.committed event.

Returns an error if the buffer is empty.

Behavior differs by mode:

  • server_commit mode: All buffered text is synthesized immediately. The server stops caching and processes everything at once.

  • commit mode: Creates user message item from buffered text.

Note: Committing the buffer triggers synthesis only -- it does not generate a model response.

Request body

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_text_buffer.commit"
}

Parameters

Parameter Type Required Description
event_id string Yes Unique event identifier generated by the client (UUID recommended). Must be unique within the WebSocket session.
type string Yes Set to input_text_buffer.commit.

input_text_buffer.clear

Clears buffer text. The server responds with an input_text_buffer.cleared event.

Request body

{
    "event_id": "event_2728",
    "type": "input_text_buffer.clear"
}

Parameters

Parameter Type Required Description
event_id string Yes Unique event identifier generated by the client (UUID recommended). Must be unique within the WebSocket session.
type string Yes Set to input_text_buffer.clear.

session.finish

Signals no more text will be sent. The server returns remaining audio and closes the connection.

Request body

{
    "event_id": "event_2239",
    "type": "session.finish"
}

Parameters

Parameter Type Required Description
event_id string Yes Unique event identifier generated by the client (UUID recommended). Must be unique within the WebSocket session.
type string Yes Set to session.finish.