Real-time speech synthesis converts text into natural speech over a WebSocket connection. It supports streaming input and output, voice cloning, voice design, and fine-grained audio control for use cases such as voice assistants, audiobooks, and intelligent customer service.
Overview
Low-latency real-time speech synthesis over WebSocket, built for voice assistants, intelligent customer service, live captioning, and other scenarios that require instant responses.
-
Streaming input and output (full-duplex WebSocket) with low time to first audio, ideal for real-time conversations such as voice assistants and intelligent customer service
-
Adjustable speech rate, pitch, volume, and bitrate for fine-grained voice control
-
Compatible with mainstream audio formats (PCM, WAV, MP3, Opus) and supports up to 48 kHz sample rate output
-
Supports Instruction-based control, allowing natural-language instructions to control voice expressiveness
-
Supports Voice cloning and Voice Design voice customization
If you don't need real-time output, use Non-real-time speech synthesis(HTTP API), which is suited for batch scenarios such as audiobooks and courseware dubbing. For model selection guidance, see Speech synthesis.
Sambert is a legacy speech synthesis model. For new projects, use CosyVoice or Qwen-TTS for better synthesis quality and richer feature support.
Prerequisites
-
If you call the service through the DashScope SDK, install the latest SDK.
Quick start
The following examples demonstrate speech synthesis for each model. For more examples and parameter descriptions, see the API reference of each model.
CosyVoice
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash are available only in the Beijing region and support only voice design and voice cloning scenarios (no system voices). Before using these models, create a voice through Voice cloning or Voice Design, then set voice to the voice ID and model to the corresponding model name in your code.
The following example shows how to synthesize speech with a system voice (see CosyVoice Voice list).
To use instruction control, configure instructions through the instructions parameter.
Python
# coding=utf-8
import os
import dashscope
from dashscope.audio.tts_v2 import *
# The API Key differs between the Singapore and Beijing regions. Obtain an API Key: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# China (Beijing) region URL. The URL varies by region.
dashscope.base_websocket_api_url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/inference'
# Model
# Different model versions require corresponding voice types:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Select a voice that supports the target language
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"
# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and obtain binary audio
audio = synthesizer.call("How is the weather today?")
# The first text submission requires establishing a WebSocket connection, so the first-packet latency includes connection setup time
print('[Metric] requestId: {}, first package delay: {} ms'.format(
synthesizer.get_last_request_id(),
synthesizer.get_first_package_delay()))
# Save audio to a local file
with open('output.mp3', 'wb') as f:
f.write(audio)
Java
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
public class Main {
// Model
// Different model versions require corresponding voice types:
// cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
// cosyvoice-v2: Use voices such as longxiaochun_v2.
// Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the target language. For details, see the CosyVoice voice list.
private static String model = "cosyvoice-v3-flash";
// Voice
private static String voice = "longanyang";
public static void streamAudioDataToSpeaker() {
// Request parameters
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
// The API Key differs between the Singapore and Beijing regions. Obtain an API Key: https://help.aliyun.com/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(model) // Model
.voice(voice) // Voice
.build();
// Synchronous mode: disable callback (second parameter is null)
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
ByteBuffer audio = null;
try {
// Block until audio is returned
audio = synthesizer.call("How is the weather today?");
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
// Close the WebSocket connection after the task completes
synthesizer.getDuplexApi().close(1000, "bye");
}
if (audio != null) {
// Save the audio data to a local file "output.mp3"
File file = new File("output.mp3");
// The first text submission requires establishing a WebSocket connection, so the first-packet latency includes connection setup time
// Note: getFirstPackageDelay() requires dashscope-sdk-java 2.18.0 or later
System.out.println(
"[Metric] requestId: "
+ synthesizer.getLastRequestId()
+ ", first package delay (ms): "
+ synthesizer.getFirstPackageDelay());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(audio.array());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
public static void main(String[] args) {
// China (Beijing) region URL. The URL varies by region.
Constants.baseWebsocketApiUrl = "wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/inference";
streamAudioDataToSpeaker();
System.exit(0);
}
}Qwen-TTS
The following example shows how to synthesize speech with a system voice (see Supported voices).
To use Instruction-based control, set model to qwen3-tts-instruct-flash-realtime and configure the instruction through the instructions parameter.
Python
Server commit mode
import os
import base64
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import *
qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
'Right? I love supermarkets like this.',
'Especially during Chinese New Year,',
'I go shopping at supermarkets.',
'And I feel',
'absolutely thrilled!',
'I want to buy so many things!'
]
DO_VIDEO_TEST = False
def init_dashscope_api_key():
"""
Set your DashScope API key. More information:
https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
"""
# API keys differ between the Singapore and Beijing regions. Get an API key: https://help.aliyun.com/en/model-studio/get-api-key
if 'DASHSCOPE_API_KEY' in os.environ:
dashscope.api_key = os.environ[
'DASHSCOPE_API_KEY'] # Load API key from environment variable DASHSCOPE_API_KEY
else:
dashscope.api_key = 'your-dashscope-api-key' # Set API key manually
class MyCallback(QwenTtsRealtimeCallback):
def __init__(self):
self.complete_event = threading.Event()
self.file = open('result_24k.pcm', 'wb')
def on_open(self) -> None:
print('connection opened, init player')
def on_close(self, close_status_code, close_msg) -> None:
self.file.close()
print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))
def on_event(self, response: str) -> None:
try:
global qwen_tts_realtime
type = response['type']
if 'session.created' == type:
print('start session: {}'.format(response['session']['id']))
if 'response.audio.delta' == type:
recv_audio_b64 = response['delta']
self.file.write(base64.b64decode(recv_audio_b64))
if 'response.done' == type:
print(f'response {qwen_tts_realtime.get_last_response_id()} done')
if 'session.finished' == type:
print('session finished')
self.complete_event.set()
except Exception as e:
print('[Error] {}'.format(e))
return
def wait_for_finished(self):
self.complete_event.wait()
if __name__ == '__main__':
init_dashscope_api_key()
print('Initializing ...')
callback = MyCallback()
qwen_tts_realtime = QwenTtsRealtime(
# To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
model='qwen3-tts-flash-realtime',
callback=callback,
# This URL is for the Beijing region. Replace WorkspaceId with your actual workspace ID. If you use the Singapore region, replace it with: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime
url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime'
)
qwen_tts_realtime.connect()
qwen_tts_realtime.update_session(
voice = 'Cherry',
response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
# To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
# instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
mode = 'server_commit'
)
for text_chunk in text_to_synthesize:
print(f'send text: {text_chunk}')
qwen_tts_realtime.append_text(text_chunk)
time.sleep(0.1)
qwen_tts_realtime.finish()
callback.wait_for_finished()
print('[Metric] session: {}, first audio delay: {}'.format(
qwen_tts_realtime.get_session_id(),
qwen_tts_realtime.get_first_audio_delay(),
))
Commit mode
import base64
import os
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *
qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
'This is the first sentence.',
'This is the second sentence.',
'This is the third sentence.',
]
DO_VIDEO_TEST = False
def init_dashscope_api_key():
"""
Set your DashScope API key. More information:
https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
"""
# API keys differ between the Singapore and Beijing regions. Get an API key: https://help.aliyun.com/en/model-studio/get-api-key
if 'DASHSCOPE_API_KEY' in os.environ:
dashscope.api_key = os.environ[
'DASHSCOPE_API_KEY'] # Load API key from environment variable DASHSCOPE_API_KEY
else:
dashscope.api_key = 'your-dashscope-api-key' # Set API key manually
class MyCallback(QwenTtsRealtimeCallback):
def __init__(self):
super().__init__()
self.response_counter = 0
self.complete_event = threading.Event()
self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')
def reset_event(self):
self.response_counter += 1
self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')
self.complete_event = threading.Event()
def on_open(self) -> None:
print('connection opened, init player')
def on_close(self, close_status_code, close_msg) -> None:
print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))
def on_event(self, response: str) -> None:
try:
global qwen_tts_realtime
type = response['type']
if 'session.created' == type:
print('start session: {}'.format(response['session']['id']))
if 'response.audio.delta' == type:
recv_audio_b64 = response['delta']
self.file.write(base64.b64decode(recv_audio_b64))
if 'response.done' == type:
print(f'response {qwen_tts_realtime.get_last_response_id()} done')
self.complete_event.set()
self.file.close()
if 'session.finished' == type:
print('session finished')
self.complete_event.set()
except Exception as e:
print('[Error] {}'.format(e))
return
def wait_for_response_done(self):
self.complete_event.wait()
if __name__ == '__main__':
init_dashscope_api_key()
print('Initializing ...')
callback = MyCallback()
qwen_tts_realtime = QwenTtsRealtime(
# To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
model='qwen3-tts-flash-realtime',
callback=callback,
# This URL is for the Beijing region. Replace WorkspaceId with your actual workspace ID. If you use the Singapore region, replace it with: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime
url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime'
)
qwen_tts_realtime.connect()
qwen_tts_realtime.update_session(
voice = 'Cherry',
response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
# To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
# instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
mode = 'commit'
)
print(f'send text: {text_to_synthesize[0]}')
qwen_tts_realtime.append_text(text_to_synthesize[0])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
callback.reset_event()
print(f'send text: {text_to_synthesize[1]}')
qwen_tts_realtime.append_text(text_to_synthesize[1])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
callback.reset_event()
print(f'send text: {text_to_synthesize[2]}')
qwen_tts_realtime.append_text(text_to_synthesize[2])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
qwen_tts_realtime.finish()
print('[Metric] session: {}, first audio delay: {}'.format(
qwen_tts_realtime.get_session_id(),
qwen_tts_realtime.get_first_audio_delay(),
))
Java
Server commit mode
appendText()
import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.util.Base64;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
public class Main {
static String[] textToSynthesize = {
"Right? I really love this kind of supermarket.",
"Especially during the Chinese New Year.",
"Going to the supermarket.",
"It just makes me feel.",
"Super, super happy!",
"I want to buy so many things!"
};
public static QwenTtsRealtimeAudioFormat ttsFormat = QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT;
// Real-time PCM audio player
public static class RealtimePcmPlayer {
private int sampleRate;
private SourceDataLine line;
private AudioFormat audioFormat;
private Thread decoderThread;
private Thread playerThread;
private AtomicBoolean stopped = new AtomicBoolean(false);
private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();
private ByteArrayOutputStream totalAudioStream = new ByteArrayOutputStream();
// Initialize the audio format and audio line.
public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
this.sampleRate = sampleRate;
this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
line = (SourceDataLine) AudioSystem.getLine(info);
line.open(audioFormat);
line.start();
decoderThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
String b64Audio = b64AudioBuffer.poll();
if (b64Audio != null) {
byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
RawAudioBuffer.add(rawAudio);
// Write audio data to totalAudioStream.
try {
totalAudioStream.write(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
playerThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
byte[] rawAudio = RawAudioBuffer.poll();
if (rawAudio != null) {
try {
playChunk(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
decoderThread.start();
playerThread.start();
}
// Play an audio chunk and block until playback completes.
private void playChunk(byte[] chunk) throws IOException, InterruptedException {
if (chunk == null || chunk.length == 0) return;
int bytesWritten = 0;
while (bytesWritten < chunk.length) {
bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
}
int audioLength = chunk.length / (this.sampleRate*2/1000);
// Wait for the buffered audio to finish playing.
Thread.sleep(audioLength - 10);
}
public void write(String b64Audio) {
b64AudioBuffer.add(b64Audio);
}
public void cancel() {
b64AudioBuffer.clear();
RawAudioBuffer.clear();
}
public void waitForComplete() throws InterruptedException {
while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
Thread.sleep(100);
}
line.drain();
}
public void shutdown() throws InterruptedException, IOException {
stopped.set(true);
decoderThread.join();
playerThread.join();
// Save the complete audio file.
File file = new File("TotalAudio_"+ttsFormat.getSampleRate()+"."+ttsFormat.getFormat());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(totalAudioStream.toByteArray());
}
if (line != null && line.isRunning()) {
line.drain();
line.close();
}
}
}
public static void main(String[] args) throws InterruptedException, LineUnavailableException, IOException {
QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
// To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime.
.model("qwen3-tts-flash-realtime")
// China (Beijing) endpoint. Replace WorkspaceId with your actual workspace ID. For Singapore, use wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime.
.url("wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime")
// API keys differ between Singapore and China (Beijing). See https://help.aliyun.com/en/model-studio/get-api-key.
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);
// Create a real-time audio player instance.
RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
@Override
public void onOpen() {
// Handle connection establishment.
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
// Handle session creation.
if (message.has("session")) {
String eventId = message.get("event_id").getAsString();
String sessionId = message.get("session").getAsJsonObject().get("id").getAsString();
System.out.println("[onEvent] session.created, session_id: "
+ sessionId + ", event_id: " + eventId);
}
break;
case "response.audio.delta":
String recvAudioB64 = message.get("delta").getAsString();
// Play audio in real time.
audioPlayer.write(recvAudioB64);
break;
case "response.done":
// Handle response completion.
break;
case "session.finished":
// Handle session termination.
completeLatch.get().countDown();
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
// Handle connection closure.
}
});
qwenTtsRef.set(qwenTtsRealtime);
try {
qwenTtsRealtime.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
.voice("Cherry")
.responseFormat(ttsFormat)
.mode("server_commit")
// To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
// .instructions("")
// .optimizeInstructions(true)
.build();
qwenTtsRealtime.updateSession(config);
for (String text:textToSynthesize) {
qwenTtsRealtime.appendText(text);
Thread.sleep(100);
}
qwenTtsRealtime.finish();
completeLatch.get().await();
qwenTtsRealtime.close();
// Wait for audio playback to complete, then shut down the player.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
System.exit(0);
}
}
Commit mode
commit()
import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.util.Base64;
import java.util.Queue;
import java.util.Scanner;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
public class Main {
public static QwenTtsRealtimeAudioFormat ttsFormat = QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT;
// Real-time PCM audio player
public static class RealtimePcmPlayer {
private int sampleRate;
private SourceDataLine line;
private AudioFormat audioFormat;
private Thread decoderThread;
private Thread playerThread;
private AtomicBoolean stopped = new AtomicBoolean(false);
private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();
private ByteArrayOutputStream totalAudioStream = new ByteArrayOutputStream();
// Initialize the audio format and audio line.
public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
this.sampleRate = sampleRate;
this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
line = (SourceDataLine) AudioSystem.getLine(info);
line.open(audioFormat);
line.start();
decoderThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
String b64Audio = b64AudioBuffer.poll();
if (b64Audio != null) {
byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
RawAudioBuffer.add(rawAudio);
// Write audio data to totalAudioStream.
try {
totalAudioStream.write(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
playerThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
byte[] rawAudio = RawAudioBuffer.poll();
if (rawAudio != null) {
try {
playChunk(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
decoderThread.start();
playerThread.start();
}
// Play an audio chunk and block until playback completes.
private void playChunk(byte[] chunk) throws IOException, InterruptedException {
if (chunk == null || chunk.length == 0) return;
int bytesWritten = 0;
while (bytesWritten < chunk.length) {
bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
}
int audioLength = chunk.length / (this.sampleRate*2/1000);
// Wait for the buffered audio to finish playing.
Thread.sleep(audioLength - 10);
}
public void write(String b64Audio) {
b64AudioBuffer.add(b64Audio);
}
public void cancel() {
b64AudioBuffer.clear();
RawAudioBuffer.clear();
}
public void waitForComplete() throws InterruptedException {
// Wait for all buffered audio data to finish playing.
while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
Thread.sleep(100);
}
// Wait for the audio line to drain.
line.drain();
}
public void shutdown() throws InterruptedException {
stopped.set(true);
decoderThread.join();
playerThread.join();
// Save the complete audio file.
File file = new File("TotalAudio_"+ttsFormat.getSampleRate()+"."+ttsFormat.getFormat());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(totalAudioStream.toByteArray());
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
if (line != null && line.isRunning()) {
line.drain();
line.close();
}
}
}
public static void main(String[] args) throws InterruptedException, LineUnavailableException, FileNotFoundException {
Scanner scanner = new Scanner(System.in);
QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
// To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime.
.model("qwen3-tts-flash-realtime")
// China (Beijing) endpoint. Replace WorkspaceId with your actual workspace ID. For Singapore, use wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime.
.url("wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime")
// API keys differ between Singapore and China (Beijing). See https://help.aliyun.com/en/model-studio/get-api-key.
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
// Create a real-time player instance.
RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);
QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("connection opened");
System.out.println("Enter text and press Enter to send. Enter 'quit' to exit the program.");
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
break;
case "response.audio.delta":
String recvAudioB64 = message.get("delta").getAsString();
byte[] rawAudio = Base64.getDecoder().decode(recvAudioB64);
// Play audio in real time.
audioPlayer.write(recvAudioB64);
break;
case "response.done":
System.out.println("response done");
// Wait for audio playback to complete.
try {
audioPlayer.waitForComplete();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
// Prepare for the next input.
completeLatch.get().countDown();
break;
case "session.finished":
System.out.println("session finished");
if (qwenTtsRef.get() != null) {
System.out.println("[Metric] response: " + qwenTtsRef.get().getResponseId() +
", first audio delay: " + qwenTtsRef.get().getFirstAudioDelay() + " ms");
}
completeLatch.get().countDown();
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
try {
// Wait for playback to complete, then shut down the player.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});
qwenTtsRef.set(qwenTtsRealtime);
try {
qwenTtsRealtime.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
.voice("Cherry")
.responseFormat(ttsFormat)
.mode("commit")
// To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
// .instructions("")
// .optimizeInstructions(true)
.build();
qwenTtsRealtime.updateSession(config);
// Read user input in a loop.
while (true) {
System.out.print("Enter the text to synthesize: ");
String text = scanner.nextLine();
// Exit when the user enters 'quit'.
if ("quit".equalsIgnoreCase(text.trim())) {
System.out.println("Closing the connection...");
qwenTtsRealtime.finish();
completeLatch.get().await();
break;
}
// Skip empty input.
if (text.trim().isEmpty()) {
continue;
}
// Re-initialize the countdown latch.
completeLatch.set(new CountDownLatch(1));
// Send the text.
qwenTtsRealtime.appendText(text);
qwenTtsRealtime.commit();
// Wait for the current synthesis to complete.
completeLatch.get().await();
}
// Clean up resources.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
scanner.close();
System.exit(0);
}
}
Advanced features
Qwen-TTS interaction modes
The Qwen-TTS Realtime API provides two interaction modes:
-
server_commit mode: The server intelligently handles text segmentation and synthesis timing. This mode suits continuous synthesis of large text blocks. The client only needs to append text without managing segmentation or submission.
-
commit mode: The client manually submits the text buffer to trigger synthesis. This mode suits scenarios that require precise control over synthesis timing, such as turn-by-turn synthesis in conversational AI.
Switch the interaction mode:
-
WebSocket: Set the
modefield in thesession.updateevent.{ "type": "session.update", "session": { "mode": "server_commit" } } -
Python SDK: Set the
modeparameter in theupdate_sessionmethod.qwen_tts_realtime.update_session( voice='Cherry', response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, mode='server_commit' ) -
Java SDK: Set the
modeparameter throughQwenTtsRealtimeConfig.builder().QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder() .voice("Cherry") .responseFormat(ttsFormat) .mode("server_commit") .build(); qwenTtsRealtime.updateSession(config);
For complete SDK code examples, see Python SDK and Java SDK. For the WebSocket event lifecycle and connection reuse, see WebSocket API reference.
Instruction-based control
Instruction-based control lets you shape tone, speed, emotion, and timbre through natural language descriptions, without adjusting complex audio parameters.
Instruction specifications by model:
CosyVoice
Supported models: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash
Different models have different instruction format requirements:
-
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash:
-
Voice Clone/Design timbres: Accept arbitrary instructions.
-
System voices: v3.5 doesn't support system voices.
-
-
cosyvoice-v3-plus:
-
Voice Clone/Design timbres: Instruction control isn't supported.
-
System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.
-
-
cosyvoice-v3-flash:
-
Voice Clone/Design timbres: Accept arbitrary instructions.
-
System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.
-
How to use: Specify instruction content through the instructions parameter.
Supported languages for instruction text:
-
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash:
-
Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.
-
System voices: v3.5 doesn't support system voices.
-
-
cosyvoice-v3-plus:
-
Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, and Russian.
-
System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.
-
-
cosyvoice-v3-flash:
-
Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, and Russian.
-
System voices: Chinese.
-
Instruction text length limit: Up to 100 characters. Chinese characters (including Simplified/Traditional Chinese, Japanese Kanji, and Korean Hanja) count as 2 characters each. Other characters (punctuation, letters, numbers, Japanese Kana, Korean Hangul, etc.) count as 1 character each.
Qwen-TTS
Supported models: Only the Qwen3-TTS-Instruct-Flash-Realtime series models are supported.
How to use: Specify instruction content through the instruction parameter.
Supported languages for instruction text: Chinese and English only.
Instruction text length limit: Up to 1,600 tokens.
Use cases:
-
Audiobook and radio drama voiceover
-
Advertising and promotional voiceover
-
Game character and animation voiceover
-
Emotionally expressive voice assistants
-
Documentary narration and news broadcasting
Tips for writing high-quality voice descriptions:
-
Core principles:
-
Be specific, not vague: Use words that describe concrete vocal qualities, such as "deep," "crisp," or "slightly fast." Avoid subjective or vague terms like "nice" or "normal."
-
Be multidimensional, not single-faceted: A good description covers multiple dimensions (gender, age, emotion, etc.). Writing only "female voice" is too broad to produce a distinctive timbre.
-
Be objective, not subjective: Focus on the physical and perceptual qualities of the voice. For example, use "slightly high pitch with energy" rather than "my favorite voice."
-
Be original, not imitative: Describe the vocal qualities you want, rather than requesting imitation of specific public figures (such as celebrities or actors). The model doesn't support imitation, and it may involve copyright risks.
-
Be concise, not redundant: Make every word count. Avoid repeating synonyms or stacking meaningless modifiers.
-
-
Description dimensions:
Combining the following dimensions produces more accurate results. The more dimensions described, the more precise the output.
Dimension
Example descriptions
Gender
Male, female, neutral
Age
Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)
Pitch
High, mid, low, slightly high, slightly low
Speed
Fast, moderate, slow, slightly fast, slightly slow
Emotion
Cheerful, calm, gentle, serious, lively, composed, soothing
Timbre
Magnetic, crisp, husky, mellow, sweet, rich, powerful
Use case
News broadcasting, advertising, audiobook, animation character, voice assistant, documentary narration
-
Examples:
-
Standard broadcasting style: Clear and precise articulation with standard pronunciation
-
Young, lively female voice with a slightly fast pace and a noticeable rising intonation, suitable for introducing fashion products
-
Calm middle-aged male voice with a slow pace, deep and magnetic timbre, suitable for reading news or narrating documentaries
-
Gentle, intellectual female voice, around 30 years old, with a calm tone, suitable for audiobook reading
-
Cute child voice, about 8-year-old girl, slightly childish speech, suitable for animation character voiceover
-
Dialects
Use the model to output speech in Chinese dialects such as Henan, Sichuan, or Cantonese. Configuration varies by model and voice type.
Dialect setup by model:
CosyVoice
-
System voices: Pick one of the following voices from CosyVoice Voice list:
-
Voices with built-in dialect support (for example,
longshange_v3) output that dialect without any extra configuration. -
Voices that support Instruction-based control and allow dialect selection (for example,
longanhuan_v3): specify the target dialect in the instruction text.
-
-
Voice Clone timbres: Use Instruction-based control to set the dialect — for example, set the instruction text to
请用河南话表达. -
Voice Design timbres: Dialects aren't supported yet.
Supported dialects per model: See the "Supported languages" entry for each model in CosyVoice.
Example: To produce Henan dialect speech, use the cosyvoice-v3-flash model with the longanhuan_v3 voice and set the instruction text to "请用河南话表达。".
# coding=utf-8
import os
import dashscope
from dashscope.audio.tts_v2 import *
# The API Key differs between the Singapore and Beijing regions. Obtain an API Key: https://help.aliyun.com/en/model-studio/get-api-key
# If the environment variable isn't configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# China (Beijing) region URL. The URL varies by region.
dashscope.base_websocket_api_url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/inference'
# Model
# Different model versions require corresponding voice types:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Pick a dialect-capable voice
model = "cosyvoice-v3-flash"
# Voice
voice = "longanhuan_v3"
# Instantiate SpeechSynthesizer and pass request parameters such as model, voice, and instruction in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice, instruction="请用河南话表达。")
# Send text for synthesis and obtain binary audio
audio = synthesizer.call("叫你去买盐,你买回来一袋面,这不是弄啥嘞吗!")
# The first text submission requires establishing a WebSocket connection, so the first-packet latency includes connection setup time
print('[Metric] requestId: {}, first package delay: {} ms'.format(
synthesizer.get_last_request_id(),
synthesizer.get_first_package_delay()))
# Save audio to a local file
with open('output.mp3', 'wb') as f:
f.write(audio)
Qwen-TTS
-
System voices: Use a system voice that supports dialects. For the Qwen-TTS voice list, see Supported voices.
-
Voice Clone/Design timbres: Dialects aren't supported.
Supported dialects per model: See the "Supported languages" entry for each model in Qwen3-TTS.
Raw WebSocket protocol
The following examples show how to connect directly to the server through the raw WebSocket protocol, for scenarios where the DashScope SDK isn't used. These are minimal working implementations. For the WebSocket protocol specification of each model, see the corresponding API reference.
Connection reuse (WebSocket)
WebSocket connections support reuse: after a synthesis task completes, the next task can start on the same connection without reconnecting.
Reuse flow:
-
CosyVoice / Sambert: The client sends
finish-task. After the server returns atask-finishedevent, the client can send arun-taskevent to start a new task. -
Qwen-TTS: The client sends
session.finish. After the server returnssession.finished, the client can establish a new session for the next task.
-
Wait for the server to return the completion event (
task-finishedorsession.finished) before starting a new task. -
CosyVoice and Sambert require a different
task_idfor each task on a reused connection. -
If a task fails, the server returns an error event and closes the connection. The connection cannot be reused.
-
If no new task starts within 60 seconds, the connection automatically disconnects.
For event details, see the corresponding CosyVoice API reference, Qwen-TTS API reference, and Sambert API reference.
High-concurrency best practices
The DashScope SDK includes built-in pooling mechanisms to reuse WebSocket connections and synthesis objects, reducing the overhead of frequent creation and destruction.
Supported scope
Available models vary by region:
China (Beijing)
To call the following models, select an API Key from the Beijing region:
-
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2, cosyvoice-v1
-
Qwen-TTS:
-
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime (stable, currently equivalent to qwen3-tts-instruct-flash-realtime-2026-01-22), qwen3-tts-instruct-flash-realtime-2026-01-22 (latest snapshot)
-
Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime (stable, currently equivalent to qwen3-tts-flash-realtime-2025-11-27), qwen3-tts-flash-realtime-2025-11-27 (latest snapshot), qwen3-tts-flash-realtime-2025-09-18 (snapshot)
-
Qwen-TTS-Realtime: qwen-tts-realtime (stable, currently equivalent to qwen-tts-realtime-2025-07-15), qwen-tts-realtime-latest (latest, currently equivalent to qwen-tts-realtime-2025-07-15), qwen-tts-realtime-2025-07-15 (snapshot)
-
-
Sambert: sambert-zhinan-v1, sambert-zhiqi-v1, sambert-zhichu-v1, sambert-zhide-v1, sambert-zhijia-v1, sambert-zhiru-v1, sambert-zhiqian-v1, sambert-zhixiang-v1, sambert-zhiwei-v1, sambert-zhihao-v1, sambert-zhijing-v1, sambert-zhiming-v1, sambert-zhimo-v1, sambert-zhina-v1, sambert-zhishu-v1, sambert-zhistella-v1, sambert-zhiting-v1, sambert-zhixiao-v1, sambert-zhiya-v1, sambert-zhiye-v1, sambert-zhiying-v1, sambert-zhiyuan-v1, sambert-zhiyue-v1, sambert-zhigui-v1, sambert-zhishuo-v1, sambert-zhimiao-emo-v1, sambert-zhimao-v1, sambert-zhilun-v1, sambert-zhifei-v1, sambert-zhida-v1, sambert-camila-v1, sambert-perla-v1, sambert-indah-v1, sambert-clara-v1, sambert-hanna-v1, sambert-beth-v1, sambert-betty-v1, sambert-cally-v1, sambert-cindy-v1, sambert-eva-v1, sambert-donna-v1, sambert-brian-v1, sambert-waan-v1. For details, see Sambert voice list
Singapore
To call the following models, select an API Key from the Singapore region:
-
CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
-
Qwen-TTS:
-
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime (stable, currently equivalent to qwen3-tts-instruct-flash-realtime-2026-01-22), qwen3-tts-instruct-flash-realtime-2026-01-22 (latest snapshot)
-
Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime (stable, currently equivalent to qwen3-tts-flash-realtime-2025-11-27), qwen3-tts-flash-realtime-2025-11-27 (latest snapshot), qwen3-tts-flash-realtime-2025-09-18 (snapshot)
-
Supported voices
Different models support different voices. Set the voice request parameter to the value in the voice parameter column of the voice list.
API reference
FAQ
Q: How do I fix incorrect pronunciation in speech synthesis? How do I control the pronunciation of polyphonic characters?
-
Replace the polyphonic character with a homophone to quickly fix the pronunciation issue.
-
Use SSML markup language to control pronunciation : both Sambert and CosyVoice support SSML.
Q: How do I troubleshoot silent audio when using a cloned voice?
-
Verify the voice status
Call the Voice cloning/design API API and confirm that the voice
statusisOK. -
Check model version consistency
Make sure the
target_modelparameter used during voice cloning matches themodelparameter used during speech synthesis. For example:-
If you used
cosyvoice-v3-plusfor cloning -
You must also use
cosyvoice-v3-plusfor synthesis
-
-
Verify source audio quality
Check whether the source audio used for voice cloning meets the audio requirements and best practices:
-
Audio duration: 10-20 seconds
-
Clear audio quality
-
No background noise
-
-
Check request parameters
Confirm that the
voiceparameter in your speech synthesis request is set to the cloned voice ID.
Q: What should I do if the cloned voice produces unstable or incomplete speech?
If the synthesized speech from a cloned voice exhibits any of the following issues:
-
Incomplete playback that only reads part of the text
-
Inconsistent synthesis quality
-
Abnormal pauses or silent segments in the speech
Possible cause: The source audio quality doesn't meet the requirements.
Solution: Check whether the source audio meets the Recording guide for voice cloning requirements. We recommend re-recording based on the recording guidelines.
Q: Why does the actual duration differ from the duration displayed in the WAV file header?
Speech synthesis uses a streaming mechanism that returns data progressively as it's generated. The duration in the saved WAV file header is an estimate and may contain inaccuracies. For precise duration, set format to pcm, wait for the complete synthesis result, and then add the WAV file header yourself.
Q: Why won't the audio play?
Troubleshoot based on the following scenarios:
-
Audio saved as a complete file (such as xx.mp3)
-
Audio format consistency: The audio format in the request parameters must match the file extension (for example, if the format is set to
wav, the file must be saved as.wav). -
Player compatibility: Confirm that your player supports the audio format and sample rate.
-
-
Streaming audio playback
-
Save the audio stream as a complete file and try playing it with a media player. If the file won't play, refer to the troubleshooting steps in Scenario 1.
-
If the file plays correctly, the issue is in the streaming playback implementation. Confirm that your player supports streaming playback (such as ffmpeg, pyaudio, AudioFormat, or MediaSource).
-
Q: Why is audio playback stuttering?
Troubleshoot with the following steps:
-
Check text send rate: Make sure the interval between text segments is short enough that the next segment arrives before the previous audio finishes playing.
-
Check callback function performance:
-
Confirm that the callback function has no blocking business logic.
-
The callback runs on the WebSocket thread. Blocking it delays data reception. Write audio data to a separate buffer and process it in a separate thread.
-
-
Check network stability: Network fluctuations can cause audio transmission interruptions or delays.
Q: Why is speech synthesis taking a long time?
Troubleshoot with the following steps:
-
Check input interval
For streaming synthesis, confirm that the interval between text segments isn't too long. Long intervals increase total synthesis time.
-
Analyze performance metrics
-
First-packet latency: typically around 500 ms.
-
RTF (Real-Time Factor = total synthesis time / audio duration): should be less than 1.0 under normal conditions.
-
Q: How do I restrict an API key to speech synthesis only (permission isolation)?
Create a new workspace and authorize only specific models to limit the scope of an API key. For details, see Manage workspaces.
Q: Can an API key from a sub-workspace call CosyVoice models?
In the default workspace, all models are available.
In a sub-workspace, you need to grant model authorization to the sub-workspace associated with the API key. See Model calls in a sub-workspace.