Choose the right model for text-to-speech, voice cloning, and voice design scenarios.
Standard speech synthesis or custom voices?
TTS models convert text to natural-sounding speech. Decide whether built-in voices or custom voices fit your needs:
|
Standard speech synthesis |
Custom voices |
|
|
Voice source |
Built-in voice library, ready to use |
Cloned from an audio sample or created from a text description |
|
Getting started |
No extra setup required — select a model and voice to start synthesizing |
Provide an audio sample or text description to create a voice |
|
Use cases |
Customer service bots, audiobook narration, news broadcasts, e-commerce live streaming |
Brand-specific voices, virtual streamers, game character dubbing |
|
Recommended models |
|
|
-
Use standard speech synthesis when built-in voices meet your needs and you want zero-configuration setup.
-
Use custom voices when you need a brand-exclusive voice, want to replicate a specific speaker, or need to create a new character voice.
Voice cloning or voice design?
Custom voices offer two creation methods:
|
Voice cloning |
Voice design |
|
|
Input |
An audio sample from the target speaker |
A text description of the desired voice (for example, "warm, low-pitched female voice") |
|
Result |
Synthesized speech closely resembles the original speaker |
A brand-new voice generated from scratch based on the description |
|
Use cases |
Reusing a brand spokesperson or streamer's voice, virtual streamers, personalized voice assistants |
Brand voice customization (no recordings available), game or animation character dubbing, creative content production |
|
Recommended models |
|
|
|
Voice management service |
|
|
-
Use voice cloning when you have a recording of the target speaker and want to reproduce that voice.
-
Use voice design when no recording is available and you want to create a voice from a text description.
WebSocket or HTTP?
-
WebSocket: Bidirectional streaming that supports streaming input and output. Audio is returned as it is synthesized, providing the lowest latency. Best for real-time scenarios: customer service bots, voice assistants, and call centers.
-
HTTP: Accepts full text input with streaming audio output delivered in segments. Best for audiobook narration, content generation, and offline production.
CosyVoice models share one model name for both WebSocket and HTTP. Qwen models use a -realtime suffix for WebSocket; models without this suffix use HTTP.
CosyVoice and Qwen WebSocket models can be accessed through the DashScope SDK (Java, Python). CosyVoice WebSocket models also support Android and iOS SDK access. Other models require direct calls using the corresponding WebSocket or HTTP protocol.
WebSocket access: Real-time speech synthesis. HTTP access: Non-real-time speech synthesis.
Instruction control
Use natural-language instructions to control speech rate, emotion, and style per request — for example, "speak gently at a slightly slower pace" or "use an excited broadcast style." Ideal for emotionally expressive content, professional broadcasts, and audiobook narration.
Supported models: CosyVoice (cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash) and Qwen-TTS (qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash). Real-time speech synthesis > Instruction control.
Recommended models
The following table lists the recommended model for each scenario. Visit the Model Gallery for a full catalog.
|
Model ID |
Series |
API |
Voice cloning |
Voice design |
Instruction control |
|
|
CosyVoice |
WebSocket / HTTP |
|
|
|
|
|
CosyVoice |
WebSocket / HTTP |
|
|
|
|
|
MiniMax |
HTTP |
|
|
|
All models
CosyVoice
Some CosyVoice models support SSML markup and reading LaTeX formulas aloud.
|
Model ID |
API |
Voice cloning |
Voice design |
Instruction control |
|
|
WebSocket / HTTP |
|
|
|
|
|
WebSocket / HTTP |
|
|
|
|
|
WebSocket / HTTP |
|
|
|
|
|
WebSocket / HTTP |
|
|
|
|
|
WebSocket / HTTP |
|
|
|
|
|
WebSocket |
|
|
|
Supported languages (by version):
-
cosyvoice-v3.5-plusandcosyvoice-v3.5-flash(no system voices):-
Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese
-
Voice design: Mandarin Chinese and English
-
-
cosyvoice-v3-plus:-
System voices: Mandarin Chinese, English, Japanese, and Korean (varies by voice)
-
Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, and Russian
-
Voice design: Mandarin Chinese and English
-
-
cosyvoice-v3-flash:-
System voices (varies by voice): Mandarin Chinese (with Cantonese, Northeastern, Henan, Hunan, Shaanxi, Shandong, Sichuanese, and Anhui dialects available via instruction control) and English
-
Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese
-
Voice design: Mandarin Chinese and English
-
-
cosyvoice-v2(no voice design):-
System voices: Mandarin Chinese, English, Korean, and Japanese (varies by voice)
-
Voice cloning: Mandarin Chinese and English
-
-
cosyvoice-v1(no voice design):-
System voices: Mandarin Chinese and English (varies by voice)
-
Voice cloning: Mandarin Chinese and English
-
Qwen3-TTS
|
Model ID |
API |
Voice cloning |
Voice design |
Instruction control |
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
|
|
WebSocket |
|
|
|
|
|
WebSocket |
|
|
|
|
|
WebSocket |
|
|
|
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
|
|
WebSocket |
|
|
|
|
|
WebSocket |
|
|
|
|
|
HTTP |
|
|
|
|
|
WebSocket |
|
|
|
|
|
WebSocket |
|
|
|
|
|
HTTP |
|
|
|
|
|
WebSocket |
|
|
|
|
|
WebSocket |
|
|
|
Supported languages (by series):
-
Qwen3-TTS-Flash series (system voices) —
qwen3-tts-flash,qwen3-tts-flash-2025-11-27,qwen3-tts-flash-2025-09-18,qwen3-tts-flash-realtime,qwen3-tts-flash-realtime-2025-11-27,qwen3-tts-flash-realtime-2025-09-18: Chinese (Mandarin; Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Min Nan, Tianjin, and Cantonese dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian -
Qwen3-TTS-Instruct-Flash series (system voices) —
qwen3-tts-instruct-flash,qwen3-tts-instruct-flash-2026-01-26,qwen3-tts-instruct-flash-realtime,qwen3-tts-instruct-flash-realtime-2026-01-22: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian -
Qwen3-TTS-VC series (voice cloning) —
qwen3-tts-vc-2026-01-22,qwen3-tts-vc-realtime-2026-01-15,qwen3-tts-vc-realtime-2025-11-27: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian -
Qwen3-TTS-VD series (voice design) —
qwen3-tts-vd-2026-01-26,qwen3-tts-vd-realtime-2026-01-15,qwen3-tts-vd-realtime-2025-12-16: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian
MiniMax
|
Model ID |
API |
Voice cloning |
Voice design |
Instruction control |
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
|
|
HTTP |
|
|
|
Qwen-TTS (legacy, token-based billing)
Legacy Qwen-TTS models billed by token. If you have migrated to Qwen3-TTS, use the recommended models listed earlier.
|
Model ID |
API |
Description |
|
|
HTTP |
Non-streaming synthesis, billed by token |
|
|
HTTP |
Non-streaming synthesis, billed by token |
|
|
HTTP |
Snapshot version, billed by token |
|
|
HTTP |
Snapshot version, billed by token |
|
|
WebSocket |
Streaming synthesis, billed by token |
|
|
WebSocket |
Streaming synthesis, billed by token |
|
|
WebSocket |
Snapshot version, streaming synthesis, billed by token |
Supported languages (by series):
-
Qwen-TTS series (system voices) —
qwen-tts,qwen-tts-latest,qwen-tts-2025-05-22,qwen-tts-2025-04-10: Chinese (Mandarin; Beijing, Shanghai, and Sichuan dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian -
Qwen-TTS-Realtime series (system voices) —
qwen-tts-realtime,qwen-tts-realtime-latest,qwen-tts-realtime-2025-07-15: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian