Speech synthesis

更新时间:
复制 MD 格式

Choose the right model for text-to-speech, voice cloning, and voice design scenarios.

Standard speech synthesis or custom voices?

TTS models convert text to natural-sounding speech. Decide whether built-in voices or custom voices fit your needs:

Standard speech synthesis

Custom voices

Voice source

Built-in voice library, ready to use

Cloned from an audio sample or created from a text description

Getting started

No extra setup required — select a model and voice to start synthesizing

Provide an audio sample or text description to create a voice

Use cases

Customer service bots, audiobook narration, news broadcasts, e-commerce live streaming

Brand-specific voices, virtual streamers, game character dubbing

Recommended models

cosyvoice-v3-plus, MiniMax/speech-2.8-hd

cosyvoice-v3.5-plus (voice cloning + voice design), MiniMax/speech-2.8-hd (voice cloning)

  • Use standard speech synthesis when built-in voices meet your needs and you want zero-configuration setup.

  • Use custom voices when you need a brand-exclusive voice, want to replicate a specific speaker, or need to create a new character voice.

Voice cloning or voice design?

Custom voices offer two creation methods:

Voice cloning

Voice design

Input

An audio sample from the target speaker

A text description of the desired voice (for example, "warm, low-pitched female voice")

Result

Synthesized speech closely resembles the original speaker

A brand-new voice generated from scratch based on the description

Use cases

Reusing a brand spokesperson or streamer's voice, virtual streamers, personalized voice assistants

Brand voice customization (no recordings available), game or animation character dubbing, creative content production

Recommended models

cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, MiniMax/speech-2.8-hd

cosyvoice-v3.5-plus, cosyvoice-v3.5-flash

Voice management service

voice-enrollment (register and manage voices)

voice-enrollment (register and manage voices)

  • Use voice cloning when you have a recording of the target speaker and want to reproduce that voice.

  • Use voice design when no recording is available and you want to create a voice from a text description.

WebSocket or HTTP?

  • WebSocket: Bidirectional streaming that supports streaming input and output. Audio is returned as it is synthesized, providing the lowest latency. Best for real-time scenarios: customer service bots, voice assistants, and call centers.

  • HTTP: Accepts full text input with streaming audio output delivered in segments. Best for audiobook narration, content generation, and offline production.

CosyVoice models share one model name for both WebSocket and HTTP. Qwen models use a -realtime suffix for WebSocket; models without this suffix use HTTP.

CosyVoice and Qwen WebSocket models can be accessed through the DashScope SDK (Java, Python). CosyVoice WebSocket models also support Android and iOS SDK access. Other models require direct calls using the corresponding WebSocket or HTTP protocol.

WebSocket access: Real-time speech synthesis. HTTP access: Non-real-time speech synthesis.

Instruction control

Use natural-language instructions to control speech rate, emotion, and style per request — for example, "speak gently at a slightly slower pace" or "use an excited broadcast style." Ideal for emotionally expressive content, professional broadcasts, and audiobook narration.

Supported models: CosyVoice (cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash) and Qwen-TTS (qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash). Real-time speech synthesis > Instruction control.

Recommended models

The following table lists the recommended model for each scenario. Visit the Model Gallery for a full catalog.

Model ID

Series

API

Voice cloning

Voice design

Instruction control

cosyvoice-v3.5-plus

CosyVoice

WebSocket / HTTP

Supported

Supported

Supported

cosyvoice-v3-plus

CosyVoice

WebSocket / HTTP

Supported

Supported

Unsupported

MiniMax/speech-2.8-hd

MiniMax

HTTP

Supported

Unsupported

Unsupported

All models

CosyVoice

Some CosyVoice models support SSML markup and reading LaTeX formulas aloud.

Model ID

API

Voice cloning

Voice design

Instruction control

cosyvoice-v3.5-plus

WebSocket / HTTP

Supported

Supported

Supported

cosyvoice-v3.5-flash

WebSocket / HTTP

Supported

Supported

Supported

cosyvoice-v3-plus

WebSocket / HTTP

Supported

Supported

Unsupported

cosyvoice-v3-flash

WebSocket / HTTP

Supported

Supported

Supported

cosyvoice-v2

WebSocket / HTTP

Supported

Unsupported

Unsupported

cosyvoice-v1

WebSocket

Supported

Unsupported

Unsupported

Supported languages (by version):

  • cosyvoice-v3.5-plus and cosyvoice-v3.5-flash (no system voices):

    • Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese

    • Voice design: Mandarin Chinese and English

  • cosyvoice-v3-plus:

    • System voices: Mandarin Chinese, English, Japanese, and Korean (varies by voice)

    • Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, and Russian

    • Voice design: Mandarin Chinese and English

  • cosyvoice-v3-flash:

    • System voices (varies by voice): Mandarin Chinese (with Cantonese, Northeastern, Henan, Hunan, Shaanxi, Shandong, Sichuanese, and Anhui dialects available via instruction control) and English

    • Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese

    • Voice design: Mandarin Chinese and English

  • cosyvoice-v2 (no voice design):

    • System voices: Mandarin Chinese, English, Korean, and Japanese (varies by voice)

    • Voice cloning: Mandarin Chinese and English

  • cosyvoice-v1 (no voice design):

    • System voices: Mandarin Chinese and English (varies by voice)

    • Voice cloning: Mandarin Chinese and English

Qwen3-TTS

Model ID

API

Voice cloning

Voice design

Instruction control

qwen3-tts-flash

HTTP

Unsupported

Unsupported

Unsupported

qwen3-tts-flash-2025-11-27

HTTP

Unsupported

Unsupported

Unsupported

qwen3-tts-flash-2025-09-18

HTTP

Unsupported

Unsupported

Unsupported

qwen3-tts-flash-realtime

WebSocket

Unsupported

Unsupported

Unsupported

qwen3-tts-flash-realtime-2025-11-27

WebSocket

Unsupported

Unsupported

Unsupported

qwen3-tts-flash-realtime-2025-09-18

WebSocket

Unsupported

Unsupported

Unsupported

qwen3-tts-instruct-flash

HTTP

Unsupported

Unsupported

Supported

qwen3-tts-instruct-flash-2026-01-26

HTTP

Unsupported

Unsupported

Supported

qwen3-tts-instruct-flash-realtime

WebSocket

Unsupported

Unsupported

Supported

qwen3-tts-instruct-flash-realtime-2026-01-22

WebSocket

Unsupported

Unsupported

Supported

qwen3-tts-vc-2026-01-22

HTTP

Supported

Unsupported

Unsupported

qwen3-tts-vc-realtime-2026-01-15

WebSocket

Supported

Unsupported

Unsupported

qwen3-tts-vc-realtime-2025-11-27

WebSocket

Supported

Unsupported

Unsupported

qwen3-tts-vd-2026-01-26

HTTP

Unsupported

Supported

Unsupported

qwen3-tts-vd-realtime-2026-01-15

WebSocket

Unsupported

Supported

Unsupported

qwen3-tts-vd-realtime-2025-12-16

WebSocket

Unsupported

Supported

Unsupported

Supported languages (by series):

  • Qwen3-TTS-Flash series (system voices) — qwen3-tts-flash, qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18, qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18: Chinese (Mandarin; Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Min Nan, Tianjin, and Cantonese dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

  • Qwen3-TTS-Instruct-Flash series (system voices) — qwen3-tts-instruct-flash, qwen3-tts-instruct-flash-2026-01-26, qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

  • Qwen3-TTS-VC series (voice cloning) — qwen3-tts-vc-2026-01-22, qwen3-tts-vc-realtime-2026-01-15, qwen3-tts-vc-realtime-2025-11-27: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

  • Qwen3-TTS-VD series (voice design) — qwen3-tts-vd-2026-01-26, qwen3-tts-vd-realtime-2026-01-15, qwen3-tts-vd-realtime-2025-12-16: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

MiniMax

Model ID

API

Voice cloning

Voice design

Instruction control

MiniMax/speech-2.8-hd

HTTP

Supported

Unsupported

Unsupported

MiniMax/speech-02-hd

HTTP

Supported

Unsupported

Unsupported

MiniMax/speech-2.8-turbo

HTTP

Supported

Unsupported

Unsupported

MiniMax/speech-02-turbo

HTTP

Supported

Unsupported

Unsupported

Qwen-TTS (legacy, token-based billing)

Legacy Qwen-TTS models billed by token. If you have migrated to Qwen3-TTS, use the recommended models listed earlier.

Model ID

API

Description

qwen-tts

HTTP

Non-streaming synthesis, billed by token

qwen-tts-latest

HTTP

Non-streaming synthesis, billed by token

qwen-tts-2025-05-22

HTTP

Snapshot version, billed by token

qwen-tts-2025-04-10

HTTP

Snapshot version, billed by token

qwen-tts-realtime

WebSocket

Streaming synthesis, billed by token

qwen-tts-realtime-latest

WebSocket

Streaming synthesis, billed by token

qwen-tts-realtime-2025-07-15

WebSocket

Snapshot version, streaming synthesis, billed by token

Supported languages (by series):

  • Qwen-TTS series (system voices) — qwen-tts, qwen-tts-latest, qwen-tts-2025-05-22, qwen-tts-2025-04-10: Chinese (Mandarin; Beijing, Shanghai, and Sichuan dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

  • Qwen-TTS-Realtime series (system voices) — qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian