Non-real-time speech synthesis-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Overview

Convert complete text into audio files through the HTTP API. Two output modes are available: non-streaming and streaming.

Non-streaming returns an audio file URL valid for 24 hours; streaming returns PCM audio data in chunks.
Multiple languages are supported, including Chinese dialects.
Supports Voice cloning and Voice Design for creating custom voices.
Supports Instruction control to control speech expressiveness through natural language instructions.
Supports Emotion and rich language tags to embed tags in text for controlling emotional expression or inserting sound effects

For low-latency streaming scenarios, see Real-time speech synthesis. For model selection recommendations, see Speech synthesis.

Prerequisites

Before you begin, complete the following preparations:

Configure an API key and set it as an environment variable
(Optional) If you call the API through the DashScope SDK, install the latest SDK

Quick start

The following tabs demonstrate speech synthesis for each model series. For more language examples and detailed parameter descriptions, see API reference.

Qwen-Audio-TTS

The following examples demonstrate how to synthesize speech with Qwen-Audio-TTS models.

Important

Qwen-Audio-TTS non-real-time speech synthesis is available only in the China (Beijing) region.

Non-streaming output

In non-streaming mode, the response contains a URL to the synthesized audio file. The URL is valid for 24 hours.

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen-audio-3.0-tts-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanlingxi",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Streaming output

Add the X-DashScope-SSE: enable header to enable streaming output. The server returns audio data incrementally using Server-Sent Events (SSE).

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "qwen-audio-3.0-tts-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanlingxi",
      "format": "wav",
      "sample_rate": 24000
    }
}'

CosyVoice

The following examples demonstrate how to synthesize speech with CosyVoice models.

Important

CosyVoice non-real-time speech synthesis is available only in the China (Beijing) region.

Non-streaming output

In non-streaming mode, the response contains a URL to the synthesized audio file. The URL is valid for 24 hours.

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanyang",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Streaming output

Add the X-DashScope-SSE: enable header to enable streaming output. The server returns audio data incrementally using Server-Sent Events (SSE).

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "cosyvoice-v3-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanyang",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Qwen-TTS

All examples in this section use system voices.

Non-streaming output

In non-streaming mode, the response contains a url field that points to the synthesized audio file. The URL is valid for 24 hours.

Python

import os
import dashscope

# The following is the configuration for the China (Beijing) region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

text = "Let me recommend this T-shirt. It looks absolutely amazing — the color is so elegant and it pairs well with almost anything. You can buy it without hesitation. It is very flattering and looks great on all body types. I highly recommend placing an order!"
# SpeechSynthesizer interface usage: dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://help.aliyun.com/zh/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text=text,
    voice="Cherry",
    language_type="Chinese", # We recommend matching this with the language of the text for correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the lines below and replace model with qwen3-tts-instruct-flash
    # instructions='Fast-paced speech with noticeable upward intonation, ideal for presenting fashion products.',
    # optimize_instructions=True,
    stream=False
)
print(response)

Java

You must import the Gson dependency. Add it using Maven or Gradle:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")

import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void call() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://help.aliyun.com/zh/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // We recommend matching this with the language of the text for correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the lines below and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Fast-paced speech with noticeable upward intonation, ideal for presenting fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        String audioUrl = result.getOutput().getAudio().getUrl();
        System.out.print(audioUrl);

        // Download the audio file to local storage
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("downloaded_audio.wav")) {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = in.read(buffer)) != -1) {
                out.write(buffer, 0, bytesRead);
            }
            System.out.println("\nAudio file downloaded to: downloaded_audio.wav");
        } catch (Exception e) {
            System.out.println("\nFailed to download audio file: " + e.getMessage());
        }
    }
    public static void main(String[] args) {
        try {
            // The following is the configuration for the China (Beijing) region.
            Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
            call();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= Important =======
# The following configuration is for the China (Beijing) region.
# The API keys for the Singapore region and the Beijing region are different. Get an API key: https://help.aliyun.com/zh/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Let me recommend this T-shirt. It looks absolutely amazing — the color is so elegant and it pairs well with almost anything. You can buy it without hesitation. It is very flattering and looks great on all body types. I highly recommend placing an order!",
        "voice": "Cherry",
        "language_type": "Chinese"
    }
}'

Streaming output

In streaming mode, audio data is returned in chunks as Base64-encoded PCM. The last packet contains the URL to the complete audio file.

Python

# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import dashscope
import pyaudio
import time
import base64
import numpy as np

# The following is the configuration for the China (Beijing) region.
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

p = pyaudio.PyAudio()
# Create an audio stream
stream = p.open(format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True)

text = "Hello, I am Qwen."
response = dashscope.MultiModalConversation.call(
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://help.aliyun.com/zh/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    text=text,
    voice="Cherry",
    language_type="Chinese",  # We recommend matching this with the language of the text for correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the lines below and replace model with qwen3-tts-instruct-flash
    # instructions='Fast-paced speech with noticeable upward intonation, ideal for presenting fashion products.',
    # optimize_instructions=True,
    stream=True
)

for chunk in response:
    if chunk.output is not None:
      audio = chunk.output.audio
      if audio.data is not None:
          wav_bytes = base64.b64decode(audio.data)
          audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
          # Play audio data directly
          stream.write(audio_np.tobytes())
      if chunk.output.finish_reason == "stop":
          print(f"finish at: {chunk.output.audio.expires_at}")
time.sleep(0.8)
# Clean up resources
stream.stop_stream()
stream.close()
p.terminate()

Java

You must import the Gson dependency. Add it using Maven or Gradle:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")

import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
import javax.sound.sampled.*;
import java.util.Base64;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void streamCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://help.aliyun.com/zh/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // We recommend matching this with the language of the text for correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the lines below and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Fast-paced speech with noticeable upward intonation, ideal for presenting fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(r -> {
            try {
                // 1. Get the Base64-encoded audio data
                String base64Data = r.getOutput().getAudio().getData();
                byte[] audioBytes = Base64.getDecoder().decode(base64Data);

                // 2. Configure the audio format (adjust according to the audio format returned by the API)
                AudioFormat format = new AudioFormat(
                        AudioFormat.Encoding.PCM_SIGNED,
                        24000, // Sample rate (must match the format returned by the API)
                        16,    // Bit depth
                        1,     // Number of channels
                        2,     // Frame size (in bytes)
                        24000, // Frame rate (must match the sample rate)
                        false  // Big-endian
                );

                // 3. Play audio data in real time
                DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
                try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
                    if (line != null) {
                        line.open(format);
                        line.start();
                        line.write(audioBytes, 0, audioBytes.length);
                        line.drain();
                    }
                }
            } catch (LineUnavailableException e) {
                e.printStackTrace();
            }
        });
    }
    public static void main(String[] args) {
        // The following is the configuration for the China (Beijing) region.
        Constants.baseHttpApiUrl = "https://dashscope.aliyuncs.com/api/v1";
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= Important =======
# The following configuration is for the China (Beijing) region.
# The API keys for the Singapore region and the Beijing region are different. Get an API key: https://help.aliyun.com/zh/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Let me recommend this T-shirt. It looks absolutely amazing — the color is so elegant and it pairs well with almost anything. You can buy it without hesitation. It is very flattering and looks great on all body types. I highly recommend placing an order!",
        "voice": "Cherry",
        "language_type": "Chinese"
    }
}'

MiniMax

MiniMax supports emotion control, speed adjustment, and pitch adjustment.

Important

MiniMax non-real-time speech synthesis is available only in the China (Beijing) region.

Non-streaming output

In non-streaming mode, the response contains the complete synthesized audio.

curl -X POST "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "The weather is great today, perfect for a walk.",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    }
  }
}'

Streaming output

Add the X-DashScope-SSE: enable header to enable streaming output.

# Get an API Key: https://help.aliyun.com/zh/model-studio/get-api-key

curl -X POST "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "The weather is great today, perfect for a walk.",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    }
  }
}'

Advanced features

Instruction control

Instruction specifications by model:

Qwen-Audio-TTS

Supported models: qwen-audio-3.0-tts-plus, qwen-audio-3.0-tts-flash

System voices and voice cloning voices: accept any instruction.

CosyVoice

Supported models: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash

Instruction format requirements vary by model:

cosyvoice-v3.5-plus, cosyvoice-v3.5-flash:
- Voice cloning/design voices: accept any instruction.
- System voices: v3.5 doesn't support system voices.
cosyvoice-v3-plus:
- Voice cloning/design voices: don't support instruction control.
- System voices: instructions must use a fixed format and content. See CosyVoice Voice list.
cosyvoice-v3-flash:
- Voice cloning/design voices: accept any instruction.
- System voices: instructions must use a fixed format and content. See CosyVoice Voice list.

Usage: Specify instruction content through the instruction parameter.

Supported languages for instruction text:

cosyvoice-v3.5-plus, cosyvoice-v3.5-flash:
- Voice cloning/design voices: Chinese, English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.
- System voices: v3.5 doesn't support system voices.
cosyvoice-v3-plus:
- Voice cloning/design voices: Chinese, English, French, German, Japanese, Korean, and Russian.
- System voices: instructions must use a fixed format and content. See CosyVoice Voice list.
cosyvoice-v3-flash:
- Voice cloning/design voices: Chinese, English, French, German, Japanese, Korean, and Russian.
- System voices: Chinese only.

Instruction text length limit: 100 characters maximum. Chinese characters (including simplified/traditional Chinese, Japanese kanji, and Korean hanja) count as 2 characters each. All other characters (such as punctuation, letters, digits, Japanese kana, and Korean hangul) count as 1 character each.

Qwen-TTS

Supported models: Only Qwen3-TTS-Instruct-Flash series models are supported.

Usage: Pass the instruction content through the instructions parameter.

Supported languages for instruction text: Only Chinese and English are supported.

Instruction text length limit: Up to 1,600 tokens.

Dialects

This section describes how to generate speech in Chinese dialects (such as Henan dialect and Sichuan dialect). The configuration method varies by model and voice type.

Qwen-Audio-TTS

System voices: Choose one of the following voice types:
- System voices that support dialects natively — no additional configuration required.
- Voices that support Instruction control and allow dialect specification — specify the dialect through instruction text.
Voice cloning voices: Configure through the Instruction control feature. For example, set the instruction text to 请用河南话表达.
Voice design voices: Dialects are not supported.

Supported dialects: See the "Supported languages" section for each model in Qwen-Audio-TTS.

CosyVoice

System voices: Select one of the following voice types from the CosyVoice Voice list:
- System voices that support dialects (for example, longshange_v3) — no additional configuration required.
- Voices that support Instruction control and allow dialect specification (for example, longanhuan_v3) — specify the dialect through instruction text.
Voice cloning voices: Configure through the Instruction control feature. For example, set the instruction text to 请用河南话表达.
Voice design voices: Dialects are not supported.

Supported dialects: See the "Supported languages" section for each model in CosyVoice.

Example: Using cosyvoice-v3-flash with the longanhuan_v3 voice and the instruction text "请用河南话表达。" to generate speech in Henan dialect.

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3-flash",
    "input": {
      "text": "叫你去买盐，你买回来一袋面，这不是弄啥嘞吗！",
      "voice": "longanhuan_v3",
      "format": "wav",
      "sample_rate": 24000,
      "instruction": "请用河南话表达。"
    }
}'

Qwen-TTS

System voices: Use system voices that support dialects. See Qwen-TTS voice list.
Voice cloning voices: Dialects are not supported.
Voice design voices: Dialects are not supported.

Supported dialects: See the "Supported languages" section for each model in Qwen3-TTS.

Emotion and rich language tags

Qwen-Audio-TTS series models support embedding emotion and rich language tags directly in the text to synthesize (the text parameter). These tags control emotional expression or insert vocal effects (such as laughter and sighs) at specified positions, producing more expressive speech without configuring complex audio parameters.

Important

Supported models: Only qwen-audio-3.0-tts-plus and qwen-audio-3.0-tts-flash.

Control tags

Control tags set the emotion or style of the speech. Place a tag in the text to affect all subsequent text until the next control tag appears or the sentence is automatically segmented due to length.

Tag	Description
`[sad]`	Sad
`[amazed]`	Amazed
`[deep and loud shouting]`	Deep, loud shouting
`[trembling]`	Trembling
`[angry]`	Angry
`[excited]`	Excited
`[sarcastic]`	Sarcastic
`[curious]`	Curious
`[like dracula]`	Dracula style (deep, eerie)
`[bored]`	Bored
`[tired]`	Tired
`[singing]`	Singing
`[scornful]`	Scornful
`[shouting]`	Shouting
`[asmr]`	ASMR soft whisper
`[panicked]`	Panicked
`[mischievously]`	Mischievous
`[empathetic]`	Empathetic
`[whispers]`	Whisper
`[reluctantly]`	Reluctant
`[crying]`	Crying
`[serious]`	Serious
`[very slowly]`	Very slow speech
`[very fast]`	Very fast speech

Rich language tags

Rich language tags insert a vocal effect at the current position in the text without affecting the emotional style of surrounding text.

Tag	Description
`[gasp]`	Gasp
`[sighing]`	Sigh
`[clears throat]`	Throat clearing
`[giggles]`	Giggle
`[laughing]`	Laughter
`[cough]`	Cough
`[snorts]`	Snort

Usage examples

The following example shows how to combine control tags and rich language tags in the text parameter:

[excited]What a beautiful day today![laughing]Let's go out and have fun together!

In this text, [excited] is a control tag that applies an excited emotion to all subsequent text. [laughing] is a rich language tag that inserts a laugh at that position before continuing to synthesize the remaining text.

You can also switch between different emotions within the same text:

[serious]Please pay attention to the safety precautions.[excited]Alright, let's get started now!

Here, [serious] sets the first sentence to a serious tone, and [excited] switches to an excited tone starting from the second sentence.

Supported models and regions

China (Beijing)

To call the following models, use an API key for the Beijing region:

Qwen-Audio-TTS: qwen-audio-3.0-tts-plus, qwen-audio-3.0-tts-flash
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
Qwen-TTS:
- Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable version, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)
- Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
- Qwen3-TTS-Flash: qwen3-tts-flash (stable version, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18
- Qwen-TTS: qwen-tts (stable version, currently equivalent to qwen-tts-2025-04-10), qwen-tts-latest (latest version, currently equivalent to qwen-tts-2025-05-22), qwen-tts-2025-05-22 (snapshot), qwen-tts-2025-04-10 (snapshot)
MiniMax: MiniMax/speech-2.8-hd, MiniMax/speech-02-hd, MiniMax/speech-2.8-turbo, MiniMax/speech-02-turbo

Singapore

To call the following models, use an API key for the Singapore region:

Qwen-TTS:
- Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable version, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)
- Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
- Qwen3-TTS-Flash: qwen3-tts-flash (stable version, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18

Supported system voices

Different models support different voices. Set the voice request parameter to a value from the voice parameter column in the following tables.

API reference

FAQ

Q: How long is the audio file URL valid?

A: The audio file URL is valid for 24 hours after generation. After the URL expires, call the API again to obtain a new URL.