Non-real-time speech synthesis

更新时间:
复制 MD 格式

Non-real-time speech synthesis converts text to speech (TTS) through an HTTP API. It suits latency-tolerant scenarios such as audiobook production, e-learning narration, and content production. The service offers a wide selection of voices, multilingual support, voice cloning, and voice design.

Overview

Convert complete text to speech files through an HTTP API. Two output modes are available: non-streaming and streaming.

  • Non-streaming returns an audio file URL that expires after 24 hours. Streaming returns PCM audio data in chunks.

  • Supports multiple languages, including Chinese dialects.

  • Supports Voice cloning and Voice Design for custom voice creation.

  • Supports Instruction control, which lets you control speech expressiveness through natural-language instructions.

For low-latency streaming synthesis, see Real-time speech synthesis. To choose a model, see Speech synthesis.

Prerequisites

Before you begin, ensure that you have:

Quick start

Each tab demonstrates synthesis with a different model family. For more code examples and parameter details, see the API reference.

CosyVoice

The examples below use CosyVoice to synthesize speech.

Important

CosyVoice non-real-time speech synthesis is available only in the Beijing region.

Non-streaming output

In non-streaming mode, the response contains a URL for the synthesized audio. The URL expires after 24 hours.

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanyang",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Streaming output

Add the X-DashScope-SSE: enable header to enable streaming output. The server returns audio data incrementally as Server-Sent Events (SSE).

curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "cosyvoice-v3-flash",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "longanyang",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Qwen-TTS

All examples in this section use a built-in voice.

Non-streaming output

In non-streaming mode, the response includes a url field pointing to the synthesized audio file. The URL expires after 24 hours.

Python

import os
import dashscope

# The following is the Beijing region URL. To use models in the Singapore region, replace the URL with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1'

text = "Today is a wonderful day to build something people love!"
# SpeechSynthesizer usage: dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text=text,
    voice="Cherry",
    language_type="English", # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
    # instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    stream=False
)
print(response)

Java

Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void call() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        String audioUrl = result.getOutput().getAudio().getUrl();
        System.out.print(audioUrl);

        // Download the audio file to local storage
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("downloaded_audio.wav")) {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = in.read(buffer)) != -1) {
                out.write(buffer, 0, bytesRead);
            }
            System.out.println("\nAudio file downloaded to local storage: downloaded_audio.wav");
        } catch (Exception e) {
            System.out.println("\nError downloading audio file: " + e.getMessage());
        }
    }
    public static void main(String[] args) {
        try {
            // The following is the Beijing region URL. To use models in the Singapore region, replace the URL with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1";
            call();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= IMPORTANT =======
# The URL below points to the China (Beijing) region. If you are using a model in the Singapore region, replace it with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

Streaming output

In streaming mode, audio data is returned incrementally as Base64-encoded PCM segments. The last packet includes a URL for the complete audio file.

Python

# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import dashscope
import pyaudio
import time
import base64
import numpy as np

# The following is the Beijing region URL. To use models in the Singapore region, replace the URL with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1'

p = pyaudio.PyAudio()
# Create an audio stream
stream = p.open(format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True)

text = "Today is a wonderful day to build something people love!"
response = dashscope.MultiModalConversation.call(
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    text=text,
    voice="Cherry",
    language_type="English",  # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
    # instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    stream=True
)

for chunk in response:
    if chunk.output is not None:
      audio = chunk.output.audio
      if audio.data is not None:
          wav_bytes = base64.b64decode(audio.data)
          audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
          # Play the audio data directly
          stream.write(audio_np.tobytes())
      if chunk.output.finish_reason == "stop":
          print("finish at: {} ", chunk.output.audio.expires_at)
time.sleep(0.8)
# Clean up resources
stream.stop_stream()
stream.close()
p.terminate()

Java

Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
import javax.sound.sampled.*;
import java.util.Base64;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void streamCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(r -> {
            try {
                // 1. Get the Base64-encoded audio data
                String base64Data = r.getOutput().getAudio().getData();
                byte[] audioBytes = Base64.getDecoder().decode(base64Data);

                // 2. Configure the audio format (adjust according to the audio format returned by the API)
                AudioFormat format = new AudioFormat(
                        AudioFormat.Encoding.PCM_SIGNED,
                        24000, // Sample rate (must match the format returned by the API)
                        16,    // Bits per sample
                        1,     // Number of channels
                        2,     // Frame size (bytes)
                        24000, // Frame rate (must match the sample rate)
                        false  // Big-endian
                );

                // 3. Play the audio data in real time
                DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
                try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
                    if (line != null) {
                        line.open(format);
                        line.start();
                        line.write(audioBytes, 0, audioBytes.length);
                        line.drain();
                    }
                }
            } catch (LineUnavailableException e) {
                e.printStackTrace();
            }
        });
    }
    public static void main(String[] args) {
        // The following is the Beijing region URL. To use models in the Singapore region, replace the URL with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1";
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= IMPORTANT =======
# The URL below points to the China (Beijing) region. If you are using a model in the Singapore region, replace it with: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

MiniMax

MiniMax supports emotion control, speech rate adjustment, and pitch tuning.

Important

MiniMax non-real-time speech synthesis is available only in the Beijing region.

Non-streaming output

In non-streaming mode, the response contains the complete synthesized audio.

curl -X POST "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "The weather is really nice today — perfect for a walk outside.",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    }
  }
}'

Streaming output

Add the X-DashScope-SSE: enable header to enable streaming output.

# Get an API Key: https://help.aliyun.com/en/model-studio/get-api-key

curl -X POST "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "The weather is really nice today — perfect for a walk outside.",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    }
  }
}'

Advanced features

Instruction control

Instruction-based control lets you shape tone, speed, emotion, and timbre through natural language descriptions, without adjusting complex audio parameters.

Instruction specifications by model:

CosyVoice

Supported models: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash

Different models have different instruction format requirements:

  • cosyvoice-v3.5-plus and cosyvoice-v3.5-flash:

    • Voice Clone/Design timbres: Accept arbitrary instructions.

    • System voices: v3.5 doesn't support system voices.

  • cosyvoice-v3-plus:

    • Voice Clone/Design timbres: Instruction control isn't supported.

    • System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.

  • cosyvoice-v3-flash:

    • Voice Clone/Design timbres: Accept arbitrary instructions.

    • System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.

How to use: Specify instruction content through the instructions parameter.

Supported languages for instruction text:

  • cosyvoice-v3.5-plus and cosyvoice-v3.5-flash:

    • Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.

    • System voices: v3.5 doesn't support system voices.

  • cosyvoice-v3-plus:

    • Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, and Russian.

    • System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.

  • cosyvoice-v3-flash:

    • Voice Clone/Design timbres: Chinese, English, French, German, Japanese, Korean, and Russian.

    • System voices: Chinese.

Instruction text length limit: Up to 100 characters. Chinese characters (including Simplified/Traditional Chinese, Japanese Kanji, and Korean Hanja) count as 2 characters each. Other characters (punctuation, letters, numbers, Japanese Kana, Korean Hangul, etc.) count as 1 character each.

Qwen-TTS

Supported models: Qwen3-TTS-Instruct-Flash family

Usage: Pass the instruction text in the instructions parameter.

Supported instruction languages: Chinese and English.

Maximum instruction length: 1,600 tokens.

Use cases:

  • Audiobook and radio drama voiceover

  • Advertising and promotional voiceover

  • Game character and animation voiceover

  • Emotionally expressive voice assistants

  • Documentary narration and news broadcasting

Tips for writing high-quality voice descriptions:

  • Core principles:

    1. Be specific, not vague: Use words that describe concrete vocal qualities, such as "deep," "crisp," or "slightly fast." Avoid subjective or vague terms like "nice" or "normal."

    2. Be multidimensional, not single-faceted: A good description covers multiple dimensions (gender, age, emotion, etc.). Writing only "female voice" is too broad to produce a distinctive timbre.

    3. Be objective, not subjective: Focus on the physical and perceptual qualities of the voice. For example, use "slightly high pitch with energy" rather than "my favorite voice."

    4. Be original, not imitative: Describe the vocal qualities you want, rather than requesting imitation of specific public figures (such as celebrities or actors). The model doesn't support imitation, and it may involve copyright risks.

    5. Be concise, not redundant: Make every word count. Avoid repeating synonyms or stacking meaningless modifiers.

  • Description dimensions:

    Combining the following dimensions produces more accurate results. The more dimensions described, the more precise the output.

    Dimension

    Example descriptions

    Gender

    Male, female, neutral

    Age

    Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)

    Pitch

    High, mid, low, slightly high, slightly low

    Speed

    Fast, moderate, slow, slightly fast, slightly slow

    Emotion

    Cheerful, calm, gentle, serious, lively, composed, soothing

    Timbre

    Magnetic, crisp, husky, mellow, sweet, rich, powerful

    Use case

    News broadcasting, advertising, audiobook, animation character, voice assistant, documentary narration

  • Examples:

    • Standard broadcasting style: Clear and precise articulation with standard pronunciation

    • Young, lively female voice with a slightly fast pace and a noticeable rising intonation, suitable for introducing fashion products

    • Calm middle-aged male voice with a slow pace, deep and magnetic timbre, suitable for reading news or narrating documentaries

    • Gentle, intellectual female voice, around 30 years old, with a calm tone, suitable for audiobook reading

    • Cute child voice, about 8-year-old girl, slightly childish speech, suitable for animation character voiceover

Dialects

This section explains how to make the model output speech in a Chinese dialect (for example, Henan or Sichuan). Settings vary by model and by voice type.

CosyVoice

  • Built-in voices: Choose a voice type from the voice list:

    • Dialect-specific voice (for example, longshange_v3): No setup required; the voice always speaks that dialect.

    • Dialect-configurable voice (for example, longanhuan_v3): Specify the dialect in the instruction text.

  • Voice clone: Set the dialect in the instruction text. For example: 请用河南话表达 (Speak in Henan dialect).

  • Voice design: Dialects are not supported.

  • Supported dialects: See the Supported languages column for each model in the voice list.

  • Example: Using cosyvoice-v3-flash with the longanhuan_v3 voice and the instruction "请用河南话表达。" produces Henan dialect speech.

    curl -X POST https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "cosyvoice-v3-flash",
        "input": {
          "text": "叫你去买盐,你买回来一袋面,这不是弄啥嘞吗!",
          "voice": "longanhuan_v3",
          "format": "wav",
          "sample_rate": 24000,
          "instruction": "请用河南话表达。"
        }
    }'

Qwen-TTS

  • Built-in voices: Use a built-in voice that supports dialects. See the voice list for details.

  • Voice clone: Dialects are not supported.

  • Voice design: Dialects are not supported.

  • Supported dialects: See the Supported languages column for each model in Qwen3-TTS.

Supported scope

Available models vary by region:

China (Beijing)

To call the following models, use an API key from the Beijing region:

  • CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2

  • Qwen-TTS:

    • Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

    • Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18

    • Qwen-TTS: qwen-tts (stable, currently equivalent to qwen-tts-2025-04-10), qwen-tts-latest (latest, currently equivalent to qwen-tts-2025-05-22), qwen-tts-2025-05-22 (snapshot), qwen-tts-2025-04-10 (snapshot)

  • MiniMax: MiniMax/speech-2.8-hd, MiniMax/speech-02-hd, MiniMax/speech-2.8-turbo, MiniMax/speech-02-turbo

Singapore

To call the following models, use an API key from the Singapore region:

  • Qwen-TTS:

    • Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

    • Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18

Built-in voices

Voices vary by model. Set the voice parameter to the value in the voice parameter column of the tables below.

API reference

FAQ

How long does the audio file URL remain valid?

The audio file URL expires 24 hours after it is generated. To get a new URL, call the API again.