CosyVoice voice cloning produces natural-sounding custom voices from 10–20 seconds of sample audio, with no training step required. Voice design generates custom voices from text descriptions, supporting multiple languages and voice characteristics. This topic covers the voice cloning and design APIs. For speech synthesis, see Real-time speech synthesis – CosyVoice.
User guide: For model overviews and selection recommendations, see Real-time speech synthesis – CosyVoice.
-
CosyVoice voice design is powered by the FunAudioGen-VD model.
-
Identical prompts may produce different voices. Generate multiple results and select the best one.
-
This topic describes CosyVoice voice cloning/design APIs. If you use Qwen models, see Qwen voice cloning and Qwen voice design.
Supported models
-
Voice cloning:
Chinese mainland
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).-
cosyvoice-v3.5-plus
-
cosyvoice-v3.5-flash
-
cosyvoice-v3-plus
-
cosyvoice-v3-flash
-
cosyvoice-v2
-
cosyvoice-v1
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.-
cosyvoice-v3-plus
-
cosyvoice-v3-flash
-
-
Voice design:
Chinese mainland
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).-
cosyvoice-v3.5-plus
-
cosyvoice-v3.5-flash
-
cosyvoice-v3-plus
-
cosyvoice-v3-flash
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.-
cosyvoice-v3-plus
-
cosyvoice-v3-flash
-
Supported languages
-
Voice cloning: Depends on the speech synthesis model that determines the voice (specified by the
target_model/targetModelparameter):-
cosyvoice-v1, cosyvoice-v2: Chinese (Mandarin), English
-
cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Jiangxi dialect, Minnan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
-
cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan dialect, Hubei dialect, Minnan dialect, Ningxia dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
Voice cloning does not currently support other languages (Spanish, Italian, etc.).
-
-
Voice design: Chinese, English.
Getting started: from voice cloning to speech synthesis
Voice cloning and speech synthesis follow a "create first, then use" workflow:
-
Prepare an audio recording file
Upload an audio file that meets the requirements in Voice cloning: Input audio format to a publicly accessible location, such as Alibaba Cloud Object Storage Service (OSS), and ensure the URL is publicly accessible.
-
Create a voice
Call the Create voice API. In this step, specify
target_modelortargetModelto declare which speech synthesis model will drive the created voice.If you already have a voice, skip this step. Check existing voices by calling the Query voice list API.
-
Use the voice for speech synthesis
After successfully creating a voice using the Create voice API, the system returns a
voice_idorvoiceID:-
Use this
voice_idorvoiceIDas thevoiceparameter in the speech synthesis API or language SDKs. -
Supports non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
-
The speech synthesis model specified during synthesis must match the
target_modelortargetModelused when creating the voice. Otherwise, the synthesis fails.
-
Sample code:
import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer
# 1. Prepare the environment.
# Configure the API key using an environment variable.
# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key
# If an environment variable is not configured, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
raise ValueError("DASHSCOPE_API_KEY environment variable not set.")
# The following is the WebSocket URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope.aliyuncs.com/api-ws/v1/inference'
# The following is the HTTP URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
# 2. Define the cloning parameters.
TARGET_MODEL = "cosyvoice-v3.5-plus"
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed. The prefix must be less than 10 characters long.
# A publicly accessible audio URL.
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # This is a sample URL. Replace it with your own.
# 3. Create a voice (asynchronous task).
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
voice_id = service.create_voice(
target_model=TARGET_MODEL,
prefix=VOICE_PREFIX,
url=AUDIO_URL
)
print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
print(f"Generated Voice ID: {voice_id}")
except Exception as e:
print(f"Error during voice creation: {e}")
raise e
# 4. Poll for the voice status.
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
try:
voice_info = service.query_voice(voice_id=voice_id)
status = voice_info.get("status")
print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
if status == "OK":
print("Voice is ready for synthesis.")
break
elif status == "UNDEPLOYED":
print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
raise RuntimeError(f"Voice processing failed with status: {status}")
# For intermediate statuses such as "DEPLOYING", continue to wait.
time.sleep(poll_interval)
except Exception as e:
print(f"Error during status polling: {e}")
time.sleep(poll_interval)
else:
print("Polling timed out. The voice is not ready after several attempts.")
raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")
# 5. Use the cloned voice for speech synthesis.
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
text_to_synthesize = "Congratulations! You have successfully cloned and synthesized your own voice."
# The call() method returns binary audio data.
audio_data = synthesizer.call(text_to_synthesize)
print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")
# 6. Save the audio file.
output_file = "my_custom_voice_output.mp3"
with open(output_file, "wb") as f:
f.write(audio_data)
print(f"Audio saved to {output_file}")
except Exception as e:
print(f"Error during speech synthesis: {e}")Getting started: from voice design to speech synthesis
Voice design and speech synthesis follow a "create first, then use" workflow:
-
Prepare the voice description and preview text required for voice design.
-
Voice description (voice_prompt): Defines the features of the target voice. See Voice design: Write high-quality voice descriptions.
-
Preview text (preview_text): The content that the target voice reads for the preview audio, such as "Hello everyone, and welcome."
-
-
Call the Create voice API to create a custom voice and retrieve the voice name and preview audio.
In this step, specify
target_modelto declare which speech synthesis model will drive the created voice.Listen to the preview audio to confirm the voice meets your expectations. If not, redesign the voice.
If you already have a voice, skip this step. Check existing voices by calling the Query voice list API.
-
Use the voice for speech synthesis.
When a voice is successfully created using the Create voice API, the system returns a
voice_id/voiceID.-
This
voice_id/voiceIDcan be used as thevoiceparameter in the speech synthesis API or language SDKs. -
Supports non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
-
The speech synthesis model specified during synthesis must be the same as the
target_model/targetModelspecified when the voice was created. Otherwise, the synthesis will fail.
-
Sample code:
-
Generate a custom voice and preview the result. If the result meets your expectations, proceed to the next step. Otherwise, regenerate the voice.
Python
import requests import base64 import os def create_voice_and_play(): # API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx" api_key = os.getenv("DASHSCOPE_API_KEY") if not api_key: print("Error: The DASHSCOPE_API_KEY environment variable is not found. Set the API key first.") return None, None, None # Prepare the request data. headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json" } data = { "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich, and magnetic voice, a steady speaking speed, and clear articulation, suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } } # The following is the URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization url = "https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" try: # Send the request. response = requests.post( url, headers=headers, json=data, timeout=60 # Add a timeout setting. ) if response.status_code == 200: result = response.json() # Get the voice ID. voice_id = result["output"]["voice_id"] print(f"Voice ID: {voice_id}") # Get the preview audio data. base64_audio = result["output"]["preview_audio"]["data"] # Decode the Base64 audio data. audio_bytes = base64.b64decode(base64_audio) # Save the audio file to a local file. filename = f"{voice_id}_preview.wav" # Write the audio data to the local file. with open(filename, 'wb') as f: f.write(audio_bytes) print(f"Audio saved to local file: {filename}") print(f"File path: {os.path.abspath(filename)}") return voice_id, audio_bytes, filename else: print(f"Request failed. Status code: {response.status_code}") print(f"Response content: {response.text}") return None, None, None except requests.exceptions.RequestException as e: print(f"A network request error occurred: {e}") return None, None, None except KeyError as e: print(f"The response data format is invalid. A required field is missing: {e}") print(f"Response content: {response.text if 'response' in locals() else 'No response'}") return None, None, None except Exception as e: print(f"An unknown error occurred: {e}") return None, None, None if __name__ == "__main__": print("Creating the voice...") voice_id, audio_data, saved_filename = create_voice_and_play() if voice_id: print(f"\nVoice '{voice_id}' created successfully.") print(f"Audio file saved: '{saved_filename}'") print(f"File size: {os.path.getsize(saved_filename)} bytes") else: print("\nFailed to create the voice.")Java
Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:
Maven
In the
pom.xmlfile, add the following content:<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson --> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.13.1</version> </dependency>Gradle
In the
build.gradlefile, add the following content:// https://mvnrepository.com/artifact/com.google.code.gson/gson implementation("com.google.code.gson:gson:2.13.1")import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.*; import java.net.HttpURLConnection; import java.net.URL; import java.util.Base64; public class Main { public static void main(String[] args) { Main example = new Main(); example.createVoice(); } public void createVoice() { // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key // If you have not configured an environment variable, replace the following line with your Model Studio API key: String apiKey = "sk-xxx" String apiKey = System.getenv("DASHSCOPE_API_KEY"); // Create a JSON request body string. String jsonBody = "{\n" + " \"model\": \"voice-enrollment\",\n" + " \"input\": {\n" + " \"action\": \"create_voice\",\n" + " \"target_model\": \"cosyvoice-v3.5-plus\",\n" + " \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich, and magnetic voice, a steady speaking speed, and clear articulation, suitable for news broadcasting or documentary commentary.\",\n" + " \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" + " \"prefix\": \"announcer\"\n" + " },\n" + " \"parameters\": {\n" + " \"sample_rate\": 24000,\n" + " \"response_format\": \"wav\"\n" + " }\n" + "}"; HttpURLConnection connection = null; try { // The following is the URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization URL url = new URL("https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization"); connection = (HttpURLConnection) url.openConnection(); // Set the request method and headers. connection.setRequestMethod("POST"); connection.setRequestProperty("Authorization", "Bearer " + apiKey); connection.setRequestProperty("Content-Type", "application/json"); connection.setDoOutput(true); connection.setDoInput(true); // Send the request body. try (OutputStream os = connection.getOutputStream()) { byte[] input = jsonBody.getBytes("UTF-8"); os.write(input, 0, input.length); os.flush(); } // Get the response. int responseCode = connection.getResponseCode(); if (responseCode == HttpURLConnection.HTTP_OK) { // Read the response content. StringBuilder response = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getInputStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { response.append(responseLine.trim()); } } // Parse the JSON response. JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject(); JsonObject outputObj = jsonResponse.getAsJsonObject("output"); JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio"); // Get the voice name. String voiceId = outputObj.get("voice_id").getAsString(); System.out.println("Voice ID: " + voiceId); // Get the Base64-encoded audio data. String base64Audio = previewAudioObj.get("data").getAsString(); // Decode the Base64 audio data. byte[] audioBytes = Base64.getDecoder().decode(base64Audio); // Save the audio to a local file. String filename = voiceId + "_preview.wav"; saveAudioToFile(audioBytes, filename); System.out.println("Audio saved to local file: " + filename); } else { // Read the error response. StringBuilder errorResponse = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getErrorStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { errorResponse.append(responseLine.trim()); } } System.out.println("Request failed. Status code: " + responseCode); System.out.println("Error response: " + errorResponse.toString()); } } catch (Exception e) { System.err.println("Request error: " + e.getMessage()); e.printStackTrace(); } finally { if (connection != null) { connection.disconnect(); } } } private void saveAudioToFile(byte[] audioBytes, String filename) { try { File file = new File(filename); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audioBytes); } System.out.println("Audio saved to: " + file.getAbsolutePath()); } catch (IOException e) { System.err.println("Error saving audio file: " + e.getMessage()); e.printStackTrace(); } } } -
Use the custom voice generated in the previous step for speech synthesis.
This example is based on the non-streaming call sample. Replace the
voiceparameter with the voice ID from voice design.Key principle: The model used for voice design (
target_model) must match the model used for synthesis (model). Mismatches cause synthesis to fail.Python
# coding=utf-8 import dashscope from dashscope.audio.tts_v2 import * import os # API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx" dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY') # The following is the URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference dashscope.base_websocket_api_url='wss://dashscope.aliyuncs.com/api-ws/v1/inference' # The same model must be used for voice design and speech synthesis. model = "cosyvoice-v3.5-plus" # Replace the voice parameter with the custom voice generated by voice design. voice = "your_voice" # Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor. synthesizer = SpeechSynthesizer(model=model, voice=voice) # Send the text to be synthesized and get the binary audio. audio = synthesizer.call("How is the weather today?") # When you send text for the first time, a WebSocket connection must be established. Therefore, the first-packet latency includes the time taken to establish the connection. print('[Metric] Request ID: {}, First-packet latency: {} ms'.format( synthesizer.get_last_request_id(), synthesizer.get_first_package_delay())) # Save the audio to a local file. with open('output.mp3', 'wb') as f: f.write(audio)Java
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam; import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer; import com.alibaba.dashscope.utils.Constants; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; public class Main { // The same model must be used for voice design and speech synthesis. private static String model = "cosyvoice-v3.5-plus"; // Replace the voice parameter with the custom voice ID generated by voice design. private static String voice = "your_voice_id"; public static void streamAudioDataToSpeaker() { // Request parameters. SpeechSynthesisParam param = SpeechSynthesisParam.builder() // API keys for the Singapore and Beijing regions are different. To get an API key, see https://help.aliyun.com/en/model-studio/get-api-key // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx") .apiKey(System.getenv("DASHSCOPE_API_KEY")) .model(model) // Model .voice(voice) // Voice .build(); // Synchronous mode: Disable the callback (the second parameter is null). SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null); ByteBuffer audio = null; try { // Block until the audio is returned. audio = synthesizer.call("How is the weather today?"); } catch (Exception e) { throw new RuntimeException(e); } finally { // After the task is complete, close the WebSocket connection. synthesizer.getDuplexApi().close(1000, "bye"); } if (audio != null) { // Save the audio data to the local file "output.mp3". File file = new File("output.mp3"); // When you send text for the first time, a WebSocket connection must be established. Therefore, the first-packet latency includes the time taken to establish the connection. // Note: The getFirstPackageDelay() method requires dashscope-sdk-java 2.18.0 or later. System.out.println( "[Metric] Request ID: " + synthesizer.getLastRequestId() + ", First-packet latency (ms): " + synthesizer.getFirstPackageDelay()); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audio.array()); } catch (IOException e) { throw new RuntimeException(e); } } } public static void main(String[] args) { // The following is the URL for the Beijing region. If you use a model in the Singapore region, replace the URL with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference"; streamAudioDataToSpeaker(); System.exit(0); } }
API reference
All API operations must use the same Alibaba Cloud account.
The DashScope SDKs for Java and Python do not support voice design. Use the RESTful API instead.
Create voice
RESTful API
-
URL
Beijing:
POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customizationSingapore:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization -
Request headers
Parameter
Type
Required
Description
Authorization
string
Authentication token. Format:
Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.Content-Type
string
Media type of data in the request body. Fixed value:
application/json. -
Request body
The request body contains all parameters (omit optional fields as needed).
ImportantDistinguish between these two parameters:
-
model: Voice cloning/design model. Fixed value: voice-enrollment -
target_model: Speech synthesis model that drives the cloned or designed voice. Must match the synthesis model used later; otherwise, synthesis fails.
Voice cloning
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl", "language_hints": ["zh"] } }Voice design
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer", "language_hints": ["zh"] }, "parameters": { "sample_rate": 24000, "response_format": "wav" } } -
-
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
create_voice.target_model
string
-
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
url
string
-
ImportantRequired only for voice cloning
Publicly accessible URL of the audio file used for voice cloning.
For audio format details, see Voice cloning: Input audio format.
For recording guidance, see Recording Guide.
voice_prompt
string
-
ImportantRequired only for voice design
Voice description. Maximum length: 500 characters.
Chinese and English only.
For guidance on writing voice descriptions, see Voice design: Write high-quality voice descriptions.
preview_text
string
-
ImportantRequired only for voice design
Text for the preview audio. Maximum length: 200 characters.
Supported languages: Chinese (zh), English (en).
prefix
string
-
The voice name (letters/numbers only, max 10 characters). Use role or scenario identifiers.
This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are:
Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf******
Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf******
language_hints
array[string]
["zh"]
Specifies the language of the sample audio to improve voice feature extraction. Available for cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus. This parameter accepts an array, but only the first element is used — pass a single value.
Functionality:
Voice cloning
Identifies the sample audio language to improve voice feature extraction and cloning quality. If the hint does not match the actual audio language (for example,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
max_prompt_audio_length
float
10.0
No
Important-
This parameter is only available for voice cloning scenarios.
-
Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
Maximum duration (in seconds) of the sample audio after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash. Valid range: [3.0, 30.0].
enable_preprocess
boolean
false
No
Important-
This parameter is only available for voice cloning scenarios.
-
If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
-
For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
-
Regardless of the enable_preprocess value, basic VAD processing is still applied when the sample audio length exceeds max_prompt_audio_length. To ensure no modifications are applied to the sample audio, set max_prompt_audio_length to a value greater than or equal to the sample audio length.
Enables audio preprocessing (noise reduction, enhancement, and volume normalization) before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash.
Valid values:
-
true: Enable
-
false: Disable
sample_rate
int
24000
ImportantAvailable only for voice design
The preview audio sample rate (Hz) for voice design.
Valid values:
-
16000
-
24000
-
48000
response_format
string
wav
ImportantAvailable only for voice design
The preview audio format for voice design.
Valid values:
-
pcm
-
wav
-
mp3
-
-
Response parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.data
string
The preview audio data from voice design (Base64-encoded).
sample_rate
int
The preview audio sample rate (Hz) from voice design. This value matches the creation request. Default: 24000 Hz.
response_format
string
The preview audio format from voice design. This value matches the creation request. Default: wav.
target_model
string
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
request_id
string
Request ID.
count
integer
Number of "create voice" operations in this request. Always 1 for voice creation.
-
Sample code
ImportantDistinguish between these two parameters:
-
model: Voice cloning/design model. Fixed value: voice-enrollment -
target_model: Speech synthesis model that drives the cloned or designed voice. Must match the synthesis model used later; otherwise, synthesis fails.
Voice cloning
If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Beijing region URL. For Singapore region: use dashscope-intl.aliyuncs.com with a different regional API key # Get your API key: https://help.aliyun.com/en/model-studio/get-api-key curl -X POST <a href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="9f104f338c7kz">https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl" } }'Voice design
If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Beijing region URL. For Singapore region: use dashscope-intl.aliyuncs.com with a different regional API key # Get your API key: https://help.aliyun.com/en/model-studio/get-api-key curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="a004d042d9uv8">https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } }' -
Python SDK
Interface description
Before using this API, install the latest DashScope SDK.
def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
'''
Create a new custom voice.
param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
param: prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
param: url Publicly accessible URL of the audio file used for voice cloning.
param: language_hints The sample audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
Helps identify sample audio language to improve voice feature extraction and cloning quality.
If the hint doesn't match actual audio, the system ignores it and auto-detects the language.
Valid values (by model):
cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
This parameter is an array, but only the first element is processed. Pass only one value.
param: max_prompt_audio_length The maximum sample audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
param: enable_preprocess Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
Regardless of the enable_preprocess value, basic VAD processing is still applied when the sample audio length exceeds max_prompt_audio_length. To ensure no modifications are applied to the sample audio, set max_prompt_audio_length to a value greater than or equal to the sample audio length.
return: voice_id The voice ID. Use directly as the voice parameter in the speech synthesis API.
'''
-
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails. -
language_hints: Language of the sample audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice cloning
Identifies the sample audio language to improve voice feature extraction and cloning quality. If the hint does not match the actual audio language (for example,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
-
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
target_model='cosyvoice-v3.5-plus',
prefix='myvoice',
url='https://your-audio-file-url'
# language_hints=['zh'],
# max_prompt_audio_length=10.0,
# enable_preprocess=False
)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")
Java SDK
Interface description
Before using this API, install the latest DashScope SDK.
/**
* Create a new custom voice.
*
* @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
* @param prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
* @param url Publicly accessible URL of the audio file used for voice cloning.
* @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
* languageHints: The sample audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
* Helps identify sample audio language to improve voice feature extraction and cloning quality.
* If hint doesn't match actual audio, the system ignores it and auto-detects the language.
* Valid values (by model):
* cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
* cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
* Only the first element is processed. Pass only one value.
* maxPromptAudioLength: The maximum sample audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3-flash model.
* Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
* enable_preprocess: Configured through the generic parameter. Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
* If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
* For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
* Regardless of the enable_preprocess value, basic VAD processing is still applied when the sample audio length exceeds max_prompt_audio_length. To ensure no modifications are applied to the sample audio, set max_prompt_audio_length to a value greater than or equal to the sample audio length.
* @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException
-
targetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails. -
languageHints: Language of the sample audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice cloning
Identifies the sample audio language to improve voice feature extraction and cloning quality. If the hint does not match the actual audio language (for example,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
-
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
String targetModel = "cosyvoice-v3.5-plus";
String prefix = "myvoice";
String fileUrl = "https://your-audio-file-url";
String cloneModelName = "voice-enrollment";
try {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice myVoice = service.createVoice(
targetModel,
prefix,
fileUrl,
VoiceEnrollmentParam.builder()
.model(cloneModelName)
.languageHints(Collections.singletonList("zh"))
// .maxPromptAudioLength(10.0f)
// .parameter("enable_preprocess", false)
.build());
logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
} catch (Exception e) {
logger.error("Failed to create voice", e);
}
}
}
List voices
Query created voices with pagination.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer" "page_size": 10, "page_index": 0 } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
list_voice.prefix
string
-
The same prefix used when creating the voice (letters/numbers only, max 10 characters).
page_index
integer
0
Page index (≥ 0).
page_size
integer
10
The number of items per page. Valid range: [0, 1000].
-
Response parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.target_model
string
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
gmt_create
string
Time the voice was created.
gmt_modified
string
Time the voice was last modified.
voice_prompt
string
Voice description.
preview_text
string
Preview text.
request_id
string
Request ID.
status
string
Voice status:
-
DEPLOYING: Under review
-
OK: Ready to use
-
UNDEPLOYED: Unavailable
-
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# This is the Beijing region URL. For Singapore region: use dashscope-intl.aliyuncs.com with a different regional API key # Get your API key: https://help.aliyun.com/en/model-studio/get-api-key curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer", "page_size": 10, "page_index": 0 } }'
Python SDK
Interface description
def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
'''
Query all created voices
param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
param: page_index Page index to query
param: page_size Page size to query
return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
Voice statuses:
DEPLOYING: Under review
OK: Approved and ready to use
UNDEPLOYED: Rejected and unavailable
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")
Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]
Response parameters
|
Parameter |
Type |
Description |
|
voice_id |
string |
Voice ID. Use directly as the |
|
target_model |
string |
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail. |
|
gmt_create |
string |
Time the voice was created. |
|
gmt_modified |
string |
Time the voice was last modified. |
|
voice_prompt |
string |
Voice description. |
|
preview_text |
string |
Preview text. |
|
request_id |
string |
Request ID. |
|
status |
string |
Voice status:
|
Java SDK
Interface description
// Voice statuses:
// DEPLOYING: Under review
// OK: Approved and ready to use
// UNDEPLOYED: Rejected and unavailable
/**
* Query all created voices. Default page index is 0, default page size is 10.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException
/**
* Query all created voices.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
* @param pageIndex Page index to query.
* @param pageSize Page size to query.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException
Request example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String prefix = "myvoice"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Query voices
Voice[] voices = service.listVoice(prefix, 0, 10);
logger.info("List successful. Request ID: {}", service.getLastRequestId());
logger.info("Voices Details: {}", new Gson().toJson(voices));
}
}
Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]
Response parameters
|
Parameter |
Type |
Description |
|
voice_id |
string |
Voice ID. Use directly as the |
|
target_model |
string |
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail. |
|
gmt_create |
string |
Time the voice was created. |
|
gmt_modified |
string |
Time the voice was last modified. |
|
voice_prompt |
string |
Voice description. |
|
preview_text |
string |
Preview text. |
|
request_id |
string |
Request ID. |
|
status |
string |
Voice status:
|
Query specific voice
Get detailed information about a specific voice by name.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
query_voice.voice_id
string
-
ID of the voice to query.
-
Response parameters
For parameter descriptions, see the List voices API.
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Beijing region. If you use the Singapore region model, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://help.aliyun.com/en/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface description
def query_voice(self, voice_id: str) -> List[str]:
'''
Query details for a specific voice
param: voice_id ID of the voice to query
return: List[str] Voice details, including status, creation time, audio link, etc.
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'
voice_details = service.query_voice(voice_id=voice_id)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")
Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}
Response parameters
Java SDK
Interface description
/**
* Query details for a specific voice
*
* @param voiceId ID of the voice to query
* @return Voice Voice details, including status, creation time, audio link, etc.
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException
Request example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice voice = service.queryVoice(voiceId);
logger.info("Query successful. Request ID: {}", service.getLastRequestId());
logger.info("Voice Details: {}", new Gson().toJson(voice));
}
}
Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}
Response parameters
Update voice (voice cloning only)
Updates an existing voice with a new audio file.
This feature is not supported for voice design.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
update_voice.voice_id
string
-
Voice to update.
url
string
-
URL of the audio file to update the voice. The URL must be publicly accessible.
-
Response parameters
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Beijing region. If you use the Singapore region model, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://help.aliyun.com/en/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } }'
Python SDK
Interface description
def update_voice(self, voice_id: str, url: str) -> None:
'''
Update a voice
param: voice_id Voice ID
param: url URL of the audio file for voice cloning
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.update_voice(
voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")
Java SDK
Interface description
/**
* Update a voice
*
* @param voiceId Voice to update
* @param url URL of the audio file for voice cloning
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void updateVoice(String voiceId, String url)
throws NoApiKeyException, InputRequiredException
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String fileUrl = "https://your-audio-file-url"; // Replace with your actual value
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Update voice
service.updateVoice(voiceId, fileUrl);
logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
}
}
Delete voice
Deletes a voice you no longer need to free up the quota. This action cannot be undone.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
delete_voice.voice_id
string
-
Voice to delete.
-
Response parameters
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Beijing region. If you use the Singapore region model, replace the URL with: https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://help.aliyun.com/en/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface description
def delete_voice(self, voice_id: str) -> None:
'''
Delete a voice
param: voice_id Voice to delete
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")
Java SDK
Interface description
/**
* Delete a voice
*
* @param voiceId Voice to delete
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Delete voice
service.deleteVoice(voiceId);
logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
}
}
Quota and cleanup
Billing
-
Voice cloning and design operations (create, query, update, delete) are free.
-
Speech synthesis with custom voices is billed by character count. For details, see Real-time speech synthesis – CosyVoice.
Copyright and legality
You are responsible for ensuring ownership and legal rights to any voice you provide. Read the terms of service.
Error codes
If you encounter an error, see Error messages for troubleshooting.
FAQ
Features
Q: How do I adjust the speed and volume of a custom voice?
The same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).
Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?
For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.
Troubleshooting
If you encounter code errors, troubleshoot using the information in Error codes.
Q: What should I do if the synthesized audio from a cloned voice contains extra content?
If the synthesized audio from a cloned voice contains unexpected characters or noise, troubleshoot as follows:
-
Check the sample audio quality
Sample audio quality directly affects synthesis results. Verify that your sample audio meets these requirements:
-
No background noise or static
-
Clear sound quality (sample rate ≥ 16 kHz recommended)
-
Audio format: WAV preferred over MP3 (avoids lossy compression artifacts)
-
Mono channel (stereo may cause interference)
-
No silent segments or long pauses
-
Moderate speech rate (overly fast speech degrades feature extraction)
-
-
Check the input text
Verify that the input text contains no special symbols or markers:
-
Avoid special symbols such as
**,"", and'' -
Unless used for LaTeX formulas, preprocess the text to filter out special symbols.
-
-
Verify voice cloning parameters
Ensure the language parameter (
language_hints/languageHints) is set correctly when . -
Try cloning again
Re-clone the voice with a higher-quality sample audio file and test the result.
-
Compare with system voices
Synthesize the same text with a preset system voice to confirm whether the issue is specific to the cloned voice.
Q: What should I do if the audio generated from a cloned voice is silent?
-
Check voice status
Call the Query specific voice API to check if the voice
statusisOK. -
Check model version consistency
Ensure the
target_modelparameter used for voice cloning exactly matches themodelparameter used for speech synthesis. For example:-
When you clone the voice, use
cosyvoice-v3-plus. -
You must also use
cosyvoice-v3-plusfor synthesis
-
-
Verify sample audio quality
Check if the sample audio used for cloning meets the voice cloning sample audio format requirements:
-
Audio duration: 10–20 seconds
-
Clear sound quality
-
No background noise
-
-
Check request parameters
Confirm the
voiceparameter is set to the cloned voice's ID during speech synthesis.
Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?
If synthesized speech from a cloned voice exhibits these issues:
-
Incomplete playback — only part of the text is spoken
-
Unstable quality — output varies between good and poor
-
Abnormal pauses or silent segments in the audio
Likely cause: The sample audio quality does not meet requirements.
Solution: Verify that the sample audio meets the following requirements, and re-record following the Recording Guide.
-
Check audio continuity: Verify that the speech is continuous with no pauses or silent segments longer than 2 seconds. Significant gaps can cause the model to learn silence or noise as voice features, degrading the result.
-
Check speech activity ratio: Active speech should exceed 60% of total audio duration. Excessive background noise or non-speech segments interfere with voice feature extraction.
-
Verify audio quality details:
-
Audio duration: 10–20 seconds (15 seconds recommended)
-
Clear pronunciation and steady speech rate
-
No background noise, echo, or static
-
Concentrated speech energy with no long silent segments
-
Q: Why can't I find the VoiceEnrollmentService class?
Your SDK version is too old. Install the latest SDK.
Q: What should I do if the voice cloning result is poor, with noise or unclear audio?
This is usually caused by low-quality sample audio. Re-record and upload the audio, strictly following the Recording Guide.
Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?
The voice cloning model learns pauses and rhythm from the sample audio. If the original recording contains a long initial silence, the synthesized output retains a similar pattern. For single words or very short text, this silence ratio is amplified, making the audio appear long but mostly silent. To avoid this, trim long silences from the sample audio and use complete sentences or longer text for synthesis. If you must synthesize a single word, add context before or after it.
Permissions and authentication
Q: Can I use an API key from a sub-workspace for voice cloning?
Yes, but you must first grant model authorization to the sub-workspace corresponding to the API key. For more information, see Model Invocation in Sub-workspaces.