Voice Cloning-Intelligent Media Services(IMS)-阿里云帮助中心

The voice cloning feature allows you to quickly generate a custom voice from source audio. You can then use this voice in the tts node of an AI Real-time Interaction workflow. You can create a voice by recording in real time or uploading an audio file in the console. You can also batch clone voices by using Model Studio APIs.

Process overview

Follow these steps to clone a voice and apply it in AI Real-time Interaction:

Clone a voice to generate a voice ID. You can record audio in real time or upload an audio file in the AI Real-time Interaction console. Alternatively, you can batch clone voices by using Model Studio APIs.
In the workflow editor, use the generated voice ID to set the Model Studio model and voice parameters for the node.

Prerequisites

You have integrated the required SDK version. For more information, see Install SDK.
You have created an API key. For more information, see Get an API key.

Audio file requirements

All source audio must meet the following requirements:

Channels: Mono or stereo
Sample rate: 16 kHz or higher
Format: WAV (16-bit), MP3, or M4A
File size: Up to 10 MB
Duration: At least 10 seconds (at least 20 seconds for real-time recordings)

Step 1: Clone a voice

You can clone voices in AI Real-time Interaction using two methods. Choose the one that best suits your use case.

Cloning method	Use case	Features
Method 1: Clone in the AI Real-time Interaction console	Ideal for creating a small number of voices by recording in real time or uploading local audio files.	Requires no coding. Supports real-time recording and auditioning.
Method 2: Batch clone by using Model Studio APIs	Ideal for batch cloning when your audio files are already stored in OSS.	Ideal for automated workflows and batch processing.

Method 1: Clone in the console

When creating a voice in the console, you can either record audio in real time or upload a local audio file as the source audio. After creation, the system automatically reviews the voice. Once approved, you can use it in the tts node of a workflow.

Create a voice

Log on to the Intelligent Media Services console.
In the left-side navigation pane, choose Real-time Conversational AI > Voice Cloning.
Click Create Voice. In the upper-right corner, the page displays your current voice count and quota limit (30 voices).
On the Create Voice page, configure the following parameters:
- Voice Name: Enter a custom name for the voice. This helps you identify and select it in workflows.
- Original Audio: Select the audio source method:
  - Upload an audio file first.: Select and upload an audio file from your local machine that meets the requirements.
  - Record: Click the record button to record audio from your microphone. The recording must be at least 20 seconds long.
- Cloning Model: Select the model to use for voice cloning. The following models are supported:
  - CosyVoice series: cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus
  For more information about these models and their capabilities, see Real-time speech synthesis - CosyVoice/Sambert and Real-time speech synthesis - Qwen.
Click Submit. The system starts processing the voice cloning request and initiates a review.
On the Voice Cloning list page, check the voice status:
- Reviewing: The voice is under review and cannot be used yet.
- Approved: The voice is ready to use. Click the Audition button to listen to a synthesized sample.
- Review Failed: The voice did not pass the review and cannot be used. Check if your audio content complies with the terms of service.
Once the voice is approved, find its voice ID on the list page. You will need this ID to configure the tts node in your workflow.

List page features:

Search for voices by name or ID.
Each account has a quota limit of 30 voices. To create a new voice after reaching this limit, you must first delete an existing one.

Method 2: Batch clone with APIs

Use the Model Studio API to batch clone voices or to integrate cloning into an automated process. Unlike the console method, the API method is designed for programmatically processing audio files that are already stored in OSS.

Note

You must activate the Model Studio service in the Model Studio console.

Upload audio to OSS

After preparing your audio files, upload them to an OSS bucket. For instructions, see Procedures.

Call the API to get a voice ID

The following sample code demonstrates how to call the Model Studio API to generate a voice ID from an audio file stored in OSS.

import os
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# Obtain the API key from an environment variable to avoid hardcoding.
dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
url = "https://your-audio-file-url"  # Replace this with the public URL of your audio file in OSS.
prefix = 'prefix'  # Define a custom prefix to identify the voice source.
target_model = "cosyvoice-v2"  # Select a cloning model.

# Create an instance of the voice enrollment service.
service = VoiceEnrollmentService()

# Call the create_voice method to clone the voice and generate a voice_id.
voice_id = service.create_voice(target_model=target_model, prefix=prefix, url=url)
print(f"your voice id is {voice_id}")
# Sample output: cosyvoice-prefix-xxxxx

After the call completes, save the returned voice_id string, which is your voice ID. Voices created via the API also appear on the Voice Cloning list page, where you can view their status and manage them.

Supported cloning models

Model Studio currently supports the following voice cloning models:

CosyVoice series: For model details and API call information, see Real-time speech synthesis - CosyVoice/Sambert.
Qwen3 series: For model details and API call information, see Real-time speech synthesis - Qwen.

Step 2: Configure the tts node

After you have a voice ID for your cloned voice, configure the tts node in an AI Real-time Interaction workflow to use it.

Procedure

Log on to the Intelligent Media Services console.
In the left-side navigation pane, choose Workflow Management.
Find the desired workflow and click its name to open the details page.
In the upper-right corner, click Edit to open the workflow editor.
On the workflow canvas, select the Text-to-speech node.
In the node configuration panel, select the model and set its parameters.
Click Save to complete the tts node configuration.

Once configured, your AI Real-time Interaction workflow will use the cloned voice for speech synthesis.

FAQ and troubleshooting

Handling review failures

If your voice fails the review, check the following:

Ensure that the audio content complies with the terms of service and does not contain illegal or prohibited content.
Ensure that the audio file meets the technical requirements, including channels, sample rate, format, file size, and duration.
Ensure that the audio quality is clear and background noise is minimal.

Quota limits

The quota limit is 30 voices per account.
To create a new voice after reaching the limit, you must first delete an unused one.