Fine-tune a CosyVoice speech synthesis model on Alibaba Cloud Model Studio using SFT (efficient_sft). Fine-tuning jobs can only be submitted through the API (HTTP). The console doesn't support this feature yet.
Scope
CosyVoice fine-tuning trains a high-fidelity custom voice model from multiple recordings of the same speaker. Use fine-tuning when Voice cloning with a single audio sample doesn't meet your fidelity requirements and you have multiple recordings (or hours of audio) from the same speaker available.
Use cases
-
Brand or IP voice customization: You have hours of high-quality recordings from a single speaker — a brand spokesperson, virtual streamer, or IP character — and need fidelity that exceeds what single-sample voice cloning delivers.
-
Long-form voice production: Audiobooks, podcasts, documentary narration, or corporate training material that requires consistent timbre, tone, and rhythm across tens of hours of output — provided you already have sufficient recordings of the target speaker.
Training data must come from a single speaker: CosyVoice fine-tuning produces a single-voice model. Mixing recordings from multiple speakers in the training set degrades voice fidelity.
Specifications
-
Fine-tuning method: SFT efficient fine-tuning (
efficient_sft). Other training methods such as CPT and DPO aren't supported. -
Supported base model:
cosyvoice-v3-flash. -
Supported region: CosyVoice fine-tuning is available only in the China (Beijing) region.
Fine-tuning output
The fine-tuning output is a standalone deployed model, not a voice ID under the base model. It has a single fixed voice learned during training. Set voice to default when calling the model — other voice values aren't supported.
Limitations
The following capabilities are determined by the base model cosyvoice-v3-flash and can't be added, extended, or changed through fine-tuning:
-
Language support: Training data must be in a language already supported by the base model. Training with data in an unsupported language (for example, Bulgarian) doesn't enable the model to synthesize speech in that language.
-
Request-level control interfaces: Fine-tuned models support SSML (fine-grained control over speed, pitch, pauses, volume, and pronunciation within a request) and LaTeX (math formula reading). Instruction control isn't supported — don't pass the
instructionparameter. -
Voice cloning and voice design: A fine-tuned model is a single-voice standalone model (
voice="default"locked). It doesn't provide voice cloning or voice design capabilities. To create new voice IDs, use the base model's Voice cloning or Voice Design features.
Do you need fine-tuning?
Fine-tuning involves dataset preparation, training time, model deployment, and additional costs — significantly more overhead than calling the base model directly.
For the following scenarios, use these built-in features instead of fine-tuning:
|
Goal |
Recommended feature |
Description |
|
Clone a speaker's voice from a single audio sample |
Voice cloning is a built-in base model feature. Provide one audio sample to create a dedicated voice ID. |
|
|
Generate a new voice from a text description (no recordings) |
Voice design generates a new voice ID directly from a text description. |
|
|
Dynamically adjust speaking style per request (speed, pitch, pauses, emotion, volume) |
Instruction control or Speech Synthesis Markup Language (SSML) |
CosyVoice offers two request-level control mechanisms, neither requiring a new model: Instruction control uses natural language descriptions to adjust speed, emotion, and style — see Real-time speech synthesis. SSML embeds tags in the text for fine-grained control over speed, pitch, pauses, volume, and pronunciation (not emotion or style) — see SSML. |
|
Enable a language the base model doesn't support |
Not achievable with fine-tuning |
Fine-tuning can't extend the base model's language support. See the Language support item in Limitations. |
Billing
CosyVoice fine-tuning involves two cost components: training costs (billed by token consumption) and deployment costs (billed by model unit usage duration).
Training costs
Training is billed at CNY 0.2 per 1,000 tokens.
Estimate the token consumption for a single job using this formula:
Here, lm_max_epoch and fm_max_epoch are the LM and FM training epochs set in the hyperparameters. The total training set duration is the combined length (in seconds) of all audio files in the training dataset. Increasing epoch count or training-set size linearly increases token consumption.
Deployment costs
Deployed fine-tuned models are billed by model unit usage duration:
Here,
Prerequisites
-
Familiarity with Introduction to Model Tuning — the basic concepts, workflow, and data format requirements for model fine-tuning.
-
An activated service and an API key. See Obtain an API key for details.
-
If you use an Alibaba Cloud sub-account (RAM user), grant the sub-account the permissions required for calling, training, and deployment.
All curl examples in this document use the macOS/Linux environment variable syntax ${DASHSCOPE_API_KEY}. For Windows CMD, replace with %DASHSCOPE_API_KEY%. For PowerShell, replace with $env:DASHSCOPE_API_KEY.
Workflow
Terminology: In this document, "fine-tuning" refers specifically to SFT efficient fine-tuning (efficient_sft) of the base model cosyvoice-v3-flash. "Fine-tuned model", "fine-tuning output", and "checkpoint" all refer to the model artifact produced after training. "Deployment instance" refers to a callable deployed_model created from that artifact.
-
Prepare the training dataset: Organize training audio files (
.wav) and a sample manifest (data.jsonl) in the required directory structure, then package them as a.ziparchive. -
Upload the training file: Upload the
.ziparchive to Alibaba Cloud Model Studio and obtain a file ID (file_id). -
Create a fine-tuning job: Start training with the
file_idand obtain a job ID (job_id) and a fine-tuned model ID (finetuned_output). -
Check job status and logs: Wait for training to complete (status becomes
SUCCEEDED). Pull training logs if troubleshooting is needed. -
Deploy and call the fine-tuned model: Deploy the successful model ID as a callable service, then use it the same way as the base model.
For real-world timing and cost benchmarks (minimal vs. recommended hyperparameters), see the comparison table at the end of Hyperparameters .
Prepare the training dataset
Language constraints: The following constraints apply to all specifications in this section. The tables below don't repeat these details:
-
Audio language must be one already supported by the base model
cosyvoice-v3-flash. -
Text language (the
textfield indata.jsonl) must match the language of its corresponding audio. For single samples containing mixed-language speech, see the Mixed languages allowed item below. -
Not extensible: Languages not supported by the base model can't be added through fine-tuning. For example, training with Bulgarian audio won't enable Bulgarian speech synthesis.
-
Mixed languages allowed: Different samples in the same training set can be in different languages (for example, some in Chinese, some in English) without affecting single-language synthesis quality. When a single sample contains mixed-language speech, the
textfield should transcribe the text exactly as spoken. -
Supplementary language phoneme coverage: When the training set is primarily one language with a small amount of another (for example, mainly Chinese with some English), the supplementary language samples should cover as many phonemes as possible (for American English, this means covering all 20 vowels and 24 consonants) to ensure voice fidelity in the supplementary language.
Audio specifications
|
Item |
Requirement |
|
Audio format |
|
|
Sample rate |
16 kHz or higher. |
|
Duration per clip |
2–30 seconds recommended; minimum 1 second. |
|
Language |
See "Language constraints" at the beginning of this section. |
|
Text language |
See "Language constraints" at the beginning of this section. |
|
Recording environment and content |
Record in a studio or low-noise environment to minimize background noise. Paralinguistic content in the training audio — laughter, sighs, coughs, breath sounds, and long pauses — is learned during fine-tuning. The model reproduces similar sounds in appropriate contexts during inference. To produce cleaner output, remove this content during data preparation. |
-
Training data characteristics directly shape the fine-tuned model: Fine-tuning fits the acoustic features of the training dataset. Statistical properties such as speaking speed, emotional tendency, and pause rhythm in the training audio are reflected in the model's default synthesis style. For example, if training data consists mostly of slow, calm readings, the model's default output will also tend toward slower speed and calmer tone. Choose training audio that matches your target synthesis scenario.
-
Watch for mispronunciations in training audio: If training data contains non-standard pronunciations, the fine-tuned model may reproduce them. The more frequently a mispronunciation appears, the more likely it is to be learned. Review audio pronunciations for accuracy during the labeling stage.
Directory structure
Organize the training samples in the following directory structure, then package the entire directory as a .zip archive for upload:
user_data/
├── data.jsonl # Training sample manifest (required)
└── train/ # Training audio directory (all training .wav files go here)
├── 100001.wav
├── 100002.wav
└── ...
data.jsonl format
The data.jsonl file contains one training sample per line:
{"wav_fn": "train/100001.wav", "text": "Hello."}
{"wav_fn": "train/100002.wav", "text": "I see."}
Field descriptions
|
Field |
Required |
Description |
|
|
Yes |
Relative path to the training audio file. Must start with train/ (matching the directory structure) — for example, |
|
|
Yes |
The corresponding transcript. Required in the current version; automatic ASR backfill isn't available. |
Text formatting guidelines:
-
No normalization needed: Numbers, mixed English/Chinese text, and punctuation don't require special normalization. Write them as naturally spoken in the audio.
-
Remove special characters: Remove non-pronounced special characters (such as emoji, decorative symbols, and control characters) from the text. Keep only content that corresponds to the actual audio pronunciation.
-
No markup languages: The
textfield must be plain text — no SSML tags, LaTeX formulas, instruction control statements, or emotion annotations. The fine-tuning stage doesn't parse markup; any markup included will be incorrectly learned as a character sequence in the text-to-speech mapping. Use request-level markup (SSML/LaTeX) when calling the fine-tuned model. For the fine-tuned model's supported request-level control interfaces, see the Request-level control interfaces item in Limitations .
Data volume recommendations
-
Recommended volume: A total training audio duration of 1–10 hours generally produces good results. Beyond 10 hours, returns diminish.
-
Sample count: At least 150 training audio clips are recommended for stable fine-tuning results. Neither the algorithm nor the platform enforces an upper limit on the number of samples in
data.jsonl— add as many as your target total duration requires.
Total training data duration directly affects token consumption and training time. See Billing for the billing formula.
Upload the training file
Upload the packaged train_data.zip to Alibaba Cloud Model Studio via multipart/form-data. Key fields: files (local zip path), purpose (set to fine-tune), and descriptions (optional).
curl --location 'https://dashscope.aliyuncs.com/api/v1/files' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--form 'files=@"train_data.zip"' \
--form 'purpose="fine-tune"' \
--form 'descriptions="Training voice packag"'
Save the file_id from the response — it's the unique identifier for your uploaded dataset and is required in the next step to create a fine-tuning job. For complete request/response field descriptions, see Upload a file.
Create a fine-tuning job
Start a training job using the file_id obtained in the previous step.
The platform runs only one training job at a time. If a job is running when you submit a new one, the new job enters a QUEUING state until the running job completes. Account for queue wait time when planning your schedule.
The following example uses minimal hyperparameters (lm_max_epoch=4, fm_max_epoch=4, etc.) intended only for quick validation — don't use these values in production. For production fine-tuning, use the recommended values in Hyperparameters (lm_max_epoch=60, fm_max_epoch=100, etc.). Training with recommended hyperparameters costs approximately 20x the token consumption and wall time of this minimal example.
curl --location --request POST 'https://dashscope.aliyuncs.com/api/v1/fine-tunes' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json' \
--data '{
"model": "cosyvoice-v3-flash",
"training_file_ids": [
"<replace with the file_id of the training dataset>"
],
"hyper_parameters": {
"lm_max_epoch": 4,
"lm_step": 1,
"lm_num": 2,
"fm_max_epoch": 4,
"fm_step": 2,
"fm_num": 2,
"lm_batch_size": 1000,
"fm_batch_size": 2000
},
"training_type": "efficient_sft"
}'
Key request fields: model is fixed to cosyvoice-v3-flash, training_type is fixed to efficient_sft, training_file_ids accepts only one training file ID, and all 8 LM/FM subfields in hyper_parameters are required. For value ranges, types, and parameter formats, see Create a tuning job .
After a successful request, save two key fields from the response: output.job_id (job ID — used to check job status and retrieve logs) and output.finetuned_output (fine-tuned model ID — used to deploy the model after training completes). The initial status is PENDING and changes as training progresses. For complete response field descriptions, see Create a tuning job .
Hyperparameters
CosyVoice fine-tuning trains two sub-networks: LM (Language Model) and FM (Flow Matching). LM is an autoregressive model that converts text to discrete speech tokens and has the most impact on prosody. FM is a flow-matching model that converts speech tokens to Mel spectrograms and has the most impact on voice fidelity. The two networks are decoupled and can be tuned independently using lm_* and fm_* hyperparameter prefixes. Hyperparameters control training epochs and checkpoint-saving frequency, directly affecting training time, token consumption, and model quality.
Start with the following recommended values for your first full run, then adjust based on results:
-
LM network:
lm_max_epoch=60,lm_step=5,lm_num=3,lm_batch_size=1000. -
FM network:
fm_max_epoch=100,fm_step=10,fm_num=3,fm_batch_size=2000.
Here, *_max_epoch is the total number of training epochs, *_step is the checkpoint-saving interval (in epochs), *_num is the maximum number of checkpoints to retain, and *_batch_size is the training batch size. For complete value ranges, see CosyVoice speech synthesis model hyper_parameters.
Increasing lm_max_epoch or fm_max_epoch linearly increases token consumption and training time (see Training costs). Additionally, higher epoch counts increase "forgetting" of the base model's original capabilities, potentially degrading long-text stability, polyphone accuracy, and other qualities. The recommended values (lm_max_epoch=60, fm_max_epoch=100) balance voice fidelity with base capability retention.
Benchmark data: The following table shows a real-world sample run to help you estimate time and cost. Actual values vary with data volume, hyperparameters, and queue wait time.
|
Item |
Measured value |
|
Training samples |
99 audio clips (dataset zip approximately 37 MB) |
|
Hyperparameters |
|
|
Total training time |
Approximately 37 minutes (from |
|
Tokens consumed |
Approximately 99,406 tokens |
|
Estimated cost |
Approximately CNY 19.88 (at CNY 0.2 per 1,000 tokens) |
These figures are based on minimal hyperparameters. With recommended hyperparameters, token consumption, cost, and training time are approximately 20x higher. Don't use this table to estimate production costs.
Checkpoints
A single fine-tuning job can produce multiple checkpoints (candidate models; view them through the List checkpoints API). The exact count and ordering are determined by lm_num, fm_num, and *_step in the hyperparameters.
-
Select models: Starting from the maximum epoch for each network, step backwards by
*_step, selecting*_numcheckpoints total. -
Combine: Take the Cartesian product of the LM and FM selections. Total candidate checkpoints =
lm_num × fm_num. -
Sort: Order by
LM epoch × FM epochdescending (higher product means both networks trained more thoroughly, so it ranks first). -
Truncate: From the descending sorted result, retain at most 10 checkpoints. If fewer than 10 candidates exist, output all of them.
-
Naming:
checkpoint-{LM epoch zero-padded to 4 digits}{FM epoch zero-padded to 4 digits}— for example, LM 4 / FM 4 producescheckpoint-00040004.
Example: With parameters lm_max_epoch=4, lm_step=1, lm_num=2, fm_max_epoch=4, fm_step=2, fm_num=2:
-
LM selects: 4, 3 (starting from 4, stepping back by
lm_step=1, takinglm_num=2). -
FM selects: 4, 2 (starting from 4, stepping back by
fm_step=2, takingfm_num=2). -
Cartesian product yields 4 candidates, sorted descending by product:
|
Rank |
LM / FM epoch |
Product |
Checkpoint name |
|
1 |
LM 4 / FM 4 |
16 |
|
|
2 |
LM 3 / FM 4 |
12 |
|
|
3 |
LM 4 / FM 2 |
8 |
|
|
4 |
LM 3 / FM 2 |
6 |
|
Check job status and logs
Training typically takes tens of minutes to several hours depending on the hyperparameters and dataset size. Two APIs are available for tracking progress: query job details to check the current stage, and get training logs to troubleshoot issues or confirm progress.
Query job details
Use the job_id returned when creating the job to check its status.
curl --location 'https://dashscope.aliyuncs.com/api/v1/fine-tunes/<job_id>' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json'
The output.status field in the response indicates the current stage: PENDING (waiting to start) → QUEUING (queued — only one job runs at a time) → RUNNING (training in progress) → SUCCEEDED (training complete). Abnormal termination states include FAILED (training failed), CANCELING (cancellation in progress), and CANCELED (canceled). For full status field definitions, see Get fine-tuning job details .
Once status becomes SUCCEEDED, use the finetuned_output saved earlier to proceed to deploying the model.
Get training logs
If a job stays in one state for an extended period or enters the FAILED state, pull the training logs to help diagnose the issue.
curl --location 'https://dashscope.aliyuncs.com/api/v1/fine-tunes/<job_id>/logs?offset=0&line=1000' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json'
Use offset to control the starting position and line to control the maximum number of lines returned (this example fetches 1,000 lines per request). For response field descriptions, see Query fine-tuning logs .
Deploy the model
Once the job status becomes SUCCEEDED, the fine-tuning output finetuned_output (the unique ID of the fine-tuned model, corresponding to the top-ranked checkpoint in Checkpoints) is ready for deployment.
Deployment templates
A model unit (MU) is the smallest unit of inference compute on the Alibaba Cloud Model Studio platform. Different model unit types correspond to different compute capacities and unit prices. Model units per replica indicates how many model units a single replica occupies. Deployment costs are directly determined by the total number of model units in use.
Deployment templates are preset deployment specifications (including a model unit type and model units per replica) that define the compute resources for a single replica. Each template defines a different combination of inference performance and unit cost. In the table below, V/II in the Model unit type column are Roman numeral identifiers for model unit specifications, corresponding to model_unit_spec field values of MU5/MU2 in API responses. CosyVoice fine-tuned models currently offer two deployment templates:
|
Template name |
Model unit type |
Model units per replica |
Use case |
|
Single-machine deployment |
Type V model unit |
1 |
Cost-effective deployment that balances inference performance with unit cost. Suitable for budget-sensitive, low-to-moderate load scenarios that still require stable service. |
|
Single-machine deployment - flagship complex inference |
Type II model unit |
8 |
For high-complexity workloads (such as very large models with long contexts). Each replica provides stronger inference capabilities. |
Billing: Deployment costs are directly tied to the total model unit count (Total Model Unit), calculated as:
Choosing a different template or setting a different replica count directly changes the billing amount. Evaluate based on your workload and budget before deploying. For the complete cost formula and unit prices per model unit type, see Deployment costs .
Deployment methods: After selecting a deployment template, complete the deployment using either of the following methods based on your use case:
Method 1: Deploy through the console (recommended for daily use)
Entry point: Go to the My Models page in Alibaba Cloud Model Studio console, find the successfully fine-tuned model, and submit it for deployment. For complete steps, see Model deployment .
Key parameters: Only two selections are required — the template and replica values automatically determine all other fields:
-
Deployment Template: Choose between Single-machine deployment and Single-machine deployment - flagship complex inference. See Deployment templates for the differences.
-
Deployed Replicas: Number of instances. Set based on your concurrency and throughput requirements (integer >= 1).
Post-deployment status: The console displays the deployment instance ID and status transitions (PENDING → DEPLOYING → RUNNING). The model is ready for calls once status reaches RUNNING.
Method 2: Deploy through the API (recommended for automation)
API deployment requires calling three endpoints in sequence. For complete field constraints, see Model deployment .
-
Query deployable models (
GET /api/v1/deployments/models): Retrieve themodel_name,template_id, andcapacity_unit_per_instanceneeded for creating a deployment:curl 'https://dashscope.aliyuncs.com/api/v1/deployments/models?page_no=1&page_size=100&version=v1.0&model_source=custom' \ --header "Authorization: Bearer ${DASHSCOPE_API_KEY}" \ --header 'Content-Type: application/json'Sample response (key fields only):
{ "output": { "models": [ { "model_name": "cosyvoice-v3-flash-ft-202605271743-dd2a", "plans": [ { "plan": "mu", "templates": [ { "template_name": "Single-machine deployment", "template_id": "dps-20260521172224-1vabse", "template_desc": "Cost-effective deployment plan, ...", "roles": { "unified": { "model_unit_spec": "MU5", "capacity_unit_per_instance": 1 } } }, { "template_name": "Single-machine deployment - flagship complex inference", "template_id": "MU2", "template_desc": "High-complexity workloads, ultra-large models with long context.", "roles": { "unified": { "model_unit_spec": "MU2", "capacity_unit_per_instance": 8 } } } ] } ] } ] } }Locate your model: The
output.modelsarray in the response lists all deployable models. Find the entry whosemodel_namematches thefinetuned_outputreturned by Create a fine-tuning job. Itsplans[].templatesfield contains the available deployment templates for that model. -
Create a deployment (
POST /api/v1/deployments): Submit a deployment request using the parameters from the previous step. The following example uses the Single-machine deployment template with 1 replica:curl --location 'https://dashscope.aliyuncs.com/api/v1/deployments' \ --header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \ --header 'Content-Type: application/json' \ --data '{ "model_name": "<MODEL_NAME>", "plan": "mu", "deploy_spec": "<TEMPLATE_ID>", "capacity": 1, "billing_method": "POST_PAY" }'Response: The
output.deployed_modelfield in the response is the deployment instance ID. Pass this value as themodelparameter when calling the model.Key parameters (replace placeholders
<MODEL_NAME>and<TEMPLATE_ID>with values from the previous step's response):-
model_name: Themodel_namefrom the entry matching yourfinetuned_outputin the previous response. -
plan: Fixed to"mu"(billed by model unit usage duration). -
deploy_spec: Thetemplate_idof your chosen template from the previous response — for example, Single-machine deployment is currentlydps-20260521172224-1vabse, and Single-machine deployment - flagship complex inference isMU2. Always use the real-time values from the previous step — don't hard-code these. -
capacity: The total number of model units for this deployment, directly tied to billing. Must be an integer multiple ofcapacity_unit_per_instancefrom the previous step;. Examples:
-
Single-machine deployment (
capacity_unit_per_instance = 1):capacity=1→ 1 replica;capacity=4→ 4 replicas. -
Single-machine deployment - flagship complex inference (
capacity_unit_per_instance = 8):capacity=8→ 1 replica;capacity=16→ 2 replicas;capacity=1is rejected (not a multiple of 8).
-
-
billing_method: Currently supports"POST_PAY"(pay-as-you-go).
API field to console mapping:
API field
Console equivalent
model_nameThe fine-tuned model selected on the My Models page
deploy_specThe selection in Deployment Template : Single-machine deployment or Single-machine deployment - flagship complex inference
capacityTotal Model Unit (the console calculates and displays this automatically per
) planandbilling_methodhave no console equivalents (console deployments default to pay-as-you-go billing by model unit usage duration).Importantcapacityis not the replica count — it's the total number of model units. Verify against the formula and examples above before submitting. -
-
Check deployment status (
GET /api/v1/deployments/<deployed_model>): The deployment progresses throughPENDING→DEPLOYING→RUNNING. Poll the status using:curl --location 'https://dashscope.aliyuncs.com/api/v1/deployments/<replace with deployed_model field value>' \ --header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \ --header 'Content-Type: application/json'When
statusisRUNNING, the model is ready for calls. For additional deployment operations (scaling, decommissioning, etc.), see Model deployment .
Call the model
Once the deployment status is RUNNING, the fine-tuned model is ready for production calls. The calling method (endpoint, request body fields, typical response, and audio retrieval) is the same as other CosyVoice models — see Speech synthesis for the full guide. Compared to calling the base model cosyvoice-v3-flash, only two request parameters differ — all other fields (text, format, sample_rate, etc.) remain the same:
-
model: Set to the deployment instance ID — theoutput.deployed_modelvalue from the deployment response. -
voice: Must be"default", representing the dedicated voice learned from training data. Passing a preset voice name or voice-cloning ID will cause the request to fail.
Full examples follow. Choose non-real-time (HTTP) or real-time (WebSocket) based on your use case:
Non-real-time (HTTP)
Synthesizes the complete text in a single request. The response contains a URL to the synthesized audio, valid for 24 hours.
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "<replace with the deployed_model from the deployment response>",
"input": {
"text": "There is a large garden behind my house.",
"voice": "default",
"format": "wav",
"sample_rate": 24000
}
}'
Real-time (Python WebSocket)
Uses the DashScope Python SDK to connect to the WebSocket synthesis endpoint. Requires the dashscope package. The WebSocket interface supports streaming callbacks (synthesize and push audio simultaneously). For quick validation, this example uses the synchronous call synthesizer.call(text) to return the complete audio at once. For a full streaming callback implementation (synthesize while playing), see the streaming synthesis example in the user guide.
# coding=utf-8
import os
import dashscope
from dashscope.audio.tts_v2 import *
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
dashscope.base_websocket_api_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference'
# Replace with the deployed_model from the deployment response; voice is fixed to "default"
model = "<replace with the deployed_model from the deployment response>"
voice = "default"
synthesizer = SpeechSynthesizer(model=model, voice=voice)
audio = synthesizer.call("There is a large garden behind my house.")
print('[Metric] requestId: {}, first-package latency: {} ms'.format(
synthesizer.get_last_request_id(),
synthesizer.get_first_package_delay()))
with open('output.mp3', 'wb') as f:
f.write(audio)
Observability and troubleshooting
Log every call to support investigation of anomalous requests and performance issues. Capture these three categories of information:
-
Request ID (
request_id): Retrieved viasynthesizer.get_last_request_id()in the WebSocket SDK (see the Fine-tune CosyVoice models example above). For HTTP calls, see Speech synthesis for the response field location. Always retain this ID when troubleshooting. -
First-packet latency: Retrieved via
synthesizer.get_first_package_delay()in the WebSocket SDK — the time (in milliseconds) until the first audio packet arrives. This is the core metric for real-time synthesis experience. The Fine-tune CosyVoice models example above demonstrates printing this value. -
Error code and message: The
codeandmessagefields in error responses distinguish between authentication failures, quota exhaustion, model not ready, text length exceeded, and other failure types. Set up alerts and troubleshooting paths per error code.
Recommended fields to log (minimum set for tracing):
|
Metric |
HTTP response field |
WebSocket SDK method |
Suggested log key |
|
Request ID |
|
|
|
|
First-packet latency |
Not applicable (HTTP returns complete audio) |
|
|
|
Error code and message |
|
|
|
Correlate the request_id with your application's request ID in logs, and retain records for at least your SLA retention period.
Apply in production
A RUNNING status doesn't mean the deployment is ready for production. Before routing traffic, address the following four fine-tuning-specific practices to improve launch quality, control costs, and reduce maintenance risk. For call-level observability and logging, see Observability and troubleshooting.
Validate base capabilities before going live: Non-voice capabilities of the fine-tuned model (polyphone and proper-noun pronunciation, long-text stability, etc.) are inherited from cosyvoice-v3-flash, but may be affected by training data quality and epoch count (see Audio specifications and Hyperparameters). Sample-test in your target use case before scaling up traffic.
Choose a better checkpoint
The default finetuned_output deployed earlier corresponds only to the top-ranked checkpoint from Checkpoints, which isn't necessarily the one that sounds best to human ears. Before going live, manually evaluate the top 2–3 checkpoints and select the one closest to your target voice.
Steps to switch to a different checkpoint:
-
Call
GET /api/v1/fine-tunes/{job_id}/checkpointsto list all checkpoints for the job. Get the model ID for each from theoutput[*].model_namefield (only returned when status=SUCCEEDED). For full field descriptions, see List checkpoints for a fine-tuning job . -
Use the selected
model_nameas the input for the "Create a deployment" step in Method 2: Deploy through the API (recommended for automation) to get a separatedeployed_model. Each checkpoint requires its own independent deployment instance — you can't swap the underlying model on an existing deployment. -
Running multiple evaluation deployments simultaneously incurs billing for each deployment's model units. Refer to Deployment costs to estimate concurrent evaluation costs, and decommission rejected instances promptly after comparison (see Decommission unused deployments).
Decommission unused deployments
Decommission deployment instances immediately once they no longer serve traffic. Common scenarios: instances rejected after multi-checkpoint evaluation, old instances replaced by a new version, instances used only for load testing or validation, and instances from failed deployments that need to be recreated.
How to decommission: Call DELETE /api/v1/deployments/{deployed_model}. When the response shows output.status as DELETING, the request has been accepted. For full field descriptions, see Delete a deployment .
Deployment costs are billed by model unit usage duration — billing continues as long as the deployment is running, even with zero call traffic. The multi-checkpoint evaluation period and version-switching period are especially prone to accumulating idle instances. Decommission immediately once evaluation or switching is complete. For the billing formula and start/end times, see Deployment costs .
Plan replica count and scale
Adjust the replica count when business concurrency increases, overall throughput is insufficient, or you need elastic scaling for peak events. Different deployment templates (Single-machine deployment and Single-machine deployment - flagship complex inference) offer different per-replica compute and suit different scenarios — see Deployment templates. Determine replica count based on your concurrency requirements.
How to scale: Call PUT /api/v1/deployments/{deployed_model}/scale with the new capacity in the request body. Note that capacity is the total model unit count (not the replica count), and must be an integer multiple of that template's capacity_unit_per_instance. See Examples for value examples. For full field descriptions, see Scale a deployment .
Replica count linearly affects billing. Estimate cost changes using the formula in Deployment costs before scaling up.
Switch versions after re-training
When adding new data or adjusting hyperparameters, always re-train from the base model cosyvoice-v3-flash — don't train incrementally on an existing fine-tuned model. Re-training produces a new finetuned_output: a different model, not a new version of the same model. To use it, create a new deployment instance — you can't swap the underlying model on an existing deployed_model.
Recommended switching workflow:
-
Deploy the new model
model_namefollowing Method 2: Deploy through the API (recommended for automation) to create a separate deployment with a newdeployed_model. Don't scale or modify settings on the old instance — neither operation changes its underlying model. -
Route canary or shadow traffic to both old and new
deployed_modelinstances simultaneously. Compare audio quality, first-packet latency, and error rate. -
After validation passes, shift all traffic to the new deployment, then decommission the old one (see Decommission unused deployments) to complete the version switch.
Fine-tuned models are frozen at the base model version used during training: When cosyvoice-v3-flash receives future upgrades (new languages, expanded SSML tags, etc.), deployed fine-tuned models won't automatically inherit those enhancements — their capabilities are locked to the base model version at training time. To use upgraded base model capabilities, re-train and re-deploy on the new version.
API reference
The following table summarizes all APIs used in this guide. The In this guide column links to the relevant step, and the Full reference column links to the field-level API documentation.
|
API |
Method / Path |
Purpose |
In this guide |
Full reference |
|
Upload training file |
|
Upload the training dataset zip |
||
|
Create fine-tuning job |
|
Start a training job |
||
|
Query job details |
|
Track training progress |
||
|
Get training logs |
|
Pull training logs for troubleshooting |
||
|
List checkpoints |
|
Enumerate checkpoint candidates and choose the best one for deployment |
||
|
Query deployable models |
|
Get deployment templates and specs |
||
|
Create deployment |
|
Deploy the fine-tuned model |
||
|
Query deployment status |
|
Poll deployment status |
||
|
Scale deployment |
|
Adjust total model unit count |
||
|
Decommission deployment |
|
Remove unused deployments and stop billing |
||
|
Speech synthesis (HTTP) |
|
Synthesize complete text in one request |
||
|
Speech synthesis (WebSocket) |
|
Streaming synthesis |