Fine-tune CosyVoice models

更新时间:
复制 MD 格式

Fine-tune a CosyVoice speech synthesis model on Alibaba Cloud Model Studio using SFT (efficient_sft). Fine-tuning jobs can only be submitted through the API (HTTP). The console doesn't support this feature yet.

Scope

CosyVoice fine-tuning trains a high-fidelity custom voice model from multiple recordings of the same speaker. Use fine-tuning when Voice cloning with a single audio sample doesn't meet your fidelity requirements and you have multiple recordings (or hours of audio) from the same speaker available.

Use cases

  • Brand or IP voice customization: You have hours of high-quality recordings from a single speaker — a brand spokesperson, virtual streamer, or IP character — and need fidelity that exceeds what single-sample voice cloning delivers.

  • Long-form voice production: Audiobooks, podcasts, documentary narration, or corporate training material that requires consistent timbre, tone, and rhythm across tens of hours of output — provided you already have sufficient recordings of the target speaker.

Note

Training data must come from a single speaker: CosyVoice fine-tuning produces a single-voice model. Mixing recordings from multiple speakers in the training set degrades voice fidelity.

Specifications

  • Fine-tuning method: SFT efficient fine-tuning (efficient_sft). Other training methods such as CPT and DPO aren't supported.

  • Supported base model: cosyvoice-v3-flash.

  • Supported region: CosyVoice fine-tuning is available only in the China (Beijing) region.

Fine-tuning output

The fine-tuning output is a standalone deployed model, not a voice ID under the base model. It has a single fixed voice learned during training. Set voice to default when calling the model — other voice values aren't supported.

Limitations

The following capabilities are determined by the base model cosyvoice-v3-flash and can't be added, extended, or changed through fine-tuning:

  • Language support: Training data must be in a language already supported by the base model. Training with data in an unsupported language (for example, Bulgarian) doesn't enable the model to synthesize speech in that language.

  • Request-level control interfaces: Fine-tuned models support SSML (fine-grained control over speed, pitch, pauses, volume, and pronunciation within a request) and LaTeX (math formula reading). Instruction control isn't supported — don't pass the instruction parameter.

  • Voice cloning and voice design: A fine-tuned model is a single-voice standalone model (voice="default" locked). It doesn't provide voice cloning or voice design capabilities. To create new voice IDs, use the base model's Voice cloning or Voice Design features.

Do you need fine-tuning?

Fine-tuning involves dataset preparation, training time, model deployment, and additional costs — significantly more overhead than calling the base model directly.

For the following scenarios, use these built-in features instead of fine-tuning:

Goal

Recommended feature

Description

Clone a speaker's voice from a single audio sample

Voice cloning

Voice cloning is a built-in base model feature. Provide one audio sample to create a dedicated voice ID.

Generate a new voice from a text description (no recordings)

Voice Design

Voice design generates a new voice ID directly from a text description.

Dynamically adjust speaking style per request (speed, pitch, pauses, emotion, volume)

Instruction control or Speech Synthesis Markup Language (SSML)

CosyVoice offers two request-level control mechanisms, neither requiring a new model: Instruction control uses natural language descriptions to adjust speed, emotion, and style — see Real-time speech synthesis. SSML embeds tags in the text for fine-grained control over speed, pitch, pauses, volume, and pronunciation (not emotion or style) — see SSML.

Enable a language the base model doesn't support

Not achievable with fine-tuning

Fine-tuning can't extend the base model's language support. See the Language support item in Limitations.

Billing

CosyVoice fine-tuning involves two cost components: training costs (billed by token consumption) and deployment costs (billed by model unit usage duration).

Training costs

Training is billed at CNY 0.2 per 1,000 tokens.

Estimate the token consumption for a single job using this formula:

.

Here, lm_max_epoch and fm_max_epoch are the LM and FM training epochs set in the hyperparameters. The total training set duration is the combined length (in seconds) of all audio files in the training dataset. Increasing epoch count or training-set size linearly increases token consumption.

Deployment costs

Deployed fine-tuned models are billed by model unit usage duration:

.

Here, . For available deployment templates and their model unit types, see Deployment templates. For unit prices and billing start/end times per model unit type, see Model deployment .

Prerequisites

Note

All curl examples in this document use the macOS/Linux environment variable syntax ${DASHSCOPE_API_KEY}. For Windows CMD, replace with %DASHSCOPE_API_KEY%. For PowerShell, replace with $env:DASHSCOPE_API_KEY.

Workflow

Terminology: In this document, "fine-tuning" refers specifically to SFT efficient fine-tuning (efficient_sft) of the base model cosyvoice-v3-flash. "Fine-tuned model", "fine-tuning output", and "checkpoint" all refer to the model artifact produced after training. "Deployment instance" refers to a callable deployed_model created from that artifact.

  1. Prepare the training dataset: Organize training audio files (.wav) and a sample manifest (data.jsonl) in the required directory structure, then package them as a .zip archive.

  2. Upload the training file: Upload the .zip archive to Alibaba Cloud Model Studio and obtain a file ID (file_id).

  3. Create a fine-tuning job: Start training with the file_id and obtain a job ID (job_id) and a fine-tuned model ID (finetuned_output).

  4. Check job status and logs: Wait for training to complete (status becomes SUCCEEDED). Pull training logs if troubleshooting is needed.

  5. Deploy and call the fine-tuned model: Deploy the successful model ID as a callable service, then use it the same way as the base model.

For real-world timing and cost benchmarks (minimal vs. recommended hyperparameters), see the comparison table at the end of Hyperparameters .

Prepare the training dataset

Language constraints: The following constraints apply to all specifications in this section. The tables below don't repeat these details:

  • Audio language must be one already supported by the base model cosyvoice-v3-flash.

  • Text language (the text field in data.jsonl) must match the language of its corresponding audio. For single samples containing mixed-language speech, see the Mixed languages allowed item below.

  • Not extensible: Languages not supported by the base model can't be added through fine-tuning. For example, training with Bulgarian audio won't enable Bulgarian speech synthesis.

  • Mixed languages allowed: Different samples in the same training set can be in different languages (for example, some in Chinese, some in English) without affecting single-language synthesis quality. When a single sample contains mixed-language speech, the text field should transcribe the text exactly as spoken.

  • Supplementary language phoneme coverage: When the training set is primarily one language with a small amount of another (for example, mainly Chinese with some English), the supplementary language samples should cover as many phonemes as possible (for American English, this means covering all 20 vowels and 24 consonants) to ensure voice fidelity in the supplementary language.

Audio specifications

Item

Requirement

Audio format

.wav

Sample rate

16 kHz or higher.

Duration per clip

2–30 seconds recommended; minimum 1 second.

Language

See "Language constraints" at the beginning of this section.

Text language

See "Language constraints" at the beginning of this section.

Recording environment and content

Record in a studio or low-noise environment to minimize background noise. Paralinguistic content in the training audio — laughter, sighs, coughs, breath sounds, and long pauses — is learned during fine-tuning. The model reproduces similar sounds in appropriate contexts during inference. To produce cleaner output, remove this content during data preparation.

Important
  • Training data characteristics directly shape the fine-tuned model: Fine-tuning fits the acoustic features of the training dataset. Statistical properties such as speaking speed, emotional tendency, and pause rhythm in the training audio are reflected in the model's default synthesis style. For example, if training data consists mostly of slow, calm readings, the model's default output will also tend toward slower speed and calmer tone. Choose training audio that matches your target synthesis scenario.

  • Watch for mispronunciations in training audio: If training data contains non-standard pronunciations, the fine-tuned model may reproduce them. The more frequently a mispronunciation appears, the more likely it is to be learned. Review audio pronunciations for accuracy during the labeling stage.

Directory structure

Organize the training samples in the following directory structure, then package the entire directory as a .zip archive for upload:

user_data/
├── data.jsonl           # Training sample manifest (required)
└── train/               # Training audio directory (all training .wav files go here)
    ├── 100001.wav
    ├── 100002.wav
    └── ...

data.jsonl format

The data.jsonl file contains one training sample per line:

{"wav_fn": "train/100001.wav", "text": "Hello."}
{"wav_fn": "train/100002.wav", "text": "I see."}

Field descriptions

Field

Required

Description

wav_fn

Yes

Relative path to the training audio file. Must start with train/ (matching the directory structure) — for example, train/100001.wav. The system resolves paths as {data_dir}/{wav_fn}, where {data_dir} is the root directory after zip extraction (user_data/).

text

Yes

The corresponding transcript. Required in the current version; automatic ASR backfill isn't available.

Text formatting guidelines:

  • No normalization needed: Numbers, mixed English/Chinese text, and punctuation don't require special normalization. Write them as naturally spoken in the audio.

  • Remove special characters: Remove non-pronounced special characters (such as emoji, decorative symbols, and control characters) from the text. Keep only content that corresponds to the actual audio pronunciation.

  • No markup languages: The text field must be plain text — no SSML tags, LaTeX formulas, instruction control statements, or emotion annotations. The fine-tuning stage doesn't parse markup; any markup included will be incorrectly learned as a character sequence in the text-to-speech mapping. Use request-level markup (SSML/LaTeX) when calling the fine-tuned model. For the fine-tuned model's supported request-level control interfaces, see the Request-level control interfaces item in Limitations .

Data volume recommendations

  • Recommended volume: A total training audio duration of 1–10 hours generally produces good results. Beyond 10 hours, returns diminish.

  • Sample count: At least 150 training audio clips are recommended for stable fine-tuning results. Neither the algorithm nor the platform enforces an upper limit on the number of samples in data.jsonl — add as many as your target total duration requires.

Note

Total training data duration directly affects token consumption and training time. See Billing for the billing formula.

Upload the training file

Upload the packaged train_data.zip to Alibaba Cloud Model Studio via multipart/form-data. Key fields: files (local zip path), purpose (set to fine-tune), and descriptions (optional).

curl --location 'https://dashscope.aliyuncs.com/api/v1/files' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--form 'files=@"train_data.zip"' \
--form 'purpose="fine-tune"' \
--form 'descriptions="Training voice packag"'
Note

Save the file_id from the response — it's the unique identifier for your uploaded dataset and is required in the next step to create a fine-tuning job. For complete request/response field descriptions, see Upload a file.

Create a fine-tuning job

Start a training job using the file_id obtained in the previous step.

The platform runs only one training job at a time. If a job is running when you submit a new one, the new job enters a QUEUING state until the running job completes. Account for queue wait time when planning your schedule.

Note

The following example uses minimal hyperparameters (lm_max_epoch=4, fm_max_epoch=4, etc.) intended only for quick validation — don't use these values in production. For production fine-tuning, use the recommended values in Hyperparameters (lm_max_epoch=60, fm_max_epoch=100, etc.). Training with recommended hyperparameters costs approximately 20x the token consumption and wall time of this minimal example.

curl --location --request POST 'https://dashscope.aliyuncs.com/api/v1/fine-tunes' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json' \
--data '{
    "model": "cosyvoice-v3-flash",
    "training_file_ids": [
        "<replace with the file_id of the training dataset>"
    ],
    "hyper_parameters": {
        "lm_max_epoch": 4,
        "lm_step": 1,
        "lm_num": 2,
        "fm_max_epoch": 4,
        "fm_step": 2,
        "fm_num": 2,
        "lm_batch_size": 1000,
        "fm_batch_size": 2000
    },
    "training_type": "efficient_sft"
}'

Key request fields: model is fixed to cosyvoice-v3-flash, training_type is fixed to efficient_sft, training_file_ids accepts only one training file ID, and all 8 LM/FM subfields in hyper_parameters are required. For value ranges, types, and parameter formats, see Create a tuning job .

Note

After a successful request, save two key fields from the response: output.job_id (job ID — used to check job status and retrieve logs) and output.finetuned_output (fine-tuned model ID — used to deploy the model after training completes). The initial status is PENDING and changes as training progresses. For complete response field descriptions, see Create a tuning job .

Hyperparameters

CosyVoice fine-tuning trains two sub-networks: LM (Language Model) and FM (Flow Matching). LM is an autoregressive model that converts text to discrete speech tokens and has the most impact on prosody. FM is a flow-matching model that converts speech tokens to Mel spectrograms and has the most impact on voice fidelity. The two networks are decoupled and can be tuned independently using lm_* and fm_* hyperparameter prefixes. Hyperparameters control training epochs and checkpoint-saving frequency, directly affecting training time, token consumption, and model quality.

Start with the following recommended values for your first full run, then adjust based on results:

  • LM network: lm_max_epoch=60, lm_step=5, lm_num=3, lm_batch_size=1000.

  • FM network: fm_max_epoch=100, fm_step=10, fm_num=3, fm_batch_size=2000.

Here, *_max_epoch is the total number of training epochs, *_step is the checkpoint-saving interval (in epochs), *_num is the maximum number of checkpoints to retain, and *_batch_size is the training batch size. For complete value ranges, see CosyVoice speech synthesis model hyper_parameters.

Note

Increasing lm_max_epoch or fm_max_epoch linearly increases token consumption and training time (see Training costs). Additionally, higher epoch counts increase "forgetting" of the base model's original capabilities, potentially degrading long-text stability, polyphone accuracy, and other qualities. The recommended values (lm_max_epoch=60, fm_max_epoch=100) balance voice fidelity with base capability retention.

Benchmark data: The following table shows a real-world sample run to help you estimate time and cost. Actual values vary with data volume, hyperparameters, and queue wait time.

Item

Measured value

Training samples

99 audio clips (dataset zip approximately 37 MB)

Hyperparameters

lm_max_epoch=4 / lm_step=1 / lm_num=2 / lm_batch_size=1000; fm_max_epoch=4 / fm_step=2 / fm_num=2 / fm_batch_size=2000

Total training time

Approximately 37 minutes (from PENDING to SUCCEEDED)

Tokens consumed

Approximately 99,406 tokens

Estimated cost

Approximately CNY 19.88 (at CNY 0.2 per 1,000 tokens)

Note

These figures are based on minimal hyperparameters. With recommended hyperparameters, token consumption, cost, and training time are approximately 20x higher. Don't use this table to estimate production costs.

Checkpoints

A single fine-tuning job can produce multiple checkpoints (candidate models; view them through the List checkpoints API). The exact count and ordering are determined by lm_num, fm_num, and *_step in the hyperparameters.

  1. Select models: Starting from the maximum epoch for each network, step backwards by *_step, selecting *_num checkpoints total.

  2. Combine: Take the Cartesian product of the LM and FM selections. Total candidate checkpoints = lm_num × fm_num.

  3. Sort: Order by LM epoch × FM epoch descending (higher product means both networks trained more thoroughly, so it ranks first).

  4. Truncate: From the descending sorted result, retain at most 10 checkpoints. If fewer than 10 candidates exist, output all of them.

  5. Naming: checkpoint-{LM epoch zero-padded to 4 digits}{FM epoch zero-padded to 4 digits} — for example, LM 4 / FM 4 produces checkpoint-00040004.

Example: With parameters lm_max_epoch=4, lm_step=1, lm_num=2, fm_max_epoch=4, fm_step=2, fm_num=2:

  • LM selects: 4, 3 (starting from 4, stepping back by lm_step=1, taking lm_num=2).

  • FM selects: 4, 2 (starting from 4, stepping back by fm_step=2, taking fm_num=2).

  • Cartesian product yields 4 candidates, sorted descending by product:

Rank

LM / FM epoch

Product

Checkpoint name

1

LM 4 / FM 4

16

checkpoint-00040004

2

LM 3 / FM 4

12

checkpoint-00030004

3

LM 4 / FM 2

8

checkpoint-00040002

4

LM 3 / FM 2

6

checkpoint-00030002

Check job status and logs

Training typically takes tens of minutes to several hours depending on the hyperparameters and dataset size. Two APIs are available for tracking progress: query job details to check the current stage, and get training logs to troubleshoot issues or confirm progress.

Query job details

Use the job_id returned when creating the job to check its status.

curl --location 'https://dashscope.aliyuncs.com/api/v1/fine-tunes/<job_id>' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json'

The output.status field in the response indicates the current stage: PENDING (waiting to start) → QUEUING (queued — only one job runs at a time) → RUNNING (training in progress) → SUCCEEDED (training complete). Abnormal termination states include FAILED (training failed), CANCELING (cancellation in progress), and CANCELED (canceled). For full status field definitions, see Get fine-tuning job details .

Note

Once status becomes SUCCEEDED, use the finetuned_output saved earlier to proceed to deploying the model.

Get training logs

If a job stays in one state for an extended period or enters the FAILED state, pull the training logs to help diagnose the issue.

curl --location 'https://dashscope.aliyuncs.com/api/v1/fine-tunes/<job_id>/logs?offset=0&line=1000' \
--header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
--header 'Content-Type: application/json'

Use offset to control the starting position and line to control the maximum number of lines returned (this example fetches 1,000 lines per request). For response field descriptions, see Query fine-tuning logs .

Deploy the model

Once the job status becomes SUCCEEDED, the fine-tuning output finetuned_output (the unique ID of the fine-tuned model, corresponding to the top-ranked checkpoint in Checkpoints) is ready for deployment.

Deployment templates

A model unit (MU) is the smallest unit of inference compute on the Alibaba Cloud Model Studio platform. Different model unit types correspond to different compute capacities and unit prices. Model units per replica indicates how many model units a single replica occupies. Deployment costs are directly determined by the total number of model units in use.

Deployment templates are preset deployment specifications (including a model unit type and model units per replica) that define the compute resources for a single replica. Each template defines a different combination of inference performance and unit cost. In the table below, V/II in the Model unit type column are Roman numeral identifiers for model unit specifications, corresponding to model_unit_spec field values of MU5/MU2 in API responses. CosyVoice fine-tuned models currently offer two deployment templates:

Template name

Model unit type

Model units per replica

Use case

Single-machine deployment

Type V model unit

1

Cost-effective deployment that balances inference performance with unit cost. Suitable for budget-sensitive, low-to-moderate load scenarios that still require stable service.

Single-machine deployment - flagship complex inference

Type II model unit

8

For high-complexity workloads (such as very large models with long contexts). Each replica provides stronger inference capabilities.

Billing: Deployment costs are directly tied to the total model unit count (Total Model Unit), calculated as: .

Choosing a different template or setting a different replica count directly changes the billing amount. Evaluate based on your workload and budget before deploying. For the complete cost formula and unit prices per model unit type, see Deployment costs .

Deployment methods: After selecting a deployment template, complete the deployment using either of the following methods based on your use case:

Method 1: Deploy through the console (recommended for daily use)

Entry point: Go to the My Models page in Alibaba Cloud Model Studio console, find the successfully fine-tuned model, and submit it for deployment. For complete steps, see Model deployment .

Key parameters: Only two selections are required — the template and replica values automatically determine all other fields:

  • Deployment Template: Choose between Single-machine deployment and Single-machine deployment - flagship complex inference. See Deployment templates for the differences.

  • Deployed Replicas: Number of instances. Set based on your concurrency and throughput requirements (integer >= 1).

Post-deployment status: The console displays the deployment instance ID and status transitions (PENDINGDEPLOYINGRUNNING). The model is ready for calls once status reaches RUNNING.

Method 2: Deploy through the API (recommended for automation)

API deployment requires calling three endpoints in sequence. For complete field constraints, see Model deployment .

  1. Query deployable models (GET /api/v1/deployments/models): Retrieve the model_name, template_id, and capacity_unit_per_instance needed for creating a deployment:

    curl 'https://dashscope.aliyuncs.com/api/v1/deployments/models?page_no=1&page_size=100&version=v1.0&model_source=custom' \
        --header "Authorization: Bearer ${DASHSCOPE_API_KEY}" \
        --header 'Content-Type: application/json'

    Sample response (key fields only):

    {
      "output": {
        "models": [
          {
            "model_name": "cosyvoice-v3-flash-ft-202605271743-dd2a",
            "plans": [
              {
                "plan": "mu",
                "templates": [
                  {
                    "template_name": "Single-machine deployment",
                    "template_id": "dps-20260521172224-1vabse",
                    "template_desc": "Cost-effective deployment plan, ...",
                    "roles": {
                      "unified": {
                        "model_unit_spec": "MU5",
                        "capacity_unit_per_instance": 1
                      }
                    }
                  },
                  {
                    "template_name": "Single-machine deployment - flagship complex inference",
                    "template_id": "MU2",
                    "template_desc": "High-complexity workloads, ultra-large models with long context.",
                    "roles": {
                      "unified": {
                        "model_unit_spec": "MU2",
                        "capacity_unit_per_instance": 8
                      }
                    }
                  }
                ]
              }
            ]
          }
        ]
      }
    }

    Locate your model: The output.models array in the response lists all deployable models. Find the entry whose model_name matches the finetuned_output returned by Create a fine-tuning job. Its plans[].templates field contains the available deployment templates for that model.

  2. Create a deployment (POST /api/v1/deployments): Submit a deployment request using the parameters from the previous step. The following example uses the Single-machine deployment template with 1 replica:

    curl --location 'https://dashscope.aliyuncs.com/api/v1/deployments' \
    --header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
    --header 'Content-Type: application/json' \
    --data '{
        "model_name": "<MODEL_NAME>",
        "plan": "mu",
        "deploy_spec": "<TEMPLATE_ID>",
        "capacity": 1,
        "billing_method": "POST_PAY"
    }'

    Response: The output.deployed_model field in the response is the deployment instance ID. Pass this value as the model parameter when calling the model.

    Key parameters (replace placeholders <MODEL_NAME> and <TEMPLATE_ID> with values from the previous step's response):

    • model_name: The model_name from the entry matching your finetuned_output in the previous response.

    • plan: Fixed to "mu" (billed by model unit usage duration).

    • deploy_spec: The template_id of your chosen template from the previous response — for example, Single-machine deployment is currently dps-20260521172224-1vabse, and Single-machine deployment - flagship complex inference is MU2. Always use the real-time values from the previous step — don't hard-code these.

    • capacity: The total number of model units for this deployment, directly tied to billing. Must be an integer multiple of capacity_unit_per_instance from the previous step; .

      Examples:

      • Single-machine deployment (capacity_unit_per_instance = 1): capacity=1 → 1 replica; capacity=4 → 4 replicas.

      • Single-machine deployment - flagship complex inference (capacity_unit_per_instance = 8): capacity=8 → 1 replica; capacity=16 → 2 replicas; capacity=1 is rejected (not a multiple of 8).

    • billing_method: Currently supports "POST_PAY" (pay-as-you-go).

    API field to console mapping:

    API field

    Console equivalent

    model_name

    The fine-tuned model selected on the My Models page

    deploy_spec

    The selection in Deployment Template : Single-machine deployment or Single-machine deployment - flagship complex inference

    capacity

    Total Model Unit (the console calculates and displays this automatically per )

    plan and billing_method have no console equivalents (console deployments default to pay-as-you-go billing by model unit usage duration).

    Important

    capacity is not the replica count — it's the total number of model units. Verify against the formula and examples above before submitting.

  3. Check deployment status (GET /api/v1/deployments/<deployed_model>): The deployment progresses through PENDINGDEPLOYINGRUNNING. Poll the status using:

    curl --location 'https://dashscope.aliyuncs.com/api/v1/deployments/<replace with deployed_model field value>' \
    --header 'Authorization: Bearer '${DASHSCOPE_API_KEY} \
    --header 'Content-Type: application/json'

    When status is RUNNING, the model is ready for calls. For additional deployment operations (scaling, decommissioning, etc.), see Model deployment .

Call the model

Once the deployment status is RUNNING, the fine-tuned model is ready for production calls. The calling method (endpoint, request body fields, typical response, and audio retrieval) is the same as other CosyVoice models — see Speech synthesis for the full guide. Compared to calling the base model cosyvoice-v3-flash, only two request parameters differ — all other fields (text, format, sample_rate, etc.) remain the same:

  • model: Set to the deployment instance ID — the output.deployed_model value from the deployment response.

  • voice: Must be "default", representing the dedicated voice learned from training data. Passing a preset voice name or voice-cloning ID will cause the request to fail.

Full examples follow. Choose non-real-time (HTTP) or real-time (WebSocket) based on your use case:

Non-real-time (HTTP)

Synthesizes the complete text in a single request. The response contains a URL to the synthesized audio, valid for 24 hours.

curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "<replace with the deployed_model from the deployment response>",
    "input": {
      "text": "There is a large garden behind my house.",
      "voice": "default",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Real-time (Python WebSocket)

Uses the DashScope Python SDK to connect to the WebSocket synthesis endpoint. Requires the dashscope package. The WebSocket interface supports streaming callbacks (synthesize and push audio simultaneously). For quick validation, this example uses the synchronous call synthesizer.call(text) to return the complete audio at once. For a full streaming callback implementation (synthesize while playing), see the streaming synthesis example in the user guide.

# coding=utf-8
import os
import dashscope
from dashscope.audio.tts_v2 import *

dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
dashscope.base_websocket_api_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference'

# Replace with the deployed_model from the deployment response; voice is fixed to "default"
model = "<replace with the deployed_model from the deployment response>"
voice = "default"

synthesizer = SpeechSynthesizer(model=model, voice=voice)
audio = synthesizer.call("There is a large garden behind my house.")

print('[Metric] requestId: {}, first-package latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

with open('output.mp3', 'wb') as f:
    f.write(audio)

Observability and troubleshooting

Log every call to support investigation of anomalous requests and performance issues. Capture these three categories of information:

  • Request ID (request_id): Retrieved via synthesizer.get_last_request_id() in the WebSocket SDK (see the Fine-tune CosyVoice models example above). For HTTP calls, see Speech synthesis for the response field location. Always retain this ID when troubleshooting.

  • First-packet latency: Retrieved via synthesizer.get_first_package_delay() in the WebSocket SDK — the time (in milliseconds) until the first audio packet arrives. This is the core metric for real-time synthesis experience. The Fine-tune CosyVoice models example above demonstrates printing this value.

  • Error code and message: The code and message fields in error responses distinguish between authentication failures, quota exhaustion, model not ready, text length exceeded, and other failure types. Set up alerts and troubleshooting paths per error code.

Recommended fields to log (minimum set for tracing):

Metric

HTTP response field

WebSocket SDK method

Suggested log key

Request ID

request_id

synthesizer.get_last_request_id()

tts_request_id

First-packet latency

Not applicable (HTTP returns complete audio)

synthesizer.get_first_package_delay()

tts_first_package_ms

Error code and message

code / message

code / message in error callback

tts_error_code / tts_error_msg

Correlate the request_id with your application's request ID in logs, and retain records for at least your SLA retention period.

Apply in production

A RUNNING status doesn't mean the deployment is ready for production. Before routing traffic, address the following four fine-tuning-specific practices to improve launch quality, control costs, and reduce maintenance risk. For call-level observability and logging, see Observability and troubleshooting.

Note

Validate base capabilities before going live: Non-voice capabilities of the fine-tuned model (polyphone and proper-noun pronunciation, long-text stability, etc.) are inherited from cosyvoice-v3-flash, but may be affected by training data quality and epoch count (see Audio specifications and Hyperparameters). Sample-test in your target use case before scaling up traffic.

Choose a better checkpoint

The default finetuned_output deployed earlier corresponds only to the top-ranked checkpoint from Checkpoints, which isn't necessarily the one that sounds best to human ears. Before going live, manually evaluate the top 2–3 checkpoints and select the one closest to your target voice.

Steps to switch to a different checkpoint:

  1. Call GET /api/v1/fine-tunes/{job_id}/checkpoints to list all checkpoints for the job. Get the model ID for each from the output[*].model_name field (only returned when status=SUCCEEDED). For full field descriptions, see List checkpoints for a fine-tuning job .

  2. Use the selected model_name as the input for the "Create a deployment" step in Method 2: Deploy through the API (recommended for automation) to get a separate deployed_model. Each checkpoint requires its own independent deployment instance — you can't swap the underlying model on an existing deployment.

  3. Running multiple evaluation deployments simultaneously incurs billing for each deployment's model units. Refer to Deployment costs to estimate concurrent evaluation costs, and decommission rejected instances promptly after comparison (see Decommission unused deployments).

Decommission unused deployments

Decommission deployment instances immediately once they no longer serve traffic. Common scenarios: instances rejected after multi-checkpoint evaluation, old instances replaced by a new version, instances used only for load testing or validation, and instances from failed deployments that need to be recreated.

How to decommission: Call DELETE /api/v1/deployments/{deployed_model}. When the response shows output.status as DELETING, the request has been accepted. For full field descriptions, see Delete a deployment .

Important

Deployment costs are billed by model unit usage duration — billing continues as long as the deployment is running, even with zero call traffic. The multi-checkpoint evaluation period and version-switching period are especially prone to accumulating idle instances. Decommission immediately once evaluation or switching is complete. For the billing formula and start/end times, see Deployment costs .

Plan replica count and scale

Adjust the replica count when business concurrency increases, overall throughput is insufficient, or you need elastic scaling for peak events. Different deployment templates (Single-machine deployment and Single-machine deployment - flagship complex inference) offer different per-replica compute and suit different scenarios — see Deployment templates. Determine replica count based on your concurrency requirements.

How to scale: Call PUT /api/v1/deployments/{deployed_model}/scale with the new capacity in the request body. Note that capacity is the total model unit count (not the replica count), and must be an integer multiple of that template's capacity_unit_per_instance. See Examples for value examples. For full field descriptions, see Scale a deployment .

Replica count linearly affects billing. Estimate cost changes using the formula in Deployment costs before scaling up.

Switch versions after re-training

When adding new data or adjusting hyperparameters, always re-train from the base model cosyvoice-v3-flash — don't train incrementally on an existing fine-tuned model. Re-training produces a new finetuned_output: a different model, not a new version of the same model. To use it, create a new deployment instance — you can't swap the underlying model on an existing deployed_model.

Recommended switching workflow:

  1. Deploy the new model model_name following Method 2: Deploy through the API (recommended for automation) to create a separate deployment with a new deployed_model. Don't scale or modify settings on the old instance — neither operation changes its underlying model.

  2. Route canary or shadow traffic to both old and new deployed_model instances simultaneously. Compare audio quality, first-packet latency, and error rate.

  3. After validation passes, shift all traffic to the new deployment, then decommission the old one (see Decommission unused deployments) to complete the version switch.

Important

Fine-tuned models are frozen at the base model version used during training: When cosyvoice-v3-flash receives future upgrades (new languages, expanded SSML tags, etc.), deployed fine-tuned models won't automatically inherit those enhancements — their capabilities are locked to the base model version at training time. To use upgraded base model capabilities, re-train and re-deploy on the new version.

API reference

The following table summarizes all APIs used in this guide. The In this guide column links to the relevant step, and the Full reference column links to the field-level API documentation.

API

Method / Path

Purpose

In this guide

Full reference

Upload training file

POST /api/v1/files

Upload the training dataset zip

Upload the training file

Upload a file

Create fine-tuning job

POST /api/v1/fine-tunes

Start a training job

Create a fine-tuning job

Create a tuning job

Query job details

GET /api/v1/fine-tunes/{job_id}

Track training progress

Query job details

Get fine-tuning job details

Get training logs

GET /api/v1/fine-tunes/{job_id}/logs

Pull training logs for troubleshooting

Get training logs

Query fine-tuning logs

List checkpoints

GET /api/v1/fine-tunes/{job_id}/checkpoints

Enumerate checkpoint candidates and choose the best one for deployment

Choose a better checkpoint

List checkpoints

Query deployable models

GET /api/v1/deployments/models

Get deployment templates and specs

Method 2, Step 1

List deployable models

Create deployment

POST /api/v1/deployments

Deploy the fine-tuned model

Method 2, Step 2

Create deployment

Query deployment status

GET /api/v1/deployments/{deployed_model}

Poll deployment status

Method 2, Step 3

Get deployment details

Scale deployment

PUT /api/v1/deployments/{deployed_model}/scale

Adjust total model unit count

Plan replica count and scale

Scale a deployment

Decommission deployment

DELETE /api/v1/deployments/{deployed_model}

Remove unused deployments and stop billing

Decommission unused deployments

Delete a deployment

Speech synthesis (HTTP)

POST /api/v1/services/audio/tts/SpeechSynthesizer

Synthesize complete text in one request

Call the model

Non-real-time speech synthesis - CosyVoice API reference

Speech synthesis (WebSocket)

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Streaming synthesis

Call the model

Real-time speech synthesis - CosyVoice API reference