TTS FAQ-Intelligent Speech Interaction(ISI)-阿里云帮助中心

This topic answers common questions about the TTS service.

This FAQ is divided into the following categories:

Features
Performance

Features

WAV file duration and actual audio length discrepancy

TTS uses streaming synthesis, which returns data as it is synthesized. Therefore, the duration in the WAV file header is an estimate and may be inaccurate. If you require an exact duration, set the format parameter to PCM. After you receive the complete audio data, you can add your own WAV header to obtain a more accurate duration. For details, see the API reference.

What is the TTS timestamp feature?

The real-time TTS service outputs the time position of each Chinese character or English word within the audio stream simultaneously with the audio. This is known as the timestamp feature, also referred to as the character-level phoneme boundary interface. The time information — including begin_index and end_index values for each character — can be used to drive virtual avatar lip movements and to generate subtitle timecodes for video dubbing. For details, see Introduction to the TTS timestamp feature.

Control how digits are read

You can use the SSML feature. SSML is an XML-based markup language for speech synthesis that provides fine-grained control over how text is converted to speech. It lets you control what is read and how it is read, including word segmentation, pronunciation, speed, pauses, pitch, volume, and even adding background music. For details, see Introduction to SSML.

Handling polyphonic characters

When the service encounters a polyphonic character that is not part of a common word or phrase, it predicts the most likely pronunciation based on the surrounding context.

Are there invocation limits for Long Text-to-Speech?

The Long Text-to-Speech feature synthesizes very long texts (such as thousands or tens of thousands of characters) into audio data. It supports output in PCM, WAV, and MP3 formats. You can also set the speech rate, pitch, and volume, and select from various male and female voices. You can retrieve the synthesis results through both real-time and asynchronous methods.

Note

Long Text-to-Speech only supports API calls. No web page entry is available.

The main difference between Long Text-to-Speech and the standard TTS service is the character limit: the standard TTS service supports texts up to 300 characters, whereas Long Text-to-Speech is designed for longer content and supports up to 100,000 characters in a single, fast synthesis request. For details, see the API reference.

Commercial edition: Audio synthesized using the Long Text-to-Speech commercial edition may be resold, provided the data complies with applicable laws and regulations.

How can I download or save audio generated by speech synthesis in the Model Studio console?

The Model Studio console and web-based interface only support online playback. They do not provide a direct download or save button. To obtain the audio file, call the API or use the SDK to write the generated audio data to a local file.

As a temporary workaround, on the Playground page, you can open your browser's developer tools (F12), go to the Network tab, and find the audio stream URL in the network requests to download the audio directly.

Does speech synthesis support multi-character dialogue, dynamic dialect switching, and what languages are supported?

Multi-character dialogue: Directly generating a single audio file that contains multiple voices is not supported. Synthesize each voice separately using different voice tones, then concatenate the audio files.
Dialect support: The SDK supports passing specified text and selecting the corresponding dialect voice when making calls. However, the product only provides the underlying synthesis capability and does not include dialogue logic generation.
Language support: Thai, Vietnamese, Indonesian, and Filipino are currently supported. Malaysian is not yet supported. CosyVoice system voices are expected to support Portuguese in June.
Voice query: There is currently no API to query the complete list of available voices. Refer to the official documentation or the product page to preview voices.

What are the commercial authorization, resale rights, and hardware device usage restrictions for speech synthesis?

Commercial use and resale: Audio synthesized using the Long Text-to-Speech commercial edition may be resold, provided the data complies with applicable laws and regulations.
Hardware use: Synthesized audio may be embedded in hardware devices. Mechanical reproduction rights restrictions do not apply.
Free quota: All voices provided after enabling the speech synthesis API can be used free of charge. The offline speech synthesis service also includes a free quota and can be previewed without purchase.

Do WeChat mini programs, Douyin mini programs, and UniApp provide speech synthesis SDKs?

WeChat mini programs: The SDK supports one-time short-text synthesis only and does not support streaming output. CosyVoice does not yet provide a dedicated WeChat mini program SDK. For streaming playback, refer to the WebSocket protocol documentation for custom integration.
Douyin mini programs / UniApp: No dedicated SDK is available. Use the RESTful API for integration.
C# offline development: The offline SDK only supports iOS and Android and does not support C#. C# development must call the online service.

Does long-text speech synthesis have a web page entry, and what are the recommended polling intervals?

Access method: Long Text-to-Speech currently only supports API calls. No web page entry is available.
Polling recommendations: For approximately 300 characters of text, synthesis typically completes within tens of seconds. Wait 30 seconds before the first polling attempt, then query every 10 seconds thereafter.
Latency note: CosyVoice long-text synthesis has slightly higher latency than standard synthesis. The system has been optimized to address this.

What are the usage restrictions for voice cloning, custom voices, and specific models?

Voice cloning model: The qwen3-tts-vc series models are recommended for voice cloning.
Qwen-Voice-Design voices: Cloned voices cannot be downloaded. Use the system-generated voice ID for subsequent synthesis.
MRCP protocol restrictions: Only voices listed in the short-text speech synthesis API documentation are supported. Custom voices generated by Qwen-TTS are not supported.
Children's voices: API calls are supported. Specify the voice directly in your code — no need to create a separate project.
Offline Cantonese voice: Purchasing Cantonese voice packages for offline use is not supported.

Where can I try the Model Studio speech synthesis feature?

To try speech synthesis on Model Studio, visit the speech synthesis demo page in the Model Studio console.

Performance

Character limit for TTS requests

The TTS service limits characters per request to avoid wasting server resources on long audio files that might not be fully used. When using the API or SDK, you can split the text into segments and then concatenate the results. For calls using the MRCP protocol, common in customer service or call center scenarios, synthesizing long text is impractical because the resulting lengthy playback is inconsistent with typical user interaction and is often interrupted. For very long texts, such as news articles with thousands of characters, use the Long Text-to-Speech API, which supports a single, fast synthesis request of up to 100,000 characters. For details, see the API reference.

Handling error code 144005

Error code 144005 (TTS_CLOUD_EXCEED_CONCURRENCY) indicates that you have exceeded the concurrency limit. The free tier of the TTS service provides a default concurrency of 2. When the number of simultaneous synthesis requests exceeds this limit, the system returns this error code, and subsequent tasks are queued, leading to response delays.

You can monitor your concurrency usage on the Monitoring Statistics page of the Intelligent Speech Interaction console. This page provides statistics for both call volume and concurrency, letting you filter data by service, project, region, and time range. If your concurrency usage is consistently high, we recommend increasing your quota in advance.

To increase your concurrency quota, navigate to the Overview page in the console and click Upgrade to Commercial Edition next to the TTS service. Upgrading provides a higher concurrency limit.

Slow synthesis or high latency

As speech synthesis quality improves, the underlying algorithms become more complex, which can increase synthesis latency. This is more noticeable with computationally intensive, high-quality voices. To significantly reduce perceived latency, use the streaming synthesis feature. This lets you save or play audio data as you receive it from the server, instead of waiting for the entire synthesis to complete.

First, ensure you are measuring the correct metric. For most applications, the key metric is first-packet latency: the time between the client sending the complete synthesis request and receiving the first packet of audio data from the server.

When calling the TTS service using an SDK, you can implement streaming synthesis by using the streaming callback API to receive audio data. The client can begin playback upon receiving the first audio chunk. For specific API and parameter details, see the API reference.

For latency-sensitive scenarios, use the CosyVoice large model streaming synthesis service. This service returns audio as it is being synthesized, further reducing first-packet latency.

Key points for optimizing first-packet latency: Choose a standard voice with lower algorithmic complexity to reduce latency. High-quality voices produce better results but require more processing time. If your concurrency limit is reached, the system returns error code 144005; in this case, refer to the troubleshooting steps for exceeding the concurrency limit.

Pronunciation accuracy in TTS

TTS is a probabilistic model. Industry-standard pronunciation accuracy is typically between 96% and 98%. In general use cases, Alibaba Cloud Intelligent Speech Interaction products achieve an accuracy of approximately 97%. This means not all pronunciation errors can be fixed automatically. We recommend working around mispronunciations by rephrasing the text or using the SSML feature for more control.

Fix pronunciation errors and control polyphonic characters

You can handle pronunciation issues in the following ways:

Try replacing the mispronounced character with a homophone (a different character with the same sound) to quickly fix the issue.
Use the SSML feature. SSML is an XML-based markup language for speech synthesis that provides fine-grained control over how text is converted to speech. It lets you control what is read and how it is read, including word segmentation, pronunciation, speed, pauses, pitch, volume, and even adding background music. For details, see Introduction to SSML.

Latency differences among voices

The real-time factor of speech synthesis depends on the model's algorithmic complexity. The fastest models can synthesize 33 seconds of audio in one second, while the slowest may synthesize only 0.7 seconds. Standard and high-quality voices have different latencies. Higher-quality voices use more advanced algorithms and thus require more processing time.

Recognized punctuation marks

Special symbols are also pronounced. For example, α, β, γ, ρ, sin, cos, and tan are read aloud. The percentage sign (%) is read as 'percent', and the service pauses for colons and parentheses. However, title marks and em dashes are not currently recognized. When handling special symbols, the Text-to-Speech (TTS) service speaks naturally and pauses where appropriate.

Partial text speed control

Yes. You can do this with the SSML feature. For details, see Introduction to SSML.

What are the timestamp calculation rules, and what splicing approach is recommended for segmented synthesis?

SSML pause timestamps: When using the <break> tag to control pauses, the returned subtitle timestamps do not include the pause duration. The index (begin_index, end_index) is calculated based on the plain-text position after removing SSML tags.
Splicing recommendations: Splitting text into multiple segments, synthesizing each separately, and then concatenating with timestamp corrections is not recommended, as this approach introduces errors. Use SSML to synthesize the full text in a single call to ensure audio continuity and timestamp accuracy.

How do I handle connection errors and interaction flow anomalies in streaming speech synthesis (CosyVoice/WebSocket)?

The following sub-scenarios and their solutions are described:

WebSocket connection rejected (readyState=3): Confirm that the service only supports the Beijing endpoint (wss://nls-gateway-cn-beijing.aliyuncs.com/ws/v1). Ensure that text is sent within 10 seconds of receiving the SynthesisStarted event to avoid idle timeout (error code 40000004).
MESSAGE_INVALID / BALANCING error: This occurs when RunSynthesis is sent before the SynthesisStarted event is received. The correct sequence is: StartSynthesis → wait for SynthesisStarted → RunSynthesis.
STATE_FAIL error: Streaming synthesis does not support the trial edition. Activate the commercial edition to use this feature.
Protocol confusion: CosyVoice streaming synthesis and real-time speech recognition may use the same WSS address, but their protocol parameters are completely different and must not be used interchangeably.
Abnormal data (fin=1, opcode=8): Provide the complete return log and a screenshot of the Demo parameters for further investigation.

How do I resolve integration errors and environment dependency issues with the offline speech synthesis SDK?

System dependencies: Offline synthesis relies on the SDK's own capabilities and does not depend on the phone's system TTS engine.
tdata.bin stat is invalid (ret:-1) / error 140900: This is caused by a missing voice file or insufficient read and write permissions. Check the /data/user/0/com.aliyun.nls/cache/nls_tts/tts/voices directory for correct permissions and proper ZIP extraction. Follow the official Demo configuration as a reference.
SDK differences: The offline speech synthesis SDK and the offline speech-to-text SDK are separate products. Fully offline speech-to-text is not currently supported.
Device restrictions: There is no minimum device count limit. However, the system enforces a maximum of 5 DeviceID registrations per day. This is a system-level limit and cannot be adjusted.

How do I troubleshoot abnormal speech synthesis output, missing audio, or poor audio quality?

Response received but no audio: The server returns binary audio data. The client must write it to a player or save it as a file. The data itself does not include a playback function.
Fixed 32-byte stream returned: This is typically caused by using a language that the selected voice does not support. No error is returned in this case, but output quality cannot be guaranteed.
No monitoring data / TaskID retrieval: The console cannot display the TaskID directly. Retrieve it from the API response or application logs. You can also view call statistics on the console statistics page.
Console monitoring capability selection unavailable: First cancel the currently selected capability, then re-select.
Voice priority: The voice configured in the SDK code takes precedence over the console configuration.
Singapore endpoint error 418: This indicates an unsupported voice is being used (for example, cally_ecmix). Switch to a supported voice such as Aixia or Xiaoyun.
Audio content does not match text: Re-record a clean audio clip with no background noise and a complete ending for voice cloning.