This document answers common questions about the Speech Synthesis service.
This FAQ is organized into the following categories:
Features
Performance
Features
WAV file duration mismatch
TTS uses streaming synthesis, which returns audio data as it is generated. As a result, the WAV file header contains only an estimated duration, which can be inaccurate. If a precise duration is critical, set the format parameter to pcm. After receiving the complete audio for an utterance, you can add a new WAV header to the file to ensure an accurate duration. For details, see the API Reference.
Timestamp feature
The timestamp feature provides the time position for each character or word as the audio stream is generated. This feature is also referred to as the word-level phoneme boundary interface. You can use this timing information to drive lip-sync for virtual avatars, create subtitles for voice-over videos, and more. For more information, see Introduction to the Speech Synthesis Timestamp Feature.
Controlling pronunciation of digits
You can use Speech Synthesis Markup Language (SSML). SSML is an XML-based markup language that lets you control how the service synthesizes text. This includes controlling sentence breaking and tokenization, pronunciation, speech rate, pauses, pitch, and volume, and even adding background music. For details, see the SSML Markup Language Guide.
Pronunciation of polyphonic characters
For polyphonic characters not found in common phrases, TTS predicts the pronunciation based on the surrounding context.
Usage limits for Long Text-to-Speech
Long Text-to-Speech converts very long texts (thousands or tens of thousands of characters) into binary audio data. It supports PCM, WAV, and MP3 output formats. You can also adjust the speech rate, pitch, and volume, and choose between male and female voices. Results can be retrieved in real time or asynchronously.
The main difference between Long Text-to-Speech and the standard Speech Synthesis service is the text length limit. The standard service supports texts up to 300 characters, while Long Text-to-Speech is designed for synthesizing texts up to 100,000 characters in a single request. For more information, see the API Reference.
Performance
Character limits for requests
Character limits on Speech Synthesis requests help conserve server-side resources, as very long synthesized texts are often not fully used. If you are using an API or SDK, you can split the text into smaller segments, synthesize them individually, and then concatenate the results. For calls using the MRCP protocol, which is common in customer service and call center scenarios, synthesizing very long texts is impractical for human-computer interaction, as the audio playback would be too long and likely interrupted. For very long texts, such as news articles with thousands of words, use the Long Text-to-Speech feature, which supports up to 100,000 characters in a single request. For details, see the API Reference.
Slow synthesis and high latency
As the quality of Speech Synthesis improves, the complexity of the underlying algorithms also increases. This can lead to longer synthesis times, especially when using advanced voices that require more computation. To minimize perceived latency, we recommend using streaming synthesis. This allows you to save or play the audio as you receive it from the server, significantly improving the user experience.
When measuring performance, it is important to distinguish between total synthesis time and first-packet latency. First-packet latency is the time from when the client sends the synthesis request to when it receives the first part of the binary stream from the server. In most real-time applications, this is the most critical metric to monitor.
Pronunciation accuracy
Speech Synthesis (TTS) is based on probabilistic models. The industry-standard pronunciation accuracy is typically between 96% and 98%. Alibaba Cloud's Intelligent Speech Interaction products achieve an accuracy of around 97% in general use cases. This means the model itself cannot eliminate every pronunciation error. If you encounter an error, we recommend either rephrasing the text with a synonym or using SSML to specify the correct pronunciation.
Correcting mispronunciations
To resolve pronunciation issues:
Try replacing the word with a homophone to quickly correct the pronunciation.
Use Speech Synthesis Markup Language (SSML) to precisely control pronunciation, speech rate, pauses, and other audio characteristics. For details, see the SSML Markup Language Guide.
Latency variation between voices
The synthesis speed depends on the complexity of the model and algorithm. Our fastest models can generate 33 seconds of audio in 1 second, while our most complex models generate 0.7 seconds of audio in 1 second. Latency differs between standard and premium voices. Generally, higher-quality voices use more advanced algorithms and thus have higher latency.
Recognized punctuation
The service reads many special characters and terms, such as Greek letters (α, β, γ, ρ), mathematical functions (sin, cos, tan), and the percent sign (%). Punctuation like colons and parentheses create natural pauses in the speech. Currently, book title marks and em dashes are not recognized.
Partial text speech rate adjustment
Yes, this is possible using SSML. For details, see the SSML Markup Language Guide.