This topic describes the latest feature updates for Intelligent Speech Interaction and related documents.
April 2023 to January 2024
Feature category | Feature name | Feature description | Update type | Related documents |
Speech recognition | On-screen captions | Audio File Transcription, Audio File Transcription - Express Edition, and Audio File Transcription - Off-Peak Edition now support on-screen captions. | New | |
Speech recognition | Model Studio Service | Cost-effective real-time speech recognition is now available. | New | |
Speech synthesis | Model Studio Service | Cost-effective speech synthesis is now available. | New | |
Speech recognition | Model Studio service | Model service - Audio File Transcription now supports the following languages and dialects: Mandarin Chinese, Chinese dialects (Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin), English, Japanese, Korean, Spanish, Indonesian, French, German, Italian, and Malay. | New | |
Speech synthesis | Minor language voices | Speech synthesis now supports the following minor language voices: Russian, Korean, Vietnamese, Thai, Italian, Spanish, French, German, and American English (male and female). | New | |
Speech recognition | Dialects | Added a 16 kHz Cantonese free-talk dialect model. | New | |
Speech synthesis | Digital human and multi-emotional voices | Added seven digital human voices: Zhibai, Zhixiaoxia, Zhixiaomei, Zhigui, Zhishuo, Aixia, and Cally. Added two multi-emotional voices: Zhifeng and Zhibing. | New |
March 2022 to March 2023
Feature category | Feature name | Feature description | Update type | Documentation |
Speech recognition | Added four new product specifications for voice analysis | New product specifications:
| New | |
Audio files support MP4 format as an input parameter | Three services now support MP4 as an input parameter:
| New | ||
Mobile Android/iOS SDK |
| New | ||
Cpp SDK |
| New | ||
Added 16 kHz recognition capabilities | Chinese-English free-talk (mixed recognition), Cantonese (Traditional), Portuguese, Turkish, Greek, Javanese, Bengali, Czech, Urdu, Nepali, Mongolian (Outer Mongolia), Uzbek, Sinhala, Marathi, Telugu, Punjabi, Swedish, Bulgarian, Catalan, Hebrew, Croatian, Hausa, Burmese, Lao, Swahili, Azerbaijani, Persian, Danish, Norwegian, Malayalam, and Kannada. | New | ||
Added 8 kHz recognition capabilities | Cantonese (Traditional), Vietnamese, Thai, Malay, and Spanish. | New | ||
Increased the number of hotwords that can be added | The maximum number of words per group is increased from 128 to 500. | Optimized | ||
Speech synthesis | Added Pinyin-level phoneme timestamps | The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support Pinyin-level phoneme timestamps. | New | |
Added word-by-word timestamps | The Real-time Long-Text-to-Speech service now supports word-by-word timestamps. | Optimized | ||
Added multi-emotional voices | The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:
| New | ||
Added multilingual voices | The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:
| New | ||
Added premium Chinese voices | The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:
| New |
March 21, 2022
Feature category | Feature name | Feature description | Update type | Related documents |
Regions and Domain Names | Multiple regions | To further reduce network latency for users in North and South China, Intelligent Speech Interaction has added the China (Beijing) and China (Shenzhen) regions in addition to the existing China (Shanghai) region. | New | Related updated documents:
|
March 04, 2022
Feature category | Feature name | Feature description | Update type | References |
Speech recognition | New SDKs | Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program. | New |
|
Speech synthesis | New SDKs | Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program. | New |
February 17, 2022
Feature category | Feature name | Feature description | Update type | Related documentation |
Speech recognition | Optimized SDK features | Optimized the C++ SDK features. | Optimized | |
Speech synthesis | Optimized SDK features | Optimized the C++ SDK features. | Optimized |
February 09, 2022
Feature category | Feature name | Feature description | Update type | Related documents |
Speech recognition | Audio File Transcription - Off-Peak Edition |
| New | What dialect models and languages are supported by the speech recognition service? |
January 21, 2022
Feature category | Feature name | Feature description | Update type | Related documents |
Speech recognition | Audio File Transcription - Off-Peak Edition | Audio File Transcription - Off-Peak Edition is a service for offline transcription of pre-recorded audio files. It differs from Audio File Transcription in its response time. The Off-Peak Edition returns results within 24 hours. | New | |
Speech synthesis | New voices - Chinese |
| New |
|
New voices - Multilingual |
| New |
|
December 23, 2021
Feature category | Feature name | Feature description | Update type | Documentation |
Speech recognition | Optimized SDK features | Optimized the Python SDK features. | Optimized |
|
Speech synthesis | Optimized SDK features | Optimized the Python SDK features. | Optimized |
July 30, 2021
Feature category | Feature name | Feature description | Update type | Documentation |
Speech recognition | Shiyinshi model | The Shiyinshi model replaced 17 general-purpose or domain-specific models. | Optimized | None |
Console | Manage projects | Optimized the project creation flow. After a project is created, you are automatically guided to configure a recognition model or a synthesis voice. | Optimized | |
Self-learning - Customize language models | Optimized the voice model customization flow. Added clearer instructions for data format requirements to prevent incorrect operations due to unclear guidance. Provided more detailed error messages and suggested solutions. | Optimized | ||
Automated testing | Added shortcut buttons for viewing test results. | Optimized | ||
Billing | Clarified rules for metering and billing reports | Added clearer explanations in the console about the rules for displaying metering and billing statistics. For example, usage and fees for the current day can be viewed on the next day. | Optimized | None |
July 08, 2021
Feature category | Feature name | Feature description | Update type | Related documents |
Speech recognition | C++ SDK optimization | Published the user documentation for C++ SDK 3.0.10. | Optimized | |
Fixed a crash issue in the C++ SDK when processing WebSocket data. | Optimized | None | ||
Russian recognition optimization | Fixed an issue where spaces were missing in Russian recognition results. | Optimized | None | |
Speech synthesis | New voices |
| New | |
Engine update | Voices in the ultra-high definition scenario now support streaming playback. | New | None | |
Improved the stability of the synthesis service. | Optimized | None | ||
English voice pause optimization | Updated the English voices Abby, Emily, and Eric to fix an issue with long pauses in some sentences. | Optimized | None |
June 03, 2021
Feature category | Feature name | Feature description | Update type | Related links |
Speech recognition | Semantic segmentation update | After semantic segmentation is enabled for real-time transcription, intermediate recognition results are processed by streaming inverse text normalization (ITN) by default. This fixes the issue of flickering numbers (changing from Chinese characters to Arabic numerals) in on-screen caption scenarios. | Optimized | None |
Speech synthesis | Supports free trial and self-service access for offline speech synthesis |
| New | |
Model update |
| New | None | |
Engine update | Supports the say-as tag in English Speech Synthesis Markup Language (SSML). | New | ||
SDK update | The SDK now supports setting the sample rate to 24 kHz and 48 kHz, in addition to the original 8 kHz and 16 kHz. | Added | None |
May 13, 2021
Feature category | Feature name | Feature description | Update type | Related links |
Speech recognition | Shiyinshi V1 - End-to-end Mandarin Chinese recognition model |
| New | None |
Post-processing model update |
| Optimized | None | |
VAD model update | The common_8k human-machine noise optimized model is now available. | Optimized | None | |
Speaker diarization model update |
| Optimized | None | |
Speech synthesis | Added a UI-based download feature | On the Speech Synthesis configuration page in the console, you can now adjust the sample rate and format, and download the audio. | New | A TTS tool for beginners—synthesize and download audio without writing code |
Engine update | Optimized performance for the ultra-high definition scenario. | Optimized | None | |
Model update | Added six voices for the ultra-high definition scenario: Zhixiang, Zhiqian, Zhinan, Zhide, Zhiru, and Zhijia. | New | None |
March 23, 2021
Feature category | Feature name | Feature description | Update type | Related documents |
Speech synthesis | Added offline speech synthesis | The offline speech synthesis feature is released. | New | |
Added ultra-high definition synthesis voices | Added ultra-high definition voices: Zhiqi and Zhichu. | New | ||
Added synthesis voices |
| New | ||
Optimized pause control | Upgraded the frontend pause model and added post-processing rules. The unacceptable rate for scenarios such as customer service, novels, news, and encyclopedias has significantly decreased. | Optimized | None | |
Fixed dictionary and number/symbol regularization rules |
| Fixed | None | |
Speech recognition | Mandarin Chinese model | Improved recognition of rare characters. Improved the recognition effect of the 8 kHz general-purpose telephone customer service model for low-volume speech. | Optimized | None |
Mandarin Chinese model (upgraded) |
| Optimized | None | |
Added a parameter for audio channel selection to Audio File Transcription (including Express Edition) | For multi-channel files, you can specify the channel to be transcribed using a parameter. This lets you skip unnecessary channels to save costs. | New | ||
Added semantic segmentation to Audio File Transcription (including Express Edition) | You can use a parameter to control whether to enable semantic segmentation. | New | ||
Product documentation update |
| New |
November 27, 2020
Feature category | Feature name | Feature description | Update type | References |
Speech recognition | Audio File Transcription - Express Edition | Audio File Transcription - Express Edition supports speech recognition models for all scenarios. The console supports querying the call volume of Audio File Transcription - Express Edition. | New | |
Optimized support for WAV files in speech recognition | Optimized ASR support for WAV files. Supports more WAV file header formats to reduce the impact of file headers on recognition results. | Optimized | None | |
Audio File Transcription - Express Edition timeout | Fixed an issue where using a 16 kHz model for 8 kHz speech recognition in Audio File Transcription - Express Edition did not immediately return an error, causing a timeout. | Fixed | None | |
Access token | Optimized token generation mechanism | Improved the token generation mechanism by adding a token validity period. This avoids potential request failures caused by the original "update token every 24 hours" mechanism. | Optimized |
August 23, 2020
Feature category | Feature name | Feature description | Update type | References |
Speech synthesis | Added resource and scenario configurations for speech synthesis |
| New | |
Speech recognition | Optimized segmentation duration for real-time speech recognition | The default maximum segmentation duration for real-time speech recognition is shortened from 60 seconds to 15 seconds to simplify related API calls. | Optimized | |
Speech recognition general-purpose model and fixes for customer service quality inspection |
| Fixed |
July 23, 2020
Feature category | Feature name | Feature description | Update type | References |
Self-learning training | Free use of self-learning model development | Self-learning models are now available for free. This provides a zero-cost custom voice service to help drive business innovation. | New | |
Self-learning platform training flow |
| New | ||
Speech synthesis | Long-text-to-speech | The long-text RESTful API with integrated caption capabilities is officially released. The developer documentation is available on the official website. | New | |
SDK | New versions of Android and iOS SDKs are available |
| Optimized | None |
Speech recognition | Fixed speech recognition issues | Optimized the English post-processing effect. This resolves an issue where the recognition result format was incorrect in some cases after enabling punctuation. | Fixed | None |
July 09, 2020
Feature category | Feature name | Feature description | Update type | References |
Speech recognition | Optimized speech recognition model | The English recognition model for 8 kHz audio sample rate in Short-sentence Recognition, Real-time Speech Recognition, and Audio File Transcription has been updated. This update improves the model's accent coverage and makes the language model more general-purpose, without decreasing the word recognition accuracy on the general test set. | Optimized | |
Speech synthesis | Fixed speech synthesis model |
| Fixed | None |