Latest updates for Intelligent Speech Interaction features-Intelligent Speech Interaction(ISI)-阿里云帮助中心

This topic describes the latest feature updates for Intelligent Speech Interaction and related documents.

April 2023 to January 2024

Feature category	Feature name	Feature description	Update type	Related documents
Speech recognition	On-screen captions	Audio File Transcription, Audio File Transcription - Express Edition, and Audio File Transcription - Off-Peak Edition now support on-screen captions.	New	API reference
Speech recognition	Model Studio Service	Cost-effective real-time speech recognition is now available.	New	Real-time Speech Recognition (Paraformer)
Speech synthesis	Model Studio Service	Cost-effective speech synthesis is now available.	New	Speech Synthesis
Speech recognition	Model Studio service	Model service - Audio File Transcription now supports the following languages and dialects: Mandarin Chinese, Chinese dialects (Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin), English, Japanese, Korean, Spanish, Indonesian, French, German, Italian, and Malay.	New	Audio File Transcription (Paraformer)
Speech synthesis	Minor language voices	Speech synthesis now supports the following minor language voices: Russian, Korean, Vietnamese, Thai, Italian, Spanish, French, German, and American English (male and female).	New	API reference
Speech recognition	Dialects	Added a 16 kHz Cantonese free-talk dialect model.	New	Speech Recognition
Speech synthesis	Digital human and multi-emotional voices	Added seven digital human voices: Zhibai, Zhixiaoxia, Zhixiaomei, Zhigui, Zhishuo, Aixia, and Cally. Added two multi-emotional voices: Zhifeng and Zhibing.	New	Speech Synthesis

March 2022 to March 2023

Feature category	Feature name	Feature description	Update type	Documentation
Speech recognition	Added four new product specifications for voice analysis	New product specifications: Sound event detection Speaker recognition Gender recognition Language identification	New	Voice Analysis
	Audio files support MP4 format as an input parameter	Three services now support MP4 as an input parameter: Audio File Transcription Audio File Transcription - Express Edition Audio File Transcription - Off-Peak Edition	New	API reference
	Mobile Android/iOS SDK	Supports long-text-to-speech. Supports secure access using Security Token Service (STS). Provides a more accurate offline authentication solution. iOS supports Xcode 14.	New	SDK and API overview
	Cpp SDK	Supports Windows x86 and x64, and supports UE5. Supports Windows C# and Unity. Supports long-text-to-speech. It supports Linux-Aarch64 platforms. Supports CXX11. Added the audio file transcription feature.	New	SDK and API overview
	Added 16 kHz recognition capabilities	Chinese-English free-talk (mixed recognition), Cantonese (Traditional), Portuguese, Turkish, Greek, Javanese, Bengali, Czech, Urdu, Nepali, Mongolian (Outer Mongolia), Uzbek, Sinhala, Marathi, Telugu, Punjabi, Swedish, Bulgarian, Catalan, Hebrew, Croatian, Hausa, Burmese, Lao, Swahili, Azerbaijani, Persian, Danish, Norwegian, Malayalam, and Kannada.	New	Speech Recognition
	Added 8 kHz recognition capabilities	Cantonese (Traditional), Vietnamese, Thai, Malay, and Spanish.	New	Speech Recognition
	Increased the number of hotwords that can be added	The maximum number of words per group is increased from 128 to 500.	Optimized	Overview
Speech synthesis	Added Pinyin-level phoneme timestamps	The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support Pinyin-level phoneme timestamps.	New	Introduction to the speech synthesis timestamp feature
	Added word-by-word timestamps	The Real-time Long-Text-to-Speech service now supports word-by-word timestamps.	Optimized	Introduction to the speech synthesis timestamp feature
	Added multi-emotional voices	The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices: Zhimiao_Multi-emotional Zhiyan_Multi-emotional Zhibei_Multi-emotional Zhitian_Multi-emotional Zhimi_Multi-emotional	New	API reference
	Added multilingual voices	The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices: Filipino female Vietnamese female Russian female Korean female American English customer service female Spanish female Italian female	New	API reference
	Added premium Chinese voices	The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices: Zhimao Zhiyuan Zhigui Zhiya Zhishuo Zhida Zhiyue Zhisha Kelly China (Hong Kong) Cantonese	New	API reference

March 21, 2022

Feature category

Feature name

Feature description

Update type

Related documents

Regions and Domain Names

Multiple regions

To further reduce network latency for users in North and South China, Intelligent Speech Interaction has added the China (Beijing) and China (Shenzhen) regions in addition to the existing China (Shanghai) region.

New

New: Regions and endpoints

March 04, 2022

Feature category	Feature name	Feature description	Update type	References
Speech recognition	New SDKs	Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program.	New	Short-sentence recognition: Go SDK, Node.js SDK, WeChat mini program Real-time speech recognition: C# SDK, Go SDK, Node.js SDK, WeChat mini program
Speech synthesis	New SDKs	Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program.	New	C# SDK Go SDK Node.js SDK WeChat mini program

February 17, 2022

Feature category	Feature name	Feature description	Update type	Related documentation
Speech recognition	Optimized SDK features	Optimized the C++ SDK features.	Optimized	Short-sentence recognition: C++ SDK Real-time speech recognition: C++ SDK
Speech synthesis	Optimized SDK features	Optimized the C++ SDK features.	Optimized	C++ SDK

February 09, 2022

Feature category

Feature name

Feature description

Update type

Related documents

Speech recognition

Audio File Transcription - Off-Peak Edition

Tamil (16 kHz)
Polish (16 kHz)
Ukrainian (16 kHz)
Romanian (16 kHz)
Dutch (16 kHz)
Hungarian (16 kHz)
Khmer (16 kHz)
Philippines (16K, 8K)
Spanish (16 kHz, 8 kHz)
Indonesian (8 kHz)
Vietnamese (8 kHz)

New

What dialect models and languages are supported by the speech recognition service?

January 21, 2022

Feature category	Feature name	Feature description	Update type	Related documents
Speech recognition	Audio File Transcription - Off-Peak Edition	Audio File Transcription - Off-Peak Edition is a service for offline transcription of pre-recorded audio files. It differs from Audio File Transcription in its response time. The Off-Peak Edition returns results within 24 hours.	New	Audio File Transcription - Off-Peak Edition
Speech synthesis	New voices - Chinese	Soothing child voice Jielidou Northeastern male voice Laotie Lolita female voice Zhiwei Livestreaming female voice Laomei Tianjin male voice Aikan Taiwanese female voice zhiqing Sweet female voice zhitian	New	Speech synthesis: API reference Speech synthesis for mobile: API reference Long-text-to-speech: API reference
Speech synthesis	New voices - Multilingual	American English female voice Annie Filipino female voice Tala	New	Speech synthesis: API reference Speech synthesis for mobile: API reference Long-text-to-speech: API reference

December 23, 2021

Feature category	Feature name	Feature description	Update type	Documentation
Speech recognition	Optimized SDK features	Optimized the Python SDK features.	Optimized	Short-sentence recognition: Python SDK Real-time speech recognition: Python SDK
Speech synthesis	Optimized SDK features	Optimized the Python SDK features.	Optimized	Python SDK

July 30, 2021

Feature category	Feature name	Feature description	Update type	Documentation
Speech recognition	Shiyinshi model	The Shiyinshi model replaced 17 general-purpose or domain-specific models.	Optimized	None
Console	Manage projects	Optimized the project creation flow. After a project is created, you are automatically guided to configure a recognition model or a synthesis voice.	Optimized	Manage projects
	Self-learning - Customize language models	Optimized the voice model customization flow. Added clearer instructions for data format requirements to prevent incorrect operations due to unclear guidance. Provided more detailed error messages and suggested solutions.	Optimized	Customize language models
	Automated testing	Added shortcut buttons for viewing test results.	Optimized	Automated testing
Billing	Clarified rules for metering and billing reports	Added clearer explanations in the console about the rules for displaying metering and billing statistics. For example, usage and fees for the current day can be viewed on the next day.	Optimized	None

July 08, 2021

Feature category	Feature name	Feature description	Update type	Related documents
Speech recognition	C++ SDK optimization	Published the user documentation for C++ SDK 3.0.10.	Optimized	Short-sentence recognition Real-time speech recognition
	C++ SDK optimization	Fixed a crash issue in the C++ SDK when processing WebSocket data.	Optimized	None
	Russian recognition optimization	Fixed an issue where spaces were missing in Russian recognition results.	Optimized	None
Speech synthesis	New voices	Ultra-high definition scenario: Lolita child voice - Zhiwei Livestreaming scenario: Northeastern male voice - Laotie, Hawkers female voice - Laomei Child voice: Soothing male child - Jielidou	New	Speech synthesis API reference
	Engine update	Voices in the ultra-high definition scenario now support streaming playback.	New	None
	Engine update	Improved the stability of the synthesis service.	Optimized	None
	English voice pause optimization	Updated the English voices Abby, Emily, and Eric to fix an issue with long pauses in some sentences.	Optimized	None

June 03, 2021

Feature category	Feature name	Feature description	Update type	Related links
Speech recognition	Semantic segmentation update	After semantic segmentation is enabled for real-time transcription, intermediate recognition results are processed by streaming inverse text normalization (ITN) by default. This fixes the issue of flickering numbers (changing from Chinese characters to Arabic numerals) in on-screen caption scenarios.	Optimized	None
Speech synthesis	Supports free trial and self-service access for offline speech synthesis	You can try five Standard Edition offline speech synthesis SDKs and five premium offline speech synthesis SDKs for free. You can also purchase commercial version SDKs with a perpetual license.	New	Offline Speech Synthesis Product Details Activate Authorization
	Model update	Added two voices for livestreaming and video dubbing: Aifei and Ailun. Added two voices for the ultra-high definition scenario: Zhifei and Zhilun. Added the American English voice Ava.	New	None
	Engine update	Supports the say-as tag in English Speech Synthesis Markup Language (SSML).	New	Introduction to the SSML markup language
	SDK update	The SDK now supports setting the sample rate to 24 kHz and 48 kHz, in addition to the original 8 kHz and 16 kHz.	Added	None

May 13, 2021

Feature category	Feature name	Feature description	Update type	Related links
Speech recognition	Shiyinshi V1 - End-to-end Mandarin Chinese recognition model	High recognition accuracy: Based on a self-developed end-to-end speech recognition framework, the Chinese recognition accuracy is among the highest in the industry. The character error rate is reduced by 10% to 30% compared to the previous generation system in scenarios such as customer service, input methods, and meetings. Supports both real-time and offline speech recognition, and supports 8 kHz and 16 kHz models. Fast recognition speed: Uses character-level modeling units and a self-developed model inference engine. The concurrent inference speed is more than 10 times faster than mainstream inference frameworks. The service response latency is at the millisecond level.	New	None
	Post-processing model update	Fixed the English ITN timestamp issue. Fixed the issue of platform differences in offline ITN timestamp output. Fixed the issue of extra spaces at the end of streaming ITN. Fixed typical bugs: Twenty to thirty years -> Twenty to thirty years One hundred and two years -> 102 years Wenyi West Road No. nine six nine -> Wenyi West Road No. 969	Optimized	None
	VAD model update	The common_8k human-machine noise optimized model is now available.	Optimized	None
	Speaker diarization model update	The 8 kHz supervised speaker diarization algorithm for audio file transcription now includes a parallel mode, which reduces the time required to obtain the output for a single request. Improved robustness to noise, further reducing single-speaker output bugs caused by noise interference.	Optimized	None
Speech synthesis	Added a UI-based download feature	On the Speech Synthesis configuration page in the console, you can now adjust the sample rate and format, and download the audio.	New	A TTS tool for beginners—synthesize and download audio without writing code
	Engine update	Optimized performance for the ultra-high definition scenario.	Optimized	None
	Model update	Added six voices for the ultra-high definition scenario: Zhixiang, Zhiqian, Zhinan, Zhide, Zhiru, and Zhijia.	New	None

March 23, 2021

Feature category	Feature name	Feature description	Update type	Related documents
Speech synthesis	Added offline speech synthesis	The offline speech synthesis feature is released.	New	API reference
	Added ultra-high definition synthesis voices	Added ultra-high definition voices: Zhiqi and Zhichu.	New	API reference
	Added synthesis voices	Added speakers: Cantonese female Jiajia, Cantonese female Taozi, Japanese male Tomoya, Japanese male Tomoka, American English Annie, and Indonesian female Indah. Voices for literary scenarios: Aixiao, Aishu, Airu, and Aiqian. Live streaming scenarios, such as those involving sales hosts and specific hosts like Stella.	New	API reference
	Optimized pause control	Upgraded the frontend pause model and added post-processing rules. The unacceptable rate for scenarios such as customer service, novels, news, and encyclopedias has significantly decreased.	Optimized	None
	Fixed dictionary and number/symbol regularization rules	Add entries, such as U+4DAE (yan3) and U+7180 (huang3). Fixed issues with the synthesized pronunciation of Chinese polyphonic characters in terms such as "COVID-19 pneumonia", "novel coronavirus", and "COVID-19 vaccine". Optimized regularization rules for numbers and symbols, such as adding support for uppercase and lowercase Roman numerals from 1 to 10. Added some British and American English entries, such as "EB virus" and "iOS". Updated Indonesian regularization rules and dictionary.	Fixed	None
Speech recognition	Mandarin Chinese model	Improved recognition of rare characters. Improved the recognition effect of the 8 kHz general-purpose telephone customer service model for low-volume speech.	Optimized	None
	Mandarin Chinese model (upgraded)	Improved recognition in noisy scenarios. Improved recognition of rare characters. Improved recognition of accents mixed with Mandarin. Improved recognition of nonsensical audio and reduced abnormal repetitions in recognition results. Improved recognition of mixed Chinese and English in livestreaming scenarios.	Optimized	None
	Added a parameter for audio channel selection to Audio File Transcription (including Express Edition)	For multi-channel files, you can specify the channel to be transcribed using a parameter. This lets you skip unnecessary channels to save costs.	New	API reference
	Added semantic segmentation to Audio File Transcription (including Express Edition)	You can use a parameter to control whether to enable semantic segmentation.	New	API reference
	Product documentation update	Added more easy-to-understand explanations about dialects and accents. Added product application videos. Added instructions on the queries per second (QPS) for Audio File Transcription calls.	New	API reference

November 27, 2020

Feature category	Feature name	Feature description	Update type	References
Speech recognition	Audio File Transcription - Express Edition	Audio File Transcription - Express Edition supports speech recognition models for all scenarios. The console supports querying the call volume of Audio File Transcription - Express Edition.	New	API reference
	Optimized support for WAV files in speech recognition	Optimized ASR support for WAV files. Supports more WAV file header formats to reduce the impact of file headers on recognition results.	Optimized	None
	Audio File Transcription - Express Edition timeout	Fixed an issue where using a 16 kHz model for 8 kHz speech recognition in Audio File Transcription - Express Edition did not immediately return an error, causing a timeout.	Fixed	None
Access token	Optimized token generation mechanism	Improved the token generation mechanism by adding a token validity period. This avoids potential request failures caused by the original "update token every 24 hours" mechanism.	Optimized	Obtain a token using an SDK

August 23, 2020

Feature category	Feature name	Feature description	Update type	References
Speech synthesis	Added resource and scenario configurations for speech synthesis	Added a resource tag to the SSML for speech synthesis. This tag can parse "offline resources for multimodal interaction" and can replace the position information of each character in the timestamp. The RESTful API for speech synthesis now supports configuring speaker, volume, speech rate, and pitch parameters in the console. This simplifies API call parameter configuration. Added speakers for literary scenarios: Ainan, Aiyan, Aihao, and Aiming, providing you with more choices.	New	API reference
Speech recognition	Optimized segmentation duration for real-time speech recognition	The default maximum segmentation duration for real-time speech recognition is shortened from 60 seconds to 15 seconds to simplify related API calls.	Optimized	API reference
Speech recognition	Speech recognition general-purpose model and fixes for customer service quality inspection	Improved the Voice Activity Detection (VAD) effect for the 16 kHz Chinese general-purpose model. This resolves the issue of misdetecting speech in silent data. Performed routine language model updates for the 8 kHz Chinese customer service quality inspection, 8 kHz English customer service quality inspection, and 16 kHz Korean models. This fixes some recognition errors.	Fixed	API reference

July 23, 2020

Feature category	Feature name	Feature description	Update type	References
Self-learning training	Free use of self-learning model development	Self-learning models are now available for free. This provides a zero-cost custom voice service to help drive business innovation.	New	Overview
Self-learning training	Self-learning platform training flow	Added recommendations for the best baseline model to simplify your training. Integrated with automated testing to provide quantifiable test metric results for models.	New	Overview
Speech synthesis	Long-text-to-speech	The long-text RESTful API with integrated caption capabilities is officially released. The developer documentation is available on the official website.	New	RESTful API
SDK	New versions of Android and iOS SDKs are available	The Android SDK size is reduced by 34.6% and the iOS SDK size is reduced by 17.5%. The SDKs have been tested with hundreds of millions of daily calls and are highly stable. Improved SDK state management (such as turning audio on/off and data push). You can focus on business implementation without complex state and thread management. Maintains API consistency with the end-to-end solution. This allows for seamless integration with Intelligent Speech Interaction scenarios such as wake-up, voice, and dialogue understanding, and offline speech synthesis in the future.	Optimized	None
Speech recognition	Fixed speech recognition issues	Optimized the English post-processing effect. This resolves an issue where the recognition result format was incorrect in some cases after enabling punctuation.	Fixed	None

July 09, 2020

Feature category	Feature name	Feature description	Update type	References
Speech recognition	Optimized speech recognition model	The English recognition model for 8 kHz audio sample rate in Short-sentence Recognition, Real-time Speech Recognition, and Audio File Transcription has been updated. This update improves the model's accent coverage and makes the language model more general-purpose, without decreasing the word recognition accuracy on the general test set.	Optimized	API reference
Speech synthesis	Fixed speech synthesis model	Abby (speaker name): Reduced the rate of missed words. Wendy (speaker name): Fixed the instability issue when synthesizing long text. English scenario: Fixed the issue where non-standard spaces in English text caused word parsing to fail, improving word recognition accuracy. Chinese scenario: Fixed issues with polyphonic characters and word segmentation.	Fixed	None