Release notes

更新时间:
复制 MD 格式

This topic describes the latest feature updates for Intelligent Speech Interaction and related documents.

April 2023 to January 2024

Feature category

Feature name

Feature description

Update type

Related documents

Speech recognition

On-screen captions

Audio File Transcription, Audio File Transcription - Express Edition, and Audio File Transcription - Off-Peak Edition now support on-screen captions.

New

API reference

Speech recognition

Model Studio Service

Cost-effective real-time speech recognition is now available.

New

Real-time Speech Recognition (Paraformer)

Speech synthesis

Model Studio Service

Cost-effective speech synthesis is now available.

New

Speech Synthesis

Speech recognition

Model Studio service

Model service - Audio File Transcription now supports the following languages and dialects: Mandarin Chinese, Chinese dialects (Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin), English, Japanese, Korean, Spanish, Indonesian, French, German, Italian, and Malay.

New

Audio File Transcription (Paraformer)

Speech synthesis

Minor language voices

Speech synthesis now supports the following minor language voices: Russian, Korean, Vietnamese, Thai, Italian, Spanish, French, German, and American English (male and female).

New

API reference

Speech recognition

Dialects

Added a 16 kHz Cantonese free-talk dialect model.

New

Speech Recognition

Speech synthesis

Digital human and multi-emotional voices

Added seven digital human voices: Zhibai, Zhixiaoxia, Zhixiaomei, Zhigui, Zhishuo, Aixia, and Cally.

Added two multi-emotional voices: Zhifeng and Zhibing.

New

Speech Synthesis

March 2022 to March 2023

Feature category

Feature name

Feature description

Update type

Documentation

Speech recognition

Added four new product specifications for voice analysis

New product specifications:

  1. Sound event detection

  2. Speaker recognition

  3. Gender recognition

  4. Language identification

New

Voice Analysis

Audio files support MP4 format as an input parameter

Three services now support MP4 as an input parameter:

  • Audio File Transcription

  • Audio File Transcription - Express Edition

  • Audio File Transcription - Off-Peak Edition

New

API reference

Mobile Android/iOS SDK

  1. Supports long-text-to-speech.

  2. Supports secure access using Security Token Service (STS).

  3. Provides a more accurate offline authentication solution.

  4. iOS supports Xcode 14.

New

SDK and API overview

Cpp SDK

  1. Supports Windows x86 and x64, and supports UE5.

  2. Supports Windows C# and Unity.

  3. Supports long-text-to-speech.

  4. It supports Linux-Aarch64 platforms.

  5. Supports CXX11.

  6. Added the audio file transcription feature.

New

SDK and API overview

Added 16 kHz recognition capabilities

Chinese-English free-talk (mixed recognition), Cantonese (Traditional), Portuguese, Turkish, Greek, Javanese, Bengali, Czech, Urdu, Nepali, Mongolian (Outer Mongolia), Uzbek, Sinhala, Marathi, Telugu, Punjabi, Swedish, Bulgarian, Catalan, Hebrew, Croatian, Hausa, Burmese, Lao, Swahili, Azerbaijani, Persian, Danish, Norwegian, Malayalam, and Kannada.

New

Speech Recognition

Added 8 kHz recognition capabilities

Cantonese (Traditional), Vietnamese, Thai, Malay, and Spanish.

New

Speech Recognition

Increased the number of hotwords that can be added

The maximum number of words per group is increased from 128 to 500.

Optimized

Overview

Speech synthesis

Added Pinyin-level phoneme timestamps

The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support Pinyin-level phoneme timestamps.

New

Introduction to the speech synthesis timestamp feature

Added word-by-word timestamps

The Real-time Long-Text-to-Speech service now supports word-by-word timestamps.

Optimized

Introduction to the speech synthesis timestamp feature

Added multi-emotional voices

The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:

  • Zhimiao_Multi-emotional

  • Zhiyan_Multi-emotional

  • Zhibei_Multi-emotional

  • Zhitian_Multi-emotional

  • Zhimi_Multi-emotional

New

API reference

Added multilingual voices

The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:

  • Filipino female

  • Vietnamese female

  • Russian female

  • Korean female

  • American English customer service female

  • Spanish female

  • Italian female

New

API reference

Added premium Chinese voices

The Speech Synthesis, Real-time Long-Text-to-Speech, and Asynchronous Long-Text-to-Speech services now support the following voices:

  • Zhimao

  • Zhiyuan

  • Zhigui

  • Zhiya

  • Zhishuo

  • Zhida

  • Zhiyue

  • Zhisha

  • Kelly China (Hong Kong) Cantonese

New

API reference

March 21, 2022

Feature category

Feature name

Feature description

Update type

Related documents

Regions and Domain Names

Multiple regions

To further reduce network latency for users in North and South China, Intelligent Speech Interaction has added the China (Beijing) and China (Shenzhen) regions in addition to the existing China (Shanghai) region.

New

New: Regions and endpoints

Related updated documents:

March 04, 2022

Feature category

Feature name

Feature description

Update type

References

Speech recognition

New SDKs

Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program.

New

Speech synthesis

New SDKs

Added SDKs for four programming languages: C# SDK, Go SDK, Node.js SDK, and WeChat mini program.

New

February 17, 2022

Feature category

Feature name

Feature description

Update type

Related documentation

Speech recognition

Optimized SDK features

Optimized the C++ SDK features.

Optimized

  • Short-sentence recognition: C++ SDK

  • Real-time speech recognition: C++ SDK

Speech synthesis

Optimized SDK features

Optimized the C++ SDK features.

Optimized

C++ SDK

February 09, 2022

Feature category

Feature name

Feature description

Update type

Related documents

Speech recognition

Audio File Transcription - Off-Peak Edition

  • Tamil (16 kHz)

  • Polish (16 kHz)

  • Ukrainian (16 kHz)

  • Romanian (16 kHz)

  • Dutch (16 kHz)

  • Hungarian (16 kHz)

  • Khmer (16 kHz)

  • Philippines (16K, 8K)

  • Spanish (16 kHz, 8 kHz)

  • Indonesian (8 kHz)

  • Vietnamese (8 kHz)

New

What dialect models and languages are supported by the speech recognition service?

January 21, 2022

Feature category

Feature name

Feature description

Update type

Related documents

Speech recognition

Audio File Transcription - Off-Peak Edition

Audio File Transcription - Off-Peak Edition is a service for offline transcription of pre-recorded audio files. It differs from Audio File Transcription in its response time. The Off-Peak Edition returns results within 24 hours.

New

Audio File Transcription - Off-Peak Edition

Speech synthesis

New voices - Chinese

  • Soothing child voice Jielidou

  • Northeastern male voice Laotie

  • Lolita female voice Zhiwei

  • Livestreaming female voice Laomei

  • Tianjin male voice Aikan

  • Taiwanese female voice zhiqing

  • Sweet female voice zhitian

New

New voices - Multilingual

  • American English female voice Annie

  • Filipino female voice Tala

New

December 23, 2021

Feature category

Feature name

Feature description

Update type

Documentation

Speech recognition

Optimized SDK features

Optimized the Python SDK features.

Optimized

Speech synthesis

Optimized SDK features

Optimized the Python SDK features.

Optimized

Python SDK

July 30, 2021

Feature category

Feature name

Feature description

Update type

Documentation

Speech recognition

Shiyinshi model

The Shiyinshi model replaced 17 general-purpose or domain-specific models.

Optimized

None

Console

Manage projects

Optimized the project creation flow. After a project is created, you are automatically guided to configure a recognition model or a synthesis voice.

Optimized

Manage projects

Self-learning - Customize language models

Optimized the voice model customization flow. Added clearer instructions for data format requirements to prevent incorrect operations due to unclear guidance. Provided more detailed error messages and suggested solutions.

Optimized

Customize language models

Automated testing

Added shortcut buttons for viewing test results.

Optimized

Automated testing

Billing

Clarified rules for metering and billing reports

Added clearer explanations in the console about the rules for displaying metering and billing statistics. For example, usage and fees for the current day can be viewed on the next day.

Optimized

None

July 08, 2021

Feature category

Feature name

Feature description

Update type

Related documents

Speech recognition

C++ SDK optimization

Published the user documentation for C++ SDK 3.0.10.

Optimized

Fixed a crash issue in the C++ SDK when processing WebSocket data.

Optimized

None

Russian recognition optimization

Fixed an issue where spaces were missing in Russian recognition results.

Optimized

None

Speech synthesis

New voices

  • Ultra-high definition scenario: Lolita child voice - Zhiwei

  • Livestreaming scenario: Northeastern male voice - Laotie, Hawkers female voice - Laomei

  • Child voice: Soothing male child - Jielidou

New

Speech synthesis API reference

Engine update

Voices in the ultra-high definition scenario now support streaming playback.

New

None

Improved the stability of the synthesis service.

Optimized

None

English voice pause optimization

Updated the English voices Abby, Emily, and Eric to fix an issue with long pauses in some sentences.

Optimized

None

June 03, 2021

Feature category

Feature name

Feature description

Update type

Related links

Speech recognition

Semantic segmentation update

After semantic segmentation is enabled for real-time transcription, intermediate recognition results are processed by streaming inverse text normalization (ITN) by default. This fixes the issue of flickering numbers (changing from Chinese characters to Arabic numerals) in on-screen caption scenarios.

Optimized

None

Speech synthesis

Supports free trial and self-service access for offline speech synthesis

  • You can try five Standard Edition offline speech synthesis SDKs and five premium offline speech synthesis SDKs for free.

  • You can also purchase commercial version SDKs with a perpetual license.

New

Model update

  • Added two voices for livestreaming and video dubbing: Aifei and Ailun.

  • Added two voices for the ultra-high definition scenario: Zhifei and Zhilun.

  • Added the American English voice Ava.

New

None

Engine update

Supports the say-as tag in English Speech Synthesis Markup Language (SSML).

New

Introduction to the SSML markup language

SDK update

The SDK now supports setting the sample rate to 24 kHz and 48 kHz, in addition to the original 8 kHz and 16 kHz.

Added

None

May 13, 2021

Feature category

Feature name

Feature description

Update type

Related links

Speech recognition

Shiyinshi V1 - End-to-end Mandarin Chinese recognition model

  • High recognition accuracy:

    Based on a self-developed end-to-end speech recognition framework, the Chinese recognition accuracy is among the highest in the industry. The character error rate is reduced by 10% to 30% compared to the previous generation system in scenarios such as customer service, input methods, and meetings.

  • Supports both real-time and offline speech recognition, and supports 8 kHz and 16 kHz models.

  • Fast recognition speed:

    Uses character-level modeling units and a self-developed model inference engine. The concurrent inference speed is more than 10 times faster than mainstream inference frameworks. The service response latency is at the millisecond level.

New

None

Post-processing model update

  • Fixed the English ITN timestamp issue.

  • Fixed the issue of platform differences in offline ITN timestamp output.

  • Fixed the issue of extra spaces at the end of streaming ITN.

  • Fixed typical bugs:

    • Twenty to thirty years -> Twenty to thirty years

    • One hundred and two years -> 102 years

    • Wenyi West Road No. nine six nine -> Wenyi West Road No. 969

Optimized

None

VAD model update

The common_8k human-machine noise optimized model is now available.

Optimized

None

Speaker diarization model update

  • The 8 kHz supervised speaker diarization algorithm for audio file transcription now includes a parallel mode, which reduces the time required to obtain the output for a single request.

  • Improved robustness to noise, further reducing single-speaker output bugs caused by noise interference.

Optimized

None

Speech synthesis

Added a UI-based download feature

On the Speech Synthesis configuration page in the console, you can now adjust the sample rate and format, and download the audio.

New

A TTS tool for beginners—synthesize and download audio without writing code

Engine update

Optimized performance for the ultra-high definition scenario.

Optimized

None

Model update

Added six voices for the ultra-high definition scenario: Zhixiang, Zhiqian, Zhinan, Zhide, Zhiru, and Zhijia.

New

None

March 23, 2021

Feature category

Feature name

Feature description

Update type

Related documents

Speech synthesis

Added offline speech synthesis

The offline speech synthesis feature is released.

New

API reference

Added ultra-high definition synthesis voices

Added ultra-high definition voices: Zhiqi and Zhichu.

New

API reference

Added synthesis voices

  • Added speakers: Cantonese female Jiajia, Cantonese female Taozi, Japanese male Tomoya, Japanese male Tomoka, American English Annie, and Indonesian female Indah.

  • Voices for literary scenarios: Aixiao, Aishu, Airu, and Aiqian.

  • Live streaming scenarios, such as those involving sales hosts and specific hosts like Stella.

New

API reference

Optimized pause control

Upgraded the frontend pause model and added post-processing rules. The unacceptable rate for scenarios such as customer service, novels, news, and encyclopedias has significantly decreased.

Optimized

None

Fixed dictionary and number/symbol regularization rules

  • Add entries, such as U+4DAE (yan3) and U+7180 (huang3).

  • Fixed issues with the synthesized pronunciation of Chinese polyphonic characters in terms such as "COVID-19 pneumonia", "novel coronavirus", and "COVID-19 vaccine".

  • Optimized regularization rules for numbers and symbols, such as adding support for uppercase and lowercase Roman numerals from 1 to 10.

  • Added some British and American English entries, such as "EB virus" and "iOS".

  • Updated Indonesian regularization rules and dictionary.

Fixed

None

Speech recognition

Mandarin Chinese model

Improved recognition of rare characters. Improved the recognition effect of the 8 kHz general-purpose telephone customer service model for low-volume speech.

Optimized

None

Mandarin Chinese model (upgraded)

  • Improved recognition in noisy scenarios.

  • Improved recognition of rare characters.

  • Improved recognition of accents mixed with Mandarin.

  • Improved recognition of nonsensical audio and reduced abnormal repetitions in recognition results.

  • Improved recognition of mixed Chinese and English in livestreaming scenarios.

Optimized

None

Added a parameter for audio channel selection to Audio File Transcription (including Express Edition)

For multi-channel files, you can specify the channel to be transcribed using a parameter. This lets you skip unnecessary channels to save costs.

New

API reference

Added semantic segmentation to Audio File Transcription (including Express Edition)

You can use a parameter to control whether to enable semantic segmentation.

New

API reference

Product documentation update

  • Added more easy-to-understand explanations about dialects and accents.

  • Added product application videos.

  • Added instructions on the queries per second (QPS) for Audio File Transcription calls.

New

API reference

November 27, 2020

Feature category

Feature name

Feature description

Update type

References

Speech recognition

Audio File Transcription - Express Edition

Audio File Transcription - Express Edition supports speech recognition models for all scenarios. The console supports querying the call volume of Audio File Transcription - Express Edition.

New

API reference

Optimized support for WAV files in speech recognition

Optimized ASR support for WAV files. Supports more WAV file header formats to reduce the impact of file headers on recognition results.

Optimized

None

Audio File Transcription - Express Edition timeout

Fixed an issue where using a 16 kHz model for 8 kHz speech recognition in Audio File Transcription - Express Edition did not immediately return an error, causing a timeout.

Fixed

None

Access token

Optimized token generation mechanism

Improved the token generation mechanism by adding a token validity period. This avoids potential request failures caused by the original "update token every 24 hours" mechanism.

Optimized

Obtain a token using an SDK

August 23, 2020

Feature category

Feature name

Feature description

Update type

References

Speech synthesis

Added resource and scenario configurations for speech synthesis

  • Added a resource tag to the SSML for speech synthesis. This tag can parse "offline resources for multimodal interaction" and can replace the position information of each character in the timestamp.

  • The RESTful API for speech synthesis now supports configuring speaker, volume, speech rate, and pitch parameters in the console. This simplifies API call parameter configuration.

  • Added speakers for literary scenarios: Ainan, Aiyan, Aihao, and Aiming, providing you with more choices.

New

API reference

Speech recognition

Optimized segmentation duration for real-time speech recognition

The default maximum segmentation duration for real-time speech recognition is shortened from 60 seconds to 15 seconds to simplify related API calls.

Optimized

API reference

Speech recognition general-purpose model and fixes for customer service quality inspection

  • Improved the Voice Activity Detection (VAD) effect for the 16 kHz Chinese general-purpose model. This resolves the issue of misdetecting speech in silent data.

  • Performed routine language model updates for the 8 kHz Chinese customer service quality inspection, 8 kHz English customer service quality inspection, and 16 kHz Korean models. This fixes some recognition errors.

Fixed

API reference

July 23, 2020

Feature category

Feature name

Feature description

Update type

References

Self-learning training

Free use of self-learning model development

Self-learning models are now available for free. This provides a zero-cost custom voice service to help drive business innovation.

New

Overview

Self-learning platform training flow

  • Added recommendations for the best baseline model to simplify your training.

  • Integrated with automated testing to provide quantifiable test metric results for models.

New

Overview

Speech synthesis

Long-text-to-speech

The long-text RESTful API with integrated caption capabilities is officially released. The developer documentation is available on the official website.

New

RESTful API

SDK

New versions of Android and iOS SDKs are available

  • The Android SDK size is reduced by 34.6% and the iOS SDK size is reduced by 17.5%. The SDKs have been tested with hundreds of millions of daily calls and are highly stable.

  • Improved SDK state management (such as turning audio on/off and data push). You can focus on business implementation without complex state and thread management.

  • Maintains API consistency with the end-to-end solution. This allows for seamless integration with Intelligent Speech Interaction scenarios such as wake-up, voice, and dialogue understanding, and offline speech synthesis in the future.

Optimized

None

Speech recognition

Fixed speech recognition issues

Optimized the English post-processing effect. This resolves an issue where the recognition result format was incorrect in some cases after enabling punctuation.

Fixed

None

July 09, 2020

Feature category

Feature name

Feature description

Update type

References

Speech recognition

Optimized speech recognition model

The English recognition model for 8 kHz audio sample rate in Short-sentence Recognition, Real-time Speech Recognition, and Audio File Transcription has been updated. This update improves the model's accent coverage and makes the language model more general-purpose, without decreasing the word recognition accuracy on the general test set.

Optimized

API reference

Speech synthesis

Fixed speech synthesis model

  • Abby (speaker name): Reduced the rate of missed words.

  • Wendy (speaker name): Fixed the instability issue when synthesizing long text.

  • English scenario: Fixed the issue where non-standard spaces in English text caused word parsing to fail, improving word recognition accuracy.

  • Chinese scenario: Fixed issues with polyphonic characters and word segmentation.

Fixed

None