Core benefits of Intelligent Speech Interaction-Intelligent Speech Interaction(ISI)-阿里云帮助中心

This topic describes the core benefits of Intelligent Speech Interaction.

Speech recognition

High recognition accuracy
Our proprietary 'Shi Yin Shi' general end-to-end speech recognition framework is based on SAN-M and provides Chinese recognition accuracy that is among the highest in the industry.
For applications such as input methods, customer service, and conferences, the word error rate is 10% to 30% lower than that of our previous system. This significantly improves speech recognition accuracy.
Fast recognition speed
Using character-level modeling units and a proprietary model inference engine, our concurrent inference speed is more than 10 times faster than mainstream industry frameworks.
Our proprietary LFR decoding technology increases decoding speed by more than three times without any loss of recognition accuracy. This significantly shortens feedback time and improves the user experience.
Unique model optimization tools
You can use the model optimization tool to customize models for specific domains. This maximizes recognition performance.
Rich features
It supports a wide range of features, including audio-text synchronization, language identification, and voiceprint recognition.
Wide domain coverage
It is suitable for various scenarios, such as AI chat, voice commands, audio and video captioning, voice search, conference transcription, voice-based quality inspection, public security and fire emergency calls, and court hearing records.

Speech synthesis

Leading technology
Our technology accounts for multi-level prosodic breaks to create a natural-sounding rhythm. It uses a combination of acoustic and linguistic parameters to build multiple automatic prediction models based on deep learning.
Realistic effect
Our on-device speech synthesis uses Knowledge-Aware Neural Text-to-Speech (KAN-TTS) technology. This technology is based on deep neural networks and machine learning. It transforms text into realistic, rich, and expressive speech. This makes the quality of offline speech synthesis nearly identical to online synthesis. Similarly, the output from custom voices is almost indistinguishable from real human recordings.
Personalized timbres
It supports multiple languages, including Chinese and English, and a variety of timbres, scenarios, and styles. You can also create custom offline voices with a small amount of data.
Natural sound
Trained on massive amounts of audio data, the synthesized voice is realistic, rich, and expressive. Its Mean Opinion Score (MOS) is among the highest in the industry.
Deep customization
You can customize voice libraries to meet the needs of your application. Choose from multiple styles, such as standard male and female voices or a gentle, sweet female voice. Use Speech Synthesis Markup Language (SSML) to control synthesis and dynamically adjust parameters such as volume, speed, and pitch. You can also use your own data to synthesize Text-to-Speech (TTS) voices.
Efficient and stable
The API is simple and easy to integrate. The service is stable and highly compatible. It features low first-packet latency, low memory usage, and low CPU consumption. Solutions are also available for low-spec hardware.
Cost-effective
Offline speech synthesis works without an internet connection for real-time synthesis. Licensing is per device, which makes costs predictable. Voice customization requires less data. For Mandarin Chinese, you can create a natural, fluent voice with just 2,000 sentences. You can add English data to enable mixed Chinese and English speech. This greatly reduces the time and cost of recording and annotation, which offers a significant price advantage.
Multi-domain coverage
We have built extensive vocabularies for many domains, including smart home, automotive, navigation, finance, telecommunications, logistics, real estate, education, and audiobooks. This ensures that our speech synthesis technology pronounces industry-specific terms accurately.

Self-learning platform

Easy to use
The self-learning platform provides a one-click, self-service solution for voice optimization. This greatly lowers the barrier to entry. Even non-technical staff can use it to significantly improve recognition accuracy for their business.
Fast
The self-learning platform can optimize, test, and deploy a custom model for your business in minutes. It also supports real-time optimization for hotwords. This avoids the long delivery times of traditional customization, which can take weeks or months.
Accurate
The self-learning platform has proven effective in many internal and external projects. For many projects, the platform not only resolved availability issues but also achieved better optimization results than competitors who used traditional methods.