Real-time Conversational AI enables efficient audio and video interactions between an AI and users. This document describes the capabilities and benefits of Real-time Conversational AI.
Product introduction
Real-time Conversational AI is a solution that helps businesses quickly build audio and video call applications between AI agents and users. You can build a custom AI agent in 10 minutes using a visual configuration interface. The agent can interact with end users in real time over the ApsaraVideo Real-time Communication network. This solution is suitable for various scenarios, such as online customer service, AI assistants, AI companions, matchmaking assistants, and virtual teachers, allowing you to quickly build real-time conversational AI capabilities.
Capabilities
In Real-time Conversational AI, an AI agent is a highly realistic, cloud-based entity that can engage in audio and video calls, chat messages, and AI-powered outbound/inbound phone calls with users. To meet different interaction needs, you can configure workflows for the agent to enable the following capabilities:
Audio and video calls
Voice calls Users communicate with the AI agent through voice.
| Digital human calls Users can interact with a digital human through video to enhance the realism of the user experience.
| Visual understanding calls The agent uses video interaction to provide feedback based on both voice and visual input.
| Video calls A digital human uses visual understanding for bidirectional video calls with users. |
The following example is from the Audio and Video Call Quick Start: Simply configure the following three nodes to create a voice call workflow.
| |||
Chat messages
Users can communicate directly with the agent through voice or text in a chat dialog box.
|
|
Take the Quick Start for chat messages as an example: Simply configure the following flow to create a chat message.
| |
AI-powered outbound/inbound phone calls
The AI voice agent supports phone calls, meeting business needs for both Real-Time Communication (RTC) and traditional phone calls. A single agent can support multiple lines, which allows businesses to use a single system to enable AI calls across multiple lines.
AI phone calls can run smoothly in noisy environments with background sounds such as music or table tapping.
The following example uses the Outbound & Inbound Call Quick Start:
Simply configure the following three nodes to create a voice call workflow and make outbound/inbound phone calls.

Visit the AI Experience Center to quickly try Real-time Conversational AI. | Visit the Demo Center to fully understand the capabilities described above. |
Benefits
Global high availability and low latency: The solution is powered by the Alibaba Cloud ApsaraVideo Real-time Communication network. With over 3,200 global nodes and Quality of Service (QoS) optimization, users can have smooth audio and video calls with the AI agent from anywhere in the world.
Easy component integration and debugging: You can integrate AI components, such as speech-to-text, large language models, speech synthesis, and proprietary vector databases, as plugins into your workflow. This lets you quickly launch your business and easily debug the entire technical solution.
Highly human-like: Alibaba Cloud continuously iterates and optimizes features such as intelligent noise reduction, smart interruption, and intelligent sentence segmentation. This makes the agent's interactive behavior more human-like.
Easy system integration: Alibaba Cloud provides four integration methods to build your Real-time Conversational AI system to meet the needs of different scenarios and applications.
For more information about Real-time Conversational AI, see Overview of Real-time Conversational AI.






