Build AI companionship apps with Real-time Conversational AI-Intelligent Media Services(IMS)-阿里云帮助中心

Build AI companionship applications with Alibaba Cloud Real-time Conversational AI, supporting audio-only and avatar-based interactions.

Background

AI companionship products span role-playing, emotional chat, and psychological therapy. While most AI chat apps rely on offline text or voice in IM-style interfaces, multimodal models like GPT-4o are driving real-time voice and video interactions for more immersive virtual experiences.

Alibaba Cloud integrates third-party LLMs and TTS to deliver real-time interactive companionship with dynamic, evolving storylines. Users can both consume and create content for a personalized experience.

Options

Interaction modes

Real-time Conversational AI supports two interaction modes. Select a mode by specifying the call type when creating your agent, then integrate the corresponding SDK. Try the demo first. To integrate the service, see Quick start for audio/video calls.

	Audio-only call	Avatar call
Example
Interaction	User: Audio AI companion: Audio	User: Audio AI companion: Video
Cost	Low	Medium

Client SDKs

For more information about SDK integration, see Developer guide.

SDK	Description
Web SDK	Recommended Desktop browsers, such as Chrome. Mobile H5, such as Alipay H5, DingTalk H5, and WeChat mini program H5. In-app WebView. Note Do not use native mobile browsers because some devices are not compatible with Web Real-Time Communication (WebRTC). Native components of WeChat mini programs are not supported. Use WeChat mini program H5. For more information about integration, see Integrate the Web SDK in a WeChat mini program.
Android/iOS SDK	Recommended: Applications that run on the Android or iOS operating system.
Others	If you want to develop on a Windows or macOS desktop, search for the DingTalk group ID 106730016696 to join the group and contact us.

Basic features

Personalized calls

Customize each user's call experience by configuring call startup parameters at call initiation.

Setting	Description	Modifiable during call?
LLM prompt	Pass user-specific information in the initial prompt for a more personalized companionship experience.	Yes
ASR language	Set the speech recognition language (such as Chinese or English).	Yes
TTS voice	Set the AI's voice and timbre.	Yes
Avatar	If using a `VideoAgent` with multiple avatars, you can specify which one to use for the call.	No
Welcome message	Set a custom welcome message for each user, such as "Hi, Alice, it's great to see you again!"	No

Knowledge base

If you need a knowledge base, perform the following steps:

Use Alibaba Cloud Model Studio to create an agent and publish it to Real-time Conversational AI. For more information about publishing an agent, see Publish a Real-time Conversational AI agent from Alibaba Cloud Model Studio.
Set up the question library in Alibaba Cloud Model Studio. For more information about how to set up a question library, see Quick Start.

User information pass-through model

If multiple users are online during a call, the LLM must be able to accurately distinguish which user sent the current input. Real-time Conversational AI lets you pass information to the LLM. You can use this feature to pass custom information, such as a UserID, to the model. For more information, see Pass business parameters to an Alibaba Cloud Model Studio LLM.

Detect and handle user silence

You can listen for the intent_recognized parameter in the callback to obtain the time of each user utterance. For more information, see Agent callbacks. This lets you handle cases where a user is silent for a long time. Common handling methods are as follows:

End the conversation: For more information, see StopAIAgentInstance - Stop an agent instance.
Play a reminder: If the user is silent for a specified number of seconds, the AI proactively plays a message to prompt the user. For more information, see Voice broadcast.
Have the LLM ask the next question: If the user is not speaking and you want the AI to continue, you can drive the model's output directly with text. For more information, see How to use text as input for a large language model.

Conversation archiving

Save audio data and text transcripts from companionship sessions. For instructions, see Data archiving.

Advanced features

Spoken language assessment (Per-sentence)

Real-time Conversational AI can record each user utterance as a separate audio file, saved in real time to your OSS bucket for pronunciation assessment.

Note

Real-time Conversational AI provides per-sentence audio recording but not the assessment feature itself. To configure per-sentence audio callbacks, see AI agent callbacks.