AI speaking tutor

Build an AI-powered speaking practice service with Real-time Conversational AI to help learners improve oral skills through personalized, on-demand sessions.

Overview

AI-powered speaking practice eliminates the need for a human partner and removes time and location constraints. The AI analyzes learner history to create personalized content, provides instant feedback and corrections, and simulates diverse scenarios to broaden language application. A stress-free environment helps learners build confidence and improve speaking skills.

Solution options

Practice formats

Real-time Conversational AI supports two call formats. Specify a call type when you create an agent, then integrate it. Try the demo to preview the experience. To integrate, follow the Quick Start for audio and video calls.

Call type	Audio-only call	Digital human call
Example
Practice format	Learner: Audio AI partner: Audio	Learner: Audio AI partner: Video
Cost	Low	Medium

Client SDKs

For more information about SDK integration, see Developer guide.

SDK	Description
Web SDK	Recommended Desktop browsers, such as Chrome. Mobile H5, such as Alipay H5, DingTalk H5, and WeChat mini program H5. In-app WebView. Note Do not use native mobile browsers because some devices are not compatible with Web Real-Time Communication (WebRTC). Native components of WeChat mini programs are not supported. Use WeChat mini program H5. For more information about integration, see Integrate the Web SDK in a WeChat mini program.
Android/iOS SDK	Recommended: Applications that run on the Android or iOS operating system.
Others	If you want to develop on a Windows or macOS desktop, search for the DingTalk group ID 106730016696 to join the group and contact us.

Core features

Personalized calls and scenario switching

Customize each call by configuring call startup and personalization parameters. Learners can switch scenarios mid-call without disconnecting. For example, switching from a "directions" scenario to a "shopping" scenario requires redefining the Large Language Model (LLM) prompt.

Setting	Description	Can be modified during a call
LLM prompt	Include learner information in the prompt. Pass it as an input parameter when starting the call for more targeted practice.	Yes
Automatic Speech Recognition (ASR) language	Set the language, such as Chinese or English.	Yes
Text-to-Speech (TTS) voice	Set the AI's voice.	Yes
Digital human avatar	If your agent is a VideoAgent and you have multiple digital human avatars, specify an avatar for the call.	No
Welcome message	Set a welcome message for different users, such as "Hello, Xiaoyun. Today, we will simulate a shopping scenario..."	No

Knowledge base

To set up a knowledge base:

Use Alibaba Cloud Model Studio to create an agent and publish it to Real-time Conversational AI. For more information about publishing an agent, see Publish a Real-time Conversational AI agent from Alibaba Cloud Model Studio.
Set up the question library in Alibaba Cloud Model Studio. For more information about how to set up a question library, see Quick Start.

Send custom messages to users

During a call, if you want to send information such as cards or questions to the client in real time, Real-time Conversational AI provides a dedicated channel for sending messages. After the client receives your custom message, it can perform custom business actions, such as downloading resources and interactive rendering.

Alibaba Cloud provides two solutions:

Solution 1: You can send custom messages to the client from your AppServer. For more information, see Send custom messages to a client.
Solution 2: You can also include custom messages in the response from the large language model (LLM). The message is delivered to the client in real time with the captions.
Note
You can hide instructions in the model response and mark them with special symbols, such as `{}` or `[]`. To do this, go to Console > Workflow > TTS Node > Filter Broadcast. The marked content is not spoken. You can then parse this content to handle your custom business logic.

User information pass-through

When multiple users are online during a call, the LLM must distinguish which user sent the current input. Pass custom information such as a UserID to the model. Pass business parameters to an Alibaba Cloud Model Studio LLM.

Detect and handle user silence

You can listen for the intent_recognized parameter in the callback to obtain the time of each user utterance. For more information, see Agent callbacks. This lets you handle cases where a user is silent for a long time. Common handling methods are as follows:

End the conversation: For more information, see StopAIAgentInstance - Stop an agent instance.
Play a reminder: If the user is silent for a specified number of seconds, the AI proactively plays a message to prompt the user. For more information, see Voice broadcast.
Have the LLM ask the next question: If the user is not speaking and you want the AI to continue, you can drive the model's output directly with text. For more information, see How to use text as input for a large language model.

Transcription and recording

Save audio or text data from practice sessions. Data archiving.

Advanced features

Sentence-by-sentence evaluation

Real-time Conversational AI records each user utterance as a separate audio file in real time and stores it in your specified Object Storage Service (OSS) bucket. You can then run pronunciation evaluation on these files.

Note

Real-time Conversational AI provides sentence-by-sentence recording only, not audio evaluation. To set up sentence-by-sentence audio callbacks, use Agent callbacks.