This topic describes how to use only the Alibaba Real-Time Communication (ARTC) SDK as a transport channel for Real-time Conversational AI.
Background
Alibaba Cloud offers two solutions for integrating AI with real-time communication (RTC):
End-to-end Real-time Conversational AI solution. For more information, see Quick Start for Real-time Conversational AI.
RTC as a transport channel. You must implement the AI service orchestration.
This topic focuses on the second solution, which is designed for application scenarios that use ARTC for audio and video transmission and require highly customized AI processing. The ARTC SDK helps you establish an efficient and stable audio and video data transmission link between the client and the server. You can then add or integrate the AI features you need.
Solution details
Alibaba Cloud provides the following two SDKs for audio and video transmission between clients and servers:
The ARTC SDK for clients, which supports Android, iOS, Windows, and H5.
The Linux SDK for servers.
Audio-only scenario
For audio-only scenarios, Alibaba Cloud recommends the following architecture:
In this architecture, the ARTC SDK and the Linux SDK join the same RTC channel. The Linux SDK receives an audio stream from the ARTC SDK and passes the decoded audio data to the business layer. You can also orchestrate AI services to process the audio with ASR, TTS, and LLM, and then send the pre-encoded audio data to the Linux SDK. The Linux SDK then encodes the data and sends it back to the ARTC SDK for playback or rendering in the application.
Digital human scenario
For digital human scenarios, Alibaba Cloud recommends the following architecture:
In the architecture above, the ARTC SDK and the Linux SDK join the same RTC channel. The Linux SDK receives the audio and video streams from the ARTC SDK and passes the decoded data to the business layer. As needed, you can orchestrate AI services for the audio. After processing by ASR, TTS, an LLM, and the digital human, the pre-encoded audio and video data is sent to the Linux SDK. The Linux SDK encodes the data and sends it back to the ARTC SDK for playback or rendering in the application.
If your digital human is from a third-party vendor and the service is not deployed on your backend server, Alibaba Cloud recommends the following architecture:
In this architecture, the third-party digital human vendor integrates the Alibaba Cloud Linux SDK and sends the digital human's audio and video data (PCM+YUV) to the SDK. The SDK then encodes this data into Opus+H.264 and sends it to the end user over the Alibaba Cloud Global Real-time Transport Network (GRTN). At this point, the RTC channel has three participants: the end user and two Linux SDKs.
(Optional) Linux SDK extension features
Audio 3A
The Alibaba Cloud Linux SDK supports 3A features, including the following:
Applying further denoising to data from the ARTC SDK to improve ASR accuracy.
Applying gain to the audio from TTS or the digital human to increase the volume based on your configured level.
The denoising feature is implemented in both the ARTC SDK and the Linux SDK. The reasons for enabling denoising on the Linux SDK are:
Denoising on the Linux SDK requires more computing power than on the ARTC SDK, which makes it better suited for server-side deployment.
Denoising optimization for ASR cannot be enabled in H5 scenarios.
These 3A features are disabled by default. You can set the volume gain explicitly using an API. To enable the denoising feature, submit a ticket.
H.265
The Alibaba Cloud Linux SDK supports H.265 encoding. Because this feature is not in high demand, the official SDK version does not include the H.265 encoding module. If you need the H.265 encoding feature, contact us.