Elastic Remote Direct Memory Access (eRDMA) is an RDMA network service provided by Alibaba Cloud. eRDMA features low latency, high throughput, and high elasticity. To use eRDMA capabilities on a large scale, the instance types that you use must support eRDMA and use elastic network interfaces (ENIs) that support Elastic RDMA interfaces (ERIs). This topic introduces eRDMA and describes the benefits, use scenarios, and limits of eRDMA.
Introduction
What is eRDMA?
eRDMA is an elastic Remote Direct Memory Access (RDMA) network developed by Alibaba Cloud for the cloud. eRDMA reuses virtual private clouds (VPCs) as the underlying link and uses a congestion control (CC) algorithm that is developed by Alibaba Cloud. eRDMA features high throughput and low latency based on RDMA supports. Compared with RDMA, eRDMA implements large-scale RDMA networking within seconds. eRDMA supports traditional high-performance computing (HPC) applications and Transmission Control Protocol/Internet Protocol (TCP/IP) applications.
You can use eRDMA to deploy HPC applications in the cloud to build high-performance application clusters that have high elasticity at low costs. You can also replace a VPC with an eRDMA network to accelerate applications.
Implementation of eRDMA capabilities
The capabilities of eRDMA must be implemented based on the instance types that support eRDMA. You can create and bind eRDMA-capable elastic network interfaces (ENIs) to Elastic Compute Service (ECS) instances of the instance types to provide large-scale RDMA network service capabilities.
Elastic RDMA Interfaces (ERIs) are virtual network interfaces that can be bound to ECS instances. ERIs must depend on ENIs to enable RDMA devices. An ERI reuses the network to which an ENI belongs. This allows you to use the RDMA feature in the original network and enjoy the low latency provided by RDMA without the need to modify business networking.
Benefits
eRDMA provides the following benefits:
-
High performance
RDMA bypasses the kernel stack to transfer data from user-mode programs to Host Channel Adapter (HCA) for network transmission. This greatly reduces the CPU load and latency. eRDMA provides the advantages of traditional RDMA interfaces and applies RDMA to VPCs. eRDMA features ultra-low latency that RDMA provides to cloud networks.
NoteAn HCA is a hardware network interface card (NIC) that connects a server to a network and provides support for RDMA.
-
Inclusiveness
You can enable eRDMA free of charge. To enable eRDMA, you need to only select the Elastic RDMA Interface option when you purchase an ECS instance.
-
Large-scale deployment
Traditional RDMA is based on lossless networks. This makes large-scale deployment costly and difficult. eRDMA uses the CC algorithm developed by Alibaba Cloud to control transmission quality in VPCs, such as latency and packet loss. eRDMA provides good performance in lossy networks.
-
Scalability
Compared with RDMA that requires a separate hardware NIC, eRDMA uses an RDMA HCA card that has cloud attributes based on the Shenlong architecture. eRDMA can dynamically add devices when you use ECS and supports hot migration, which allows for flexible deployment.
-
Shared VPCs
eRDMA depends on ENIs and reuses networks to which ENIs belong. This allows you to activate the RDMA feature in legacy networks without the need to modify service networking.
Scenarios
The TCP/IP protocol stack provides mainstream network communication protocols based on which many applications are built. With the development of business that is related to data centers, higher requirements are imposed on network performance, such as lower latency and higher throughput. TCP/IP has become a bottleneck that restricts the performance of communication networks due to limits such as high copy overheads, cross-protocol stack processing, complex CC algorithm, and frequent context switching.
RDMA helps resolve the preceding pain points. RDMA provides features, such as zero-copy and kernel bypass, to prevent overheads when data is copied and context is frequently switched. Compared with TCP/IP communication, RDMA features low latency, high throughput, and low CPU utilization. However, RDMA has a few use scenarios due to high prices and O&M costs.
Alibaba Cloud eRDMA is designed to have inclusive compatibility with diverse cloud environments. eRDMA provides low latency and lowers requirements for a wide range of applications to adapt to cloud environments to enhance their performance. Compared with traditional RDMA, eRDMA can be used in a wide range of scenarios, such as Redis-based cache databases, Spark-based big data analytics, Weather Research and Forecasting Model (WRF) in HPC, and AI training. eRDMA offers considerable performance gains in the preceding scenarios.
Limits
Before you use eRDMA, make sure that the following conditions are met.
-
ECS instances: For more information about configuration constraints, see Configure eRDMA on an enterprise-level instance.
-
GPU instances: For more information about configuration constraints, see Configure eRDMA on a GPU-accelerated instance.