RPC

更新时间:
复制 MD 格式

Service Mesh is the core of Ant Group's next-generation architecture. At Ant Group's current scale, migrating the existing Service-Oriented Architecture (SOA) system to a Service Mesh architecture is like changing the wheels on a running train. This topic describes the design and migration plan for the RPC layer and how Ant Group's core applications transitioned smoothly to a Service Mesh architecture to handle the high traffic challenges of the Double 11 Shopping Festival and reduce costs.

Introduction to Service Mesh

Ant Group's Service Mesh, similar to community versions, is divided into two parts:

  • Control plane: Named SOFAMesh. We plan to contribute more openly to Istio in the future.

  • Data plane: Named MOSN. It supports HTTP, SOFARPC, Dubbo, and WebService. This topic focuses on the data plane implementation.

Comparison of the community's Service Mesh architecture and Ant Group's Service Mesh architecture

对比图

Why use Service Mesh?

Service Mesh solves the following urgent problems in an SOA:

  • Coupling between the infrastructure and business development

  • Challenges in achieving transparent stability and high availability

Before using Service Mesh

Before implementing Service Mesh, the evolution of SOFAStack faced the following key challenges:

  • The framework and business logic were too tightly coupled.

  • Supporting RPC-layer requirements, such as traffic scheduling, traffic mirroring, and grayscale traffic routing, required significant development resources.

  • Business teams were required to upgrade their middleware versions.

Key challenges:

主要问题

  • Framework and business coupling: Upgrade costs were high. Many requirements could not be pushed on the client side and required corresponding features on the server-side, resulting in less elegant solutions.

  • Annual increase in machine resources: It was difficult to handle the massive traffic of the Double 11 Shopping Festival without adding more machines.

  • Traffic scheduling demands: New demands, such as traffic rerouting, grayscale traffic routing, blue-green deployments, and A/B testing, were difficult to meet.

  • Framework upgrade demands: After the basic framework was ready, it was uncertain whether new features would be compatible with older versions if users did not upgrade the API layer.

  • Inconsistent versions: Client framework versions were inconsistent in the online environment.

In an SOA, the collaboration between business and infrastructure teams was as follows:

  • Business teams can be decoupled: Business teams could independently manage one or more services. Upgrades and maintenance for these services did not require intervention from other teams. SOA achieved decoupling between business teams based on interface contracts.

  • Infrastructure and business teams were tightly coupled: For a long time, the infrastructure team required close cooperation from service teams to push many changes, such as upgrading JAR packages. This coupling led to various problems, such as inconsistent client versions in the online environment and high upgrade costs.

After using Service Mesh

Service Mesh sinks the infrastructure layer. This decouples the infrastructure from business teams, allowing each team to iterate more quickly.

Before and after decoupling

解耦对比图

Service Mesh solution

Selection considerations

  • Problem: To achieve the desired results with Service Mesh, we first had to choose the right technology. The main considerations were:

    • Open source vs. self-developed: Because of proprietary protocols and legacy issues, a full migration to Envoy was not suitable.

    • SDK vs. transparent hijacking: Transparent hijacking has poor operations and maintenance (O&M) and observability, low performance, and risks that are difficult to control.

  • Final choice: A self-developed data plane with a lightweight SDK, which is MOSN (Modular Open Smart Network-Proxy).

Overall target architecture

总体框架

MOSN currently supports the following components:

  • Pilot

  • Ant Group's service discovery component, SOFARegistry

  • Ant Group's messaging component, SOFAMQ

  • Database component

At the product level, MOSN provides developers with the following capabilities:

  • O&M

  • Monitoring

  • Security

To achieve this, we had to address three major business needs:

  • Framework upgrade solution

  • Container replacement solution

  • MOSN upgrade solution

三大诉求

Framework upgrade solution

  • Before the upgrade, the online environment was as follows:

    • Application code and the framework were decoupled to some extent.

    • You can interact with an API.

    • Code needed to be packaged and run in SOFABoot.

      框架升级

  • Upgrade solution: The main steps are as follows.

    1. After an assessment, you can directly upgrade the underlying SOFABoot if the risks are manageable.

    2. The RPC detects information to determine whether the current pod needs to enable MOSN capabilities.

    3. The system detects the container identity passed by the Platform as a Service (PaaS) to determine the MOSN enabling status. It then passes the publishing and subscribing information to MOSN to complete calls directly without address lookup.框架升级2

  • Pros and cons:

    • Pros: Batch O&M operations directly modified the online SOFABoot version, which provided existing applications with MOSN capabilities.

    • Cons:

      • This upgrade plan requires operations, such as shutting down traffic, during runtime.

      • It did not provide smooth upgrade capabilities.

      • It was tightly coupled with business code and not suitable for long-term use.

  • Reasons for not using the community's traffic hijacking solution:

    • When many iptables rules are configured, performance degrades severely.

    • It has poor controllability and observability, and problems are difficult to troubleshoot.

    • From the beginning, the goal for Service Mesh was a full migration of Ant Group's online systems. This required very high performance and O&M standards. We could not accept service disruptions or a significant increase in resource consumption.

Container replacement solution

The framework upgrade solution only proved that migration was possible, but not that it was efficient or fast. Faced with hundreds of thousands of business containers handling live traffic, we needed a way to integrate them quickly and stably. Traditional replacement integration would incur enormous costs with such high traffic. Therefore, the Ant Group team chose in-place integration.

Traditional integration vs. in-place integration

传统和原地接入对比

Comparison of traditional and in-place integration:

  • Traditional integration:

    • The upgrade operation requires a resource buffer and is completed by performing batch operations that continuously move the replacement buffer.

    • This requires the PaaS layer to have a large buffer, which increases resource costs.

    • Low CPU utilization.

  • In-place integration:

    • Using the PaaS layer, Operator operations directly inject MOSN into the existing container and perform an in-place restart. This completes the upgrade at the container level. After the upgrade, the pod has MOSN capabilities.

    • Increased CPU utilization.

    • This is similar to an oversubscription solution. It appears to allocate CPU and memory, but in reality, the increase is negligible.

MOSN upgrade solution

After completing the container replacement, we faced a third challenge: with large-scale container deployment, MOSN would inevitably have bugs. How do we upgrade MOSN when a problem occurs? Upgrading a component across hundreds of thousands of online containers is very difficult. Therefore, we had to consider the MOSN upgrade solution from the beginning.

Simple upgrade solution

  • Approach: Destroy the container and rebuild it, as shown in the following figure.简版升级

  • Drawbacks:

    • When the number of containers is large, the O&M costs are prohibitive.

    • After a container is destroyed, if the rebuilding is not fast enough, it can affect business capacity and cause service failures.

Non-disruptive upgrade solution

To address the shortcomings of the simple upgrade solution, we worked with the PaaS team to develop a non-disruptive upgrade solution for MOSN, as shown in the following figure:无损升级

  • Approach:

    • MOSN can perceive its own state.

    • When a new MOSN starts, it uses a domain socket on a shared volume to check if an old MOSN is already running in the same pod. If so, it notifies the old MOSN to perform a smooth upgrade. The process is as follows:

      1. The new MOSN notifies the old MOSN to start the smooth upgrade process.

      2. The old MOSN passes the service's listen file descriptor (Fd) to the new MOSN. After the new MOSN receives the Fd, it starts. At this point, both the old and new MOSN instances are providing services.

      3. The new MOSN notifies the old MOSN to close its listen Fd and then begins to migrate existing persistent connections.

      4. The old MOSN completes the migration and is then destroyed.

  • Example solution:

    升级方案.png

  • Advantage: Any online MOSN version upgrade can be performed without affecting existing services.

MOSN application: Time-sharing scheduling

Technological transformation is usually driven by business needs and specific scenarios, not by the technology itself. Few people upgrade for the sake of upgrading or innovate for the sake of innovation. Technology is typically driven by business needs, and in turn, it drives the business forward.

In the Alibaba ecosystem, the continuous expansion of activities such as Taobao Live, real-time red envelopes, and Ant Forest presents complex technological challenges.

For businesses, these scenarios mean that the traffic is almost unsupportable, even with optimized code. The typical solution is to scale out by adding more machines, but this increases costs. Service Mesh provides a better solution for this situation.

By collaborating with the JVM and systems departments, we use advanced time-sharing scheduling to achieve flexible resource scheduling. This approach achieves better results without adding machine resources. The process is as follows:

  • Resource requirements: Assume the following resource requirements for two large resource pools.

    • At time X, resource domain A needs more resources.

    • At time Y, resource domain B needs more resources.

    • The total amount of resources cannot be increased.分时调度

  • Borrowing solution: The preceding resource requirements must be met by borrowing machines. There are two borrowing solutions.

    分时调度对比.png

    • Conventional solution:

      • First, you release resources, then destroy the process, rebuild the resources, and finally start the resources for resource domain B.

      • This process is a heavy operation for many machines. Change introduces risk, and making such a change at a critical time can have unintended side effects.

    • Time-sharing scheduling solution: MOSN's approach is as follows.

      • A portion of the resources always runs two applications through oversubscription.

      • At time X: Resource domain A is in a running state, and resource domain B is in a keepalive state. Resource domain A uses MOSN to reroute all its traffic. The application's CPU and memory are then limited to a very low level, retaining about 1% of its capacity. This way, the machine can still be prefetched, and the process does not need to stop.

      • At time Y: Resource domain B is in a running state, and resource domain A is in a keepalive state. When a large-scale resource scheduling is needed, a switch can be pushed to turn on resource limits and cancel the keepalive state. Resource domain B can be instantly restored to full capacity. Resource domain A then enters the state it was in at time X, with its CPU and memory limited.

      • MOSN's ability to maintain traffic keepalive with an extremely low resource footprint makes rapid resource borrowing possible.

      Details of MOSN time-sharing scheduling

      分时调度3

Future outlook for Service Mesh

After two years of development at Ant Group, Service Mesh withstood the test of the Double 11 Shopping Festival. During the festival, it covered hundreds of core transaction links. The number of containers with MOSN injected reached hundreds of thousands. On Double 11, it handled tens of millions of queries per second (QPS) with an average response time (RT) of less than 0.2 ms. MOSN itself completed dozens of online upgrades during the sales promotions. It achieved its design goals and completed the first step of separating the infrastructure from the business, which demonstrated the increased iteration speed of the infrastructure after adopting Service Mesh.

There is no silver bullet in software engineering. Architecture design and solution implementation are always a matter of balance and trade-offs. Although there are still challenges to overcome, we at Ant Group believe that cloud-native is the future.

After two years of exploration and practice, Ant Group has accumulated a wealth of experience. Service Mesh may be the closest thing to a "silver bullet" in the cloud-native world. In the future, Service Mesh will become the standard solution for microservices in a cloud-native environment. Ant Group and Alibaba Group plan to participate deeply in the Istio community. We will work with the community to make Istio the de facto standard for Service Mesh.

The expected Service Mesh architecture

未来架构.png

References