Service Mesh includes a control plane and a data plane. Its data plane is the self-developed MOSN. This topic describes how Ant Group ensures quality during the MOSN implementation and how to smoothly evolve from the existing microservices model to the Service Mesh architecture.
MOSN overview
The main differences between the classic microservices model and the Service Mesh architecture include the following:
Classic microservices model: Uses a software development kit (SDK) to implement service registration and discovery.
Service Mesh model: Uses the MOSN data plane to implement service registration and discovery. Service invocations are also handled by MOSN.
Classic microservices model

Service Mesh microservices model

In the Service Mesh model, offloading the original SDK logic to the MOSN data plane provides the following benefits:
After basic capabilities are offloaded, upgrades to these capabilities no longer depend on application modifications and releases. This reduces application disruptions.
After basic capabilities are upgraded, they can be rapidly iterated and rolled out transparently to applications.
The capabilities of Ant Group's self-developed MOSN data plane:

Implementation challenges
MOSN has rich capabilities. However, its implementation process faces several challenges.
What are the challenges in quality assurance?
How can these challenges be addressed?
Quality assurance challenges

New product: Ant Group did not adopt community solutions such as Envoy or Linkerd. Instead, the team chose to develop its own data plane, MOSN. The entire stack, from the underlying network model to the upper-layer business modules, had to be re-coded by developers. This new infrastructure was intended to replace the stable SDK running in the production environment. This type of production change carries a high risk.
Non-standard language: This self-developed MOSN data plane uses the Go language. The primary technology stack within Ant Group is Java, and the corresponding development processes are built around Java. This newly introduced technology stack required a new process platform for support.
Sales promotions: The timeline was tight from the start of the infrastructure upgrade within Ant Group to the arrival of the Double 11 Shopping Festival. The task was to complete the meshification of all core-link applications. This involved over 100 applications and hundreds of thousands of pods. The new system also had to withstand the test of the Double 11 Shopping Festival.
Online stability: The upgrade process required no service interruptions. After the large-scale upgrade, business operations had to run normally.
Output sites: The solution needed to support multiple sites, including Ant Group, MYbank, and the public cloud. However, the infrastructure dependencies and required capabilities differ across these sites. These different requirements can lead to code branch fragmentation. Therefore, two questions needed to be considered.
How can we ensure the quality of output for multiple sites simultaneously?
How can these fragmented branches be managed effectively?
How to address quality assurance challenges
Standardize the development process
To promote the implementation of meshification, Ant Group's R&D effectiveness department launched the ANT CODE development platform. This platform supports the Go language and provides custom pipeline orchestration capabilities and native plugins. Based on this platform, the Ant Group team standardized the development procedure as follows:
Git-flow branch management
Code management requires a clear process and standard. The Ant Group team adopted the code management solution proposed by Vincent Driessen. For more information, see A Successful Git Branching Model.

Code Review (CR): Among existing quality assurance methods, Code Review (CR) is a cost-effective way to identify problems. Based on the ANT CODE platform, the Ant Group team defined the following CR standards.

Delivery pipeline: The MOSN continuous delivery pipeline is an example of the delivery process, as shown in the following figure.

Pipeline checkpoints

If any of the preceding steps fail, the entire pipeline fails, and the code cannot be merged.
The checkpoint rules are located in a YAML configuration file in the project's root directory. A custom plugin centralizes these rules, which simplifies the YAML configuration file. This also means that rule changes only require updating the plugin's in-memory configuration.
Integration testing
To validate MOSN, the Ant Group team built a complete staging environment. This section uses the RPC feature in MOSN as an example to describe the components and deployment architecture of this environment.
Components of the integration testing environment

Pros and cons of the integration testing environment:
Pros:
MOSN's routing capabilities are fully compatible with the original SDK's capabilities. They are also continuously optimized. For example, a routing cache is used to improve performance.
This environment enables the use of automated scripts and continuous integration during iterations.
Cons: This staging environment cannot identify all issues. Some problems may not be discovered until they reach the production environment and cause business disruptions.
The following section uses RPC routing as an example to discuss offline integration testing.
When business applications perform cross-IDC routing, they mainly use ANTVIP. This requires applications to set the virtual IP address (VIP) in their code. The format is as follows:
<sofa:reference interface="com.alipay.APPNAME.facade.SampleService" id="sampleRpcService">
<sofa:binding.tr>
<sofa:vip url="APPNAME-pool.zone.alipay.net:12200"/>
</sofa:binding.tr>
</sofa:reference>However, during runtime, some applications have invalid URLs configured for historical reasons, such as:
<sofa:reference interface="com.alipay.APPNAME.facade.SampleService" id="sampleRpcService">
<sofa:binding.tr>
<sofa:vip url="http://APPNAME-pool.zone.alipay.net:12200?_TIMEOUT=3000"/>
</sofa:binding.tr>
</sofa:reference>The preceding VIP URL specifies port 12200 but also specifies the http protocol. This configuration is invalid. For historical reasons, this scenario was made compatible in the original Cloud Engine (CE). However, this compatibility was not carried over to MOSN.
This type of legacy issue is common and can be classified as an omission in test scenario analysis. Typically, for such scenarios, online traffic playback can be used to copy real online traffic to an offline environment to supplement the test scenarios. However, the existing traffic playback capability cannot be used directly for MOSN because RPC route addressing is related to the deployment structure, and online traffic cannot run directly in an offline environment. Therefore, a new traffic playback solution is needed. This capability is currently under development.
Special testing
In addition to the functional testing mentioned above, the Ant Group team also introduced the following special tests:
Compatibility testing
Performance testing
Fault injection testing
Compatibility testing
MOSN compatibility validation diagram
Issues found: Compatibility testing primarily identified issues in scenarios involving a mix of services with and without MOSN.
For example, during offline validation, calls from a client with MOSN to a server-side without MOSN would occasionally fail. The server-side would report the following protocol parsing error:
[SocketAcceptorIoProcessor-1.1]Error com.taobao.remoting -Decode meetexception
java.io.IOException:Invalid protocol header:1Analysis showed that the cause was a protocol mismatch. The old version of RPC supported the TR protocol, while the new version supports the BOLT protocol. During application upgrades, some services provided both TR and BOLT protocols simultaneously. This means the same interface offered services with different protocols, as shown in the following example:

Figure description:
The application sends a service subscription request to MOSN. MOSN then subscribes to the configuration center. The configuration center returns two addresses to MOSN, one supporting the TR protocol and the other supporting the BOLT protocol. MOSN selects one of these addresses and returns it to the application.
The address that MOSN returns to the application is the first entry from the data returned by the configuration center. This leads to two possibilities:
The address is for a server-side that supports the BOLT protocol: When the application later initiates a service call, it requests MOSN directly using the BOLT protocol. When MOSN selects an address, it polls the two service providers. If the call is routed to Server1, the previously mentioned protocol-not-supported error occurs.
The address is for a server-side that supports the TR protocol: Because BOLT is compatible with TR, the problem does not occur.
Solution: The final solution is that after MOSN receives the address list from the configuration center, it applies a filter. If the server-side list contains the TR protocol, all connections are downgraded to use the TR protocol.
Performance testing
To verify the performance impact of adopting MOSN, the Ant Group team connected the business applications in the performance stress testing environment to MOSN. They used the business's stress testing traffic for validation. The following figure shows a performance comparison after a business application adopted MOSN:

Fault injection testing
From MOSN's perspective, its external dependencies are as follows:
In addition to verifying MOSN's features, the Ant Group team also conducted special testing on its external dependencies using fault injection. This method helped discover scenarios not covered by the functional tests.
The following section uses the 12199 port between the application and MOSN as an example.
Diagram of MOSN and application heartbeat disconnection handling

Figure description:
After an application is connected to MOSN, the 12200 port that the application originally exposed is now used by MOSN for listening. The application's port is changed to 12199. MOSN sends heartbeats to the application's 12199 port to check if the application is alive.
If a problem occurs while the application is running, MOSN can detect it promptly through heartbeats.
If MOSN detects a heartbeat abnormality, it deregisters the service from the configuration center and shuts down the service on the externally provided 12200 port. This prevents the server-side from receiving service invocations after a problem occurs, which would otherwise cause request failures.
To validate this scenario, the Ant Group team used the `iptables` command in the offline staging environment to drop response data from the application to MOSN. This created an artificial scenario where the application was abnormal. Using this method, the team also found and fixed other unexpected bugs.
Rapid iteration
The time from project kickoff to the Double 11 event was limited. To ensure a smooth transition during Double 11, the Ant Group team adopted a rapid iteration strategy. This strategy allowed MOSN to receive sufficient validation time in the production environment within a controllable quality range. This was possible because of the basic testing capabilities mentioned earlier and Ant Group's three key principles for online changes.
Online stress testing and drills
After MOSN was fully online, it underwent multiple rounds of online stress testing triggered by the business's Site Reliability Engineers (SREs), which helped discover more issues. Online drills focused on MOSN's emergency plans, such as reducing the online log level.
Monitoring and inspection
Before meshification, online monitoring was performed mainly at the pod level. After meshification, and in cooperation with the monitoring team, online monitoring now supports independent monitoring for MOSN. It also supports monitoring for the Sidecar and application containers within a pod.
Inspection supplements monitoring. Some information, such as MOSN version consistency, cannot be obtained directly through monitoring. Inspection can identify which pods have not been upgraded to the latest MOSN version and whether any problematic versions are still running online.
Future product improvements
The Double 11 event proved the performance and online stability of Service Mesh. However, the technology risk foundation is not yet solid enough. In addition to building new features, the Ant Group team will work to strengthen the technology risk realm. In terms of quality, this work mainly includes:

In this process, the Ant Group team hopes to introduce new testing technologies and achieve new quality innovations to continuously improve Service Mesh.


