In recent years, the concept of cloud-native has grown in popularity. Ant Group is committed to technological innovation and actively applies the Service Mesh concept in the cloud-native field. By integrating Service Mesh with our existing technical architecture, we extracted common capabilities, such as communication, data, and security, to create MOSN. We also built on Istio's capabilities to extend the Service Mesh control plane, which provides advanced management for MOSN. This topic describes how we ensure quality and reliability when implementing the Service Mesh control plane at Ant Group.
This topic covers the following aspects:
Service Mesh components
Service Mesh component diagram

Service Mesh component description:
Data plane: The data plane, called MOSN, is an independent proxy module that processes application data requests. It is separate from the application and provides request proxying and complex communication logic processing.
Control plane: The control plane, called SOFAMesh, manages application configurations and business rules, such as feature toggles and service routing rules. It sends configurations to direct the data plane's execution, which supports various business needs and implementations.
Differences between the control plane and classic microservices
Classic microservice architecture

Integrated Service Mesh architecture

The main differences between the two are:
Service Mesh adds a MOSN sidecar to the data plane pod.
Service Mesh removes the Config Server. The Service Mesh control plane handles pushing configuration data to MOSN. In contrast, classic microservices use a Config Server to push configuration data.
Service Mesh control plane implementation
The Service Mesh control plane is built on Kubernetes and provides advanced operational capabilities such as CustomResourceDefinition (CRD) and Role-Based Access Control (RBAC). It also extends business information through the xDS protocol, which is based on the open source Istio Envoy. This allows the control plane to send service routing rules and configuration toggles to MOSN for execution to meet various business requirements. During the Double 11 shopping festival, the control plane implemented the following two capabilities:
CRD configuration delivery: CRD stands for CustomResourceDefinition, a custom resource semantic natively supported by Kubernetes (K8s).
TLS encrypted communication: TLS stands for Transport Layer Security.
CRD configuration delivery
Ant Group extended ScopeConfig to limit the effective scope of a custom resource (CR). It supports settings at the data center, cluster, application, and IP levels. On top of ScopeConfig, we added PolicyConfig CRDs for policies, rules, and toggles. This enables controllable grayscale releases, degradation rollbacks, and online A/B testing.

TLS encrypted communication
The TLS encrypted communication process is as follows:
After the TLS toggle is enabled, MOSN requests TLS certificate information from the Citadel Agent in the control plane through Unified Diagnostic Services (UDS).
The TLS certificate is granted by Ant Group's KMS to the Citadel service in the control plane. Citadel then performs certificate checks and other processing.
The validated certificate is synced to the Citadel Agent. This loop repeats to update the certificate.
Example of the TLS encrypted communication process

Challenges and solutions
Test architecture improvements
In standard mode, a pod with a Docker-imaged service typically runs only one application container. In this scenario, the testing framework only needs to focus on the application.
Service Mesh operates in a non-standard mode. MOSN acts as a sidecar and coexists with the application container in the same pod, sharing resources. All business traffic must be processed through MOSN. Therefore, testing and validation must extend to MOSN, not just the application.
MOSN integrates the control plane's xDS client to communicate with the Pilot component and receive configuration information. Ant Group's technical architecture includes disaster recovery capabilities such as 'three regions, five data centers' and 'intra-city active-active'. This results in scenarios with Logical Data Centers (LDCs) and multiple zones within a single cluster.
The control plane's Pilot component sends configurations with a granularity of cluster, zone, application, and IP. To verify the accuracy of these multi-zone delivery rules, you must create multiple xDS clients or MOSN instances.
The sidecar cannot be accessed directly. You must expose an interface through the test application to enable higher-level testing.

Code quality control
For new projects and new code, we must ensure code quality during rapid iteration. The control plane is developed almost entirely in Go. Unlike a mature development language such as Java, Go does not have extensive component support. This means there are fewer quality metrics available for measurement.
Using the Ant Code service from Ant Group's efficiency team, our team configured a pipeline for the Service Mesh control plane. It currently supports security scans, specification scans, and code coverage. It also supports code reviews. Code can be merged only after it passes reviews from both developers and quality assurance staff. When merging, unit tests (UTs) are automatically triggered to prevent defective code from being integrated.
After reviews and UTs pass, a test image is automatically built through a unified image packaging process. This improves the quality of the test package. The following figure provides an example.
Building on Ant Code, our team developed custom components to dynamically manage code coverage. This practice continuously improves our code quality and encourages developers to perform self-testing.

Stability requirements
CRD delivery is a core capability of the control plane. TLS encrypted communication is also triggered by a CRD delivery toggle. The key performance factors for delivery include the following:
The number of concurrent clients that Pilot can support.
The delivery time to the client, because real-time configuration delivery is critical.
During stress testing, we did not have enough resources to create many xDS clients. Therefore, we developed a mock client, which is a simplified xDS client that retains only the core communication module. A single pod can support tens of thousands of client impersonations. After a period of continuous stress testing, our team found that frequent memory garbage collection (GC) caused high latency.
PProf memory analysis graph

PProf memory analysis graph description:
MessageToStruct and FromJsonMap consume the most memory. Both are used for data transformation.
We had previously made similar optimizations for MessageToStruct, so this issue was quickly resolved.
FromJsonMap: This is the core of CRD data transformation. It transforms data into a YAML format that Kubernetes can recognize. Our team refactored it to reuse memory and optimize the transformation function. This significantly reduced the execution time, bringing it down to the millisecond level.
O&M control
The control plane service is developed based on Kubernetes and is published online as a Deployment and a DaemonSet. Each release affects the entire cluster, which requires threat control measures.
Relying on Ant Code's review mechanism, our team established a three-party review process involving developers, quality assurance, and Site Reliability Engineering (SRE) to control the scope of changes. After the review is complete, the changes are automatically deployed to the corresponding cluster. We use monitoring for immediate post-deployment checks. If a problem is found, an alert is promptly triggered. This collaborative process ensures the quality of our online production environment.
Future plans
