This topic describes the O&M challenges and evolution of a large-scale service mesh at Ant Group. It covers the process of implementing the service mesh to support the Ant Financial Double 11 Shopping Festival.
Cloud-native choices and issues
Traditional Service Mesh:
Software architecture: Middleware capabilities are decoupled from the application framework and packaged as independent software.
Deployment: A common deployment method is to run the service mesh as an independent process alongside the application process, both within the same application container.
From the beginning, Ant Group chose to embrace a cloud-native approach.
Sidecar pattern
The approach of running an independent process within an application container has the following advantages and disadvantages:
Advantages: It is compatible with legacy deployment modes and allows for rapid online deployment.
Disadvantages: It is highly intrusive to the application container and is difficult to manage for image-based containers.
A cloud-native approach solves these disadvantages and provides several other advantages:
It decouples the O&M of the service mesh from the application container. This allows middleware O&M to be managed at the infrastructure level.
Only stable, long-term JVM parameters related to the service mesh are kept in the application image. This allows the application to connect to the service mesh using only a few environment variables.
To support the evolution of container-based O&M, applications must complete an image-based migration to connect to the service mesh. This process lays the foundation for further cloud-native development.
Advantages | Disadvantages | |
Independent process | Compatible with legacy deployment modes. Low migration cost. Fast online deployment. | Creating images from manually modified application containers complicates O&M. |
Sidecar | Decoupling through desired state-oriented O&M. | Depends on the Kubernetes infrastructure. High cost to migrate the O&M environment. Applications require image-based migration. |
After connecting to the service mesh, a typical pod structure may contain multiple sidecars:
MOSN: RPC Mesh, MSG Mesh, and more.
Other sidecars.

These sidecar containers share the same network namespace as the application container. This allows the application process to access services provided by the service mesh through a local port. This approach ensures a user experience consistent with the traditional deployment method.
Cloud-native infrastructure support
The Ant Group team also promoted cloud-native migration at the infrastructure level to support the implementation of the service mesh.
Full image-based migration for applications
First, Ant Group promoted a full image-based migration. This process involved migrating all containers of core internal applications to images. The migration included the following tasks:
Adding support for service mesh environment variables at the base image layer.
Adapting application Dockerfiles for the service mesh.
Migrating legacy static files that were managed separately from the frontend for historical reasons.
Performing a push-to-pull migration for many applications that used frontend block distribution.
Upgrading and replacing a large batch of VM-mode containers.
Podifying Containers
In addition to the application image migration, the sidecar pattern requires all application containers to run in pods to allow network sharing among multiple containers. A direct upgrade would have incurred high development and trial-and-error costs. The hundreds of applications connecting to the service mesh had tens of thousands of non-Kubernetes containers. The Ant Group team ultimately chose to replace them all with Kubernetes pods through large-scale scaling.
After these two rounds of migration, the Ant Group team completed the cloud-native migration at the infrastructure level.
Challenges to the resource model
A key issue with the sidecar pattern is how to allocate resources.
The ideal ratio assumption
The initial resource design was based on the principle that memory cannot be overcommitted. Ant Group assumed that the basic resource usage of MOSN is proportional to the specifications selected by the application. This means the extra CPU and memory requested for the sidecar are proportional to the resources of the application container. This ratio was set to 1/4 for CPU and 1/16 for memory.
The resource allocation for a typical pod at this stage is shown in the following figure:

Limitations of the ideal ratio
The ideal ratio assumption led to two problems:
Ant Group had already implemented Quota control for application resources. However, the sidecar was not inside the application container, so the service mesh container became a source of resource leakage.
Due to application diversity, the service mesh containers of some high-traffic applications experienced severe out-of-memory (OOM) errors.
To quickly support the rollout of the service mesh in non-cloud environments, an in-place injection feature was launched. However, resources for in-place injection could not be allocated separately. Because memory could not be overcommitted, a two-part allocation method was used. The pod's memory resources were allocated as follows:
1/16 of the memory was allocated to the sidecar.
15/16 of the memory was allocated to the application container.
However, some new issues have also emerged:
Inconsistent memory visible to the application.
Business Monitoring drift.
Risk of OOM errors for the application process.
Solution
To solve these problems, the Ant Group team made another assumption: the resources used by the application before connecting to the service mesh are equivalent to the resources that will be occupied by the service mesh container. Therefore, the process of connecting to the service mesh is treated as a resource swap.
Based on this assumption, the scheduling layer was updated to support resource overcommitment within a pod. The CPU and memory for the service mesh container were overcommitted from the pod's total resources. The application container could still see all the originally allocated resources.
An example of the new resource allocation solution is shown below:

The new allocation solution considered the following factors:
Memory overcommitment
The risk of a pod OOM error
Therefore, the OOM score for Sidecar containers is also adjusted to prioritize the startup of the Service Mesh process over the Java business process during an out-of-memory event, which further reduces the impact.
The new allocation solution also solved the two problems mentioned earlier. It also smoothly supported multiple rounds of stress testing before the sales promotions.
Reconstruction
When the new allocation solution went online, the service mesh was already deployed with the elastic site construction. The Ant Group team also found that in some scenarios, the service mesh container could not obtain sufficient CPU resources. This caused significant jitter in the application response time (RT). The reason was that in CPU Share mode, a CPU Quota was not allocated to the sidecar by default within the pod.
This led to two new problems:
Existing allocated sidecars still had an OOM risk.
The sidecar could not obtain CPU resources.
Ant Group could not afford the cost of replacing all pods. With support from the scheduling team, the team manually recalculated and modified the pod annotations. This process reallocated all resources within the pods to mitigate these two risks. The total number of containers fixed was approximately 250,000.
Evolution of O&M facilities in large-scale scenarios
Changes to the service mesh include injection and upgrades. All underlying changes are handled by the Operator component. The Operator accepts identifiers written to the pod annotation from a higher layer and then modifies the corresponding Pod Spec to complete the change. This is a typical cloud-native approach. Due to Ant Group's resource constraints and O&M needs, in-place injection and smooth upgrades were also developed.
Access
There are two connection types:
Creation-time injection: Initially, the service mesh only supported sidecar injection at creation time.
In-place injection: This was later introduced to support large-scale, rapid injection and rollback.
The two connection types have the following advantages and disadvantages:
Configure access:
The resource replacement procedure requires a large buffer.
Rollback is difficult.
In-place injection:
Does not require resource reallocation.
Allows for in-place rollback.
In-place injection and rollback require fine-grained modification of the Pod Spec. Many problems were discovered during implementation. This feature has only been tested on a small scale.
Upgrade
The service mesh is deeply involved in service traffic. Therefore, the initial sidecar upgrade method also required the application to restart. Although this process seems simple, Ant Group encountered two critical problems:
The startup order of containers within a pod was random, which sometimes prevented the application from starting. This problem was solved by modifying the startup logic at the scheduling layer to require the pod to wait for all sidecars to start first. However, this solution led to a new problem.
The sidecar started too slowly, causing an upper-layer timeout. This issue is still being resolved.
In the sidecar, MOSN provides a more flexible smooth upgrade mechanism. The Operator controls the startup of a second MOSN sidecar, completes the connection migration, and then shuts down the legacy sidecar. Small-scale tests show that the application experiences no traffic interruption and the process is nearly seamless. Smooth upgrades also involve many operations on the Pod Spec. For stability reasons before the sales promotions, this method has not been widely used.
Scalability issues
As the system gradually reached the state required for the sales promotions, the number of containers connected to the service mesh began to increase explosively. The number of containers grew rapidly from thousands to over 100,000 and eventually reached hundreds of thousands across the entire platform. This expansion was followed by several version changes.
This rapid growth, combined with a lack of corresponding platform capabilities, brought significant challenges to large-scale sidecar O&M:
Chaotic version management:
The mapping between sidecar versions and applications or zones was maintained in the configuration of an internal metadata platform. After many applications were connected, global versions, experiment versions, and special bugfix versions became mixed in multiple configuration items. This broke the unified baseline and made maintenance difficult.
Metadata inconsistency:
The metadata platform maintained pod-granularity sidecar version information. However, because the Operator is oriented towards the desired state, inconsistencies between the metadata and the underlying reality could occur. These inconsistencies are currently found through inspections.
Lack of a comprehensive support platform for sidecar operations:
Lack of a multi-dimensional global view.
Lack of a standardized phased release flow.
Reliance on manual experience for configuration management changes.
Excessive monitoring noise.
Currently, the development teams for the service mesh and Platform as a Service (PaaS) are building the corresponding capabilities. These problems are gradually being mitigated.
Building peripheral technical risk capabilities
Monitoring capability
Ant Group's monitoring platform provides basic monitoring capabilities and dashboards for the service mesh. It also provides application-dimension sidecar monitoring, including the following:
System monitoring:
CPU
MEM
LOAD
Business Monitoring:
RT
RPC traffic
MSG traffic
Error Log Monitoring
The service mesh process also provides a corresponding Metrics interface for service-granularity data ingestion and calculation.
Inspections
After the service mesh went online, the following inspections were gradually added:
Log volume check
Version consistency check
Time-sharing scheduling status consistency check
Contingency plans and emergency response
The service mesh itself can disable some features as needed. The following features are currently implemented through the configuration center:
Log level degradation
Tracelog log level degradation
Control plane (Pilot) dependency degradation
Server Load Balancer long polling degradation
For services that the service mesh depends on, corresponding contingency plans were also added to prevent potential jitter risks:
Stop changes to the Server Load Balancer list.
Disable push from the service registry during peak hours.
The service mesh is a fundamental component. The current emergency measures are mainly the following restart methods:
Individual sidecar restart
Pod restart
Change risk control
In addition to the traditional three-step change process, Ant Group also introduced unattended changes. This provides automatic detection, automatic analysis, and change circuit breaking for service mesh changes.
Unattended change risk control focuses on the impact on the system and application after a change. It integrates multiple layers of detection, mainly including the following:
System metrics: machine memory, disk, CPU.
Business metrics: RT, QPS of the application and the service mesh.
Application-related links: abnormal conditions in upstream and downstream applications.
Global business metrics.
With this series of control facilities, platform-wide service mesh change risks can be detected and blocked within a single batch of changes, which prevents the risk from escalating.
Future outlook
During the rapid implementation of the service mesh, a series of problems were encountered and solved. However, many more problems still need to be addressed. As one of the core components of the next-generation cloud-native middleware, the technical risk capabilities of the service mesh need continuous improvement and refinement.
In the future, continuous development is needed in the following realms:
Support for large-scale, efficient injection and rollback.
More flexible change capabilities, including seamless smooth and non-smooth changes for applications.
More precise change risk control capabilities.
More efficient, low-noise monitoring.
More complete control plane support.
Application-dimension parameter customization capabilities.