This topic describes the evolution of the wireless gateway, explains the reasons for and value of its Service Mesh transformation, and shows how to smoothly migrate service traffic to the new Mesh gateway architecture during the Double 11 sales promotion.
This topic covers the following aspects:
Evolution of the gateway: Explains why the gateway was transformed using a mesh architecture.
Gateway mesh transformation: Describes how to perform the mesh transformation.
Double 11 Implementation: An introduction to the implementation of core capabilities.
Evolution of the gateway
Ant Group's wireless gateway is the primary traffic entry point for its clients, connecting to hundreds of business systems and providing tens of thousands of API services. It supports various business scenarios, such as Alipay, MYbank, Ant Fortune, MYbank Loan, Zhima Credit, and Ant International's A+ services. The architecture of the wireless gateway is constantly evolving to handle diverse business models and complex network structures.
Centralized gateway
In the "All In Wireless" era, a centralized gateway was developed to consolidate common capabilities. The following figure shows an example:

Process description:
The client connects to the Spanner cluster, which is the gateway access layer.
Spanner forwards the client request to the wireless gateway cluster.
The gateway processes the request using common capabilities, such as authentication and traffic limiting. It then routes the request to the correct backend service based on the service identity. After the service processes the request, the response is returned along the same path.
The 2016 Spring Festival "Red Packet" campaign and the growth of new services, such as Ant Forest, led to a continuous increase in the number of machines in the gateway cluster. This increased operations and maintenance (O&M) complexity and created unsustainable IT costs. At the same time, core business links, such as the wireless cashier and Scan, urgently needed to be isolated from other services. This isolation was necessary to prevent unpredictable traffic bursts from affecting the availability of these high-priority services. Therefore, the development and promotion of a decentralized gateway began in the second half of 2016.
Decentralized gateway
Example of a decentralized gateway

The decentralized gateway splits the capabilities of the original centralized gateway:
Forwarding logic: The capability to forward requests based on the service identity is migrated from the gateway to Spanner.
Gateway logic: Common gateway capabilities, such as authentication, traffic limiting, and LDC, are extracted into an mgw JAR file. This file is integrated into the business system and runs in the same Java Virtual Machine (JVM) process as the backend service.
The client request processing flow is as follows:
After a client request reaches Spanner, Spanner forwards the request to the mgw of the backend service based on the service identity.
The mgw handles common gateway capabilities. Then, 90% of requests proceed with an internal JVM call.
Compared with the centralized gateway, the decentralized gateway has the following advantages:
Improved performance:
It removes one layer from the gateway link because the gateway JAR file calls the business service through an internal JVM call.
During major sales promotions, you do not need to worry about gateway capacity because the number of gateways scales with the number of business services.
Improved stability:
In a centralized gateway architecture, a problem with the gateway affects all business services.
With decentralization, a gateway problem does not affect decentralized applications.
However, every approach has its downsides. As the decentralized gateway was implemented in the top 30 gateway applications, its disadvantages became apparent:
Low development efficiency:
Difficult to integrate: Integration requires adding more than 15 pom dependencies and more than 20 configurations, which is highly intrusive to the business configuration.
Difficult version convergence: The mgw.jar file has more than 40 versions, but some services still use the initial version. This makes version convergence difficult.
Difficult to promote new features: Rolling out new features requires businesses to upgrade and release their services, a process that often takes a month or longer.
Interference with service stability:
Dependency conflicts interfered with service stability on multiple occasions.
Gateway features could not be released through grayscale, monitored independently, or have their resource usage assessed and isolated.
No support for heterogeneous access: Non-Java applications cannot use the decentralized gateway.
Mesh gateway
Most of the problems with the decentralized gateway were caused by bundling gateway functions with business processes. This led the Ant Group team to consider whether these problems could be solved by separating gateway functions from business processes. This idea aligns with the Sidecar pattern of Service Mesh. Therefore, after three years of using the decentralized gateway, we transformed the Ant Group wireless gateway into a mesh architecture to address these pain points.
Example of a Mesh gateway

The Mesh gateway is deployed in the same pod as the backend service. This means the Mesh gateway and the business system run on the same server but in different processes. The Mesh gateway has all the capabilities of a complete gateway. The client request processing flow is as follows:
After a client request reaches Spanner, Spanner forwards the request to the Mesh gateway in the same pod as the backend service based on the service identity.
The Mesh gateway executes common logic and then calls the business service, which runs in a different process on the same machine, to complete the request.
Mesh gateways have the following advantages over decentralized gateways:
Development efficiency:
Easy integration: Businesses can integrate with the Mesh gateway without adding complex dependencies or configurations.
Easy version convergence and fast promotion of new features: After a new version passes grayscale validation, it can be rolled out across the entire network. This eliminates the need to maintain and troubleshoot problems that are caused by multiple versions.
Business stability:
Seamless upgrades: Business systems do not need to be aware of gateway upgrades. Gateway iterations can be upgraded seamlessly using existing MOSN features for fine-grained grayscale releases and validation.
Independent monitoring: Because the Mesh gateway runs in a separate process from the business system, you can use real-time telemetry to monitor its performance. This data can be used for evaluation and optimization to enhance backend service stability.
Heterogeneous system access: The Mesh gateway acts as a proxy that hides backend differences from the frontend. It supports major Remote Procedure Call (RPC) protocols, such as SOFARPC, Dubbo, and gRPC. It also supports access for heterogeneous systems that are not part of the SOFA framework.
This concludes the history of the wireless gateway's evolution.
You may have the following questions:
After the mesh transformation, the request processing flow is no longer an in-process call. It adds an extra hop compared with the decentralized gateway. Does this cause performance loss?
With such a major change, how can stability be ensured during the online deployment process?
The following sections will answer these questions.
Gateway mesh transformation
Architecture
Following the layered model of Service Mesh, the Mesh gateway is divided into a data plane and a control plane. The following figure shows an example:

Description of the layered model for gateway mesh transformation:
The blue arrow shows the client request processing flow. The Mesh gateway data plane is located in the MOSN Sidecar within Ant Group.
The green arrow shows the gateway configuration delivery process. The Mesh gateway control plane is currently managed by the gateway management platform.
The red arrow shows the O&M system for the MOSN Sidecar.
Data plane
The data plane implements all the original gateway capabilities in the Go language. It is integrated into the MOSN model and reuses the capabilities of other existing components. The Spanner network access layer also implements request distribution decision-making. The data plane includes the following features:
Data transformation: Transforms client request data into backend request data and backend response data into a client response. It uses the extension capabilities of the MOSN protocol layer to parse Mmtp, which is the gateway's proprietary protocol.
Common features: Provides common gateway capabilities, such as authorization, security, traffic limiting, LDC, and retries. These are extended using the MOSN Stream Filter registration mechanism and the unified Stream Send/Receive Filter interfaces.
Request routing: Routes client requests to the correct backend system based on specific rules. After implementing the LDC logic at the gateway layer, it reuses the route matching capability of the MOSN RPC component. Most requests are routed to the current zone, which allows them to be sent directly to the business process port of the current pod.
Backend calls: Supports calls to various types of backend services, such as SOFARPC and Chair. Support for more RPC frameworks and heterogeneous systems is planned for the future.
Control plane
The control plane is a management system that provides services for gateway users, such as adding and configuring APIs. It can deliver gateway configuration data to the Mesh gateway's Sidecar instances. Does the extra hop affect business performance?
Performance loss analysis at the MOSN layer

For more information about performance analysis, see Ant Financial Service Mesh In-depth Practice.
The analysis concludes that compared with complex business logic, the performance loss from the mesh is within an acceptable range. It also provides the ability to quickly realize benefits. The Mesh gateway also simplified sessions during this integration process:
Content simplification: Reduced from 7 KB to 650 bytes.
No decompression: Avoids the performance loss from GZip decompression.
Wireless and PC isolation: Solves the session pollution problem.
In scenarios with session validation, stress testing of the baseline performance shows that compared with the decentralized gateway, Queries Per Second (QPS) nearly doubled and response time (RT) decreased by about 15%.
Double 11 implementation
During the 2019 Double 11 sales promotion, the Mesh gateway was implemented as follows:
100% of traffic for the Taobao payment API in the sales promotion payment link was directed to the Mesh gateway.
The core application for the member path ingests all traffic during the sales promotion.
The gateway routes 5% of the top APIs' traffic.
This traffic distribution shows that the Mesh gateway supports multi-dimensional, percentage-based traffic direction. Of course, implementing a new architecture during a major sales promotion is full of risks, as the following figure shows:

To manage these risks, we built capabilities for the three core requirements: monitoring, grayscale release, and rollback. Currently, the Mesh gateway routing provides switches at the API percentage, server, zone, and application levels. This supports grayscale releases for businesses and immediate remediation.
Switch | Effective Time | Function |
|---|---|---|
Mesh Percentage | Immediately | Controls whether mesh routing is enabled for a specific API. Supports percentages. |
Label | Automatic detection | Controls whether mesh routing is enabled for a specific server. |
Zone | Spanner Reload | Controls whether mesh routing is enabled for a specific zone. |
MeshOnly | Spanner Reload | Controls whether mesh routing is enabled for a specific application. |
This section focuses on the API percentage-based traffic splitting policy for the Mesh gateway. In collaboration with the Spanner network access layer, the Ant Group team developed a mesh traffic splitting feature. This feature lets you switch a percentage of traffic for a single API to the Mesh gateway, with changes taking effect in seconds. If an API is configured with the rule decentralized x%, mesh y%, the traffic flow is shown in the following figure:

Decentralized gateway service: Served by Port 1 (Http) or Port 2 (Mmtp).
Mesh gateway service: Served by Port 3 (Http) or Port 4 (Mmtp).
The percentage information is configured by the business owner on the API management page and is returned to the Spanner Worker in the API response header. The Worker then learns this information and makes traffic splitting decisions based on the specified percentages. A route failure backoff mechanism is also implemented. The priority is Service:decentralized port > Service:mesh port > Gateway. The centralized gateway acts as a final fallback to ensure that the business service does not fail.