Introduction to unitization

更新时间:
复制 MD 格式

Many large internet systems, such as Taobao, Alipay, and MYbank, use a unitized architecture. Because this approach provides significant benefits, more companies are adopting it. This topic explains the importance of unitization and its benefits to your system. It also describes the principles and implementation of unitization, using the Alipay system from Ant Group as an example.

Single-point bottlenecks

As an internet system grows, it inevitably faces single-point bottlenecks. This is a common challenge for large-scale systems such as Alipay, Taobao, Google, and Facebook. These bottlenecks appear in different forms as a system evolves.

系统发展单点

Server and application single point

In the early stages of a system's development, servers and applications are the first components to become bottlenecks. The solution is straightforward: add more machines and split the applications.

Single-point database

Next, the database becomes a bottleneck. Resolving this issue is more complex. A typical approach is to first apply vertical partitioning and then horizontal splitting. This process requires you to address challenges such as managing multiple data sources, implementing data partitioning, and ensuring transparent access.

Data center bottlenecks

As applications, servers, and databases multiply, a single data center reaches its capacity bottleneck and can no longer accommodate more servers. Increased service traffic also increases the risk associated with running the system in a single data center. If a power outage or other disaster causes the data center to fail, the entire system will become unavailable. To mitigate this risk, the system must be deployed across two or more data centers.

Multi-data-center deployments typically follow one of two models:

  • Vertical model: Applications and databases are partitioned and deployed across different data centers. A single business operation might require services from applications in multiple data centers. This approach logically expands a single physical data center, which resolves the capacity issue.

  • Horizontal model: Identical applications are deployed in each data center, and each data center is capable of handling all business operations for the entire system. At runtime, each data center processes only a portion of the total service traffic.

From an implementation perspective, the vertical model is easier. Although it can overcome the data center capacity bottleneck, it does not provide disaster recovery. Disaster recovery is a critical requirement for large-scale internet systems and an essential capability for financial services. Therefore, most large systems adopt the horizontal scaling model across multiple data centers.

Single-region deployment

Disaster recovery is especially important for systems with hundreds of millions of users or for data-intensive systems. Data center-level disaster recovery is not sufficient. You must also consider regional disaster recovery, which means you cannot deploy all data centers in the same geographical area. This strategy prevents catastrophic events, such as earthquakes, tsunamis, or nuclear explosions, from destroying the entire system. This requirement is typical for financial systems such as banks and third-party payment services. For example, banks often have a standard two-region, three-data-center requirement for their data center deployment. As a result, a single-region deployment becomes a bottleneck for business growth.

The ability to deploy parts of a system to a distant region or city indicates a mature, large-scale internet system. Multi-region deployment is fundamentally similar to multi-data-center deployment. It faces almost the same problems, but with the added challenge of distance. Distance introduces latency, and longer distances result in higher latency. A latency of a few milliseconds is generally not a problem. However, when latency reaches tens of milliseconds, it becomes a significant issue because many business operations cannot tolerate the effects of this delay.

Unitization

A multi-region, multi-data-center deployment is the inevitable evolutionary path for large-scale internet systems. To achieve this, a system must address the challenges mentioned earlier: traffic distribution, data splitting, and latency. While many technical solutions exist for these problems, they all rely on a specific deployment architecture. Although multiple deployment architectures are available, both theoretical research and practical experience from pioneering systems indicate that a unitized deployment is the optimal solution.

A unit is a self-contained collection of resources at the application service layer that can perform all business operations. This collection includes all the necessary services and the data partition allocated to that unit. A unitized architecture uses these units as its basic deployment components. Multiple units are deployed across data centers, and the number of units in each data center can vary. Each unit contains all the applications that the system requires. The data within a unit is a subset of the full data, partitioned along a specific dimension.

In a traditional Service-Oriented Architecture (SOA), services are hierarchical, and the number of nodes in each layer can differ. When an upper-layer service calls a lower-layer service, it randomly selects a node.

SOA 化(服务化)架构

In a unitized architecture, services are also hierarchical. The key difference is that every node in each layer belongs to one and only one unit. When an upper-layer service calls a lower-layer service, it selects a node only from within its own unit.

lhc架构

A unit is similar to a complete, miniature version of the entire system. It is fully functional because it contains all applications. However, it does not contain all data, because it can operate only on a subset of the data. A unitized system is easy to deploy across multiple data centers. For example, you can deploy some units in one data center and other units in different data centers. By placing a traffic distributor at the business entry point, you can adjust the distribution of service traffic among the units.

单元

Based on the definition and characteristics of a unit, we can identify a core requirement for a unitized architecture: data partitioning. In fact, the data partitioning scheme determines the proportion of service traffic that each unit can handle. Data partitioning, also known as sharding, involves horizontally splitting the global data along a specific dimension. The data in each resulting partition does not overlap. This is the goal of database horizontal splitting.

However, simply partitioning the data is not enough. Another requirement for unitization is that the partitioning dimension and sharding rules must be consistent for all business data across the system. For example, if you partition data by user, then all end-to-end services, such as transactions, acquiring, micro-loans, payments, and accounting, must also be partitioned by user. These services must also use the same rules to create the same number of partitions. For example, you can use the last two digits of a user ID as an identifier to divide the full data of each service into 100 partitions (00 to 99).

With these two principles in place, unitization becomes feasible. You can deploy one or more data partitions within a unit. The ratio of this partitioned data to the total data volume determines the proportion of service traffic the unit can handle. When you partition data, the choice of dimension is critical. A good dimension has the following attributes:

  • Appropriate granularity: If the granularity is too coarse, you lose flexibility and precision in traffic distribution. If the granularity is too fine, it places a heavy load on data support resources and access logic.

  • Sufficiently balanced: After partitioning along this dimension, the data volume in each deployment unit should be almost identical.

For systems where the user is the primary entity, such as many user-facing systems like Alipay, partitioning data by the user dimension is a best practice.

Logical units

Logical units are the foundation of a unitized architecture. In this context, a unit is also called a Zone. Based on your business characteristics, you can deploy your system in different types of logical units. For more information, see Create a logical unit.