Hologres provides a mechanism of single-instance fast recovery to quickly recover the system from failures. This topic describes the trigger conditions and behavior of the single-instance fast recovery mechanism.
Implementation logic
In versions earlier than Hologres V2.0, Hologres compute nodes (the worker nodes in the following figure) are scheduled like containers, and the resource manager performs regular health checks. If a compute node takes more than 1 minute to respond due to out-of-memory (OOM) errors or faults in hardware or software, the resource manager automatically starts another normal compute node and migrates shards from the faulty compute node to the new node. For example, if Worker Node 3 responds late, the resource manager starts Worker Node 4 to replace Worker Node 3. This implements fast recovery. Data is stored in Apsara Distributed File System and does not need to be migrated between compute nodes. Compute nodes are lightweight and stateless to support fast recovery. By default, the single-instance fast recovery feature is enabled for each Hologres instance. If an exception occurs on a node, the instance can automatically recover without the need of manual O&M. If a query operator attempts to access a node that is in automatic recovery, the query immediately fails. The node recovery takes about one minute. If the node contains a large number of tables, the recovery time is longer.
In Hologres V2.0 and later, a fast recovery mechanism is introduced. For instances that use high specifications, fragment resources of worker nodes can be used to accelerate recovery and reduce the impact on online services. In this fast recovery mechanism, if a worker node on an instance becomes faulty, metadata of the shards that are allocated to the faulty worker node can be quickly loaded to other normal worker nodes. This mechanism is supported only if the total number of worker nodes reaches a specific threshold. The allowed number of faulty worker nodes on an instance depends on the total number of worker nodes on the instance.
The following table lists mappings between instance specifications and the allowed number of faulty worker nodes.
Number of CUs
Number of instance nodes
Allowed number of faulty worker nodes
160 ≤ Number of CUs < 320
10 ≤ Number of nodes on an instance < 20
1
320 < Number of CUs
20 ≤ Number of nodes on an instance
2
Metadata of the shards that are originally allocated to a faulty worker node is temporarily loaded to other normal worker nodes. This ensures quick recovery.
Sample scenario
By default, metadata of shards is evenly loaded to all worker nodes of an instance. In this example, the instance contains 10 worker nodes and 10 shards. One shard is loaded to each worker node.
If Worker 2 fails, the instance uses Worker 1 to load the metadata of Shard 2 within 10 seconds after a failure is detected.

After Worker 2 restarts, the system does not automatically load the metadata of Shard 2 back to Worker 2. The metadata of Shard 2 is still loaded to Worker 1. If load imbalance is detected after a fast recovery is triggered, you can use the Rebalance function for load balancing of the instance. For more information, see Rebalance.