In a cluster, a service often has multiple providers. Some providers might become unresponsive even if their persistent connections are still active. This can be caused by network problems, misconfigurations, prolonged full garbage collection (GC), full thread pools, or hardware failures. The single-node failure removal feature degrades these abnormal providers and directs more client requests to healthy nodes. When an abnormal node starts to function normally again, the feature recovers it. This process prevents service failures from impacting the business, avoids an avalanche effect, and improves system availability.
How it works
The single-node failure removal feature counts calls and exceptions within a time window. It calculates the exception rate for each service IP address and the average exception rate for the entire service. If the exception rate of an IP address exceeds the service's average rate by a specified ratio, the feature degrades the weight of that service and IP address dimension. If the weight is not degraded to 0, the feature restores the weight after calls to that dimension return to normal. The entire calculation and adjustment process is asynchronous and does not block calls.
Usage
The following is a configuration example:
FaultToleranceConfig faultToleranceConfig =new FaultToleranceConfig();
faultToleranceConfig.setRegulationEffective(true);
faultToleranceConfig.setDegradeEffective(true);
faultToleranceConfig.setTimeWindow(20);
faultToleranceConfig.setWeightDegradeRate(0.5);
FaultToleranceConfigManager.putAppConfig("appName", faultToleranceConfig);With this configuration, the target application checks for abnormal conditions within each 20 second time window. If a service and IP address dimension is identified as an abnormal node, its weight is degraded to 0.5. For more information, see Automatic failure removal.