High availability for serverless resource groups

更新时间:
复制 MD 格式

DataWorks Serverless Resource Groups use a multi-zone deployment architecture by default, spanning at least two Availability Zones within the same city. When a zone fails, the system automatically reschedules tasks to other active zones — keeping your data development workloads running without manual intervention.

How it works

Serverless Resource Groups support multi-zone deployment by default for both pay-as-you-go and subscription billing. Resources are distributed across at least two Availability Zones in the same city. When a zone becomes unavailable, computing resources in that zone are automatically redirected to the remaining active zones through a failover process.

The following diagram illustrates the multi-zone architecture and how tasks are redistributed during a zone failure.

image

Core concepts

Concept

Description

High availability

Serverless Resource Groups are deployed across multiple Availability Zones. If one zone fails, tasks are automatically rescheduled to run in other zones, ensuring business continuity.

Failover

The process by which the system automatically reschedules tasks from a failed zone to other active zones. Failover is triggered when computing resources or services in the original zone become unavailable.

Resource availability ratio

The percentage of computing units (CUs) in a resource group that are available for tasks at a given time. A single-zone failure reduces the overall resource pool, which lowers the resource availability ratio.

Limitations

High availability ensures task scheduling continuity — not unlimited resources or a 100% task success rate.

During a single-zone failure, the following situations may occur:

  • Reduced resource availability. When a zone fails, the overall CU pool of the resource group shrinks. Tasks may queue while waiting for available resources in the remaining zones.

  • Task failure and retry. Tasks running in the failed zone will fail. The system attempts to reschedule them in other zones through failover. For this to succeed, tasks must be rerunnable and must have an automatic retry policy configured.

  • External dependency requirements. If a task depends on an external system — such as a database or Message Queue (MQ) service — that system must also support high availability. A DataWorks failover cannot recover a task that cannot reach an unavailable external dependency.

Unsupported scenarios

The following DataWorks resource group use cases do not support high availability by default:

Scenario

High Availability Support

Personal developer environments

Not supported by default

DataService Studio

Not supported by default

Model Service

Not supported by default

Configure your environment for effective failover

High availability is a shared responsibility: DataWorks handles zone-level redundancy automatically, but your workloads must be configured to take full advantage of it. Complete the following steps to ensure your tasks recover reliably from a zone failure.

  1. Configure an automatic retry policy for tasks. Tasks running in a failed zone are restarted in another zone. Without a retry policy, those task instances fail permanently. Configure automatic retry in your task settings.

  2. Ensure external dependencies support high availability. If your tasks depend on external systems — such as databases or Message Queue (MQ) services — those systems must also be configured for high availability. A DataWorks failover cannot recover a task that cannot reach an unavailable external dependency.