Reliably process distributed multi-step transactions

更新时间:
复制 MD 格式

This topic describes how to use Serverless workflow to implement long-running distributed transactions, allowing you to focus on your business logic.

Overview

Complex business applications, such as e-commerce, hotel, or flight booking systems, often involve multiple remote services and require strong transactional semantics. This means all steps in a process must either complete successfully or fail together, leaving no intermediate states. In applications with low traffic and centralized data storage, the atomicity, consistency, isolation, and durability (ACID) properties of a relational database can fulfill this requirement. However, to achieve high availability and scalability for high-traffic scenarios, businesses often adopt a distributed microservices architecture. In such an architecture, ensuring transactional integrity typically requires complex solutions involving message queues and databases to persist messages and process state. This adds significant development and operational overhead. Serverless workflow simplifies this by providing built-in support for long-running distributed transactions.

Scenarios

Assume an application allows users to book train tickets, flights, and hotels, and requires these three operations to be transactional. This feature requires three remote calls (for example, booking a train ticket requires calling the 12306 API). If all three calls succeed, the order is successful. In practice, any remote call can fail. Therefore, the application must implement compensation logic to roll back completed operations for different failure scenarios, as shown in the following figure:

  • If booking the train ticket (BuyTrainTicket) succeeds but reserving the flight (ReserveFlight) fails, the system must cancel the train ticket (CancelTrainTicket) and notify you that the order has failed.
  • If booking the train ticket (BuyTrainTicket) and reserving the flight (ReserveFlight) both succeed but booking the hotel (ReserveHotel) fails, the system must cancel the flight (CancelFlight) and train ticket (CancelTrainTicket), and then notify you that the order has failed.
longtxn-saga_train_flight_hotel

Implementation with Serverless workflow

The following example demonstrates how to orchestrate Function Compute (FC) functions into a Serverless workflow to implement a reliable, multi-step, long-running process. The example consists of three steps:

  1. Create Function Compute functions
  2. Create a flow
  3. Execute and view results

Step 1: Create Function Compute functions

This step simulates the three operations in the use case: booking a train ticket, reserving a flight, and booking a hotel.

Create the following Python 2.7 function. For detailed instructions, see the Function Compute documentation. We recommend the following names:
  • Service: fnf-demo
  • Function: Operation

The Operation function simulates each operation, such as reserving a flight or hotel. It determines whether the operation succeeds or fails based on its input.

import json
import logging
import uuid
def handler(event, context):
  evt = json.loads(event)
  logger = logging.getLogger()
  id = uuid.uuid4()
  op = "operation"
  if 'operation' in evt:
    op = evt['operation']
    if op in evt:
      result = evt[op]
      if result == False:
        logger.info("%s failed" % op)
        exit()
  logger.info("%s succeeded, id %s" % (op, id))
  return '{"%s":"success", "%s_txnID": "%s"}' % (op, op, id)         

Step 2: Create a flow

Use the Serverless workflow console to create the following flow.

  1. Configure the flow's RAM role.
    {
        "Statement": [
            {
                "Action": "sts:AssumeRole",
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "fnf.aliyuncs.com"
                    ]
                }
            }
        ],
        "Version": "1"
    }                               
  2. Define the flow.
    version: v1
    type: flow
    steps:
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: BuyTrainTicket
        inputMappings:
        - target: operation
          source: buy_train_ticket
        - target: buy_train_ticket
          source: $input.buy_train_ticket_result
        catch: 
        - errors:
          - FC.Unknown
          goto: OrderFailed
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: ReserveFlight
        inputMappings:
        - target: operation
          source: reserve_flight
        - target: reserve_flight
          source: $input.reserve_flight_result
        catch:  # If the ReserveFlight task fails with an FC.Unknown error, jump to the CancelTrainTicket step.
        - errors:
          - FC.Unknown
          goto: CancelTrainTicket
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: ReserveHotel
        inputMappings:
        - target: operation
          source: reserve_hotel
        - target: reserve_hotel
          source: $input.reserve_hotel_result
        retry:  # Retry up to 3 times with exponential backoff for FC.Unknown errors. The initial interval is 1s, and subsequent intervals are doubled.
        - errors:
          - FC.Unknown
          intervalSeconds: 1
          maxAttempts: 3
          multiplier: 2
        catch:  # If the ReserveHotel task fails with an FC.Unknown error after all retries, jump to the CancelFlight step.
          - errors:
            - FC.Unknown
            goto: CancelFlight
      - type: succeed
        name: OrderSucceeded
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: CancelFlight
        inputMappings:
        - target: operation
          source: cancel_flight
        - target: reserve_flight_txnID
          source: $local.reserve_flight_txnID
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: CancelTrainTicket
        inputMappings:
        - target: operation
          source: cancel_train_ticket
        - target: reserve_flight_txnID
          source: $local.reserve_flight_txnID
      - type: fail
        name: OrderFailed                              

Step 3: Execute and view results

In the console, start a new execution for the flow that you created. The StartExecution API requires input in JSON format. The following JSON object can be used to simulate the success or failure of each step. For example, "reserve_hotel_result":"fail" simulates a failure in the hotel booking step. The StartExecution API is asynchronous. When called, Serverless workflow returns an execution name that you can use to query the execution's status.

{
  "buy_train_ticket_result":"success",
  "reserve_flight_result":"success",
  "reserve_hotel_result":"fail"
}                       

After the execution starts, you can view its progress and results in the Serverless workflow console. From the step details tab, you can see that because "reserve_hotel_result":"fail" and the ReserveHotel function call failed, Serverless workflow follows the flow definition and sequentially cancels the flight (CancelFlight) and the train ticket (CancelTrainTicket). Serverless workflow persists the state of each step transition, so network interruptions or process crashes do not affect the transactional integrity of the flow.

The step details panel displays the error details for the failed step: the error is FC.Unknown, the cause is Process exited unexpectedly, and the retry count is 3.

A flow execution generates execution history events. You can query these events using the console, SDK, or CLI to call the GetExecutionHistory API.

The lifecycle of each step generates events such as StepEntered, TaskScheduled, TaskStarted, TaskSucceeded, and StepExited. In the Execution History tab, you can view each event's ID, type, step name, timestamp, and relative duration.

Error handling and retries

  1. In the preceding example, remote calls such as reserving a flight and booking a hotel can fail due to network or service errors. Adding retries for transient errors can increase the order success rate. Serverless workflow provides a built-in retry feature for task type steps. For example, the ReserveHotel step is configured to use exponential backoff for FC.Unknown errors. If the ReserveHotel step still fails after reaching the maximum number of retries, the catch definition ensures that the FC.Unknown error is caught and the execution jumps to the CancelFlight step to run the defined compensation logic.
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: ReserveHotel
        inputMappings:
        - target: operation
          source: reserve_hotel
        retry:  # Retry up to 3 times with exponential backoff for FC.Unknown errors. The initial interval is 1s, and subsequent intervals are doubled.
        - errors:
          - FC.Unknown
          intervalSeconds: 1
          maxAttempts: 3
          multiplier: 2
        catch: # If the ReserveHotel task fails with an FC.Unknown error after all retries, jump to the CancelFlight step.
          - errors:
            - FC.Unknown
            goto: CancelFlight           
  2. From the execution history, you can see that after adding retries, the ReserveHotel task was executed multiple times, up to the maximum retry count. Each retry attempt triggers a sequence of three events: TaskScheduled, TaskStarted, and TaskFailed. The execution history shows this cycle repeating, with each TaskScheduled event marking a new retry.

Data transfer between steps

  1. If the hotel reservation fails, the workflow must cancel the flight and train ticket. These compensation actions require the transaction IDs (txnID) returned by the corresponding ReserveFlight and BuyTrainTicket steps. The following inputMappings object shows how to pass the output from a previous step as input to the CancelFlight step.
      - type: task
        resourceArn: acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation
        name: CancelFlight
        inputMappings:
        - target: operation
          source: cancel_flight
        - target: reserve_flight_txnID
          source: $local.reserve_flight_txnID
                        
  2. The output from each completed step is stored in the local object within the EventDetail of the corresponding StepExited event.
      {  
         "input":{
            "operation":"reserve_hotel",
            "reserve_hotel_result":"fail"
         },
         "local":{
            "buy_train_ticket":"success",
            "buy_train_ticket_txnID":"d37412b3-bb68-4d04-9d90-c8c15643d45e",
            "reserve_flight_result":"success",
            "reserve_flight_txnID":"024caecf-cfa3-43a6-b561-9b6fe0571b55"
         },
         "resourceArn":"acs:fc:{region}:{accountID}:services/fnf-demo/functions/Operation",
         "cause":"{\"errorMessage\":\"Process exited unexpectedly before completing request (duration: 12ms, maxMemoryUsage: 9.18MB)\"}",
         "error":"FC.Unknown",
         "retryCount":3,
         "goto":"CancelFlight"
      }         
  3. After the mapping defined in inputMappings is applied to the data in the EventDetail, the input for the CancelFlight step becomes the following JSON object. This ensures the CancelFlight function receives the reserve_flight_txnID field.
      "input":{
        "operation":"cancel_flight",
        "reserve_flight_txnID":"024caecf-cfa3-43a6-b561-9b6fe0571b55"
      }