Create a data backfill task

更新时间:
复制 MD 格式

Data backfill tasks refresh historical data using different scheduling methods. For regular backfills, use the timed scheduling feature. For nodes that require frequent backfills but with uncertain timing and data timestamps, create a manually executed backfill task.

Procedure

  1. In the top navigation bar of the Dataphin homepage, choose Development > Task Operations.

  2. In the left navigation bar, choose Task Operations > Data Backfill Flow.

  3. In the top menu bar, select the production or development environment.

  4. On the Data Backfill Flow page, click +Create Data Backfill Task.

  5. In the Create Data Backfill Task dialog box, configure the data backfill parameters.

    Parameter

    Description

    Basic Information

    Data Backfill Task Name

    Enter a name for the data backfill task. The name can be up to 128 characters long.

    Data Backfill Task Owner

    Select the owner of the data backfill task.

    Project Of Data Backfill

    Select the project for the data backfill task. Only projects for which you have the O&M-Access Directory permission are available.

    Data Backfill Range

    Start Node

    Select the start node for the data backfill range.

    Downstream Task Selection

    Note

    If the start task is a logical table, the display range of downstream tasks depends on the selected logical table fields that need data backfill.

    • List Mode: Applicable to downstream tasks of all levels, up to 2000 tasks. Task dependencies can be quickly selected from 1 to 10 levels or all levels.

      Filter Paused Tasks And Their Downstream:

      • Selected by default. When selected, the list does not display nodes with scheduling mode set to skip execution at the specified level and filter conditions, along with all their downstream nodes. At the same time, it cancels the selection of paused tasks.

      • For logical tables, as long as they contain paused fields, they are filtered out. Downstream tasks of all fields contained in logical tables marked as paused in the dependency downstream list are also filtered out.

        Note

        Downstream logical table fields can only be selected as a whole for data backfill. You cannot filter out only the paused fields.

    • Mass Mode: If list mode cannot meet your requirements for selecting downstream nodes (for example, there are too many nodes or you need to batch-select certain nodes), use mass mode. Mass mode searches for tasks within the selected range from the current node downward based on filter conditions and orchestrates them by dependency. It is suitable for global data backfill scenarios. Mass mode also supports the following filter parameters:

      • Coverage Range: Supports specifying the range through Specify Project, Specify Node Output Name, All Downstream Of Current Node, Specify First-level Child Nodes And All Their Downstream, and Specify End Point.

        • Specify Project: Specify the data backfill range by selecting a project.

        • Specify Node Output Name: Specify the data backfill range by entering node output names. When entering multiple names, use line breaks to separate them. You can enter up to 1000 names.

        • All Downstream Of Current Node: Backfill data for all downstream nodes of the current node.

        • Specify First-level Child Nodes And All Their Downstream: Backfill data for several first-level child nodes of the current node and all their downstream nodes.

        • Specify End Point: Backfill data for all nodes on the link from the start point to the end point. The start point defaults to the current node and cannot be modified. You can select multiple end point nodes.

        • Specify Node Name: Specify the data backfill range by entering node names. Separate multiple nodes with line breaks. You can enter up to 5000 characters. When a node name corresponds to multiple tasks, you can click Select Data Backfill Node in the prompt message. In the Nodes With Duplicate Node Names dialog box, select the corresponding node to confirm which node needs data backfill.

          Note
          • If the selected end point node is not a downstream node of the start point, data backfill will only be performed on the two isolated nodes: the start point and the end point.

          • Logical table task end points only support selection of the entire table (all fields).

      • Exclude Within Selected Range: Specify Node Output Names or Node Names to be excluded from the coverage range. Exclude Paused Nodes And Their Downstream is selected by default, which functions the same as Filter Paused Nodes And Their Downstream in list mode.

        Note
        • After excluding certain tasks within the selected range, isolated task nodes may appear on the DAG graph of the data backfill instance.

        • This is suitable for scenarios where data backfill is only needed for a specific downstream task node.

      • Selected Node List: In mass mode, you can View Selected Node List to confirm the data backfill nodes or click Export Selected Node List to export it as a local file. The file format is csv.

    Run Configuration

    Schedule Type

    Select Recurrency Triggered or Manual Run.

    • Recurrency Triggered: The data backfill task will generate a data backfill instance for scheduling and execution Before 11 PM On The Day Before the set scheduled running time. Recurrency Triggered requires configuring Scheduled Running Time and Data Backfill Data Timestamp.

      • Scheduled Running Time: Supports Daily, Weekly, and Monthly.

        Note

        Monthly scheduled running time supports selecting Last Day Of Month (the last day of each month) to run.

      • Scheduling Time Zone: Displays the configured scheduling time zone, which cannot be modified.

      • Data Backfill Data Timestamp: Supports Last N Days, Last N Weeks (Sunday To Monday), Last N Months (First Day To Last Day), or Custom data backfill data timestamp.

      • Preview Recent Running Time And Data Backfill Data Timestamp: Based on the scheduled running time and data backfill data timestamp configured above, preview the task running time and corresponding data backfill data timestamp in list form (only 5 groups are displayed).

    • Manual Run: Manually generate and run data backfill instances.

    Concurrent Running Groups

    Concurrent running groups control how many backfill processes run simultaneously. You can select from 1 to 12 concurrent groups.

    • If the span of the data timestamp is less than the number of concurrent running groups, the actual number of parallel groups will be the number of days in the data timestamp.

    • If the span of the data timestamp is greater than the number of concurrent running groups, there may be both serial and parallel execution. Instances within the same group run in the order of data timestamps, while instances in different groups run in parallel. For example, if the data timestamp is January 11 to January 13 and the number of concurrent running groups is 2, January 11 and January 12 form one group, and January 13 forms another group. The instances for January 11 and January 13 start running simultaneously, while the instance for January 12 starts running after the instance for January 11 completes.

      Note

      Concurrent running is not supported when there are cross-cycle dependencies among the selected nodes.

    Data Backfill Order

    Choose to perform data backfill in ascending or descending order of business time.

    Note

    Descending order by data timestamp is not supported when there are cross-cycle dependencies among the selected nodes.

    Specify Temporary Scheduling Resource Group

    you can specify a temporary resource group for this backfill to meet temporary resource needs. For more information, see Overview of custom scheduling resource groups. If no temporary resource group is specified, the task scheduling resource group configured for each task is used.

    Note

    The configured resource group can only be selected from resource groups with application scenarios that include Batch Operations.

    Dry Run This Node

    Specify whether to dry-run this task:

    • Yes: The backfill instance runs in dry-run mode — once scheduled, it returns success immediately without executing the task.

      Note

      Suitable when the current node does not need a backfill but you need to select its downstream nodes for backfill.

    • No: The node runs normally.

    Instances Corresponding To Paused Scheduling Tasks

    Configure the running status of backfill instances generated by paused scheduling tasks:

    • Pause Running (May Block Data Backfill Process): All backfill instances generated by paused scheduling tasks are paused, which blocks downstream instances from running.

      Note

      Suitable when neither the current task nor its downstream tasks need to run.

    • Dry Run: Backfill instances generated by the selected paused tasks succeed directly in dry-run mode.

      Note

      Suitable when the current task does not need to run but downstream tasks need to run normally.

    • Normal Run: All backfill instances generated by paused tasks run normally.

      Note

      Suitable when the current node is set to skip execution but needs to run on the selected data backfill data timestamp.

    Instances Corresponding To Dry-Run Scheduling Tasks

    Configure the running status of backfill instances generated by dry-run scheduling tasks:

    • Dry Run: Backfill instances generated by the selected dry-run scheduling tasks succeed directly in dry-run mode.

    • Normal Run: All backfill instances generated by dry-run tasks run normally.

    Hourly Interval Impact Range

    For hourly or minute-level tasks, configure the effective range:

    • Do Not Affect Daily/Weekly/Monthly Scheduling Tasks (Run When Selected): Downstream tasks are not affected by the hourly interval selection and all run.

    • Daily/Weekly/Monthly Scheduling Tasks Only Run When Their Scheduled Running Time Is Within The Selected Hourly Interval: Downstream tasks are affected by the hourly interval and only run when their scheduled running time is within the selected hourly interval.

  6. Click OK to complete the creation of the data backfill task.

What to do next

After you create a data backfill task, you can manage it based on its scheduling type, such as manually running the task, deleting it, or changing the owner. For more information, see: