Create offline datasets using form-based processing-Dataphin(Dataphin)-阿里云帮助中心

Dataphin labels are based on an offline computing engine. Form-based processing lets you define dataset metrics for offline labels by applying aggregate functions such as count, sum, max, and min to source table fields.

Prerequisites

Before creating an offline dataset, you must create a label project for it. For more information, see Create a label project.

Procedure

On the Dataphin homepage, in the top navigation bar, choose Label > Label Workbench.
In the top navigation bar, select a project.
In the left-side navigation pane, choose Data Preparation > Offline Dataset.
On the Offline Dataset page, click Add Dataset. In the Add Offline Dataset dialog box, select form-based processing.

On the Create Table Mapping page, configure the Basic Information, Processing Logic, and O&M Configuration for the dataset.

Basic Information

Parameter	Description
Dataset name	Enter a name for the dataset. The name must be 64 characters or less and can contain Chinese characters, letters, digits, and underscores (_).
Dataset code	The dataset code is a unique identifier that helps distinguish between datasets with the same name. The code must be 64 characters or less, start with a letter, and contain only lowercase letters, digits, and underscores (_).
Dataset update method	Supported methods are Periodic Update and Manual Update. Periodic Update: The dataset is updated automatically at a scheduled interval. Manual Update: Updates must be triggered manually.
Owner	Select an owner for this offline dataset.
Description	Enter a brief description of the offline dataset, up to 1,000 characters.

Processing Logic

Parameter	Description
Project/Data plate	Select the project or data plate that the offline dataset will reference. The drop-down list includes all projects (that are bound to an offline computing source) and data plates in the current tenant. Note If you have not purchased the Smart R&D edition, you can only select a project.
Logical table/Source table	Select the logical table or source table for which you want to define the dataset. Logical Table: This option is available if you selected Data Plate for Project/Data plate. You can only select logical tables for which you have synchronous read permissions. To select a logical table, first choose a logical table type, then a subject domain, and finally select the target logical table from that subject domain. You can search for the subject domain and the logical table by keyword. The logical table types are fact logical table, dimension logical table, and aggregated logical table. Note By default, the output of a logical table does not include its associations. Source Table: This option is available if you selected Project for Project/Data plate. You can only select tables that the project's production account has permission to read. If you do not have the required permissions, you can click Request Permissions to apply for them. Note Currently, only partitioned tables are supported.
Date partition	Select the partition field from the source table. If the source table is a partitioned table, the system defaults to using a specific field name for the date partition. If a field with this name is not found in the table's partition fields, the system uses the first partition field instead.
Partition field format	Enter a date format or select a predefined format. Available options are yyyymmdd, yyyy-mm-dd, yyyy/mm/dd, and yyyy.mm.dd.
Entity ID-value type	Select the entity ID field from the source table. Only fields of type String or Bigint are supported.
Metric configuration	Click + Add Metric, select the field to process, and configure the required aggregate function, time window, metric name, and description. The system automatically determines the value type. Aggregate function: The available aggregate functions depend on the field's data type. Bigint: `count`, `sum`, `max`, and `min`. String: `count`, `max`, and `min`. Time window: Supports Last 1 day, Last 7 days, Last 15 days, Last 30 days, and custom ranges. For custom ranges, you can use a normal calendar or switch to a previously created calendar. Metric name: The name can contain Chinese characters, letters, digits, and underscores (_). It must be no more than 64 characters long. Value type: After you configure the source field and aggregate function, the system automatically identifies the metric's value type. Configure code: You can configure a code table for fields of type Integer, Decimal(M,0), Boolean, or String. Click to open the Configure Code Table dialog box. In the Configure Code Table dialog box, configure the following parameters. Configure code table: By default, no code table is configured. You can select Code Table to assign a code table to the metric. Code table source: Currently, only Manual Configuration is supported. Code table name: Enter a name for the code table. The name can contain Chinese characters, letters, digits, and special characters. It must be no more than 128 characters long. Code table description: Enter a brief description for the code table, up to 1,000 characters in length. Code information: You can add up to 500 code pairs by using single or batch input. Single input: Click Add Code Value and enter a Code Value and Code Name. Both fields are required and must be unique. The code value type must match the metric's value type. You can click to remove the current row. Batch input: Click Batch Input to open the Batch Input Code Information dialog box. You can enter multiple code values and names. Separate each entry with a line break, and separate the code value and code name with a colon (`:`). Click Click to Identify, and the system automatically parses the information and populates the list. Clear all: Click Clear All to clear the code information list. Click OK to save the code configuration. Note When you use batch input, if there are duplicate code values or names, the system highlights the first duplicate entry when you click OK. Description: Enter a brief description for the metric, up to 128 characters in length. Actions: Click to remove the corresponding metric. To add more metrics, click + Add Metric.
Filter conditions	Use filter conditions if you need to filter the source data before aggregation. Supported operators include: Greater than or equal to, Greater than, Less than or equal to, Less than, Is not null, Is null, In, Not in, OR, AND, Later than or equal to, Later than, Earlier than or equal to, and Earlier than. To add multiple filter conditions, click + Add Filter Condition. When multiple conditions exist, you can combine them using OR or AND logic. OR: The filter is applied if any one of the conditions is met. AND: The filter is applied only if all conditions are met.

O&M Configuration

Note
This configuration is not required if the dataset update method is set to Manual Update.
1. Scheduling cycle
  - Scheduled update time: Supports scheduling At a specific time of a day. The task runs daily at the specified time.
  - Scheduling run plan: Click Preview. The scheduling run plan shows all scheduling instances and their scheduling types for each day in a given month based on the configured scheduling cycle and conditions. You can preview the schedule by Business Date or Run Date (Scheduling Date).
    
    If a day has instances with multiple scheduling type states, the calendar displays all types with a color code and shows the count for each type. For example, the following figure shows that on the 4th of the month, the task has 44 normal scheduling instances, 2 suspended instances, and 12 dry-run instances.
    
    Hover over a day to view a detailed list of scheduled instances, including the scheduling type, scheduling condition, and condition name.
  - Conditional scheduling: You can set up multiple scheduling conditions. The system evaluates conditions sequentially. Once a condition is met, its schedule is triggered, and no further conditions are evaluated. If no conditions are met, the default scheduling configuration is used. For more information, see Conditional scheduling rules.
    
    Important
    Conditional scheduling takes effect only when the scheduling type is set to Normal Scheduling.
2. Scheduling dependency
  
  A scheduling dependency defines the upstream and downstream relationship between nodes. In Dataphin, a downstream task node runs only after its upstream task nodes complete successfully.
  - Automatic resolution
    
    The system automatically resolves upstream dependency nodes based on task lineage. Data updates depend on the output from these upstream nodes.
    Note
    
    If the result from automatic resolution is incorrect, you can click to disable the dependency on that node.
    
    By default, dependencies are for the current cycle.
  - Add dependency
    
    If Automatic resolution fails to resolve the scheduling dependencies or the upstream dependency configuration generated by Automatic resolution does not match the actual application, you can manually add upstream dependencies for the node.
    
    Click Add Dependency, choose to add a Physical Node or a Logical Table Node, select one or more target nodes in the dialog box, and then click OK.
    Note
    
    If you have not purchased the Smart R&D edition, you can only add Physical Node dependencies.
    
    If you run automatic resolution after adding a dependency manually, the system overwrites your manual configuration for any node that it also identifies as a dependency.
  - Edit dependency
    
    In the scheduling dependency list, click the icon in the Actions column for an upstream dependency. In the dialog box that appears, you can modify the Dependency Cycle, Dependency Policy, and Dependency Field (this setting is modifiable only for logical table nodes). For more information on dependency configurations, see Configure scheduling dependencies for offline tasks and Scheduling dependency scenarios, rules, and examples.
    
    Click the icon in the Actions column to remove a dependency node.

Click Save and Publish to create the offline dataset.

Note
After the dataset is saved, you can click Data Preview. This preview helps you verify that your processing logic is correct.

Next steps

After you create and configure the offline dataset, you can create offline labels for it. For more information, see Offline labels.