Create an offline dataset through form processing-Dataphin(Dataphin)-阿里云帮助中心

The Dataphin tag leverages the offline computing engine and enables the configuration of dataset metrics via form processing. It performs operations such as counting, summing, and identifying maximum and minimum values of source table fields to define dataset metrics for offline tag usage. This topic guides you through the creation of an offline dataset using form processing.

Prerequisites

Before you can create an offline dataset, you must first establish the tag project to which it will belong. For more information, see create a tag project.

Procedure

On the Dataphin home page, click the top menu bar Tag > Tag Workbench.
In the top menu bar, select Project.
In the left-side navigation pane, select Data Preparation > Offline Dataset.
On the Offline Dataset page, click New Dataset. In the New Offline Dataset dialog box, select Form Processing.

On the New Table Mapping configuration page, configure the dataset's Basic Information, Processing Logic, and Operations Configuration.

Basic Information

Parameter	Description
Dataset Name	Enter the name information of the dataset. Supports Chinese, English, numbers, and underscores (_), within 64 characters.
Dataset Code	The unique identifier of the offline dataset. When there are identical offline dataset names, it helps you locate the specific offline dataset. Must start with a letter, allowing lowercase English letters, numbers, and underscores (_), within 64 characters.
Dataset Update Method	Supports Periodic Update and Manual Update methods. Periodic Update: Automatically updates the dataset at regular intervals. Manual Update: Updates the dataset through manual operation.
Owner	Please select the owner of the offline dataset.
Description	Enter a brief description of the offline dataset, within 1000 characters.

Processing Logic

Parameter	Description
Project/data Section	Select the Project or Data Version that the offline dataset needs to reference. The drop-down list includes all projects (bound to offline computing sources) and data sections under the current tenant. Note If Intelligent R&D version is not purchased, only Project can be selected.
Logical Table/Source Table	Select the logical table/source table you need to define for the dataset. Logical Table: If Data Section is selected in the project/data section, the source table can be selected. Only logical tables with read-through permission can be selected in the logical table. When selecting a logical table, first select the Logical Table Type, then select the Subject Area, and finally select the target Logical Table from all logical tables in the subject area. Both subject areas and logical tables support keyword search. Logical table types are divided into Logical Fact Table, Logical Dimension Table, and Logical Aggregate Table. Note The default output method of the logical table does not include associations. Source Table: If Project is selected in the project/data section, the source table can be selected. Only tables with tenant account table data can be selected in the source table. If there is no permission, you can click Request Permission to apply.
Date Partition	Select the partition field of the source table. If the selected source table is a partitioned table, the system will use the field name as the date partition by default. If the default field name is not in the partition field list of the source table, the system will use the first partition field of the table as the date partition. If the selected source table is a non-partitioned table, there is no need to select a date partition.
Partition Field Format	Enter the date format or select an existing date format. You can choose yyyymmdd, yyyy-mm-dd, yyyy/mm/dd, yyyy.mm.dd.
Entity Id-value Type	Select the entity ID field in the source table. Only character or long integer field types are supported.
Metric Configuration	Click +add Metric, and select the statistical field you need to process, choose the required statistical function, time window, metric name, and description. The system will automatically detect the value type based on the selection. Statistical Function: Different statistical functions are supported based on the statistical field type. Long Integer: Count, Sum, Max, Min. String: Count, Max, Min. Time window: Supports the previous 1 day, 7 days, 15 days, 30 days, and custom options. When customizing, you can use the regular calendar or switch to an already created calendar. Metric Name: Supports Chinese, English, numbers, and underscores (_), within 64 characters. Value Type: After completing the configuration of the statistical field and statistical function, the system will automatically detect the value type of the metric. Configure Code Value: Supports Integer, Decimal(M,0), Boolean, and String Type field configuration lookup tables. Click to enter the Configure Lookup Table dialog box. In the Configure Lookup Table dialog box, configure the relevant parameters. Configuration lookup table: By default, it is not configured. You can select Reference Tables to configure the corresponding lookup table for the metric. Lookup Table Source: Currently, only Manual Configuration is supported. Lookup Table Name: Enter the lookup table name. Supports Chinese, English numbers, and special characters, within 128 characters. Lookup Table Description: Enter a brief description of the lookup table, within 1000 characters. Code Information: Supports single and batch input, with a maximum of 500 groups. Single Input: Click Add Code Value, enter Code Value and Code Name, both must not be empty and must be unique. Additionally, the type of code value needs to match the value type of the metric. You can click to delete the current row. Batch Input: Click Batch Input, you can batch input code values and code names in the Batch Input Code Information dialog box. Each group is separated by a line, and a half-width colon (:) is used to separate the code value and code name. Click Click To Detect, and the system will automatically parse the code information in the batch input box and fill it into the code information list. One-click Purge: Click One-click Purge, and the system will automatically clear the information list. Click OK to complete the code value configuration. Note When batch inputting code information, if there are duplicate code values or code names, after clicking OK, the system will automatically locate the first error row. Description: Enter a brief description of the metric, within 128 characters. Operation: Click to delete the currently configured metric. If you need to add multiple metrics, you can click +add Metric to add new ones.
Filter Condition	If you need to filter the statistical field data, you can use filter conditions for filtering. Filter conditions support: Greater Than Or Equal To, Greater Than, Less Than Or Equal To, Less Than, Not Empty, Empty, In Range, Not In Range, Or, And, Later Than Or Equal To, Later Than, Earlier Than Or Equal To, Earlier Than. If you need multiple filter conditions, you can click +add Filter Condition to add new ones. When there are multiple filter conditions, Or, And logical operations are supported. Or: Filter when one of the filter conditions is met. And: Filter only when all filter conditions are met.

Operations Configuration
Note
If the dataset update method is Manual Update, no configuration is necessary.
1. Scheduling Cycle
  - Scheduled Update Time: Supports scheduling at a specific time of day. The task automatically runs once a day, and you can specify the running time as needed.
  - Scheduling Run Plan: Click Preview. The scheduling run plan displays all scheduling instances and their types for each day of a specific month according to the configured scheduling cycle and conditions. The preview date type can be selected based on Data Timestamp or Run Date (Scheduling Date).
    If multiple scheduling type statuses are present for all instances in one day, the system displays all included statuses by color and shows the name of each scheduling type status and the number of corresponding instances. For example, the figure below shows that on the 4th of a certain month, the current scheduling task has 44 normal scheduling instances, 2 paused instances, and 12 dry-run instances.
    Hover the mouse over the scheduling type module of a certain day to view the detailed scheduling instance list of the current scheduling task on that day, including scheduling type, scheduling condition, and condition name.
  - Conditional Scheduling: You can set multiple scheduling conditions. The system evaluates these conditions sequentially from top to bottom. When a condition is met, the corresponding scheduling is executed, and the evaluation process for any remaining conditions stops. If no conditions are met, the system executes the default scheduling configuration. For more information, see the conditional scheduling rule description.
    Important
    Conditional scheduling is only effective when the scheduling type is Normal Scheduling.
2. Scheduling Dependency
  Scheduling dependency refers to the relationships between upstream and downstream nodes. In Dataphin, the downstream task node will start running only after the upstream task node has completed successfully.
  - Automatic Parsing
    The system automatically parses the upstream dependency nodes based on task lineage and creates associations. Data updates will depend on upstream data output.
    Note
    If the automatically parsed result does not meet expectations, you can click to deactivate the effective button. Deactivation means choosing not to depend on that node.
    By default, the dependency is on the current cycle.
  - Add Dependency
    If Automatic Parsing does not parse the scheduling dependency relationship or the upstream dependency configuration generated by Automatic Parsing does not match the actual application, you can manually add the node's Upstream Dependency.
    Click Add Dependency, select to add Physical Node or Logical Table Node, and in the pop-up dialog box, select one or more target nodes, then click OK.
    Note
    If the Intelligent R&D version is not purchased, only Physical Node dependencies can be added.
    After manually adding dependencies, clicking automatic parsing again will overwrite the nodes if the parsed nodes match the manually added dependency nodes.
  - Edit Dependency
    In the scheduling dependency list, click the Actions column of the desired upstream dependency table icon. A dialog box will appear where you can modify the Dependency Cycle , Dependency Policy , and Dependency Field (modification is supported only for logical table nodes). For details on configuring dependencies and their descriptions, see configure offline task scheduling dependency, scheduling dependency scene rules and examples.
    Click the Actions column of the target upstream dependency table icon to delete the corresponding dependency node.

Click Save And Publish to complete the creation of the offline dataset.
Note
After saving successfully, you can click Data Preview. The system will display the corresponding data information according to the configured processing logic to help you verify the accuracy of the processing logic.

What to do next

Once you have completed the creation and configuration of the offline dataset, you can proceed to create corresponding offline tags for it. For more information, see offline tags.

上一篇: Create an offline dataset by using SQL processing 下一篇: Manage offline datasets