Create and manage identification features
Identification features are defined based on field data content and metadata properties, using regular expressions, inclusion, exclusion, and other conditions to intelligently recommend data classifications or standards. Dataphin includes built-in identification features for entities such as phone numbers and ID numbers, and also lets you customize identification features. This topic describes how to create and manage identification features.
Prerequisites
You must activate X-Data Standard to use the intelligent generation feature for identification features.
Permission description
Super administrators, data standard administrators, security administrators, and custom global roles with Feature-Management permissions can create and manage identification features.
Introduction to identification features
Identification features are instrumental in the intelligent recommendation of data standard mappings and field classification and labeling. The feature scan configuration affects both the standard module's mapping rule tasks and the security module's identification rule execution. Configure these settings carefully to avoid semantic conflicts, resource waste, and other potential issues.
Identification feature scan configuration
At the top menu bar on the Dataphin home page, select Administration > Standard.
In the navigation pane on the left, select General Configuration > Feature. On the Feature page, click Scan Configuration.
In the Scan Configuration dialog box, configure the parameters.
Parameter
Description
Scan configuration
Scan Range
Select the scan range of identification rules. By default, Filter View is selected, and it supports switching to Include View.
NoteBatch import and manual addition of identification results are not affected by this configuration and can directly add identification results of view objects.
When the scan range selection includes view, both rule-based automatic scanning and lineage-based automatic inheritance scanning methods will classify and label view objects.
View objects include physical views, logical views, data source views, data source materialized views, and materialized views.
Concurrent Runs
Controls the number of identification tasks that run globally at the same time. These tasks include mapping rule tasks in the standard module that are intelligently mapped by identification features, and tasks in the security module such as scheduled scans, manual scans, real-time scans, and automatic inheritance scans triggered by lineage updates. The default value is 16. You can set this parameter to a positive integer from 1 to 100.
NoteThis parameter takes effect only when auto-triggered sampling query is disabled.
Increasing the degree of parallelism can speed up scans but consumes more cluster computing resources. To ensure system stability, configure this parameter based on your business needs.
Sampling configuration
NoteThis applies to both auto sampling and temporary sampling queries that are triggered for content-based identification when auto sampling is disabled.
Auto sampling
This feature is enabled when data sampling is turned on in Administration > Metadata > Sampling Configuration and the trigger scenario is the execution of security identification rules or standard mapping rules. Otherwise, the feature is disabled.
When enabled, automatic data sampling is performed based on the Metadata-Sampling Configuration settings. When an identification rule runs, the system first checks the data range for sample values to determine if data sampling is needed. Then, it performs automatic sampling based on the automatic sampling update policy.
NoteEnable this feature when security identification rules involve content-based identification or when standard mapping is configured for intelligent mapping by identification feature. This helps prevent data from becoming outdated and avoids extra resource consumption from temporary data queries.
Execution space
If no sample data is available for content-based identification, you must select compute resources to run temporary data queries. You can modify the configuration in Administration > Metadata > Sampling Configuration > Compute Source.
NoteTemporary data query tasks consume compute resources. In most cases, you can select the project where the data resides.
If you want to reduce the resource load and query costs on the data's source project, you can allocate dedicated project resources or queues for temporary data queries. For example, you can choose a separate subscription project to avoid interference with normal business projects.
Ensure that the account configured for the compute source in the selected project has read permissions for the relevant data tables.
Scan disable period
During the specified time period, auto-triggered data sampling query tasks are not initiated and will immediately fail. This prevents the tasks from consuming excessive compute resources that could affect the normal operation of production environment tasks, ensuring the stability of online data tasks. You can modify the configuration in Administration > Metadata > Sampling Configuration > Compute Source.
Click OK to complete the identification feature scan configuration.
Create identification features
On the Feature page, click the Create Feature button.
In the Add Feature dialog box, configure the parameters. Ensure the configuration matches the settings found under Data Security > Feature. For more information, see Create identification features.
Click OK to complete the addition of the identification feature.
Manage identification features
On the Feature page, you will find information on the name, description, type, last updated by, and last update time of the identification features.
(Optional) You can search for a specific identification feature by name or filter by type.
You can perform the following operations on the target identification feature. Supported operations are consistent with Data Security > Feature. For more information, see Manage identification features.
What to do next
When creating data standards, you can specify associated identification features, such as linking the ID number standard with the ID number feature. For configuration details, see Create and manage data standards.
When creating mapping rules, you can select the Intelligent Match by Identification Feature method. This rule matches features according to the identification features configured in the chosen data standard and the selected asset objects, suggesting appropriate mapping relationships. For configuration details, see Create and manage mapping rules.
Intelligent generation
Dataphin uses Alibaba Cloud Model Studio and X-Data Standard to intelligently generate regular expressions and suggest possible field names based on the feature name that you enter. This process quickly recommends feature expressions and their explanations, reducing configuration costs and improving the accuracy of standard mapping.
To use this intelligent feature, you must first enable the intelligent application for X-Data Standard.
On the Dataphin home page, choose Administration > Standard from the top menu bar.
In the navigation pane on the left, choose General Configuration > Feature. On the Feature page, click Create Feature.
In the Add Feature dialog box, enter a name and click Intelligent Generation. The configuration is the same as on the Data Security > Feature page. For more information, see Intelligent generation.