Scan for sensitive data using identification tasks

更新时间:
复制 MD 格式

Data Security Center (DSC) provides data discovery capabilities. By managing sensitive data identification tasks, DSC helps you identify sensitive information in authorized assets and manage it through classification and grading, including the location, type, and sensitivity level of the data. Understanding your sensitive data helps you properly manage access permissions and enhance data security. This topic describes how to use identification tasks to scan for sensitive data.

Prerequisites

You have completed asset authorization in DSC for the target assets, allowing DSC to access their data.

Identification tasks

An identification task scans data in your authorized assets based on the identification models within an identification template. After a scan, it generates results and classifies and grades the discovered sensitive data. For more information about identification templates, see View and configure identification templates.

Task types

Data Security Center provides two types of identification tasks: default tasks and custom identification tasks.

Default tasks

After you authorize an asset, DSC automatically creates a scan task for each asset instance using the main identification template. These tasks are referred to as default tasks. For more information about the main identification template, see Use identification templates.

The following table describes the configurations of default tasks.

Parameter

Description

Identification template

A default task uses the main identification template, which cannot be modified. If the main identification template is a built-in identification template, the common identification template is also used.

  • Main identification template: You can set a built-in industry template, such as the template for the Internet industry or the Internet of Vehicles (IoV) industry, or a custom template as the main identification template.

  • Common identification template: This template is formulated based on the GB/T 35273-2020 Personal Information Security Specification released by the Standardization Administration of China. This template helps you manage personal information and control risks.

Scan cycle (default)

  • For databases, buckets, or Logstores connected using a one-click connection, a default task is created after the connection is complete.

    • If you select Scan assets and identify sensitive data now. during the Connect process, the corresponding default task runs immediately.

    • If you do not select Scan assets and identify sensitive data now. during the Connect process, go to the Classification and Grading > Tasks page. On the Identification Tasks tab, find the task in the Default Tasks list and click Rescan to run the default task manually.

  • For databases connected using an account and password, a default task is created after the connection is complete. Starting from the next day, the scan runs every day in the early morning.

A minimum of 24 hours must elapse between two scans.

Scan scope

For all authorized assets:

  • For database and OSS assets, the initial scan is a full scan of all authorized data. Subsequent scans cover only incremental data.

  • For Simple Log Service assets, each scan processes all data stored from 00:00 to 24:00 on the day before yesterday.

    If you want to scan more Simple Log Service data, create a custom identification task and configure the scan scope. For more information, see Create a custom identification task in this topic.

If you switch the main identification template, a scan is not immediately triggered. The new identification template is used in the next run of the default task.

Custom identification tasks

You can add a custom identification task to scan a specified data asset using an enabled identification template. If the identification template that you want to use is not enabled, you must enable it first. For more information, see Enable an identification template.

image

Scan specifications

Scan limits

To prevent oversized files or tables from affecting the overall scan progress, DSC imposes limits on the size of scannable files and table fields. Before you scan for sensitive data, note the following limits:

  • For structured data (such as RDS MySQL, RDS PostgreSQL, and PolarDB) and big data (such as Tablestore and MaxCompute), the first 200 rows of a table are sampled by default. You can increase the number of sampled rows to a maximum of 1,000. For each sampled row, only the first 10 KB of data in each field is scanned.

  • For unstructured data (such as OSS and Simple Log Service):

    • By default, the system does not scan individual files larger than 200 MB.

    • For data in OSS:

      • You can manually adjust the scan threshold for a single file up to a maximum of 1,000 MB.

      • For compressed or archived files, only the first 1,000 sub-files are scanned.

      • When a single OSS bucket is scanned, a maximum of four files can be scanned concurrently.

      • QPS limit: When a single task is running, a maximum of 100 API calls per second are made to the corresponding OSS bucket.

      • Bandwidth limit: When a single task is running, the maximum downstream bandwidth of internal network traffic for the corresponding OSS bucket is 200 MB/s.

    • DSC supports scanning more than 800 file types in OSS, including text files, office documents, image files, design documents, code files, data files, binary files, signature verification files, archive files, applications, audio and video files, chemical structure files, and other categories. For more information, see Supported OSS file types for identification.

For more information about the limits of identification tasks, see Limitations.

Scannable data objects

  • Database asset: <instance>/<database>/<table_name>. An identification task treats each data table as a data object.

  • Big data: <instance>/<table_name>. An identification task treats each data table as a data object.

  • OSS asset: <OSS bucket>/<file_name>. An identification task treats each file as a data object.

  • Simple Log Service asset: <SLS Project>/<logstore>/<Time Window>. An identification task treats the data within each 5-minute time window as a single data object.

Scan speed

Scan speeds vary based on the data asset type. The following speeds are for reference only.

  • Structured data (such as RDS MySQL, RDS PostgreSQL, and PolarDB) and big data (such as Tablestore and MaxCompute): For large databases with more than 1,000 tables, the scan speed is 1,000 columns per minute, based on 200 rows per column.

  • Unstructured data (such as OSS and Simple Log Service): It takes 6 to 48 hours to scan 1 TB of data. The wide range in duration is due to the varying distribution of file types within the data. The average scan duration is 24 hours.

Scan mechanism

Task type

Initial scan

Subsequent scans

Default task

Performs a full scan on all existing data in the target asset.

Scans new or modified data objects.

You can manually perform a rescan or configure the scan cycle for the default task.

Custom identification task

Scans data based on your custom scan scope.

Scans new or modified data objects within the custom scan scope based on your custom scan cycle.

During subsequent automatic scans, DSC does not rescan data objects that have not changed since the last scan.

Scan results

The sensitivity level of a scan result is determined by the models in the identification template used for the task. The result takes on the highest sensitivity level of all matched models. DSC defines sensitivity levels from S1 to S10, where a larger number indicates a higher sensitivity level. N/A indicates that no sensitive data is identified.

The range of sensitivity levels available for an identification model is determined by the sensitivity levels included in the associated identification template. For more information about how to set these levels, see Set sensitivity levels for an identification template.

Recommendations

Recommendation

Description

Confirm scan scope and priority

When you begin scanning for sensitive data, you may have a large backlog of data to classify and grade, making it impractical to scan all data at once. First, evaluate which of your data assets have a higher scan priority. Prioritize scanning data that poses higher potential risks, such as data that is frequently accessed, updated, or subject to unknown operations.

Limit the initial scan scope

To achieve optimal scan results, specify a limited scan scope instead of performing a full scan initially. For example, start with a single database, an OSS bucket, or a few files. By limiting the initial scan scope, you can better determine which identification features to enable and which rules to use, helping you discover meaningful sensitive data.

If you do not need all identification features, do not enable all of them, as false positives or invalid results can complicate risk assessment. While enabling all features can match a broad range of results for certain data types like dates, times, and URLs, this approach may not be suitable for large-scale data scanning.

When you scan structured data, ensure that the sample size is sufficient to produce scan results.

Set the task start time

Schedule identification tasks to run automatically at a specific time daily, weekly, or monthly, based on the update frequency of your data assets. This allows you to check for changes since the last scan and promptly identify sensitive information in the modified data. Running scans regularly also helps you identify trends or anomalies in the scan results.

Manage default tasks

View default tasks

  1. Log on to the Data Security Center console.

  2. In the navigation pane on the left, select Classification and Grading > Tasks.

  3. On the Tasks page, click the Identification Tasks tab, and then click Default Tasks.

  4. On the Discovery Task Monitoring page, view the list of default tasks.

  5. In the Actions column of a default task, you can perform the following operations:

    • Rescan: Use this operation if an identification model is upgraded, you change the main identification template, or the database content has changed and you need to obtain scan results as soon as possible.

    • Pause: If you notice abnormal database behavior, you can click Pause in the Actions column of a default identification task to temporarily stop the scan.

    • Terminate: Stops the execution of the current and subsequent default tasks.

    • Enable: Re-enables a terminated default task.

    Note

    Default tasks cannot be deleted.

Adjust default task scan settings

Default tasks support periodic scanning. Set the scan cycle to match the update frequency of your database to promptly discover sensitive information in new or changed data. The minimum scan cycle is daily.

  1. On the Discovery Task Monitoring page, select the checkbox next to the task for which you want to configure a scan cycle and click Scan Settings.

  2. In the Scan Settings dialog box, set the scan cycle and automatic scan start time, and then click OK.

    Important
    • To minimize the impact of scans on your database, set the scan start time to an off-peak period for your data asset.

    • During a scan task, monitor your database or service status for anomalies, such as sudden spikes in CPU or memory usage. If you detect an anomaly related to the scan task, promptly pause or terminate the identification task. On the Tasks page, click Pause or Terminate in the Actions column of the target identification task to stop the scan.

Create a custom identification task

If you want to use an enabled template other than the main one to scan a specific database, create a custom identification task.

Important

The system supports a maximum of five active identification tasks. Each periodic scan task occupies one active task slot. If you have five periodic tasks configured, you cannot create new identification tasks.

Create a custom identification task

  1. In the navigation pane on the left, select Classification and Grading > Tasks.

  2. On the Identification Tasks tab, select an Asset Type for which you want to create an identification task, and then click Create.

    Asset types are categorized into structured data (RDS, PolarDB, PolarDB-X, PolarDB-X 2.0, MongoDB, OceanBase, and self-managed databases), unstructured data (OSS and Simple Log Service), and big data (Tablestore, MaxCompute, ADB-MYSQL, and ADB-PG). The number of connected data sources is displayed next to each sub-item.

  3. In the Create panel, configure the parameters for the identification task and click OK.

    Category

    Parameter

    Description

    Basic Information

    Asset Type

    Displays the asset type you selected. This parameter cannot be changed.

    Task Name

    Enter a name for the task.

    Task notes

    Enter a description for the task.

    Task and Plan

    Select a start time for the task. Valid values:

    • Immediate Scan: The task runs immediately after it is created.

    • Periodic Scan: In the Scan Frequency and Scan Time (Structured Data Only) drop-down lists, select a scan frequency and a time period to run the scan. To run a scan immediately, select Scan Once Now.

      Note

      The Scan Time setting applies only to structured data, not to unstructured data.

    Identification Template

    Select the identification template to use for the scan. You can select up to two enabled identification templates. For more information about how to enable an identification template, see Use identification templates.

    Identification Scope

    Identification Scope of Structured Data

    Select a scan scope for structured data, such as RDS and PolarDB. Valid values:

    • Global Scan: Scans all your structured data assets.

    • Specify Scan Scope: Configure the Instance Name, Database name, and Scan Limit.

      • Configure the instance and database names. To add multiple instances, click Add Identification Scope.

      • Configure Scan Limit. By default, the first 200 rows are scanned. The maximum value is 1,000.

    Unstructured data OSS identification scope

    For unstructured data in OSS, select an Object, a Sampling Method, a Scan Depth, and a Scan Limit.

    • Valid values for Object:

      • Global Scan: Scans your unstructured data assets in OSS.

      • Specify Scan Scope: Select the buckets that you want to scan. You can select multiple buckets.

        After you specify the bucket files to scan, you can add filter conditions to define a more precise scan scope. You can specify whether to include or exclude values for Prefix, Directory, and Suffix to filter the scan scope.

    • Sampling Method: Obtains data from your unstructured data assets in OSS using the ListObjects API operation and scans the data based on your configurations.

      • Global Scan: Scans all data.

      • Specify the sampling ratio: Sampling Rate and scan data based on the specified ratio.

        Note

        For example, if you Sampling Rate of 1/10, one file is scanned, nine files are skipped, and then the 11th file is scanned.

    • Valid values for Scan Depth:

      • Global Scan: Scans the data in the full path of an asset.

      • Specify Scan Scope: Specify the depth of the bucket path. The path depth is delimited by forward slashes (/). Valid values: 1 to 10. For example, if you set the value to 5, bucket paths that are up to five levels deep are scanned.

    • Scan Limit: The default value is 200 MB, and the maximum value is 1,000 MB. For data that exceeds the scan limit, only the data up to the configured size is scanned. For example, if you set the limit to 200 MB for a file of 300 MB, the data beyond the limit is not scanned.

    • AI-Powered Image Detection: If you have an available quota for AI-powered image detection, you can enable this feature. It uses large AI models to improve the accuracy of sensitive information detection in images.

    • Synchronize All Identification Results to SLS: Select whether to synchronize all identification results to Simple Log Service.

    • Synchronize Identification Results to DMS: When selected, identification results are automatically synced to column labels in Data Management (DMS) after the task completes. Make sure that the target asset is connected to DMS before enabling this option.

    Unstructured data SLS identification scope

    Set the Asset Scope and Time Range for Simple Log Service.

    • Valid values for Asset Scope:

      • Global Scan: Scans your unstructured data assets in Simple Log Service.

      • Specify Scan Scope: Select the Project and its Logstores that you want to scan. You can select one Project and multiple Logstores.

    • Valid values for Time Range:

      • Last 15 Minutes, Last 1 hour, Yesterday, Last 1 Day, Last 7 Days, or Last 30 Days.

      • Custom: You can select a time range in minutes with a step of 5 minutes.

    Other settings

    Tagging Result Overwriting

    Specify how to handle sensitive data that was previously corrected. Valid values:

    • Skip Manual Tagging Result: Preserves the original manual correction results. This is the recommended option.

    • Overwrite Manual Tagging Result: Overwrites the manual correction results with the new identification results.

Edit or delete custom tasks

The task list includes columns such as Task ID, Task Name, Operator, Identification Template, Scan Status, Start Time, End Time, and Actions. The scan status can be Finished, Not Started, or Terminated. Depending on the status, the Actions column provides options such as Rescan, Details, Edit, Pause, and Terminate.

  • Edit: Reconfigures a custom identification task. You can modify all parameters.

  • image > Delete: Deletes redundant custom identification tasks.

Manage task status

Rescan a task

If an identification model is upgraded or your database content changes and you need to view the scan results immediately, perform a rescan. A rescan performs a full scan on the target asset. After you start a rescan, the scan task runs immediately. Set the scan start time to an off-peak period.

Before you perform a rescan, make sure that the relevant identification template is enabled.

Note

You cannot rescan a custom identification task whose Scan Type is set to Immediate Scan.

  1. On the Identification Tasks tab, perform a rescan:

    • To rescan a custom identification task, find the task in the task list and click Rescan in the Actions column.

    • To rescan a default task, click Default Tasks, find the target asset, and click Rescan in the Actions column.

  2. You can view the scan progress in the Scan Status column of the identification task.

Pause or terminate a task

The task list includes columns such as Task ID, Task Name, Operator, Identification Template, Scan Status, Start Time, End Time, and Actions. The scan status can be Finished, Not Started, or Terminated. Depending on the status, the Actions column provides options such as Rescan, Details, Edit, Pause, and Terminate.

  • Pause: If you detect an exception in your database service, you can click Pause in the Actions column of a custom identification task to temporarily stop the running task.

  • Terminate: Stops the execution of the current and subsequent identification tasks. This operation is supported for both custom identification tasks and default tasks.

Revise identification models

The revision feature allows you to correct misidentified or missed sensitive data, enabling more precise data management and protection. DSC lets you revise and restore sensitive data identification models. You can perform the following steps:

  1. On the Tasks page, click the Revision Tasks tab.

  2. In the data type navigation pane on the left, click the asset type you want to revise.

  3. Click Revision or Resume in the Actions column of the target sensitive data. Follow the on-screen instructions to modify the Revised Model, and then click OK.

    From the Revised Model drop-down list, you can select a sensitive data category, such as Private KEY, PEM certificate, AccessKey ID, AccessKey Secret, GPS location, or Password.

    The Restore operation restores the identification model to its state before the revision.

View and export identification results

On the Data Classification > Asset Insight page of the DSC console, you can view the latest sensitive data results detected using the main identification template. For more information, see View sensitive data identification results.

Use an export task to download the sensitive data identification results from any enabled identification template (either the main or an active one).

Important

The identification template and asset type that you select for an export task must have a corresponding identification task that has already completed successfully. Otherwise, the export file will be empty.

Create an export task

You can perform the following steps to create an export task and download the exported results.

  1. On the Tasks page, click the Export Tasks tab.

  2. On the Export Tasks tab, click Create.

  3. Configure the export task and click OK.

    1. In the Basic Information section, enter a task name and select the template used by the identification task.

      You can select only enabled templates.

    2. In the Export Dimension section, select Asset Type or Asset Instance.

      • Asset Type: Select the asset types whose results you want to export.

      • Asset Instance: Select the asset instances whose results you want to export.

    After you create an export task, you can view its status in the export task list. Exporting larger amounts of data takes longer. Please wait patiently.

Download exported results

After the Export Status changes to Finished, click Download in the Actions column of the export task.

Important

After the export is complete, you must download the exported data within three days. After three days, the export task expires, and you can no longer download the exported sensitive data.

Related documents

FAQ