Data Security Center (DSC) provides data discovery capabilities. By managing sensitive data identification tasks, DSC helps you identify sensitive information in authorized assets and manage it through classification and grading, including the location, type, and sensitivity level of the data. Understanding your sensitive data helps you properly manage access permissions and enhance data security. This topic describes how to use identification tasks to scan for sensitive data.
Prerequisites
You have completed asset authorization in DSC for the target assets, allowing DSC to access their data.
Identification tasks
An identification task scans data in your authorized assets based on the identification models within an identification template. After a scan, it generates results and classifies and grades the discovered sensitive data. For more information about identification templates, see View and configure identification templates.
Task types
Data Security Center provides two types of identification tasks: default tasks and custom identification tasks.
Default tasks
After you authorize an asset, DSC automatically creates a scan task for each asset instance using the main identification template. These tasks are referred to as default tasks. For more information about the main identification template, see Use identification templates.
The following table describes the configurations of default tasks.
|
Parameter |
Description |
|
Identification template |
A default task uses the main identification template, which cannot be modified. If the main identification template is a built-in identification template, the common identification template is also used.
|
|
Scan cycle (default) |
A minimum of 24 hours must elapse between two scans. |
|
Scan scope |
For all authorized assets:
If you switch the main identification template, a scan is not immediately triggered. The new identification template is used in the next run of the default task. |
Custom identification tasks
You can add a custom identification task to scan a specified data asset using an enabled identification template. If the identification template that you want to use is not enabled, you must enable it first. For more information, see Enable an identification template.
Scan specifications
Scan limits
To prevent oversized files or tables from affecting the overall scan progress, DSC imposes limits on the size of scannable files and table fields. Before you scan for sensitive data, note the following limits:
-
For structured data (such as RDS MySQL, RDS PostgreSQL, and PolarDB) and big data (such as Tablestore and MaxCompute), the first 200 rows of a table are sampled by default. You can increase the number of sampled rows to a maximum of 1,000. For each sampled row, only the first 10 KB of data in each field is scanned.
-
For unstructured data (such as OSS and Simple Log Service):
-
By default, the system does not scan individual files larger than 200 MB.
-
For data in OSS:
-
You can manually adjust the scan threshold for a single file up to a maximum of 1,000 MB.
-
For compressed or archived files, only the first 1,000 sub-files are scanned.
-
When a single OSS bucket is scanned, a maximum of four files can be scanned concurrently.
-
QPS limit: When a single task is running, a maximum of 100 API calls per second are made to the corresponding OSS bucket.
-
Bandwidth limit: When a single task is running, the maximum downstream bandwidth of internal network traffic for the corresponding OSS bucket is 200 MB/s.
-
-
DSC supports scanning more than 800 file types in OSS, including text files, office documents, image files, design documents, code files, data files, binary files, signature verification files, archive files, applications, audio and video files, chemical structure files, and other categories. For more information, see Supported OSS file types for identification.
-
For more information about the limits of identification tasks, see Limitations.
Scannable data objects
-
Database asset: <instance>/<database>/<table_name>. An identification task treats each data table as a data object.
-
Big data: <instance>/<table_name>. An identification task treats each data table as a data object.
-
OSS asset: <OSS bucket>/<file_name>. An identification task treats each file as a data object.
-
Simple Log Service asset: <SLS Project>/<logstore>/<Time Window>. An identification task treats the data within each 5-minute time window as a single data object.
Scan speed
Scan speeds vary based on the data asset type. The following speeds are for reference only.
-
Structured data (such as RDS MySQL, RDS PostgreSQL, and PolarDB) and big data (such as Tablestore and MaxCompute): For large databases with more than 1,000 tables, the scan speed is 1,000 columns per minute, based on 200 rows per column.
-
Unstructured data (such as OSS and Simple Log Service): It takes 6 to 48 hours to scan 1 TB of data. The wide range in duration is due to the varying distribution of file types within the data. The average scan duration is 24 hours.
Scan mechanism
|
Task type |
Initial scan |
Subsequent scans |
|
Default task |
Performs a full scan on all existing data in the target asset. |
Scans new or modified data objects. You can manually perform a rescan or configure the scan cycle for the default task. |
|
Custom identification task |
Scans data based on your custom scan scope. |
Scans new or modified data objects within the custom scan scope based on your custom scan cycle. |
During subsequent automatic scans, DSC does not rescan data objects that have not changed since the last scan.
Scan results
The sensitivity level of a scan result is determined by the models in the identification template used for the task. The result takes on the highest sensitivity level of all matched models. DSC defines sensitivity levels from S1 to S10, where a larger number indicates a higher sensitivity level. N/A indicates that no sensitive data is identified.
The range of sensitivity levels available for an identification model is determined by the sensitivity levels included in the associated identification template. For more information about how to set these levels, see Set sensitivity levels for an identification template.
Recommendations
|
Recommendation |
Description |
|
Confirm scan scope and priority |
When you begin scanning for sensitive data, you may have a large backlog of data to classify and grade, making it impractical to scan all data at once. First, evaluate which of your data assets have a higher scan priority. Prioritize scanning data that poses higher potential risks, such as data that is frequently accessed, updated, or subject to unknown operations. |
|
Limit the initial scan scope |
To achieve optimal scan results, specify a limited scan scope instead of performing a full scan initially. For example, start with a single database, an OSS bucket, or a few files. By limiting the initial scan scope, you can better determine which identification features to enable and which rules to use, helping you discover meaningful sensitive data. If you do not need all identification features, do not enable all of them, as false positives or invalid results can complicate risk assessment. While enabling all features can match a broad range of results for certain data types like dates, times, and URLs, this approach may not be suitable for large-scale data scanning. When you scan structured data, ensure that the sample size is sufficient to produce scan results. |
|
Set the task start time |
Schedule identification tasks to run automatically at a specific time daily, weekly, or monthly, based on the update frequency of your data assets. This allows you to check for changes since the last scan and promptly identify sensitive information in the modified data. Running scans regularly also helps you identify trends or anomalies in the scan results. |
Manage default tasks
View default tasks
Log on to the Data Security Center console.
In the navigation pane on the left, select .
-
On the Tasks page, click the Identification Tasks tab, and then click Default Tasks.
-
On the Discovery Task Monitoring page, view the list of default tasks.
-
In the Actions column of a default task, you can perform the following operations:
-
Rescan: Use this operation if an identification model is upgraded, you change the main identification template, or the database content has changed and you need to obtain scan results as soon as possible.
-
Pause: If you notice abnormal database behavior, you can click Pause in the Actions column of a default identification task to temporarily stop the scan.
-
Terminate: Stops the execution of the current and subsequent default tasks.
-
Enable: Re-enables a terminated default task.
NoteDefault tasks cannot be deleted.
-
Adjust default task scan settings
Default tasks support periodic scanning. Set the scan cycle to match the update frequency of your database to promptly discover sensitive information in new or changed data. The minimum scan cycle is daily.
-
On the Discovery Task Monitoring page, select the checkbox next to the task for which you want to configure a scan cycle and click Scan Settings.
-
In the Scan Settings dialog box, set the scan cycle and automatic scan start time, and then click OK.
Important-
To minimize the impact of scans on your database, set the scan start time to an off-peak period for your data asset.
-
During a scan task, monitor your database or service status for anomalies, such as sudden spikes in CPU or memory usage. If you detect an anomaly related to the scan task, promptly pause or terminate the identification task. On the Tasks page, click Pause or Terminate in the Actions column of the target identification task to stop the scan.
-
Create a custom identification task
If you want to use an enabled template other than the main one to scan a specific database, create a custom identification task.
The system supports a maximum of five active identification tasks. Each periodic scan task occupies one active task slot. If you have five periodic tasks configured, you cannot create new identification tasks.
Create a custom identification task
In the navigation pane on the left, select .
-
On the Identification Tasks tab, select an Asset Type for which you want to create an identification task, and then click Create.
Asset types are categorized into structured data (RDS, PolarDB, PolarDB-X, PolarDB-X 2.0, MongoDB, OceanBase, and self-managed databases), unstructured data (OSS and Simple Log Service), and big data (Tablestore, MaxCompute, ADB-MYSQL, and ADB-PG). The number of connected data sources is displayed next to each sub-item.
-
In the Create panel, configure the parameters for the identification task and click OK.
Category
Parameter
Description
Basic Information
Asset Type
Displays the asset type you selected. This parameter cannot be changed.
Task Name
Enter a name for the task.
Task notes
Enter a description for the task.
Task and Plan
Select a start time for the task. Valid values:
-
Immediate Scan: The task runs immediately after it is created.
-
Periodic Scan: In the Scan Frequency and Scan Time (Structured Data Only) drop-down lists, select a scan frequency and a time period to run the scan. To run a scan immediately, select Scan Once Now.
NoteThe Scan Time setting applies only to structured data, not to unstructured data.
Identification Template
Select the identification template to use for the scan. You can select up to two enabled identification templates. For more information about how to enable an identification template, see Use identification templates.
Identification Scope
Identification Scope of Structured Data
Select a scan scope for structured data, such as RDS and PolarDB. Valid values:
-
Global Scan: Scans all your structured data assets.
-
Specify Scan Scope: Configure the Instance Name, Database name, and Scan Limit.
-
Configure the instance and database names. To add multiple instances, click Add Identification Scope.
-
Configure Scan Limit. By default, the first 200 rows are scanned. The maximum value is 1,000.
-
Unstructured data OSS identification scope
For unstructured data in OSS, select an Object, a Sampling Method, a Scan Depth, and a Scan Limit.
-
Valid values for Object:
-
Global Scan: Scans your unstructured data assets in OSS.
-
Specify Scan Scope: Select the buckets that you want to scan. You can select multiple buckets.
After you specify the bucket files to scan, you can add filter conditions to define a more precise scan scope. You can specify whether to include or exclude values for Prefix, Directory, and Suffix to filter the scan scope.
-
-
Sampling Method: Obtains data from your unstructured data assets in OSS using the ListObjects API operation and scans the data based on your configurations.
-
Global Scan: Scans all data.
-
Specify the sampling ratio: Sampling Rate and scan data based on the specified ratio.
NoteFor example, if you Sampling Rate of 1/10, one file is scanned, nine files are skipped, and then the 11th file is scanned.
-
-
Valid values for Scan Depth:
-
Global Scan: Scans the data in the full path of an asset.
-
Specify Scan Scope: Specify the depth of the bucket path. The path depth is delimited by forward slashes (/). Valid values: 1 to 10. For example, if you set the value to 5, bucket paths that are up to five levels deep are scanned.
-
-
Scan Limit: The default value is 200 MB, and the maximum value is 1,000 MB. For data that exceeds the scan limit, only the data up to the configured size is scanned. For example, if you set the limit to 200 MB for a file of 300 MB, the data beyond the limit is not scanned.
-
AI-Powered Image Detection: If you have an available quota for AI-powered image detection, you can enable this feature. It uses large AI models to improve the accuracy of sensitive information detection in images.
-
Synchronize All Identification Results to SLS: Select whether to synchronize all identification results to Simple Log Service.
-
Synchronize Identification Results to DMS: When selected, identification results are automatically synced to column labels in Data Management (DMS) after the task completes. Make sure that the target asset is connected to DMS before enabling this option.
Unstructured data SLS identification scope
Set the Asset Scope and Time Range for Simple Log Service.
-
Valid values for Asset Scope:
-
Global Scan: Scans your unstructured data assets in Simple Log Service.
-
Specify Scan Scope: Select the Project and its Logstores that you want to scan. You can select one Project and multiple Logstores.
-
-
Valid values for Time Range:
-
Last 15 Minutes, Last 1 hour, Yesterday, Last 1 Day, Last 7 Days, or Last 30 Days.
-
Custom: You can select a time range in minutes with a step of 5 minutes.
-
Other settings
Tagging Result Overwriting
Specify how to handle sensitive data that was previously corrected. Valid values:
-
Skip Manual Tagging Result: Preserves the original manual correction results. This is the recommended option.
-
Overwrite Manual Tagging Result: Overwrites the manual correction results with the new identification results.
-
Edit or delete custom tasks
The task list includes columns such as Task ID, Task Name, Operator, Identification Template, Scan Status, Start Time, End Time, and Actions. The scan status can be Finished, Not Started, or Terminated. Depending on the status, the Actions column provides options such as Rescan, Details, Edit, Pause, and Terminate.
-
Edit: Reconfigures a custom identification task. You can modify all parameters.
-
> Delete: Deletes redundant custom identification tasks.
Manage task status
Rescan a task
If an identification model is upgraded or your database content changes and you need to view the scan results immediately, perform a rescan. A rescan performs a full scan on the target asset. After you start a rescan, the scan task runs immediately. Set the scan start time to an off-peak period.
Before you perform a rescan, make sure that the relevant identification template is enabled.
You cannot rescan a custom identification task whose Scan Type is set to Immediate Scan.
-
On the Identification Tasks tab, perform a rescan:
-
To rescan a custom identification task, find the task in the task list and click Rescan in the Actions column.
-
To rescan a default task, click Default Tasks, find the target asset, and click Rescan in the Actions column.
-
-
You can view the scan progress in the Scan Status column of the identification task.
Pause or terminate a task
The task list includes columns such as Task ID, Task Name, Operator, Identification Template, Scan Status, Start Time, End Time, and Actions. The scan status can be Finished, Not Started, or Terminated. Depending on the status, the Actions column provides options such as Rescan, Details, Edit, Pause, and Terminate.
-
Pause: If you detect an exception in your database service, you can click Pause in the Actions column of a custom identification task to temporarily stop the running task.
-
Terminate: Stops the execution of the current and subsequent identification tasks. This operation is supported for both custom identification tasks and default tasks.
Revise identification models
The revision feature allows you to correct misidentified or missed sensitive data, enabling more precise data management and protection. DSC lets you revise and restore sensitive data identification models. You can perform the following steps:
-
On the Tasks page, click the Revision Tasks tab.
-
In the data type navigation pane on the left, click the asset type you want to revise.
-
Click Revision or Resume in the Actions column of the target sensitive data. Follow the on-screen instructions to modify the Revised Model, and then click OK.
From the Revised Model drop-down list, you can select a sensitive data category, such as Private KEY, PEM certificate, AccessKey ID, AccessKey Secret, GPS location, or Password.
The Restore operation restores the identification model to its state before the revision.
View and export identification results
On the page of the DSC console, you can view the latest sensitive data results detected using the main identification template. For more information, see View sensitive data identification results.
Use an export task to download the sensitive data identification results from any enabled identification template (either the main or an active one).
The identification template and asset type that you select for an export task must have a corresponding identification task that has already completed successfully. Otherwise, the export file will be empty.
Create an export task
You can perform the following steps to create an export task and download the exported results.
-
On the Tasks page, click the Export Tasks tab.
-
On the Export Tasks tab, click Create.
-
Configure the export task and click OK.
-
In the Basic Information section, enter a task name and select the template used by the identification task.
You can select only enabled templates.
-
In the Export Dimension section, select Asset Type or Asset Instance.
-
Asset Type: Select the asset types whose results you want to export.
-
Asset Instance: Select the asset instances whose results you want to export.
-
After you create an export task, you can view its status in the export task list. Exporting larger amounts of data takes longer. Please wait patiently.
-
Download exported results
After the Export Status changes to Finished, click Download in the Actions column of the export task.
After the export is complete, you must download the exported data within three days. After three days, the export task expires, and you can no longer download the exported sensitive data.
Related documents
-
For more information about the identification templates used in identification tasks and the types of sensitive data that can be identified, see View and configure identification templates.
-
For the asset types that DSC supports for sensitive data identification, see Supported data asset types.
-
For common issues that may occur when you run an identification task, see Data scanning and identification.
FAQ
-
How long does it take to complete a scan after a data source is authorized?
-
How does the scan mechanism of DSC work for unstructured data sources (OSS and Simple Log Service)?
-
How am I charged for scanning unstructured data sources (OSS and Simple Log Service)?
-
Can encrypted data in an authorized data asset be identified?
-
Does a full table scan on MongoDB cause excessive I/O and affect online services?
-
Can DSC identify sensitive data in compressed packages and text documents in OSS?
-
Does DSC support exporting sensitive data identification results?
-
Can the scan results for MongoDB be accurate to specific fields?
-
Why am I unable to select the common identification template?
-
Why is my identification task always in the waiting state in the Free Edition?