LLM - Sensitive Information Masking (DLC)

更新时间:
复制 MD 格式

The LLM - Sensitive Information Masking (DLC) component replaces sensitive information with placeholders such as [EMAIL], [TELEPHONE], [MOBILEPHONE], and [IDNUM]. The input OSS data file must be in JSON Lines (JSONL) format (example), where each line is a standalone JSON object.

Supported computing resources

DLC

How it works

The component detects and masks the following types of sensitive information:

  • Mobile phone numbers: Strings that match the following regular expressions are replaced with [MOBILEPHONE].

    • r'(?<!\d)(1(3[0-9]|4[579]|5[0-3,5-9]|6[6]|7[0135678]|8[0-9]|9[89])\d{8})(?!\d)'

    • r'(?<!\d)(1[\d]{2}-\d{4}-\d{4}\D|\D1\d{10}\D|\D1[\d]{2} \d{4} \d{4})(?!\d)'

    • r'(?<!\d)(1[3-9]\d{9})(?!\d)'

  • Landline phone numbers: Strings that match the following regular expression are replaced with [TELEPHONE].

    • r'(?<!\d)(\(?0\d{2,3}[-\s)]?\d{7,8})(?!\d)'

  • Email addresses: Strings that match the following regular expression are replaced with [EMAIL].

    • r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+'

  • ID card numbers: Strings that match the following regular expressions are replaced with [IDNUM].

    • r'(?<!\d)([1-6]\d{5}[12]\d{3}(0[1-9]|1[12])(0[1-9]|1[0-9]|2[0-9]|3[01])\d{3}(\d|X|x))(?!\d)'

    • r'(?<!\d)([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])(?!\d)'

For example, to mask an email address:

Before

The Current Field Value contains a JavaScript code snippet for a Select2 Malay translation plugin. The Author line contains the email address (xxx@gmail.com) to be masked.

After

/**
 * Select2 Malay translation.
 *
 * Author: Kepoweran &lt;[EMAIL]&gt;
 */
(function ($) {
    "use strict";
    // ...
})(jQuery);

Configure the component

On the Designer workflow page, add the LLM - Sensitive Information Masking (DLC) component and configure its parameters in the right pane.

Parameter type

Parameter

Required

Description

Default

Field settings

Target field

Yes

The field to process.

None

Data output OSS directory

No

The OSS directory for the processed data. If empty, the component uses the default workspace path.

None

Execution tuning

Number of processes

No

The number of processes to use for parallel execution.

8

Select resource group

Public resource group

No

Specify the node specifications (CPU or GPU instance types), the number of nodes, and the VPC.

None

Dedicated resource group

No

Specify the number of CPU cores, amount of memory, shared memory size, number of GPUs, and number of nodes.

None

Maximum runtime

No

The maximum runtime of the component. The system terminates the job if this limit is exceeded.

None