The LLM-Sensitive Content Mask (MaxCompute) component masks sensitive information in text data used to train large language models (LLMs). It detects and replaces personally identifiable details with fixed tokens, so downstream LLM training pipelines receive sanitized text without modifying the surrounding content.
Limitations
-
This component requires MaxCompute as the compute engine.
How it works
The component scans each selected text column using regular expressions and replaces matching strings with a predefined token. The rest of the text is left unchanged.
The following table describes each supported entity type, its replacement token, and the regular expressions used for detection.
| Entity type | Replacement token | Regular expressions |
|---|---|---|
| Mobile phone numbers | [MOBILEPHONE] |
r'(?<!\d)(1(3[0-9]|4[579]|5[0-3,5-9]|6[6]|7[0135678]|8[0-9]|9[89])\d{8})(?!\d)' |
r'(?<!\d)(1[\d]{2}-\d{4}-\d{4}\D|\D1\d{10}\D|\D1[\d]{2} \d{4} \d{4})(?!\d)' |
||
r'(?<!\d)(1[3-9]\d{9})(?!\d)' |
||
| Landline phone numbers | [TELEPHONE] |
r'(?<!\d)(\\(?0\d{2,3}[-\s)]?\d{7,8})(?!\d)' |
| Email addresses | [EMAIL] |
r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+' |
| China resident identity card (PRC) numbers | [IDNUM] |
r'(?<!\d)([1-6]\d{5}[12]\d{3}(0[1-9]|1[12])(0[1-9]|1[0-9]|2[0-9]|3[01])\d{3}(\d|X|x))(?!\d)' |
r'(?<!\d)([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])(?!\d)' |
Example: masking sensitive content in a sentence
Input:
Contact zhang.wei@company.com or call 13812345678 for assistance.
Output:
Contact [EMAIL] or call [MOBILEPHONE] for assistance.
The same masking rules apply to all selected columns. Each text column is scanned independently and all matching strings in that column are replaced.
Configure the component
Configure the LLM-Sensitive Content Mask (MaxCompute) component in Machine Learning Designer. The following tables describe the parameters.
Fields setting
| Parameter | Required | Description | Default value |
|---|---|---|---|
| Select Target Column | Yes | The text columns to process. Multiple columns can be selected. The same masking rules apply to all selected columns. | None |
| Output table lifecycle | No | How long (in days) temporary tables generated by the component are retained before being automatically recycled. Must be a positive integer. | 28 |
Tuning
| Parameter | Required | Description | Default value |
|---|---|---|---|
| Number of CPUs per instance of map task | No | Number of CPUs allocated to each map task instance. Valid values: 50–800. | 100 |
| The memory size per instance of map task | No | Memory allocated to each map task instance, in MB. Valid values: 256–12,288. | 1024 |
| The maximum size of input data for a map | No | Maximum amount of data each map task instance processes, in MB. Valid values: 1–2,147,483,647. Reduce this value if you encounter memory errors; increase it to reduce the number of map tasks for small datasets. | 256 |
What's next
For an overview of Machine Learning Designer and how to build pipelines with components, see Overview of Machine Learning Designer.