LLM-Sensitive Keywords Filter (MaxCompute)-Platform For AI(PAI)-阿里云帮助中心

LLM training datasets often contain sensitive content that can compromise model safety and compliance. The LLM-Sensitive Keywords Filter (MaxCompute) component scans text samples against a keyword list and removes matches before training. Add it to a Machine Learning Designer pipeline to clean your dataset at scale. The built-in list covers 12,000+ sensitive keywords; you can also supply a custom keyword file.

How it works

The component scans each text sample in the target column against the sensitive keyword list.
Matched samples are marked in an optional Boolean output column (is_sensitive) and an optional keyword column (sensitive_words).
A SQL WHERE clause filters out the flagged samples. The default clause is where not is_sensitive.

Supported computing resources

MaxCompute

Configure the component

In Machine Learning Designer, open the pipeline details page, add the LLM-Sensitive Keywords Filter (MaxCompute) component to the pipeline, and set the following parameters.

Fields setting

Parameter	Default	Description
Select Target Column	—	The columns to scan for sensitive keywords.
Whether to Save the Sensitive Results	—	Saves detection results to the output table as two additional columns. When enabled, configure the sub-parameters below.
Sensitive bool value saved column name	`is_sensitive`	Name of the Boolean column that indicates whether a sample contains sensitive keywords.
Sensitive words saved column name	`sensitive_words`	Name of the column that stores the detected sensitive keywords.
SQL Script	`where not is_sensitive`	The `WHERE` clause that determines which samples to keep. If you rename the result columns, update this clause to match.
Sensitive Keywords File	Default keyword file	Path to a custom keyword file. Leave blank to use the built-in list of 12,000+ keywords.
Output table lifecycle	`28`	Number of days before temporary output tables are recycled. Must be a positive integer. Unit: days.

Custom keyword file format

Each line in the file contains one keyword:

keyword_one
keyword_two
keyword_three

If you rename the result columns from their defaults, update the SQL Script field to reference the new column names.

Tuning

Parameter	Default	Valid range	When to adjust
Number of CPUs per instance of map task	`100`	50–800	Increase if map tasks are slow on text-heavy samples.
The memory size per instance of map task	`1024` MB	256–12288 MB	Increase if map tasks run out of memory for large samples.
The maximum size of input data for a map	`256` MB	1–Integer.MAX_VALUE MB	Decrease to split large inputs into smaller map tasks; increase to reduce task overhead on small inputs.