LLM training datasets often contain sensitive content that can compromise model safety and compliance. The LLM-Sensitive Keywords Filter (MaxCompute) component scans text samples against a keyword list and removes matches before training. Add it to a Machine Learning Designer pipeline to clean your dataset at scale. The built-in list covers 12,000+ sensitive keywords; you can also supply a custom keyword file.
How it works
The component scans each text sample in the target column against the sensitive keyword list.
Matched samples are marked in an optional Boolean output column (
is_sensitive) and an optional keyword column (sensitive_words).A SQL
WHEREclause filters out the flagged samples. The default clause iswhere not is_sensitive.
Supported computing resources
Configure the component
In Machine Learning Designer, open the pipeline details page, add the LLM-Sensitive Keywords Filter (MaxCompute) component to the pipeline, and set the following parameters.
Fields setting
| Parameter | Default | Description |
|---|---|---|
| Select Target Column | — | The columns to scan for sensitive keywords. |
| Whether to Save the Sensitive Results | — | Saves detection results to the output table as two additional columns. When enabled, configure the sub-parameters below. |
| Sensitive bool value saved column name | is_sensitive | Name of the Boolean column that indicates whether a sample contains sensitive keywords. |
| Sensitive words saved column name | sensitive_words | Name of the column that stores the detected sensitive keywords. |
| SQL Script | where not is_sensitive | The WHERE clause that determines which samples to keep. If you rename the result columns, update this clause to match. |
| Sensitive Keywords File | Default keyword file | Path to a custom keyword file. Leave blank to use the built-in list of 12,000+ keywords. |
| Output table lifecycle | 28 | Number of days before temporary output tables are recycled. Must be a positive integer. Unit: days. |
Custom keyword file format
Each line in the file contains one keyword:
keyword_one
keyword_two
keyword_threeIf you rename the result columns from their defaults, update the SQL Script field to reference the new column names.
Tuning
| Parameter | Default | Valid range | When to adjust |
|---|---|---|---|
| Number of CPUs per instance of map task | 100 | 50–800 | Increase if map tasks are slow on text-heavy samples. |
| The memory size per instance of map task | 1024 MB | 256–12288 MB | Increase if map tasks run out of memory for large samples. |
| The maximum size of input data for a map | 256 MB | 1–Integer.MAX_VALUE MB | Decrease to split large inputs into smaller map tasks; increase to reduce task overhead on small inputs. |