LLM-Sensitive Keywords Filter (MaxCompute)

更新时间:
复制 MD 格式

LLM training datasets often contain sensitive content that can compromise model safety and compliance. The LLM-Sensitive Keywords Filter (MaxCompute) component scans text samples against a keyword list and removes matches before training. Add it to a Machine Learning Designer pipeline to clean your dataset at scale. The built-in list covers 12,000+ sensitive keywords; you can also supply a custom keyword file.

How it works

  1. The component scans each text sample in the target column against the sensitive keyword list.

  2. Matched samples are marked in an optional Boolean output column (is_sensitive) and an optional keyword column (sensitive_words).

  3. A SQL WHERE clause filters out the flagged samples. The default clause is where not is_sensitive.

Supported computing resources

MaxCompute

Configure the component

In Machine Learning Designer, open the pipeline details page, add the LLM-Sensitive Keywords Filter (MaxCompute) component to the pipeline, and set the following parameters.

Fields setting

ParameterDefaultDescription
Select Target ColumnThe columns to scan for sensitive keywords.
Whether to Save the Sensitive ResultsSaves detection results to the output table as two additional columns. When enabled, configure the sub-parameters below.
Sensitive bool value saved column nameis_sensitiveName of the Boolean column that indicates whether a sample contains sensitive keywords.
Sensitive words saved column namesensitive_wordsName of the column that stores the detected sensitive keywords.
SQL Scriptwhere not is_sensitiveThe WHERE clause that determines which samples to keep. If you rename the result columns, update this clause to match.
Sensitive Keywords FileDefault keyword filePath to a custom keyword file. Leave blank to use the built-in list of 12,000+ keywords.
Output table lifecycle28Number of days before temporary output tables are recycled. Must be a positive integer. Unit: days.

Custom keyword file format

Each line in the file contains one keyword:

keyword_one
keyword_two
keyword_three

If you rename the result columns from their defaults, update the SQL Script field to reference the new column names.

Tuning

ParameterDefaultValid rangeWhen to adjust
Number of CPUs per instance of map task10050–800Increase if map tasks are slow on text-heavy samples.
The memory size per instance of map task1024 MB256–12288 MBIncrease if map tasks run out of memory for large samples.
The maximum size of input data for a map256 MB1–Integer.MAX_VALUE MBDecrease to split large inputs into smaller map tasks; increase to reduce task overhead on small inputs.