LLM-Length Filter (DLC)

更新时间:
复制 MD 格式

LLM training corpora often contain texts that are too short to carry meaningful information or too long to indicate malformed crawl artifacts. The LLM-Length Filter (DLC) component removes these texts from your dataset by filtering on total character count, average line length, and maximum line length — reducing the volume of data that downstream, more expensive quality filters need to process.

The component runs on Deep Learning Containers (DLC) and reads input data from Object Storage Service (OSS). Input files must be in JSON Lines format: each line is a valid JSON object, but the file as a whole is not. See the example data file for reference.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-Length Filter (DLC) component using the tabs described below.

Fields setting tab

ParameterRequiredDescriptionDefault
Target Process FieldYesThe name of the JSON field to filter on.
Whether to Filter with Text LengthNoFilters texts by total character count. When selected, configure: Minimum Length (texts shorter than this are removed) and Maximum Length (texts longer than this are removed).Unselected
Whether to Filter with the Average Length of the SampleNoSplits the text on line breaks, computes the average line length, and filters based on that value. When selected, configure: Minimum average length and Maximum Average Length.Unselected
Whether to Filter with the Longest Line Length of the SampleNoSplits the text on line breaks, computes the maximum line length, and filters based on that value. When selected, configure: Minimum length of the Longest Line and Maximum length of the Longest Line.Unselected
OSS Directory for Saving OutputDataNoThe OSS path where filtered output is saved. If left blank, the default workspace path is used.Default workspace path
Note: The three filter types are independent — enable any combination. Each enabled filter type requires its own minimum and maximum thresholds.
Tip: If you are unsure what thresholds to set, run a statistical analysis on a representative sample of your data first to understand its length distribution, then configure the ranges accordingly.

Tuning tab

ParameterRequiredDescriptionDefault
Number of ProcessesNoThe number of parallel processes for filtering.8

Select resource group tab

ParameterRequiredDescriptionDefault
Public Resource GroupNoThe instance type (CPU or GPU), number of instances, and virtual private cloud (VPC).
Dedicated resource groupNoThe number of vCPUs, memory, shared memory, GPUs, and instances.
Maximum Running DurationNoThe maximum time the component can run. If exceeded, the job is terminated.