Description of the LLM-Length Filter (DLC) component-Platform For AI(PAI)-阿里云帮助中心

LLM training corpora often contain texts that are too short to carry meaningful information or too long to indicate malformed crawl artifacts. The LLM-Length Filter (DLC) component removes these texts from your dataset by filtering on total character count, average line length, and maximum line length — reducing the volume of data that downstream, more expensive quality filters need to process.

The component runs on Deep Learning Containers (DLC) and reads input data from Object Storage Service (OSS). Input files must be in JSON Lines format: each line is a valid JSON object, but the file as a whole is not. See the example data file for reference.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-Length Filter (DLC) component using the tabs described below.

Fields setting tab

Parameter	Required	Description	Default
Target Process Field	Yes	The name of the JSON field to filter on.	—
Whether to Filter with Text Length	No	Filters texts by total character count. When selected, configure: Minimum Length (texts shorter than this are removed) and Maximum Length (texts longer than this are removed).	Unselected
Whether to Filter with the Average Length of the Sample	No	Splits the text on line breaks, computes the average line length, and filters based on that value. When selected, configure: Minimum average length and Maximum Average Length.	Unselected
Whether to Filter with the Longest Line Length of the Sample	No	Splits the text on line breaks, computes the maximum line length, and filters based on that value. When selected, configure: Minimum length of the Longest Line and Maximum length of the Longest Line.	Unselected
OSS Directory for Saving OutputData	No	The OSS path where filtered output is saved. If left blank, the default workspace path is used.	Default workspace path

Note: The three filter types are independent — enable any combination. Each enabled filter type requires its own minimum and maximum thresholds.

Tip: If you are unsure what thresholds to set, run a statistical analysis on a representative sample of your data first to understand its length distribution, then configure the ranges accordingly.

Tuning tab

Parameter	Required	Description	Default
Number of Processes	No	The number of parallel processes for filtering.	8

Select resource group tab

Parameter	Required	Description	Default
Public Resource Group	No	The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC).	—
Dedicated resource group	No	The number of vCPUs, memory, shared memory, GPUs, and instances.	—
Maximum Running Duration	No	The maximum time the component can run. If exceeded, the job is terminated.	—