LLM-Count Filter (DLC)

更新时间:
复制 MD 格式

The LLM-Count Filter (DLC) component filters text samples based on the ratio of digits and letters. The input Object Storage Service (OSS) data file must be in the JSON Lines format, where each line is a valid JSON object. The file itself is not a valid JSON object. For more information, see Example.

Supported computing resources

Deep Learning Containers (DLC)

Algorithm description

This component supports the following features:

  • Filter text based on the number or ratio of digits and letters

    The algorithm counts the digits and letters in the text and filters the text based on the threshold value.

  • Filter the text based on the ratio of letters to text tokens

    The algorithm tokenizes the text by using the pythia-6.9b-deduped model, calculates the ratio of digits and letters to tokens, and filters the text accordingly.

Configure the component

On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Count Filtering (DLC) component.

Tab

Parameter

Required

Description

Default value

Fields Setting

Target Process Field

Yes

The field to process.

N/A

Whether to Filter with AlphaNumeric Count or Ratio

No

Specifies whether to filter text based on the ratio of digits and letters to the text length. If you select this option, configure the following parameters:

  • Minimum Counts or Ratio of AlphaNumeric Chars

  • Maximum Counts or Ratio of AlphaNumeric Chars

Unselected

Whether to Filter with the Ratio of the Number of alpha chars to the Number of Text Tokens

No

The algorithm tokenizes the text by using the pythia-6.9b-deduped model, calculates the ratio of digits and letters to tokens, and filters the text accordingly. If you select this option, configure the following parameters:

  • Minimum Ratio of Alpha Chars to Text Tokens

  • Maximum Ratio of Alpha Chars to Text Tokens

Unselected

OSS Directory for Saving OutputData

No

The OSS directory for storing the output data. If you leave this parameter empty, the default workspace path is used.

N/A

Tuning

Number of Processes

No

The number of processes.

8

Select Resource Group

Public Resource Group

No

The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use.

N/A

Dedicated resource group

No

The number of vCPUs, memory, shared memory, number of GPUs, and number of instances to use.

N/A

Maximum Running Duration

No

The maximum duration for which the component can run. The job is terminated if this duration is exceeded.

N/A