LLM-Text Normalizer (MaxCompute)

更新时间:
复制 MD 格式

The LLM-Text Normalizer component preprocesses text for large language models (LLMs). It normalizes Unicode text and converts Traditional Chinese to Simplified Chinese.

Limitations

This component is supported only on the MaxCompute compute engine.

Algorithm

The LLM-Text Normalizer component supports the following features:

  • Normalizes Unicode text using the NFKC (Normalization Form Compatibility Composition) method.

    ftfy.fix_text(text, normalization='NFKC')

  • Converts Traditional Chinese characters to Simplified Chinese using the opencc package.

The following compares the data before and after processing:

  • Before processing: The data table contains six rows of test data. The column type is text. The data includes mixed Chinese and English text, Traditional and Simplified Chinese, special characters, and garbled text caused by encoding issues.

  • After processing: The table contains 6 rows of data. Row 1: ✔ No problems. Row 2: The Mona Lisa doesn't have eyebrows.. Row 3: No problems. Row 4: Alibaba. Row 5: These are a few traditional characters, which will be converted to simplified characters. Row 6: Test the conversion effect of a combination of traditional afadf characters $#@#, simplified characters, and various other characters and numbers 123213*&dasd. Traditional Chinese characters have been converted to Simplified Chinese characters, and English letters, numbers, and special characters remain unchanged.

Visual configuration parameters

You can configure the component parameters visually in Machine Learning Designer.

Tab

Parameter

Required

Description

Default

Field Settings

Select target column

Yes

The columns to process. You can select multiple columns.

None

Set output table lifecycle

No

Specifies the lifecycle (in days) for the temporary table generated by this component. After this period, the table is deleted.

28

Tuning

Number of CPUs per instance

No

The number of vCPUs for each map task instance. Valid values: 50 to 800.

100

Memory size per instance (MB)

No

The memory size in MB for each map task instance. Valid values: 256 to 12288.

1024

Data size per instance (MB)

No

The maximum data size (in MB) that each map task instance can process. This parameter controls the input size for the map phase. Valid values: 1 to Integer.MAX_VALUE.

256

Related documents

For more information about Machine Learning Designer components, see Machine Learning Designer overview.