The LLM-Text Normalizer component preprocesses text for large language models (LLMs). It normalizes Unicode text and converts Traditional Chinese to Simplified Chinese.
Limitations
This component is supported only on the MaxCompute compute engine.
Algorithm
The LLM-Text Normalizer component supports the following features:
-
Normalizes Unicode text using the NFKC (Normalization Form Compatibility Composition) method.
ftfy.fix_text(text, normalization='NFKC') -
Converts Traditional Chinese characters to Simplified Chinese using the opencc package.
The following compares the data before and after processing:
-
Before processing: The data table contains six rows of test data. The column type is
text. The data includes mixed Chinese and English text, Traditional and Simplified Chinese, special characters, and garbled text caused by encoding issues. -
After processing: The table contains 6 rows of data. Row 1:
✔ No problems. Row 2:The Mona Lisa doesn't have eyebrows.. Row 3:No problems. Row 4:Alibaba. Row 5:These are a few traditional characters, which will be converted to simplified characters. Row 6:Test the conversion effect of a combination of traditional afadf characters $#@#, simplified characters, and various other characters and numbers 123213*&dasd. Traditional Chinese characters have been converted to Simplified Chinese characters, and English letters, numbers, and special characters remain unchanged.
Visual configuration parameters
You can configure the component parameters visually in Machine Learning Designer.
|
Tab |
Parameter |
Required |
Description |
Default |
|
Field Settings |
Select target column |
Yes |
The columns to process. You can select multiple columns. |
None |
|
Set output table lifecycle |
No |
Specifies the lifecycle (in days) for the temporary table generated by this component. After this period, the table is deleted. |
28 |
|
|
Tuning |
Number of CPUs per instance |
No |
The number of vCPUs for each map task instance. Valid values: 50 to 800. |
100 |
|
Memory size per instance (MB) |
No |
The memory size in MB for each map task instance. Valid values: 256 to 12288. |
1024 |
|
|
Data size per instance (MB) |
No |
The maximum data size (in MB) that each map task instance can process. This parameter controls the input size for the map phase. Valid values: 1 to Integer.MAX_VALUE. |
256 |
Related documents
For more information about Machine Learning Designer components, see Machine Learning Designer overview.