The LLM-Text Normalization (DLC) component normalizes Unicode text and converts Traditional Chinese to Simplified Chinese. The input OSS data file must be in JSONL format (example). In a JSONL file, each line is a valid JSON object, but the file as a whole is not a valid JSON object.
Supported compute resources
Algorithm description
The LLM-Text Normalization component supports the following features:
-
Normalizes Unicode text using the NFKC method.
ftfy.fix_text(text, normalization='NFKC') -
Converts Traditional Chinese to Simplified Chinese.
It uses the
openccpackage for the conversion.
The following table describes the results of the process.
|
Before processing Column A in the data table is the |
After processing The column is the |
Configure the component
You can add the LLM-Text Normalization (DLC) component to your workflow in Designer and configure the parameters in the pane on the right.
|
Parameter type |
Parameter |
Required |
Description |
Default value |
|
|
Field settings |
Target process field |
Yes |
The name of the field to process. |
None |
|
|
Normalize Unicode text (NFKC) |
No |
Specifies whether to normalize Unicode text using the NFKC method. |
Selected |
||
|
Convert Traditional Chinese to Simplified Chinese |
No |
Specifies whether to convert Traditional Chinese to Simplified Chinese. |
Selected |
||
|
OSS output directory for data |
No |
The OSS storage directory for the processed data. If you leave this empty, the default path of the workspace is used. |
None |
||
|
Execution tuning |
Number of processes |
No |
The number of processes to use. |
8 |
|
|
Select resource group |
Public resource group |
No |
Select the node specifications (CPU or GPU-accelerated instance specifications), number of nodes, and virtual private cloud (VPC). |
None |
|
|
Dedicated resource group |
No |
Select the number of CPU cores, memory, shared memory, number of GPUs, and number of nodes. |
None |
||
|
Maximum runtime |
No |
The maximum runtime of the component. If the runtime exceeds this value, the job is terminated. |
None |
||