LLM-Text Normalization (DLC)

更新时间:
复制 MD 格式

The LLM-Text Normalization (DLC) component normalizes Unicode text and converts Traditional Chinese to Simplified Chinese. The input OSS data file must be in JSONL format (example). In a JSONL file, each line is a valid JSON object, but the file as a whole is not a valid JSON object.

Supported compute resources

DLC

Algorithm description

The LLM-Text Normalization component supports the following features:

  • Normalizes Unicode text using the NFKC method.

    ftfy.fix_text(text, normalization='NFKC')

  • Converts Traditional Chinese to Simplified Chinese.

    It uses the opencc package for the conversion.

The following table describes the results of the process.

Before processing

Column A in the data table is the text type. It contains six rows of text data with encoding anomalies and a mix of Traditional and Simplified Chinese.

After processing

The column is the text type and contains six rows of results. English text and special characters, such as ✔ No problems and The Mona Lisa doesn't have eyebrows., remain unchanged. Traditional Chinese characters are converted to Simplified Chinese. For example, These are a few traditional characters that will be converted to simplified characters. This conversion also applies to the Traditional Chinese parts of mixed text that contains Simplified Chinese, English, numbers, and special characters.

Configure the component

You can add the LLM-Text Normalization (DLC) component to your workflow in Designer and configure the parameters in the pane on the right.

Parameter type

Parameter

Required

Description

Default value

Field settings

Target process field

Yes

The name of the field to process.

None

Normalize Unicode text (NFKC)

No

Specifies whether to normalize Unicode text using the NFKC method.

Selected

Convert Traditional Chinese to Simplified Chinese

No

Specifies whether to convert Traditional Chinese to Simplified Chinese.

Selected

OSS output directory for data

No

The OSS storage directory for the processed data. If you leave this empty, the default path of the workspace is used.

None

Execution tuning

Number of processes

No

The number of processes to use.

8

Select resource group

Public resource group

No

Select the node specifications (CPU or GPU-accelerated instance specifications), number of nodes, and virtual private cloud (VPC).

None

Dedicated resource group

No

Select the number of CPU cores, memory, shared memory, number of GPUs, and number of nodes.

None

Maximum runtime

No

The maximum runtime of the component. If the runtime exceeds this value, the job is terminated.

None