LLM-MD5 Deduplicator (DLC)

更新时间:
复制 MD 格式

The LLM-MD5 Deduplicator (DLC) removes duplicate text entries from Large Language Model (LLM) training datasets. It computes an MD5 hash for each text entry and keeps one copy when multiple entries produce the same hash.

Input data must be stored in Object Storage Service (OSS) as a JSON Lines file, where each line is a valid JSON object and the file as a whole is not a valid JSON object. See example data for the expected format.

Supported computing resources

Deep Learning Containers (DLC)

How it works

The component deduplicates text entries in three steps:

  1. Strips leading and trailing whitespace from each text entry.

  2. Computes an MD5 hash using Python's hashlib.md5 method. Character casing is preserved — Hello and hello produce different hashes.

  3. Retains one entry per unique hash value and discards the rest.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-MD5 Deduplicator (DLC) component with the following parameters.

Tab Parameter Type Required Default Description
Fields Setting Target Process Field String Yes The name of the JSON field containing the text to deduplicate.
Fields Setting OSS Directory for Saving OutputData String No Workspace default path The OSS directory where the deduplicated output is stored.
Tuning Number of Processes Integer No 8 The number of parallel processes to use.
Select Resource Group Public Resource Group No The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use.
Select Resource Group Dedicated resource group No The number of vCPUs, memory, shared memory, GPUs, and instances to use.
Select Resource Group Maximum Running Duration No The maximum run time for the component. The job terminates if this limit is exceeded.