The LLM-MD5 Deduplicator (DLC) removes duplicate text entries from Large Language Model (LLM) training datasets. It computes an MD5 hash for each text entry and keeps one copy when multiple entries produce the same hash.
Input data must be stored in Object Storage Service (OSS) as a JSON Lines file, where each line is a valid JSON object and the file as a whole is not a valid JSON object. See example data for the expected format.
Supported computing resources
How it works
The component deduplicates text entries in three steps:
-
Strips leading and trailing whitespace from each text entry.
-
Computes an MD5 hash using Python's
hashlib.md5method. Character casing is preserved —Helloandhelloproduce different hashes. -
Retains one entry per unique hash value and discards the rest.
Configure the component
On the pipeline page of Machine Learning Designer, configure the LLM-MD5 Deduplicator (DLC) component with the following parameters.
| Tab | Parameter | Type | Required | Default | Description |
|---|---|---|---|---|---|
| Fields Setting | Target Process Field | String | Yes | — | The name of the JSON field containing the text to deduplicate. |
| Fields Setting | OSS Directory for Saving OutputData | String | No | Workspace default path | The OSS directory where the deduplicated output is stored. |
| Tuning | Number of Processes | Integer | No | 8 | The number of parallel processes to use. |
| Select Resource Group | Public Resource Group | — | No | — | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use. |
| Select Resource Group | Dedicated resource group | — | No | — | The number of vCPUs, memory, shared memory, GPUs, and instances to use. |
| Select Resource Group | Maximum Running Duration | — | No | — | The maximum run time for the component. The job terminates if this limit is exceeded. |