Description of the LLM-MD5 Deduplicator (DLC) component-Platform For AI(PAI)-阿里云帮助中心

The LLM-MD5 Deduplicator (DLC) removes duplicate text entries from Large Language Model (LLM) training datasets. It computes an MD5 hash for each text entry and keeps one copy when multiple entries produce the same hash.

Input data must be stored in Object Storage Service (OSS) as a JSON Lines file, where each line is a valid JSON object and the file as a whole is not a valid JSON object. See example data for the expected format.

Supported computing resources

Deep Learning Containers (DLC)

How it works

The component deduplicates text entries in three steps:

Strips leading and trailing whitespace from each text entry.
Computes an MD5 hash using Python's hashlib.md5 method. Character casing is preserved — Hello and hello produce different hashes.
Retains one entry per unique hash value and discards the rest.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-MD5 Deduplicator (DLC) component with the following parameters.

Tab	Parameter	Type	Required	Default	Description
Fields Setting	Target Process Field	String	Yes	—	The name of the JSON field containing the text to deduplicate.
Fields Setting	OSS Directory for Saving OutputData	String	No	Workspace default path	The OSS directory where the deduplicated output is stored.
Tuning	Number of Processes	Integer	No	8	The number of parallel processes to use.
Select Resource Group	Public Resource Group	—	No	—	The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use.
Select Resource Group	Dedicated resource group	—	No	—	The number of vCPUs, memory, shared memory, GPUs, and instances to use.
Select Resource Group	Maximum Running Duration	—	No	—	The maximum run time for the component. The job terminates if this limit is exceeded.