Description of the LLM-Document Deduplicator (DLC) component-Platform For AI(PAI)-阿里云帮助中心

The LLM-Document Deduplicator (DLC) component uses the SimHash algorithm to calculate text similarity and deduplicate texts. The input Object Storage Service (OSS) data file must be in JSON Lines format, where each line is a valid JSON object but the file itself is not a single JSON object. For more information, see Example.

Supported computing resources

DLC

Configure the component

On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Document Deduplicator (DLC) component.

Tab	Parameter		Required	Description	Default value
Fields Setting	Target Process Field		Yes	The name of the field that you want to process.	N/A
	Text Separator, default is space		No	The delimiter used to split text into words. By default, spaces are used. If you leave this parameter empty, the algorithm deduplicates text based on single characters. Enclose the delimiter in double quotation marks ("").	" "
	window_size		Yes	The length of substrings used as document features. For example, if the text is "the cute alibaba mascot" and window_size is 2, the substrings are: ["the cute", "cute alibaba", "alibaba mascot"]. The algorithm calculates SimHash values based on the hash values of these substrings. A smaller value generates more distinct features but is more susceptible to edits. A larger value captures more context but may ignore details.	6
	num_blocks		Yes	The number of blocks into which the SimHash value is divided for similarity comparison. For example, if the SimHash value is a 64-bit integer and num_blocks is 4, the value is divided into 4 separate 16-bit blocks. More blocks produce finer-grained comparisons, which reduces false positives but may increase false negatives. The num_blocks value must be smaller than the number of bits in the SimHash value.	6
	hamming_distance		Yes	The Hamming distance threshold for determining text similarity. If the number of differing bits between two SimHash values is less than or equal to this value, the texts are considered similar. A smaller value identifies only highly similar texts as duplicates, which may miss some duplicated content. A larger value identifies more similar texts but may increase false positives. Recommended values: 3, 4, or 5.	4
	OSS Directory for Saving OutputData		No	The OSS directory for storing generated data. If not specified, the default workspace path is used.	N/A
Tuning	Number of Processes		No	The number of processes.	8
	Select Resource Group	Public Resource Group	No	The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use.	N/A
	Select Resource Group	Dedicated resource group	No	The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use.	N/A
	Maximum Running Duration		No	The maximum running time for the component. The job is terminated if this duration is exceeded.	N/A