The LLM-Document Deduplicator (DLC) component uses the SimHash algorithm to calculate text similarity and deduplicate texts. The input Object Storage Service (OSS) data file must be in JSON Lines format, where each line is a valid JSON object but the file itself is not a single JSON object. For more information, see Example.
Supported computing resources
Configure the component
On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Document Deduplicator (DLC) component.
|
Tab |
Parameter |
Required |
Description |
Default value |
|
|
Fields Setting |
Target Process Field |
Yes |
The name of the field that you want to process. |
N/A |
|
|
Text Separator, default is space |
No |
The delimiter used to split text into words. By default, spaces are used. If you leave this parameter empty, the algorithm deduplicates text based on single characters. Enclose the delimiter in double quotation marks (""). |
" " |
||
|
window_size |
Yes |
The length of substrings used as document features. For example, if the text is "the cute alibaba mascot" and window_size is 2, the substrings are: ["the cute", "cute alibaba", "alibaba mascot"]. The algorithm calculates SimHash values based on the hash values of these substrings. A smaller value generates more distinct features but is more susceptible to edits. A larger value captures more context but may ignore details. |
6 |
||
|
num_blocks |
Yes |
The number of blocks into which the SimHash value is divided for similarity comparison. For example, if the SimHash value is a 64-bit integer and num_blocks is 4, the value is divided into 4 separate 16-bit blocks. More blocks produce finer-grained comparisons, which reduces false positives but may increase false negatives. The num_blocks value must be smaller than the number of bits in the SimHash value. |
6 |
||
|
hamming_distance |
Yes |
The Hamming distance threshold for determining text similarity. If the number of differing bits between two SimHash values is less than or equal to this value, the texts are considered similar. A smaller value identifies only highly similar texts as duplicates, which may miss some duplicated content. A larger value identifies more similar texts but may increase false positives. Recommended values: 3, 4, or 5. |
4 |
||
|
OSS Directory for Saving OutputData |
No |
The OSS directory for storing generated data. If not specified, the default workspace path is used. |
N/A |
||
|
Tuning |
Number of Processes |
No |
The number of processes. |
8 |
|
|
Select Resource Group |
Public Resource Group |
No |
The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use. |
N/A |
|
|
Dedicated resource group |
No |
The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. |
N/A |
||
|
Maximum Running Duration |
No |
The maximum running time for the component. The job is terminated if this duration is exceeded. |
N/A |
||