The LLM-Clean Copyright Information (MaxCompute) component preprocesses text data for large language models (LLMs). It primarily removes copyright comment headers from code.
Supported computing resources
Algorithm
The component removes copyright or other comment information from text in the following two steps:
-
First, the algorithm checks if the text contains a string that matches the regular expression
'/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/'(comment characters).-
If a matching string is found, the algorithm checks whether it contains the
copyrightkeyword. If so, it deletes the entire string and returns the result. Otherwise, it returns the text unchanged. -
If no match for the regular expression is found, the algorithm proceeds to Step 2.
-
-
The algorithm splits the text by line breaks. The algorithm then iterates through each line to check if it starts with a comment marker, such as
//,#, or--. Once a matching line is found, the algorithm continues to count consecutive comment lines until a non-comment line is found. Finally, the algorithm deletes the consecutive block of comments from the text and returns the result.
Both preceding steps detect only the first matching comment block, which means the component only checks the header of the text. The component does not process the remaining content. The following example shows the before and after.
|
Before The Current field value pop-up window displays the JavaScript source code for angular-spinner 0.3.1. At the top of the code is an MIT license comment block that includes the version number, license type (License: MIT), and the notice |
After
|
Configure the component
In Machine Learning Designer, add the LLM-Clean Copyright Information (MaxCompute) component to a pipeline and configure its parameters in the pane on the right.
|
Type |
Parameter |
Default |
Description |
|
Field settings |
Select target column |
None |
Select the column or columns to process. |
|
Output table lifecycle |
28 |
The lifecycle of the output temporary table, in days. The table is recycled after this period. Default: 28. |
|
|
Tuning |
Number of CPUs per instance |
100 |
The number of CPUs for each map task instance. The value must be in the range [50, 800]. |
|
Memory size per instance (MB) |
1024 |
The memory size for each map task instance, in MB. The value must be in the range [256, 12288]. |
|
|
Data size per instance (MB) |
256 |
The maximum amount of data that each map task instance can process. You can control the map-side input by setting this parameter. The value, in MB, must be in the range [1, Integer.MAX_VALUE]. |