LLM-Clean Copyright Information component-Platform For AI(PAI)-阿里云帮助中心

The LLM-Clean Copyright Information (MaxCompute) component preprocesses text data for large language models (LLMs). It primarily removes copyright comment headers from code.

Supported computing resources

MaxCompute

Algorithm

The component removes copyright or other comment information from text in the following two steps:

First, the algorithm checks if the text contains a string that matches the regular expression '/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/' (comment characters).
- If a matching string is found, the algorithm checks whether it contains the copyright keyword. If so, it deletes the entire string and returns the result. Otherwise, it returns the text unchanged.
- If no match for the regular expression is found, the algorithm proceeds to Step 2.
The algorithm splits the text by line breaks. The algorithm then iterates through each line to check if it starts with a comment marker, such as //, #, or --. Once a matching line is found, the algorithm continues to count consecutive comment lines until a non-comment line is found. Finally, the algorithm deletes the consecutive block of comments from the text and returns the result.

Both preceding steps detect only the first matching comment block, which means the component only checks the header of the text. The component does not process the remaining content. The following example shows the before and after.

Before

The Current field value pop-up window displays the JavaScript source code for angular-spinner 0.3.1. At the top of the code is an MIT license comment block that includes the version number, license type (License: MIT), and the notice Copyright (C) 2013, 2014, Uri Shaked and contributors. This is followed by an IIFE and code lines such as angular.module('angularSpinner', []).

After

(function(window, angular, undefined) {
'use strict';
angular.module('angularSpinner', [])
.factory('usSpinnerService', ['$rootScope', function ($rootScope) {
    var config = {};
    config.spin = function (key) {
        $rootScope.$broadcast('us-spinner:spin', key);
    };
}]);
})(window, window.angular);

Configure the component

In Machine Learning Designer, add the LLM-Clean Copyright Information (MaxCompute) component to a pipeline and configure its parameters in the pane on the right.

Type	Parameter	Default	Description
Field settings	Select target column	None	Select the column or columns to process.
Field settings	Output table lifecycle	28	The lifecycle of the output temporary table, in days. The table is recycled after this period. Default: 28.
Tuning	Number of CPUs per instance	100	The number of CPUs for each map task instance. The value must be in the range [50, 800].
	Memory size per instance (MB)	1024	The memory size for each map task instance, in MB. The value must be in the range [256, 12288].
	Data size per instance (MB)	256	The maximum amount of data that each map task instance can process. You can control the map-side input by setting this parameter. The value, in MB, must be in the range [1, Integer.MAX_VALUE].