LLM-Clean Copyright Information (MaxCompute)

更新时间:
复制 MD 格式

The LLM-Clean Copyright Information (MaxCompute) component preprocesses text data for large language models (LLMs). It primarily removes copyright comment headers from code.

Supported computing resources

MaxCompute

Algorithm

The component removes copyright or other comment information from text in the following two steps:

  1. First, the algorithm checks if the text contains a string that matches the regular expression '/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/' (comment characters).

    • If a matching string is found, the algorithm checks whether it contains the copyright keyword. If so, it deletes the entire string and returns the result. Otherwise, it returns the text unchanged.

    • If no match for the regular expression is found, the algorithm proceeds to Step 2.

  2. The algorithm splits the text by line breaks. The algorithm then iterates through each line to check if it starts with a comment marker, such as //, #, or --. Once a matching line is found, the algorithm continues to count consecutive comment lines until a non-comment line is found. Finally, the algorithm deletes the consecutive block of comments from the text and returns the result.

Both preceding steps detect only the first matching comment block, which means the component only checks the header of the text. The component does not process the remaining content. The following example shows the before and after.

Before

The Current field value pop-up window displays the JavaScript source code for angular-spinner 0.3.1. At the top of the code is an MIT license comment block that includes the version number, license type (License: MIT), and the notice Copyright (C) 2013, 2014, Uri Shaked and contributors. This is followed by an IIFE and code lines such as angular.module('angularSpinner', []).

After

(function(window, angular, undefined) {
'use strict';
angular.module('angularSpinner', [])
.factory('usSpinnerService', ['$rootScope', function ($rootScope) {
    var config = {};
    config.spin = function (key) {
        $rootScope.$broadcast('us-spinner:spin', key);
    };
}]);
})(window, window.angular);

Configure the component

In Machine Learning Designer, add the LLM-Clean Copyright Information (MaxCompute) component to a pipeline and configure its parameters in the pane on the right.

Type

Parameter

Default

Description

Field settings

Select target column

None

Select the column or columns to process.

Output table lifecycle

28

The lifecycle of the output temporary table, in days. The table is recycled after this period. Default: 28.

Tuning

Number of CPUs per instance

100

The number of CPUs for each map task instance. The value must be in the range [50, 800].

Memory size per instance (MB)

1024

The memory size for each map task instance, in MB. The value must be in the range [256, 12288].

Data size per instance (MB)

256

The maximum amount of data that each map task instance can process. You can control the map-side input by setting this parameter. The value, in MB, must be in the range [1, Integer.MAX_VALUE].