LLM-Clean Special Content component-Platform For AI(PAI)-阿里云帮助中心

The LLM-Clean Special Content component preprocesses text data for a large language model (LLM). It removes special content such as navigation information, author information, source information, URLs, non-printable characters, and HTML markup. It also parses and extracts text from HTML structures.

Limitations

This component supports only the MaxCompute compute engine.

How it works

The LLM-Clean Special Content component supports the following features:

First, the component splits the text into multiple lines using line breaks.

Remove navigation information
- Navigation information keywords include: 'Home>', 'Main page>', 'Home»', 'Home/', and 'Home|'.
- Navigation information regular expressions: 'Current location:.*[>]{1,}' and 'Location:.*[>]{1,}'.
- The component deletes lines that contain these keywords or match these regular expressions.
Remove author information

The component deletes a line if it contains one of the specified author-related keywords and at least one punctuation mark from the set '.?!;:,'.

Author information keywords include: 'Reporter ', 'Source:', 'Editor:', 'Login|Register', 'This article URL:', 'Publish date:', 'Time added:', 'Share to:', '“Scan”', 'Related links:', 'Lottery', 'Site navigation ', '| Contact us', 'Homepage ', 'Current location:', 'Published at ', and 'Location: '.

Remove source information

Regular expressions for the article source include: r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})', r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[source:|editor:]'.

The component checks only the first five lines for matches with the regular expressions and deletes any matching lines.

Note
If you enable the removal of navigation and author information, the component checks the first five lines of the remaining text, not the original text.
Remove URLs

The component deletes any text that matches the regular expression r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'.
Remove non-printable characters

The component deletes any characters that match the regular expression '[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+'.
Remove HTML markup and parse HTML text

The component replaces '<li>' and '<ol>' with '\n*' and removes the '</li>' and '</ol>' tags. It then parses the remaining HTML and returns the text content.

Example of removing a URL:

Before:

/*
 AngularJS v1.3.0-beta.2
 (c) 2010-2014 Google, Inc. http://angularjs.org
 License: MIT
*/
(function(H,a,A){'use strict';function D(p,g){g=g||
{};a.forEach(g,function(a,c){delete g[c]});for(var c in
p)!p.hasOwnProperty(c)||"$"===c.charAt(0)&&"$"===c.charAt(1)||(g[c]=p[c])}})

After:

After processing, the URL http://angularjs.org is removed from the comment block.

Visual configuration parameters

You can configure the component parameters in the Machine Learning Designer interface.

Tab	Parameter	Required	Description	Default
Field Settings	Select target column	Yes	The column to process.	None
Field Settings	Output table lifecycle	No	The retention period, in days, for the temporary table that this component generates. The table is recycled after this period.	28

Tuning	Number of CPUs per instance of map task	No	The number of CPUs per map task instance. Valid values range from 50 to 800.	100
	The memory size per instance of map task	No	The memory size, in MB, per map task instance. Valid values range from 256 to 12,288.	1024
	The maximum size of input data for a map	No	The maximum amount of input data, in MB, that a map task instance can process. This parameter controls the input split size for the map phase. The valid range is [1, Integer.MAX_VALUE].	256

LLM-Clean Special Content (MaxCompute)

Limitations

How it works

Visual configuration parameters

Related documentation