The LLM-Clean Special Content component preprocesses text data for a large language model (LLM). It removes special content such as navigation information, author information, source information, URLs, non-printable characters, and HTML markup. It also parses and extracts text from HTML structures.
Limitations
This component supports only the MaxCompute compute engine.
How it works
The LLM-Clean Special Content component supports the following features:
First, the component splits the text into multiple lines using line breaks.
-
Remove navigation information
-
Navigation information keywords include:
'Home>','Main page>','Home»','Home/', and'Home|'. -
Navigation information regular expressions:
'Current location:.*[>]{1,}'and'Location:.*[>]{1,}'. -
The component deletes lines that contain these keywords or match these regular expressions.
-
-
Remove author information
The component deletes a line if it contains one of the specified author-related keywords and at least one punctuation mark from the set
'.?!;:,'.Author information keywords include:
'Reporter ','Source:','Editor:','Login|Register','This article URL:','Publish date:','Time added:','Share to:','“Scan”','Related links:','Lottery','Site navigation ','| Contact us','Homepage ','Current location:','Published at ', and'Location: '.
-
Remove source information
Regular expressions for the article source include:
r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})',r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[source:|editor:]'.The component checks only the first five lines for matches with the regular expressions and deletes any matching lines.
NoteIf you enable the removal of navigation and author information, the component checks the first five lines of the remaining text, not the original text.
-
Remove URLs
The component deletes any text that matches the regular expression
r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'. -
Remove non-printable characters
The component deletes any characters that match the regular expression
'[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+'. -
Remove HTML markup and parse HTML text
The component replaces
'<li>'and'<ol>'with'\n*'and removes the'</li>'and'</ol>'tags. It then parses the remaining HTML and returns the text content.
Example of removing a URL:
-
Before:
/* AngularJS v1.3.0-beta.2 (c) 2010-2014 Google, Inc. http://angularjs.org License: MIT */ (function(H,a,A){'use strict';function D(p,g){g=g|| {};a.forEach(g,function(a,c){delete g[c]});for(var c in p)!p.hasOwnProperty(c)||"$"===c.charAt(0)&&"$"===c.charAt(1)||(g[c]=p[c])}}) -
After:
After processing, the URL http://angularjs.org is removed from the comment block.
Visual configuration parameters
You can configure the component parameters in the Machine Learning Designer interface.
|
Tab |
Parameter |
Required |
Description |
Default |
|
Field Settings |
Select target column |
Yes |
The column to process. |
None |
|
Output table lifecycle |
No |
The retention period, in days, for the temporary table that this component generates. The table is recycled after this period. |
28 |
|
|
Tuning |
Number of CPUs per instance of map task |
No |
The number of CPUs per map task instance. Valid values range from 50 to 800. |
100 |
|
The memory size per instance of map task |
No |
The memory size, in MB, per map task instance. Valid values range from 256 to 12,288. |
1024 |
|
|
The maximum size of input data for a map |
No |
The maximum amount of input data, in MB, that a map task instance can process. This parameter controls the input split size for the map phase. The valid range is [1, Integer.MAX_VALUE]. |
256 |
Related documentation
For more information about components in Machine Learning Designer, see Overview of Machine Learning Designer.