LLM-Clean Special Content (MaxCompute)

更新时间:
复制 MD 格式

The LLM-Clean Special Content component preprocesses text data for a large language model (LLM). It removes special content such as navigation information, author information, source information, URLs, non-printable characters, and HTML markup. It also parses and extracts text from HTML structures.

Limitations

This component supports only the MaxCompute compute engine.

How it works

The LLM-Clean Special Content component supports the following features:

First, the component splits the text into multiple lines using line breaks.

  • Remove navigation information

    • Navigation information keywords include: 'Home>', 'Main page>', 'Home»', 'Home/', and 'Home|'.

    • Navigation information regular expressions: 'Current location:.*[>]{1,}' and 'Location:.*[>]{1,}'.

    • The component deletes lines that contain these keywords or match these regular expressions.

  • Remove author information

    The component deletes a line if it contains one of the specified author-related keywords and at least one punctuation mark from the set '.?!;:,'.

    Author information keywords include: 'Reporter ', 'Source:', 'Editor:', 'Login|Register', 'This article URL:', 'Publish date:', 'Time added:', 'Share to:', '“Scan”', 'Related links:', 'Lottery', 'Site navigation ', '| Contact us', 'Homepage ', 'Current location:', 'Published at ', and 'Location: '.

  • Remove source information

    Regular expressions for the article source include: r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})', r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[source:|editor:]'.

    The component checks only the first five lines for matches with the regular expressions and deletes any matching lines.

    Note

    If you enable the removal of navigation and author information, the component checks the first five lines of the remaining text, not the original text.

  • Remove URLs

    The component deletes any text that matches the regular expression r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'.

  • Remove non-printable characters

    The component deletes any characters that match the regular expression '[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+'.

  • Remove HTML markup and parse HTML text

    The component replaces '<li>' and '<ol>' with '\n*' and removes the '</li>' and '</ol>' tags. It then parses the remaining HTML and returns the text content.

Example of removing a URL:

  • Before:

    /*
     AngularJS v1.3.0-beta.2
     (c) 2010-2014 Google, Inc. http://angularjs.org
     License: MIT
    */
    (function(H,a,A){'use strict';function D(p,g){g=g||
    {};a.forEach(g,function(a,c){delete g[c]});for(var c in
    p)!p.hasOwnProperty(c)||"$"===c.charAt(0)&&"$"===c.charAt(1)||(g[c]=p[c])}})
  • After:

    After processing, the URL http://angularjs.org is removed from the comment block.

Visual configuration parameters

You can configure the component parameters in the Machine Learning Designer interface.

Tab

Parameter

Required

Description

Default

Field Settings

Select target column

Yes

The column to process.

None

Output table lifecycle

No

The retention period, in days, for the temporary table that this component generates. The table is recycled after this period.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs per map task instance. Valid values range from 50 to 800.

100

The memory size per instance of map task

No

The memory size, in MB, per map task instance. Valid values range from 256 to 12,288.

1024

The maximum size of input data for a map

No

The maximum amount of input data, in MB, that a map task instance can process. This parameter controls the input split size for the map phase. The valid range is [1, Integer.MAX_VALUE].

256

Related documentation

For more information about components in Machine Learning Designer, see Overview of Machine Learning Designer.