LLM-LaTeX Remove Header (DLC)

更新时间:
复制 MD 格式

LaTeX documents typically begin with a preamble containing document class declarations, package imports, author metadata, and abstract formatting commands. Including this preamble in LLM training data introduces noise that degrades model quality. The LLM-LaTeX Remove Header (DLC) component strips everything before the first section command, keeping only the body content that starts at \section, \chapter, or another structural heading.

How it works

The component scans each LaTeX document in your input file using the following regular expression to locate the first structural section command:

r'^(.*?)(\\\bchapter\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bpart\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\})'

The expression matches section commands in the <section-type>[optional-args]{name} format, covering \chapter, \part, \section, \subsection, \subsubsection, \paragraph, and \subparagraph. Multiple section types are separated by |.

All content before the first matched section command is removed. The matched section line and everything after it are retained.

Before processingimageAfter processingimage

Boundary behavior: If a LaTeX document contains no matching section command, the component behavior depends on the Whether Remove no Header Sample setting. With the default (selected), the document is dropped from the output. If you clear this option, the document is kept unchanged.

Supported computing resources

DLC

Prerequisites

Before you begin, make sure you have:

  • An input file in JSON Lines format, where each line is a valid JSON object containing a LaTeX text field. The file as a whole is not a valid JSON object. Download a sample file to see the expected structure.

  • An Object Storage Service (OSS) bucket accessible from your PAI workspace

Configure the component

Configure the LLM-LaTeX Remove Header (DLC) component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console.

Fields setting

ParameterRequiredDescriptionDefault
Target Process FieldYesThe name of the JSON field containing the LaTeX text to process
Whether Remove no Header SampleNoWhen selected, documents with no matching section command are dropped from the output. Clear this option to keep such documents unchanged—useful when your dataset contains a mix of LaTeX and non-LaTeX content that you want to preserve.Selected
OSS Directory for Saving OutputDataNoThe OSS path where processed data is written. If left blank, the default workspace path is used.

Tuning

ParameterRequiredDescriptionDefault
Number of ProcessesNoThe number of parallel processes for processing. Increase this value for large datasets to reduce processing time.8

Select resource group

ParameterRequiredDescriptionDefault
Public Resource GroupNoSpecify the instance type (CPU or GPU), number of instances, and VPC
Dedicated resource groupNoSpecify the number of vCPUs, memory, shared memory, number of GPUs, and number of instances
Maximum Running Duration (seconds)NoThe maximum time the component can run. The job is terminated if this limit is exceeded.