LaTeX documents typically begin with a preamble containing document class declarations, package imports, author metadata, and abstract formatting commands. Including this preamble in LLM training data introduces noise that degrades model quality. The LLM-LaTeX Remove Header (DLC) component strips everything before the first section command, keeping only the body content that starts at \section, \chapter, or another structural heading.
How it works
The component scans each LaTeX document in your input file using the following regular expression to locate the first structural section command:
r'^(.*?)(\\\bchapter\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bpart\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\})'The expression matches section commands in the <section-type>[optional-args]{name} format, covering \chapter, \part, \section, \subsection, \subsubsection, \paragraph, and \subparagraph. Multiple section types are separated by |.
All content before the first matched section command is removed. The matched section line and everything after it are retained.
Before processing![]() | After processing![]() |
|---|
Boundary behavior: If a LaTeX document contains no matching section command, the component behavior depends on the Whether Remove no Header Sample setting. With the default (selected), the document is dropped from the output. If you clear this option, the document is kept unchanged.
Supported computing resources
Prerequisites
Before you begin, make sure you have:
An input file in JSON Lines format, where each line is a valid JSON object containing a LaTeX text field. The file as a whole is not a valid JSON object. Download a sample file to see the expected structure.
An Object Storage Service (OSS) bucket accessible from your PAI workspace
Configure the component
Configure the LLM-LaTeX Remove Header (DLC) component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console.
Fields setting
| Parameter | Required | Description | Default |
|---|---|---|---|
| Target Process Field | Yes | The name of the JSON field containing the LaTeX text to process | — |
| Whether Remove no Header Sample | No | When selected, documents with no matching section command are dropped from the output. Clear this option to keep such documents unchanged—useful when your dataset contains a mix of LaTeX and non-LaTeX content that you want to preserve. | Selected |
| OSS Directory for Saving OutputData | No | The OSS path where processed data is written. If left blank, the default workspace path is used. | — |
Tuning
| Parameter | Required | Description | Default |
|---|---|---|---|
| Number of Processes | No | The number of parallel processes for processing. Increase this value for large datasets to reduce processing time. | 8 |
Select resource group
| Parameter | Required | Description | Default |
|---|---|---|---|
| Public Resource Group | No | Specify the instance type (CPU or GPU), number of instances, and VPC | — |
| Dedicated resource group | No | Specify the number of vCPUs, memory, shared memory, number of GPUs, and number of instances | — |
| Maximum Running Duration (seconds) | No | The maximum time the component can run. The job is terminated if this limit is exceeded. | — |

