Clean GitHub code data for LLM training

更新时间:
复制 MD 格式

Raw GitHub code data contains privacy risks, noise, and near-duplicate files that degrade large language model (LLM) training quality. This tutorial shows you how to use the built-in LLM data processing components in PAI (Platform for AI) to clean 5,000 samples from the open source RedPajama project — covering sensitive data masking, noise removal, quality filtering, and deduplication.

Prerequisites

Before you begin, ensure that you have:

How it works

The LLM Data Processing - GitHub Code preset template builds a nine-stage pipeline. Each component passes its output to the next, so only samples that pass every filter reach the final data table.

Stage Component Problem it solves What it does
1 LLM-MaskSensitiveInfo-1 Training data contains real personally identifiable information (PII) that the model could reproduce Replaces email addresses with [EMAIL], phone numbers with [TELEPHONE] or [MOBILEPHONE], and ID card numbers with IDNUM
2 LLM-RemoveSpecialContent-1 URLs add noise and skew the model toward learning URL patterns instead of code logic Deletes all URLs from the content field
3 LLM-NormalizeText-1 Scraped code repositories often contain improperly encoded Unicode characters from mixed-encoding environments Applies Unicode normalization to the content field
4 LLM-RemoveCopyright-1 Copyright headers repeat across many files and add redundant tokens that inflate training cost without improving quality Deletes copyright information from the content field
5 LLM-CountFilter-1 Raw GitHub data includes binary files, configuration dumps, and other non-code content Removes samples where the ratio of letters and digits falls below the threshold. GitHub code is predominantly alphanumeric, so a low ratio indicates binary or non-code data
6 LLM-LengthFilter-1 Very short files are often auto-generated stubs; files with very long lines are often minified or obfuscated Filters by total content length, average line length, and maximum line length. Line lengths are measured by splitting on the newline character (\n)
7 LLM-FilterByNGramRepetitionRatio-1 Repetitive code — such as auto-generated getters and setters — degrades training quality Uses a sliding window of size N to generate character-level and word-level grams. Computes the repetition ratio as (grams appearing more than once) / (all grams) and removes samples above the threshold. Word-level statistics use lowercase comparison
8 LLM-LengthFilter-2 Files with too few or too many words are outliers that hurt model generalization Splits each sample into words on spaces and filters by word count
9 LLM-DeduplicateByMinHash-1 Near-duplicate files (for example, copied code with minor formatting edits) inflate the training set and cause overfitting Removes similar text using MinHash

Clean and process GitHub code data

Step 1: Open Machine Learning Designer

  1. Log on to the PAI console.

  2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of your workspace.

  3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).

Step 2: Build the workflow from a preset template

  1. On the Machine Learning Designer page, click the Preset Templates tab.

  2. On the LLM tab, find LLM Data Processing - GitHub Code and click Create.

  3. In the Create Workflow dialog box, configure the parameters and click OK. Default values work for this tutorial.

    The Workflow Data Storage parameter sets the path of the OSS bucket used to store data that is generated when the workflow runs.
  4. In the workflow list, double-click the workflow to open it. The system automatically assembles the nine-component pipeline from the preset template:

    Workflow canvas showing the nine LLM data processing components connected in sequence

Step 3: Review the component behavior

The following examples show the data transformations applied by key components.

LLM-MaskSensitiveInfo-1 replaces email addresses, phone numbers, and ID card numbers with placeholders:

LLM-RemoveSpecialContent-1 deletes all URLs from the content field:

  • Before: URL visible in content field

  • After: URL removed

LLM-RemoveCopyright-1 deletes copyright information from the content field:

  • Before: Copyright header in content

  • After: Copyright removed

LLM-CountFilter-1 removes samples with a low alphanumeric ratio:

Samples removed by count filter

LLM-LengthFilter-1 removes samples that are too short or have excessively long lines:

Samples removed by length filter

Step 4: Run the workflow

Click the Run button Run button above the canvas.

Step 5: View the output

After the workflow completes, right-click the Write to Data Table-1 component and choose View Data > Output from the shortcut menu.

The output table shows the data samples that passed all filters and processing steps.

Output data table showing cleaned and filtered samples