LLM data processing: Wikipedia

更新时间:
复制 MD 格式

LLM data processing algorithms allow you to edit and transform data, filter low-quality samples, and remove duplicate samples. You can combine algorithms based on your requirements to filter for suitable data and generate text that meets your needs, yielding high-quality data for subsequent LLM training. This topic demonstrates how to use the LLM data processing components provided by PAI to clean and process Wikipedia data, using a small subset from the open-source RedPajama Wikipedia dataset.

Dataset

The LLM Data Processing-Wikipedia (web text data) preset template in Designer uses a dataset of 5,000 samples extracted from the raw data of the open-source RedPajama project.

Create and run a workflow

  1. Go to the Designer page.

    1. Log on to the PAI console.

    2. In the upper-left corner, select a region.

    3. In the left navigation pane, click Workspaces, and then click the name of the workspace that you want to use.

    4. In the left navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Designer page.

  2. Create a workflow.

    1. On the Preset Templates tab, choose Business Area > LLM, and then click Create on the LLM Data Processing-Wikipedia (web text data) template card.

      image

    2. Configure the workflow parameters (or keep the default settings), and then click Confirm.

    3. In the workflow list, select the workflow you created and click Open.

  3. Workflow description:

    image

    The following list describes the key algorithm components in the workflow:

    • LLM-Sensitive Content Mask (MaxCompute)-1

      Masks sensitive information in the "text" field. For example:

      • Replaces email address strings with [EMAIL].

      • Replaces telephone or mobile phone numbers with [TELEPHONE] or [MOBILEPHONE].

      • Replaces ID card numbers with IDNUM.

    • LLM-Clean Special Content (MaxCompute)-1

      Removes URLs from the "text" field.

    • LLM-Text Normalizer (MaxCompute)-1

      Normalizes Unicode characters in the "text" field and converts Traditional Chinese to Simplified Chinese.

    • LLM-Count Filter (MaxCompute)-1

      Removes samples from the "text" field that do not meet the specified count or ratio of alphabetic and numeric characters. Most characters in the Wikipedia dataset are letters and numbers, so this component removes some dirty data.

    • LLM-Length Filter (MaxCompute)-1

      Filters samples based on the average line length in the "text" field. The average length is calculated by splitting the text in each sample by the newline character \n.

    • LLM-N-Gram Repetition Filter (MaxCompute)-1

      Filters samples based on the character-level N-gram repetition rate in the "text" field. This component applies a sliding window of size N over the text at the character level to form a sequence of N-length segments. Each segment is a gram. It counts the occurrences of all grams and then uses the ratio of the total frequency of grams with a frequency greater than 1 / the total frequency of all grams as the repetition rate for filtering.

    • LLM-Sensitive Words Filter (MaxCompute)-1

      Filters samples in the "text" field that contain sensitive words using the system's preset sensitive word list.

    • LLM-Language Recognition and Filter (MaxCompute)-1

      Calculates a confidence score for the text in the "text" field and filters samples based on the configured confidence threshold.

    • LLM-Length Filter (MaxCompute)-2

      Filters samples based on the maximum line length in the "text" field. The maximum line length is calculated by splitting the text in each sample by the newline character \n.

    • LLM-Perplexity Filter (MaxCompute)-1

      Calculates the perplexity of the text in the "text" field and filters samples based on a specified perplexity threshold.

    • LLM-Special Characters Ratio Filter (MaxCompute)-1

      Removes samples from the "text" field that do not meet the specified ratio of special characters.

    • LLM-Length Filter (MaxCompute)-3

      Filters samples based on the length of the "text" field.

    • LLM-Tokenization (MaxCompute)-1

      Tokenizes the text in the "text" field and saves the results to a new column.

    • LLM-Length Filter (MaxCompute)-4

      Splits the text in each sample into a list of words using a space (" ") as the separator, then filters the sample based on the word count.

    • LLM-N-Gram Repetition Filter (MaxCompute)-2

      Filters samples based on the word-level N-gram repetition rate in the "text" field. This component first converts all words to lowercase. It applies a sliding window of size N over the text at the word level to form a sequence of N-length segments. Each segment is a gram. It counts the occurrences of all grams and then uses the ratio of the total frequency of grams with a frequency greater than 1 / the total frequency of all grams as the repetition rate for filtering.

    • LLM-MinHash Deduplicator (MaxCompute)-1

      Removes similar samples based on the MinHash algorithm.

  4. Run the workflow.

    After the run completes, right-click the Write To Data Table-1 component and choose View Data > Output to view the processed samples.

    image

References