Create and manage datasets

更新时间:
复制 MD 格式

PAI AI Asset Management lets you create and manage datasets and their versions. Use dataset versioning to reproduce experiments, track data lineage, and revert to previous versions when needed.

Dataset overview

PAI supports two dataset types: basic datasets and labeled datasets. Basic datasets contain raw, unlabeled data for pre-training models. Labeled datasets contain manually annotated data for fine-tuning and evaluation.

Item

Basic dataset

Labeled dataset

Definition

Raw, unlabeled data.

Data with manually added labels. You can export this data from iTAG. Export annotated data.

Data processing

Data cleaning, deduplication, and more.

Data labeling, validation, and more.

Use cases

  • Unsupervised learning

  • Pre-train models to capture broad features.

  • Supervised learning and model evaluation

  • Fine-tune models to improve performance on specific tasks.

Access the Datasets page

  1. Log on to the PAI console.

  2. In the upper-left corner, select a region.

  3. In the left-side navigation pane, click Workspaces, and then click the name of the target workspace.

  4. In the left-side navigation pane, choose AI Computing Asset Management > Dataset.

Create a basic dataset

On the Custom Dataset tab, click Create Dataset, and select Basic for Data Type. You can create a dataset from various Storage options, including Object Storage Service (OSS) and file systems (General-purpose NAS, Extreme NAS, CPFS, and AI-CPFS).

Storage type: OSS

Parameter

Description

Content Type

Select the data type: Image, Text, Audio, Video, Tabular, or General. Selecting a specific type helps filter datasets in labeling scenarios.

Owner

Select the owner of the dataset. Only a workspace administrator can configure this parameter.

Import Format/OSS Path

  • When the import format is file, select a file for the OSS path. The dataset maps to this file. Commonly used for iTAG datasets.

  • When the import format is a folder, the OSS path must be a mountable folder path. Commonly used for DSW, DLC, or EAS.

Default Mount Path

Default data mount path, commonly used in DSW and DLC:

  • In DSW, mount the file system to this path when creating an instance.

  • In DLC, the system reads files from this directory. For example, python /root/data/file.py.

Enable Version Acceleration

When you set Import Format to Folder, you can enable dataset version acceleration. Key parameters:

  • Maximum Capacity: The acceleration slot capacity. Must be greater than or equal to the dataset size.

  • Accelerated Mount Target: Uses an internal mount target by default. You can select an existing target or create one.

    Note

    When using Lingjun Intelligent Computing Resources, if you select Accelerated Mount Target for Create Mount Target, select VPC for Mount Target Type. The VPC and vSwitch must match those of your Lingjun Intelligent Computing Resources.

  • Accelerated Version Default Mount Path: Default mount path for the accelerated version.

Storage type: file system

Parameter

Description

Content Type

Select the data type: Image, Text, Audio, Video, Tabular, or General. Selecting a specific type helps filter datasets in labeling scenarios.

Owner

Select the owner of the dataset. Only a workspace administrator can configure this parameter.

File System

Select a file system that corresponds to the selected Storage.

Mount Target

Configure a mount target to access the NAS file system.

File System Path

Specify an existing storage path in the NAS file system. For example, /.

Default Mount Path

Default data mount path, commonly used in DSW and DLC:

  • In DSW, mount the file system to this path when creating an instance.

  • In DLC, the system reads files from this directory. For example, python /root/data/file.py.

Enable Version Acceleration

When the Storage is General-purpose NAS, Extreme NAS, or CPFS, you can enable dataset version acceleration. Key parameters:

  • Maximum Capacity: The acceleration slot capacity. Must be greater than or equal to the dataset size.

  • Accelerated Version Default Mount Path: Default mount path for the accelerated version.

Create a basic dataset version

On the Custom Dataset tab, click Create Version in the Actions column for the target dataset.

Important notes:

  • Dataset name, storage type, and data type are inherited from V1 and cannot be changed.

  • The version number is auto-generated and cannot be changed.

  • Other parameters match those in Create a basic dataset.

View public datasets

PAI provides built-in public datasets such as MMLU, CMMLU, and GSM8K. On the Public Dataset tab, you can click a dataset name to view its basic information.

Each dataset shows the Dataset Name/ID, Type, Task, Language, Size, Data Volume, and Publisher.

Manage datasets

For custom datasets, you can view version history, create a new version, make the dataset public, or delete it. For labeled datasets, you can view data, make the dataset public, or delete it.

Important notes:

  • For a dataset whose Visibility is set to Visible Only to the Dataset Owner, you can click Set Dataset to Public to share it with all members in the workspace. Once a dataset is made public, this action cannot be reversed. Proceed with caution.

  • If a RAM user encounters an access denied error when viewing dataset data, grant the required permissions to the RAM user.

  • Deleting a dataset may affect running tasks that depend on it. This action is irreversible. Proceed with caution.

Related documents

For iTAG dataset creation: Create a dataset or .