PAI AI Asset Management lets you create and manage datasets and their versions. Use dataset versioning to reproduce experiments, track data lineage, and revert to previous versions when needed.
Dataset overview
PAI supports two dataset types: basic datasets and labeled datasets. Basic datasets contain raw, unlabeled data for pre-training models. Labeled datasets contain manually annotated data for fine-tuning and evaluation.
|
Item |
Basic dataset |
Labeled dataset |
|
Definition |
Raw, unlabeled data. |
Data with manually added labels. You can export this data from iTAG. Export annotated data. |
|
Data processing |
Data cleaning, deduplication, and more. |
Data labeling, validation, and more. |
|
Use cases |
|
|
Access the Datasets page
-
Log on to the PAI console.
-
In the upper-left corner, select a region.
-
In the left-side navigation pane, click Workspaces, and then click the name of the target workspace.
-
In the left-side navigation pane, choose AI Computing Asset Management > Dataset.
Create a basic dataset
On the Custom Dataset tab, click Create Dataset, and select Basic for Data Type. You can create a dataset from various Storage options, including Object Storage Service (OSS) and file systems (General-purpose NAS, Extreme NAS, CPFS, and AI-CPFS).
Storage type: OSS
|
Parameter |
Description |
|
Content Type |
Select the data type: Image, Text, Audio, Video, Tabular, or General. Selecting a specific type helps filter datasets in labeling scenarios. |
|
Owner |
Select the owner of the dataset. Only a workspace administrator can configure this parameter. |
|
Import Format/OSS Path |
|
|
Default Mount Path |
Default data mount path, commonly used in DSW and DLC:
|
|
Enable Version Acceleration |
When you set Import Format to Folder, you can enable dataset version acceleration. Key parameters:
|
Storage type: file system
|
Parameter |
Description |
|
Content Type |
Select the data type: Image, Text, Audio, Video, Tabular, or General. Selecting a specific type helps filter datasets in labeling scenarios. |
|
Owner |
Select the owner of the dataset. Only a workspace administrator can configure this parameter. |
|
File System |
Select a file system that corresponds to the selected Storage. |
|
Mount Target |
Configure a mount target to access the NAS file system. |
|
File System Path |
Specify an existing storage path in the NAS file system. For example, |
|
Default Mount Path |
Default data mount path, commonly used in DSW and DLC:
|
|
Enable Version Acceleration |
When the Storage is General-purpose NAS, Extreme NAS, or CPFS, you can enable dataset version acceleration. Key parameters:
|
Create a basic dataset version
On the Custom Dataset tab, click Create Version in the Actions column for the target dataset.
Important notes:
-
Dataset name, storage type, and data type are inherited from V1 and cannot be changed.
-
The version number is auto-generated and cannot be changed.
-
Other parameters match those in Create a basic dataset.
View public datasets
PAI provides built-in public datasets such as MMLU, CMMLU, and GSM8K. On the Public Dataset tab, you can click a dataset name to view its basic information.
Each dataset shows the Dataset Name/ID, Type, Task, Language, Size, Data Volume, and Publisher.
Manage datasets
For custom datasets, you can view version history, create a new version, make the dataset public, or delete it. For labeled datasets, you can view data, make the dataset public, or delete it.
Important notes:
-
For a dataset whose Visibility is set to Visible Only to the Dataset Owner, you can click Set Dataset to Public to share it with all members in the workspace. Once a dataset is made public, this action cannot be reversed. Proceed with caution.
-
If a RAM user encounters an access denied error when viewing dataset data, grant the required permissions to the RAM user.
-
Deleting a dataset may affect running tasks that depend on it. This action is irreversible. Proceed with caution.
Related documents
For iTAG dataset creation: Create a dataset or .