DataWorks datasets let you manage unstructured data such as images and documents. You can create datasets backed by OSS or NAS storage and track data versions.
Overview
Datasets let you read and write data stored in OSS and NAS from DataWorks. You can create multiple dataset versions, track changes, and revert to a previous version if needed.
Precautions
The dataset feature is currently in beta. The final features and stability may vary.
Billing
The DataWorks dataset feature is free. OSS and NAS storage incurs separate storage and network fees (OSS billing, NAS billing).
Create a dataset
-
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. On the page that appears, click Go to Data Map.
-
On the Data Map page, click Data Catalog (
) in the left-side navigation pane. Then, under Catalogs, click Dataset Catalog. -
Click your target workspace. On the dataset list page, click Create Dataset and configure the settings described below.
If an administrator has configured custom attributes for the dataset type, you can set values such as business domain and data sensitivity level when creating the dataset. Custom attributes can inherit workspace-level values.
Storage class: OSS
-
Dataset Configuration:
Configuration item
Description
Storage class
OSS
Content type
Optional. Select the data type. Default: General.
-
Import Configuration:
Configuration item
Description
OSS path
Path of the OSS folder to mount.
NoteMake sure you have the required OSS Bucket permissions.
Default mount path
Mount path for the OSS folder in DataWorks. Default:
/mnt/data/. You can change the mount path.
Storage class: NAS
-
Dataset Configuration:
Configuration item
Description
Storage class
Select File Storage (General-purpose NAS) or File Storage (Extreme NAS file systems)
Content type
Optional. Select the data type. Default: General.
-
Import Configuration:
Configuration item
Description
File system
Select a NAS file system in the current region under your Alibaba Cloud account.
File system mount target
Configure a mount target for the NAS file system.
ImportantThe mount target VPC must connect to the resource group VPC:
-
Use the same VPC for both the NAS mount target and the resource group.
-
For cross-VPC scenarios, follow the Overview to connect the NAS mount target VPC to the resource group VPC.
File system path
Path of the NAS folder to mount. Default:
/. The path must exist in the NAS file system.Default mount path
Mount path for the NAS folder in DataWorks. Default:
/mnt/data/. You can change the mount path. -
Manage datasets
In , go to the dataset list for the target workspace. In the Actions column of the target dataset, click Details. The details page shows the Overview and Dataset Version information, and supports these operations:
-
Create Version: Click Create Version in the upper-right corner to customize the OSS Path or NAS File System Configuration and set the Default Mount Path.
-
Delete Dataset: Click Delete in the upper-right corner.
-
View Dataset Data: Available only for OSS datasets. In the Dataset Version section, select the target version and click View in OSS to view the corresponding storage path in the OSS console.
-
Delete Version: In the Dataset Version section, select the version and click Delete.
Deleting a dataset or version does not delete the original files, but the deletion is irreversible in DataWorks. Proceed with caution.
Use a dataset
Use datasets in Data Studio with Shell nodes, Python nodes, Notebook development, or in your personal development environment.