Manage datasets

更新时间:
复制 MD 格式

DataWorks datasets let you manage unstructured data such as images and documents. You can create datasets backed by OSS or NAS storage and track data versions.

Overview

Datasets let you read and write data stored in OSS and NAS from DataWorks. You can create multiple dataset versions, track changes, and revert to a previous version if needed.

Precautions

The dataset feature is currently in beta. The final features and stability may vary.

Billing

The DataWorks dataset feature is free. OSS and NAS storage incurs separate storage and network fees (OSS billing, NAS billing).

Create a dataset

  1. Log on to the DataWorks console. In the target region, click Data Governance > Data Map in the left-side navigation pane. On the page that appears, click Go to Data Map.

  2. On the Data Map page, click Data Catalog (image) in the left-side navigation pane. Then, under Catalogs, click Dataset Catalog.

  3. Click your target workspace. On the dataset list page, click Create Dataset and configure the settings described below.

Note

If an administrator has configured custom attributes for the dataset type, you can set values such as business domain and data sensitivity level when creating the dataset. Custom attributes can inherit workspace-level values.

Storage class: OSS

  • Dataset Configuration:

    Configuration item

    Description

    Storage class

    OSS

    Content type

    Optional. Select the data type. Default: General.

  • Import Configuration:

    Configuration item

    Description

    OSS path

    Path of the OSS folder to mount.

    Note

    Make sure you have the required OSS Bucket permissions.

    Default mount path

    Mount path for the OSS folder in DataWorks. Default: /mnt/data/. You can change the mount path.

Storage class: NAS

  • Dataset Configuration:

    Configuration item

    Description

    Storage class

    Select File Storage (General-purpose NAS) or File Storage (Extreme NAS file systems)

    Content type

    Optional. Select the data type. Default: General.

  • Import Configuration:

    Configuration item

    Description

    File system

    Select a NAS file system in the current region under your Alibaba Cloud account.

    File system mount target

    Configure a mount target for the NAS file system.

    Important

    The mount target VPC must connect to the resource group VPC:

    • Use the same VPC for both the NAS mount target and the resource group.

    • For cross-VPC scenarios, follow the Overview to connect the NAS mount target VPC to the resource group VPC.

    File system path

    Path of the NAS folder to mount. Default: /. The path must exist in the NAS file system.

    Default mount path

    Mount path for the NAS folder in DataWorks. Default: /mnt/data/. You can change the mount path.

Manage datasets

In Data Catalog > Dataset Catalog, go to the dataset list for the target workspace. In the Actions column of the target dataset, click Details. The details page shows the Overview and Dataset Version information, and supports these operations:

  • Create Version: Click Create Version in the upper-right corner to customize the OSS Path or NAS File System Configuration and set the Default Mount Path.

  • Delete Dataset: Click Delete in the upper-right corner.

  • View Dataset Data: Available only for OSS datasets. In the Dataset Version section, select the target version and click View in OSS to view the corresponding storage path in the OSS console.

  • Delete Version: In the Dataset Version section, select the version and click Delete.

Important

Deleting a dataset or version does not delete the original files, but the deletion is irreversible in DataWorks. Proceed with caution.

Use a dataset

Use datasets in Data Studio with Shell nodes, Python nodes, Notebook development, or in your personal development environment.

Use a dataset.