Shared sample datasets

更新时间:
复制 MD 格式

Shared sample datasets enable you to quickly validate data processing performance, optimize query efficiency, or verify features. They are ideal resources for development, testing, and learning.

Create a dataset

  1. Log on to the Data Lake Formation console.

  2. In the left navigation pane, click Data catalog.

  3. Click Data sharing > Received, locate the data share named dlf_samples, and click Create Catalog.

    The catalog created from a received share is read-only.

  4. Click the Catalogs tab to view the newly created catalog.

    In the Catalogs list, the status of the new catalog appears as Running.

Query the dataset

The shared catalog includes multiple sizes of the TPC-DS standard sample database, suitable for data testing, analysis, and baseline performance evaluation at different scales. The available datasets include the following:

Sample database name

Sample data description

tpcds_paimon_sf1

TPC-DS 1 GB Paimon table

tpcds_paimon_sf2

TPC-DS 2 GB Paimon table

tpcds_paimon_sf10

TPC-DS 10 GB Paimon table

tpcds_paimon_sf100

TPC-DS 100 GB Paimon table

tpcds_iceberg_sf1

TPC-DS 1 GB Iceberg table

Note

You can associate this catalog with other platforms such as EMR and Flink to query data. For more information, see Engine integration.