Deliver data to OSS using the console

更新时间:
复制 MD 格式

Create a delivery task in the Tablestore console to deliver data from a Tablestore table to an OSS bucket.

Prerequisites

Activate the OSS service and create a bucket in the same region as your Tablestore instance. For more information, see Activate OSS.

Note

Data delivery supports delivering data to any OSS bucket in the same region as the Tablestore instance. To deliver data to other data warehouses, such as MaxCompute, submit a ticket to apply for this feature.

Usage notes

  • Data lake shipping is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), and China (Shenzhen).

  • Data delivery ignores delete operations. Data that you delete in Tablestore is not deleted from the destination OSS bucket.

  • Initializing a new delivery task can take up to one minute.

  • With a stable write rate, synchronization latency is typically within 3 minutes. The P99 latency is within 10 minutes.

    Note

    P99 latency is the 99th percentile of latency, meaning that 99% of requests are faster than this value.

Create a delivery task

  1. Go to the Instance Management page.

    1. Log on to the Tablestore console.

    2. In the top navigation bar, select a resource group and a region. Click the name of the instance or click Manage Instance in the Actions column.

  2. On the Instance Management page, click Deliver Data to OSS.

  3. (Optional) Create the service-linked role AliyunServiceRoleForOTSDataDelivery.

    The first time you configure data delivery, you must create the Tablestore service-linked role AliyunServiceRoleForOTSDataDelivery. This role grants Tablestore permission to write data to an OSS bucket. For more information, see Tablestore service-linked role.

    Note

    For more information about service-linked roles, see Service-linked roles.

    1. On the Deliver Data to OSS page, click Role for Delivery Service.

    2. In the Role Details dialog box, review the information and click OK.

  4. Create a delivery task.

    1. On the Deliver Data to OSS page, click Create Task.

    2. In the Create Task dialog box, configure the parameters.

      Parameter

      Description

      Task Name

      The name of the delivery task.

      The name must be 3 to 16 characters in length and can contain only lowercase letters (a-z), digits, and hyphens (-). The name must start and end with a lowercase letter or a digit.

      Destination Region

      The region where the Tablestore instance and the OSS bucket are located.

      Source Table

      The name of the source Tablestore table.

      Destination Bucket

      The name of the OSS bucket.

      Important

      The OSS bucket must already exist and be in the same region as the Tablestore instance.

      Destination Prefix

      The directory prefix in the OSS bucket for the delivered Tablestore data. The destination prefix supports five time variables: $yyyy, $MM, $dd, $HH, and $mm. For more information, see time partitioning.

      • Including time variables in the destination prefix dynamically generates OSS directories based on the data's write time. This partitions data by time in a style similar to Hive partition naming.

      • If you do not include time variables in the destination prefix, all files are delivered to the specified static directory prefix.

      Synchronization Mode

      The type of the delivery task. Valid values:

      • Incremental: Synchronizes only incremental data.

      • Full: Performs a one-time full-table data synchronization.

      • Differential: After a full data synchronization is complete, the task synchronizes incremental data.

      For incremental data synchronization, you can view the latest delivery time and the current delivery status.

      Destination Object Format

      Delivered data is stored in the Parquet columnar format. By default, data delivery uses PLAIN encoding, which supports data of any type.

      Schema Generation Type

      You can select which source fields to write to the destination file, specify their order, and assign them new names. The order of columns in the schema configuration determines the final data layout in OSS.

      Configure the delivery schema based on the selected schema generation type.

      • If you set Schema Generation Type to Manual, you must manually configure the source field, destination field name, and destination field type for each delivery field.

      • If you set Schema Generation Type to Auto Generate, the system automatically identifies and matches the fields for delivery.

      Important

      The data type of a delivered field must match the data type of the corresponding source field. Otherwise, the field is discarded as dirty data. For more information about data type mappings, see Data type mapping.

      When you configure the delivery schema, you can perform the following operations:

      • Click Add Field to add a new delivery field.

      • In the Actions column, click the down11 up arrow or up down arrow icon to adjust the order of the delivery fields.

      • In the Actions column, click the delete delete icon to remove a delivery field.

      Schema Configurations

  5. Click OK.

    In the View Statement to Create Table dialog box, you can view the automatically generated create table statement for an EMR external table. You can copy this statement to quickly create an external table in EMR and access the data in OSS.

    After you create the delivery task, you can perform the following operations:

    • View delivery details, such as the task name, table name, destination bucket, destination prefix, latest synchronization time, and status.

    • View or copy the create table statement.

      In the Actions column, click View Statement to Create Table to view or copy the create table statement used to create an external table in a computing engine such as EMR. For more information, see Use EMR.

    • View delivery error messages.

      If the OSS bucket or delivery permissions are configured incorrectly, the data delivery fails. In this case, you can view the relevant error messages on the task status page. For more information about how to handle errors, see Error handling.

    • Delete the delivery task.

      In the Actions column, click Delete to delete the delivery task. If you attempt to delete a task that is in the initialization stage, the system returns an error. In this case, try again later.

View data in OSS

After the delivery task is initialized and data has been delivered, you can view the data in OSS by using the OSS console, an API or SDK, or a computing engine such as EMR. For more information, see File overview.

The address format for an OSS Object is as follows:

oss://BucketName/TaskPrefix/TaskName_ConcurrentID_TaskPrefix__SequenceID

This format consists of the following components: BucketName is the bucket name. TaskPrefix is the directory prefix, which is also included in the filename. TaskName is the delivery task name. ConcurrentID is an internal concurrency ID that starts at 0. The delivery system automatically increases concurrency as throughput increases. SequenceID is the file sequence number, which increments starting from 1.

Time partitioning

Data delivery can extract the time when data was written to Tablestore. You can use the following variables to convert the data's write time into a directory prefix for the OSS bucket: $yyyy (four-digit year), $MM (two-digit month), $dd (two-digit day), $HH (two-digit hour), and $mm (two-digit minute).

Note

For optimal performance, we recommend that files in OSS are at least 4 MB. When a computing engine loads data from OSS, a larger number of partitions leads to longer transaction execution times. Therefore, the time partitioning granularity should not be too fine. In most real-time write scenarios, partitioning by day or hour is sufficient, and minute-level partitioning is usually unnecessary.

For example, consider data written to Tablestore at 16:03 on August 31, 2020. The following table shows the resulting object paths in OSS for different destination prefix configurations.

OSS bucket

TaskName

Destination prefix

OSS object path

myBucket

testTask

myPrefix

oss://myBucket/myPrefix/testTask_0_myPrefix__1

myBucket

testTaskTimeParitioned

myPrefix/$yyyy/$MM/$dd/$HH/$mm

oss://myBucket/myPrefix/2020/08/31/16/03/testTaskTimeParitioned_0_myPrefix_2020_08_31_16_03__1

myBucket

testTaskTimeParitionedHiveNamingStyle

myPrefix/year=$yyyy/month=$MM/day=$dd

oss://myBucket/myPrefix/year=2020/month=08/day=31/testTaskTimeParitionedHiveNamingStyle_0_myPrefix_year=2020_month=08_day=31__1

myBucket

testTaskDs

ds=$yyyy$MM$dd

oss://myBucket/ds=20200831/testTaskDs_0_ds=20200831__0

Data type mapping

Parquet logical type

Tablestore data type

Boolean

Boolean

Int64

Int64

Double

Double

UTF8

String

Error handling

Error message

Cause

Solution

UnAuthorized

Tablestore does not have the necessary permissions.

Verify that the service-linked role AliyunServiceRoleForOTSDataDelivery exists in RAM.

If the role does not exist, create it from the Deliver Data to OSS page. When you start to create a delivery task, the console prompts you to create the required role.

InvalidOssBucket

The specified OSS bucket does not exist.

  • Verify that the OSS bucket is in the same region as the Tablestore instance.

  • Verify that the OSS bucket exists.

After you create the OSS bucket, the system automatically retries writing data to it and updates the delivery progress.