Introduction to data lake shipping-Tablestore(Tablestore)-阿里云帮助中心

Tablestore data lake shipping enables you to perform full backups or ship data in real time to a data lake in Object Storage Service (OSS). This feature supports low-cost historical data storage and large-scale offline and near-real-time data analytics.

Scenarios

Data lake shipping is suitable for the following scenarios:

Data tiering

You can use data lake shipping with the Tablestore time to live feature to store full data in OSS quickly and at a low cost. Tablestore supports low-latency queries and analytics on hot data.
Full data backup

Data lake shipping automatically delivers full data from Tablestore tables to an OSS bucket for data backup and archiving.
Large-scale real-time data analytics

Data lake shipping delivers incremental Tablestore data to OSS in near real-time, writing data every two minutes. The shipped data can be partitioned by system time and stored in the Parquet column store format. You can use the high read bandwidth of OSS and the scan-optimized column store format to perform efficient, real-time data analytics.

Features

The main features of data lake shipping are as follows:

Data lake shipping automatically pulls full and incremental data from Tablestore. The data is written to OSS for persistent storage once it reaches a specific size or after a two-minute interval.
You can configure one of three data shipping modes: incremental, full, or full and incremental. All shipped data is stored in the Parquet column store format.
You can monitor the synchronization checkpoint for real-time shipping. Data lake shipping provides the DescribeDeliveryTask API. This API returns the checkpoint for the real-time data that the task has successfully delivered.

Core advantages

Easy to use

You can complete a simple configuration in the console to set up fully managed, automatic data shipping from Tablestore to OSS. No monitoring or Operations and Maintenance (O&M) is required. The shipping task runs stably within the Service-Level Agreement (SLA) and scales with throughput.
Integrated full and incremental

Data lake shipping provides integrated capabilities for both full and incremental data shipping. Incremental shipping tasks provide a near-real-time experience by continuously pulling new data, caching it for two minutes, and then writing it to OSS.
Seamless integration with the computing ecosystem

Shipped data is compatible with open source standards. It is stored in the Parquet column store format and follows Hive naming conventions. You can use E-MapReduce to directly analyze the data in OSS using external tables.
Data tiering storage and access experience

After data is shipped to OSS, Tablestore provides tiered data access across data tables, index tables, and the data in OSS. This structure supports the analytics needs of different scenarios.

Notes

Data lake shipping is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), and China (Shenzhen).

Workflow

You can create a shipping task to deliver Tablestore data to OSS. For more information, see Ship data to OSS using the console and Ship data to OSS using the SDK.
You can use EMR to analyze the Tablestore data in OSS. For more information, see Analyze data using EMR.