Incremental archive to MaxCompute

更新时间:
复制 MD 格式

This topic explains how to incrementally archive data from HBase to MaxCompute.

Notice

The incremental archive feature for MaxCompute was discontinued on June 16, 2023. Lindorm Tunnel Service (LTS) instances purchased after June 16, 2023 cannot use this feature. LTS instances purchased before this date can continue to use this feature.

Prerequisites

  1. Lindorm Tunnel Service (LTS) is activated.

  2. An HBase data source is added.

  3. A MaxCompute data source is added.

Supported versions

  • Self-managed HBase 1.x and 2.x.

  • E-MapReduce HBase.

  • ApsaraDB for HBase Standard Edition, ApsaraDB for HBase Performance-enhanced Edition, and Lindorm.

Limitations

  • The incremental archive feature is based on HBase logs. Therefore, you cannot export data that was imported by using bulk loading.

Log lifecycle

  • After you enable archiving, if data is not consumed, logs are retained for 48 hours by default. After this period, the system automatically cancels the subscription and deletes the retained data.

  • Data consumption may fail if the LTS cluster is released before the task terminates, or if the synchronization task is paused.

Submit archive task

  1. In the LTS console, navigate to Lindorm/HBase Export > Incremental archive to MaxCompute in the left-side navigation pane.

  2. Click Create Task. On the Incremental archive to MaxCompute page, configure the required parameters. Select the Source Cluster and Destination Cluster, and specify the Table Mapping. Click Parameter Description for details, such as setting tableMode to wideTable. You can also provide a Task Name (optional) and set Advanced Configurations. Before you submit the task, confirm the log retention period (hbase.master.logcleaner.ttl) of the source cluster is sufficient to prevent task failure. After you finish the configuration, click Create. The configuration in this example archives data in real time from the HBase table wal-test to MaxCompute.

    • The columns to be archived are cf1:a, cf1:b, cf1:c, and cf1:d.

    • mergeInterval specifies the archive interval in milliseconds. The default value is 86400000, which is one day.

    • mergeStartAt specifies the start time for the merge operation in yyyyMMddHHmmss format. For example, a value of 20190930000000 indicates that the operation starts at 00:00:00 on September 30, 2019. You can specify a time in the past.

  3. After you submit the archive task, you can monitor the workflow status on the task details page. This page has three sections. The Table Creation Details section shows the table creation status. A SUCCEEDED status indicates that the table is created. The Real-time Synchronization Channel section shows the channel status (e.g., RUNNING), the synchronization latency, and the checkpoint (offset) of the log synchronization. The Table Merge section shows the status (e.g., RUNNING) and progress (e.g., 47.50%) of merge tasks. After a merge is complete, you can query the latest partition in MaxCompute.

  4. Log on to MaxCompute to query the archived data. After the merge is complete, run the SELECT * FROM wal_test WHERE pt = 'xxxxxx' statement in MaxCompute to query the archived data. The result includes columns such as rowkey, cf1_string, cf1_long, cf1_boolean, cf1_short, cf1_bigdecimal, cf1_double, cf1_float, cf1_null, and pt. This confirms that the data synchronization from HBase to MaxCompute was successful.

Parameters

The exported table is specified in the following format:

hbaseTable/odpsTable {"cols": ["cf1:a|string", "cf1:b|int", "cf1:c|long", "cf1:d|short","cf1:e|decimal", "cf1:f|double","cf1:g|float","cf1:h|boolean","cf1:i"], "mergeInterval": 86400000, "mergeStartAt": "20191008100547"}
hbaseTable/odpsTable {"cols": ["cf1:a", "cf1:b", "cf1:c"],  "mergeStartAt": "20191008000000"}
hbaseTable {"mergeEnabled": false} // Disables the merge operation.

The table mapping expression consists of three parts: {{hbaseTable}}, {{odpsTable}}, and {{tbConf}}. {{hbaseTable}} specifies the source HBase table. {{odpsTable}} is optional and specifies the destination table name. By default, the name is the same as the HBase table name. Note that MaxCompute converts unsupported characters such as . and - in table names to _. {{tbConf}} specifies the archive behavior for the table. The following table describes the parameters supported by {{tbConf}}.

Parameter

Description

Example

cols

Specifies the columns and their data types to export. By default, data is converted to the HexString format.

"cols": ["cf1:a", "cf1:b", "cf1:c"]

mergeEnabled

Specifies whether to convert the key-value (KV) table to a wide table. The default value is true.

"mergeEnabled": false

mergeStartAt

The start time for the merge operation. The format is yyyyMMddHHmmss. You can specify a time in the past.

"mergeStartAt": "20191008000000"

mergeInterval

The interval for the merge operation, in milliseconds. The default value is 86400000 (one day), which means data is archived daily.

"mergeInterval": 86400000