With DataX, you can synchronize full data from an HBase database to a data table in Tablestore.
Usage notes
DataX supports only full data synchronization from HBase to Tablestore. Incremental data synchronization is not supported.
Prerequisites
-
You have a Linux server with the following software installed.
NoteIf you do not have a server that runs a Linux system, you can use an ECS instance to deploy a Linux system before you proceed. For more information, see Use an ECS instance from the console (Quick Start).
-
Java 8 (64-bit) is installed.
-
Python 2 or Python 3 is installed.
-
-
You have completed the following tasks in RAM:
-
You have created a RAM user and granted it the AliyunOTSFullAccess permission. For more information, see Create a RAM user and Grant permissions to a RAM user.
WarningUsing your Alibaba Cloud account's AccessKey directly poses a security risk. We recommend using a RAM user's AccessKey to minimize risks from a potential leak.
You have created an AccessKey for the RAM user. For more information, see Create an AccessKey.
-
-
You have completed the following tasks in Tablestore:
-
You have created a Tablestore instance. For more information, see Create an instance.
-
You have obtained the endpoint from the instance details page.
On the Overview page, click the instance name. On the Instance Details tab, find the Instance Endpoint section and select the endpoint that matches your network environment.
The available network types are classic network, public network, VPC, and public network (dual-stack).
-
You have created a Tablestore data table to store the migrated data. For more information, see Create a data table.
NoteWhen creating the data table, we recommend using the original HBase primary key or a unique index as the primary key for the Tablestore table.
-
Step 1: Download DataX
You can either download the DataX source code and compile it locally, or download the pre-compiled package directly.
-
Download and compile the DataX source code.
-
Run the following command to download the DataX source code with Git.
git clone https://github.com/alibaba/DataX.git -
Go to the downloaded source code directory and run the following command to package the project with Maven.
NoteThis step compiles reader and writer plugins for all data sources and may take a significant amount of time.
mvn -U clean package assembly:assembly -Dmaven.test.skip=trueAfter the compilation is complete, go to the target/datax/datax directory. The following table describes the subdirectories.
Directory
Description
bin
Contains the executable file datax.py, which is the entry point for the DataX tool.
plugin
Contains the reader and writer plugins for various data sources.
conf
Contains the core.json file, which defines default parameter values such as channel flow control and buffer size. You generally do not need to modify this file.
-
-
Download the DataX toolkit directly.
Step 2: Prepare the JSON file
The HbaseReader plugin uses the HBase Java client to connect to a remote HBase service, scan data within a specified rowkey range, and assemble it into a custom DataX dataset. A downstream writer then processes this dataset.
Select the reader plugin that corresponds to your HBase version.
DataX provides reader plugins only for HBase 0.94, HBase 1.1, and HBase 2.0. To read data from other HBase versions, you must build a custom tool by using the HBase API.
-
Hbase2.0 XReader: This plugin reads data from Phoenix.
Configuration example
The following example shows how to configure a job to extract data from HBase to a local file in normal mode by using the Hbase1.1 XReader. For detailed descriptions of the configuration parameters, see the Hbase11XReader plugin documentation.
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [
{
"reader": {
"name": "hbase11xreader",
"parameter": {
"hbaseConfig": {
"hbase.zookeeper.quorum": "your.zookeeper.quorum.host:2181"
},
"table": "users",
"encoding": "utf-8",
"mode": "normal",
"column": [
{
"name": "rowkey",
"type": "string"
},
{
"name": "info:age",
"type": "string"
},
{
"name": "info:birthday",
"type": "date",
"format":"yyyy-MM-dd"
},
{
"name": "info:company",
"type": "string"
},
{
"name": "address:country",
"type": "string"
},
{
"name": "address:province",
"type": "string"
},
{
"name": "address:city",
"type": "string"
}
],
"range": {
"startRowkey": "",
"endRowkey": "",
"isBinaryRowkey": true
}
}
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"path": "/path/to/your/result/directory",
"fileName": "hbase_data",
"writeMode": "truncate"
}
}
}
]
}
}
Step 3: Run the synchronization command
Run the following command to synchronize the data.
python datax.py -j"-Xms4g -Xmx4g" hbase_to_ots.json
The -j"-Xms4g -Xmx4g" parameter limits the amount of memory used by the JVM. If you do not specify this parameter, the system uses the configuration in conf/core.json, which defaults to 1 GB.
Related operations
After you migrate your HBase data to Tablestore, you can read it by using the Tablestore SDK or the Tablestore HBase Client. For more information, see Migrate from HBase Client to Tablestore HBase Client.