The HBase data source lets you read from and write to HBase. This topic describes its data synchronization capabilities in DataWorks.
Supported versions
There are two types of HBase plugins: the HBase plugin and the HBase{xx}xsql plugin. The HBase{xx}xsql plugin requires both HBase and Phoenix.
-
HBase plugin:
Supports
HBase0.94.x,HBase1.1.x, andHBase2.x. Both the codeless UI and code editor are supported. Use thehbaseVersionparameter to specify the version.-
If your HBase version is
HBase0.94.x, set hbaseVersion to 094x for both the reader and writer."reader": { "hbaseVersion": "094x" }"writer": { "hbaseVersion": "094x" } -
If your HBase version is HBase1.1.x or HBase2.x, set hbaseVersion to 11x for both the reader and writer.
"reader": { "hbaseVersion": "11x" }"writer": { "hbaseVersion": "11x" }The HBase1.1.x plugin is compatible with HBase 2.0.
-
-
HBase{xx}xsql plugin
-
HBase20xsql plugin: Supports
HBase2.xandPhoenix5.x. Only the code editor is supported.HBase11xsql plugin: Supports
HBase1.1.xandPhoenix5.x. Only the code editor is supported. -
The HBase{xx}xsql writer plugin provides a simple way to import data in batches to SQL tables (Phoenix) in HBase. Phoenix encodes the rowkey. Writing data by using the HBase API requires manual data conversion, a complex and error-prone process.
NoteThe plugin uses the Phoenix JDBC driver to execute UPSERT statements, writing data to the table in batches. Its high-level interface also enables synchronous updates to index tables.
-
Limitations
|
HBase Reader |
HBase20xsql Reader |
HBase11xsql Writer |
|
|
|
Features
HBase Reader
HBase Reader supports the normal and multiVersionFixedColumn modes. See HBase field mapping guide for configuration instructions.
-
In
normalmode, HBase Reader treats an HBase table as a standard two-dimensional table (wide table) and reads the latest version of the data.hbase:007:0> scan 'student' ROW COLUMN+CELL s001 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20 s001 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom s001 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90 s001 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85 1 row(s) in 0.0580 secondsThe following table shows the output data.
Row key
basic:age
basic:name
score:english
score:math
s001
20
Tom
90
85
-
multiVersionFixedColumnmode: Reads an HBase table as a narrow table. Each returned record consists of four columns:rowKey,family:qualifier,timestamp, andvalue. You must explicitly specify the columns to read. The value of each cell is treated as a record. If multiple versions exist, multiple records are returned.hbase:007:0> scan 'student',{VERSIONS=>5} ROW COLUMN+CELL s001 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20 s001 column=basic:age, timestamp=2026-03-09T14:30:00.100, value=19 s001 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom s001 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90 s001 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85 1 row(s) in 0.0260 seconds }Row key
family:qualifier
Timestamp
Value
s001
basic:age
2026-03-09T14:41:40.240
20
s001
basic:age
2026-03-09T14:30:00.100
19
s001
basic:name
2026-03-09T14:41:40.214
Tom
s001
score:english
2026-03-09T14:41:40.333
90
s001
score:math
2026-03-09T14:41:40.277
85
HBase Writer
-
rowkeygeneration rule: Currently, HBase Writer supports concatenating multiple fields from the source to use as therowkeyfor an HBase table. -
You can specify a version (timestamp) for writing data to HBase. The available options are:
-
Use the current time as the version.
-
Specify a source column as the version.
-
Specify a time as the version.
-
Supported data types
Batch read
-
This table shows the supported HBase data types and how HBase Reader converts them.
Category
Data Integration column type
Database data type
Integer
long
short, int, and long
Floating-point
double
float and double
String
string
binary_string and string
Date and time
date
date
Byte
bytes
bytes
Boolean
boolean
boolean
-
HBase20xsql Reader supports most, but not all, Phoenix data types. Ensure your data types are supported.
-
This table shows how HBase20xsql Reader maps Phoenix data types to DataX internal types.
DataX internal type
Phoenix data type
long
INTEGER, TINYINT, SMALLINT, BIGINT
double
FLOAT, DECIMAL, DOUBLE
string
CHAR, VARCHAR
date
DATE, TIME, TIMESTAMP
bytes
BINARY, VARBINARY
boolean
BOOLEAN
Batch write
This table lists the data types that HBase Writer supports.
-
Each column's configured data type must match the corresponding data type in the HBase table.
-
Only the data types listed in the table are supported.
|
Category |
Database data type |
|
Integer |
INT, LONG, and SHORT |
|
Floating-point |
FLOAT and DOUBLE |
|
Boolean |
BOOLEAN |
|
String |
STRING |
Precautions
If you encounter the "tried to access method com.google.common.base.Stopwatch" error when testing connectivity, add the hbaseVersion property to the data source configuration to specify the HBase version.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure single-table batch synchronization
-
For details, see Codeless UI configuration and script mode configuration.
Because HBase is a schemaless data source, the Codeless UI does not display field mappings by default. You must configure them manually.
When using HBase as a data source, you must first select an Output Mode: normal mode or multiVersionFixedColumn mode.
The field mapping configuration differs for each mode:
-
normal mode: This is the default mode. This mode reads an HBase table as a standard two-dimensional table and retrieves the latest version of the data. When using HBase as a data source, you must configure the mapping between the Source Field and Target Field. The source and destination fields have a one-to-one mapping. Because the source table has no fixed fields, fields are mapped by their order by default. To change the mapping, you must manually edit the field order.
New version
In the field mapping configuration, the mappings are:
rowkey→rowkey,basic:age→age,basic:name→name,score:english→english, andscore:math→math. The source fields are displayed in JSON format and include thenameandtype(both arestring) properties.Legacy version
In the field mapping configuration, source fields are displayed in the
Type|ColumnFamily:ColumnNameformat, includingstring|rowkey,string|basic:age,string|basic:name,string|score:english, andstring|score:math. These map to the target fieldsrowkey,age,name,english, andmath, respectively.The target table contains fields such as rowkey, age, name, english, math, and pt. For example: rowkey=s001, age=20, name=Tom, english=90, math=85, pt=222222.
-
multiVersionFixedColumn mode: Each output record consists of four columns (rowKey, family:qualifier, timestamp, and value), and this mode allows you to read multiple data versions. The Source Field is configured in the
ColumnFamily:Qualifierformat, such asbasic:age. The destination table has four fixed columns: row_key, cf, timestamp_col, and value. No mapping configuration is required.New version
In the field mapping area, map source fields to target fields. The source fields are in JSON format. Example mappings:
{"name":"rowkey","type":"string"}maps to the target field rowkey,{"name":"basic:age","type":"string"}maps to family,{"name":"basic:name","type":"string"}maps to timestamp, and{"name":"score:english","type":"string"}maps to value. The system does not synchronize the unmapped source field{"name":"score:math","type":"string"}. You can click the Edit button on either side to edit the source and target fields respectively.Legacy version
In the field mapping area, the source field includes
string|rowkey,string|basic:age,string|basic:name,string|score:english, andstring|score:math. The target field includesrowkey,family,timestamp, andvalue. The source and target fields are mapped row-by-row.The target table contains four fixed columns: row_key, cf, timestamp_col, and value. For example: row_key=s001, cf=basic:age, timestamp_col=1234567890, value=20.
-
When using HBase as the data destination (only normal mode is supported), you must configure the Target Field and rowkey. You can form the rowkey field by concatenating multiple source fields.
-
-
For parameters and script examples for script mode, see Appendix: Script examples and parameter descriptions.
FAQ
-
Q: What is an appropriate concurrency setting? Does increasing concurrency help when the import speed is slow?
A: The default heap size for the Java Virtual Machine (JVM) in a data import process is 2 GB. Concurrency is implemented through multi-threading and configured by the number of channels. Excessive threads can degrade performance without improving import speed due to frequent garbage collection. We recommend using 5–10 concurrent threads (channels).
-
Q: What is an optimal value for
batchSize?A: The default value is 256. Calculate the optimal
batchSizebased on the row size. A single batch should typically contain 2–4 MB of data. Divide this data volume by the row size to determine the recommendedbatchSize. -
Q: When reading data from HBase in
multiVersionFixedColumnmode, I receive ajava.lang.StringIndexOutOfBoundsException: String index out of range: -1error. How can I resolve this?A: This error typically occurs because the
namefield in the column configuration does not follow thecolumnFamily:qualifier(columnFamily:qualifier) format. For example, you might have specified only the qualifier, such asage, instead ofbasic:age. To resolve this, ensure thenamefor every column exceptrowkeyis formatted ascolumnFamily:qualifier.
Appendix: Sample script and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Script mode configuration. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
HBase Reader example
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"hbase",// The plugin name.
"parameter":{
"mode":"normal",// Specifies the mode for reading data from HBase. Valid values: `normal` and `multiVersionFixedColumn`.
"scanCacheSize":"256",// Specifies the number of rows to fetch from the server per RPC.
"scanBatchSize":"100",// Specifies the number of columns to fetch from the server per RPC.
"hbaseVersion":"094x/11x",// The HBase version.
"column":[// The fields to read.
{
"name":"rowkey",// The field name.
"type":"string"// The data type.
},
{
"name":"basic:age",
"type":"string"
},
{
"name":"basic:name",
"type":"string"
},
{
"name":"score:english",
"type":"string"
},
{
"name":"score:math",
"type":"string"
}
],
"range":{// Specifies the rowkey range to read.
"endRowkey":"",// The end rowkey.
"isBinaryRowkey":true,// Specifies whether to use binary conversion for startRowkey and endRowkey. The default value is false.
"startRowkey":""// The start rowkey.
},
"maxVersion":"",// Specifies the number of versions to read in multi-version mode.
"encoding":"UTF-8",// The encoding format.
"table":"student",// The table name.
"hbaseConfig":{// The connection configuration for the HBase cluster, in JSON format.
"hbase.zookeeper.quorum":"hostname",
"hbase.rootdir":"hdfs://ip:port/database",
"hbase.cluster.distributed":"true"
}
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"odps",// The plugin name for the destination. This example uses MaxCompute. You can replace it with another Writer plugin.
"parameter":{
"partition":"",// Partition information for the destination table. Not required for non-partitioned tables.
"truncate":true,// Specifies whether to clear the destination table or partition before writing data.
"datasource":"odps_datasource",// The MaxCompute data source name.
"column":[// The destination fields.
"rowkey",
"basic_age",
"basic_name",
"score_english",
"score_math"
],
"table":"student_target"// The name of the destination MaxCompute table.
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of error records to allow before the job fails.
},
"speed":{
"throttle":true,// Specifies whether to enable rate limiting. If set to false, the mbps parameter is ignored.
"concurrent":1,// The number of concurrent jobs.
"mbps":"12"// The rate limit in megabytes per second (MB/s).
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
HBase Reader script (multiVersionFixedColumn mode)
The following example shows a complete script for reading data from HBase in multiVersionFixedColumn mode and writing it to MaxCompute. In this mode, the value of each cell in HBase is converted into a separate record. Each record consists of four columns: rowkey, family:qualifier, timestamp, and value.
{
"type":"job",
"version":"2.0",
"steps":[
{
"stepType":"hbase",// Plugin name.
"parameter":{
"mode":"multiVersionFixedColumn",// The mode for reading data from HBase. This example uses multiVersionFixedColumn mode.
"scanCacheSize":"256",// The number of rows that the HBase client reads from the server in each remote procedure call (RPC).
"scanBatchSize":"100",// The number of columns that the HBase client reads from the server in each RPC.
"hbaseVersion":"20x",// HBase version.
"datasource":"hbase_datasource",// HBase data source name.
"column":[// The columns to read. The first column must be rowkey. The names of other columns must be in the "column family:qualifier" format.
{
"name":"rowkey",// The rowkey column.
"type":"string"
},
{
"name":"basic:age",// The age column in the basic column family.
"type":"string"
},
{
"name":"basic:name",// The name column in the basic column family.
"type":"string"
},
{
"name":"score:english",// The english column in the score column family.
"type":"string"
},
{
"name":"score:math",// The math column in the score column family.
"type":"string"
}
],
"range":{
"isBinaryRowkey":false
},
"maxVersion":"-1",// The maximum number of versions to read. This parameter is required in multiVersionFixedColumn mode. A value of -1 specifies that all versions are read.
"encoding":"UTF-8",// Encoding format.
"table":"student"// HBase table name.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"odps",// The name of the destination plugin. This example uses MaxCompute.
"parameter":{
"partition":"",// The partition of the destination table. This parameter is not required for non-partitioned tables.
"truncate":true,// If set to true, this clears the destination table or partition before writing data.
"datasource":"odps_datasource",// MaxCompute data source name.
"column":[// The destination has four fixed columns that correspond to the rowkey, family:qualifier, timestamp, and value from the source, respectively.
"row_key",
"cf",
"timestamp_col",
"value"
],
"table":"hbase_multiversion_target"// The name of the destination MaxCompute table.
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of error records allowed.
},
"speed":{
"throttle":false,// No rate limiting.
"concurrent":2// Job concurrency.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
The destination table in MaxCompute must be created in advance. For example: CREATE TABLE IF NOT EXISTS hbase_multiversion_target (row_key STRING, cf STRING, timestamp_col STRING, value STRING);
HBase Reader script parameters
|
Parameter |
Description |
Required |
Default |
|
haveKerberos |
If haveKerberos is true, the HBase cluster requires Kerberos authentication. Note
|
No |
false |
|
hbaseConfig |
The configuration required to connect to the HBase cluster, in JSON format. The hbase.zookeeper.quorum parameter, which specifies the ZooKeeper endpoint for HBase, is required. You can also add other HBase client configurations, such as scan cache and batch settings, to optimize interaction with the server. Note
If you are connecting to an ApsaraDB for HBase instance, you must use its internal network endpoint. |
Yes |
None |
|
mode |
The supported read modes for HBase are normal and multiVersionFixedColumn. |
Yes |
None |
|
table |
The name of the HBase table to read. Table names are case-sensitive. |
Yes |
None |
|
encoding |
The encoding format, such as UTF-8 or GBK, is used to convert a binary HBase byte[] value to a String. |
No |
utf-8 |
|
column |
The HBase field to read. This parameter is required in normal mode and multiVersionFixedColumn mode.
|
Yes |
None |
|
maxVersion |
The maximum number of cell versions to read in multi-version mode. Valid values are |
Required in |
None |
|
range |
Specifies the rowkey range to read.
|
No |
None |
|
scanCacheSize |
The number of rows to fetch from HBase in a single remote procedure call (RPC). |
No |
256 |
|
scanBatchSize |
The number of columns to fetch from HBase in a single RPC. Set this to -1 to fetch all columns. Note
The value for scanBatchSize should be greater than the actual number of columns to avoid data quality risks. |
No |
100 |
HBase Writer script
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"hbase",// The plugin name.
"parameter":{
"mode":"normal",// The write mode for HBase.
"walFlag":"false",// Set to `false` to disable write-ahead logging (WAL).
"hbaseVersion":"094x",// The HBase version.
"rowkeyColumn":[// The columns that form the HBase rowkey.
{
"index":"0",// The index of the source data column.
"type":"string"// The data type for this part of the rowkey.
},
{
"index":"-1",
"type":"string",
"value":"_"
}
],
"nullMode":"skip",// Specifies how to handle null values from the source.
"column":[// The destination columns in the HBase table.
{
"name":"columnFamilyName1:columnName1",// The column name, in `family:qualifier` format.
"index":"0",// The index of the source data column.
"type":"string"// The data type of the column value.
},
{
"name":"columnFamilyName2:columnName2",
"index":"1",
"type":"string"
},
{
"name":"columnFamilyName3:columnName3",
"index":"2",
"type":"string"
}
],
"encoding":"utf-8",// The character encoding.
"table":"",// The name of the destination HBase table.
"hbaseConfig":{// Configuration for the HBase cluster connection, in JSON format.
"hbase.zookeeper.quorum":"hostname",
"hbase.rootdir":"hdfs: //ip:port/database",
"hbase.cluster.distributed":"true"
}
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of allowed error records.
},
"speed":{
"throttle":true,// Enables (`true`) or disables (`false`) rate limiting. If `true`, the rate is defined by the `mbps` parameter.
"concurrent":1, // The number of concurrent write tasks.
"mbps":"12"// The maximum transfer rate in megabytes per second (MB/s).
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
HBase writer script parameters
|
Parameter |
Description |
Required |
Default |
|
haveKerberos |
Specifies whether the HBase cluster requires Kerberos authentication. Set this parameter to Note
|
No |
|
|
hbaseConfig |
The JSON configuration for connecting to the HBase cluster. The Note
To connect to an ApsaraDB for HBase database, you must use its internal network endpoint. |
Yes |
None |
|
mode |
The mode for writing data to HBase. Only the |
Yes |
None |
|
table |
The name of the HBase table to write to. This parameter is case-sensitive. |
Yes |
None |
|
encoding |
The encoding format for converting STRING data to |
No |
|
|
column |
The configuration for the columns to which you are writing data:
|
Yes |
None |
|
rowkeyColumn |
The rowkey column in the HBase table to write to:
The format is as follows.
|
Yes |
None |
|
versionColumn |
Specifies the timestamp for the write operation. The value can be from a source column or a constant. If this parameter is not configured, the system's current time is used.
The following examples show the format.
|
No |
None |
|
nullMode |
Specifies how to handle null values in the source data:
|
No |
|
|
walFlag |
When an HBase client submits data, it first writes the operations to a write-ahead log (WAL) before writing to the MemStore. This process ensures data durability. To improve write performance, you can disable the WAL by setting this parameter to |
No |
|
|
writeBufferSize |
The size of the write buffer for the HBase client, in bytes. This parameter is used with
|
No |
8 MB |
|
fileSystemUsername |
To resolve Ranger permission issues during a synchronization task, convert the wizard-based task to script mode. Then, set the |
No |
None |
HBase20xsql Reader demo
{
"type":"job",
"version":"2.0",// Version number.
"steps":[
{
"stepType":"hbase20xsql",// Plugin name.
"parameter":{
"queryServerAddress": "http://127.0.0.1:8765", // Phoenix QueryServer endpoint.
"serialization": "PROTOBUF", // QueryServer serialization format.
"table": "TEST", // Table to read.
"column": ["ID", "NAME"], // Columns to read.
"splitKey": "ID" // Sharding key, which must be the primary key of the table.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// Maximum allowed error records.
},
"speed":{
"throttle":true,// Toggles rate limiting. If true, the rate is limited by the mbps parameter.
"concurrent":1,// Number of concurrent jobs.
"mbps":"12"// Rate limit in MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
HBase20xsql Reader parameters
|
Parameter |
Description |
Required |
Default |
|
queryServerAddress |
The endpoint of the Phoenix QueryServer. The HBase20xsql Reader plugin uses a lightweight client to connect to this endpoint. To pass user credentials for ApsaraDB for HBase Performance-enhanced Edition (Lindorm), append the |
Yes |
None |
|
serialization |
The serialization protocol used by the Phoenix QueryServer. |
No |
PROTOBUF |
|
table |
The name of the table to read. The name is case-sensitive. |
Yes |
None |
|
schema |
The schema that contains the table. |
No |
None |
|
column |
The columns to synchronize. Use a JSON array to define the column names. If you do not specify this parameter or leave it empty, the reader reads all columns. |
No |
All columns |
|
splitKey |
When a table is read, it is sharded. If you specify the splitKey parameter, the field that splitKey represents is used for data sharding. This allows data synchronization to start concurrent tasks and improves performance. You can choose between two different sharding methods. If the splitPoint parameter is empty, the table is automatically sharded based on method one by default:
|
Yes |
None |
|
splitPoints |
Automatic sharding based on the minimum and maximum values of the sharding key may not prevent data hot spots. For optimal performance, we recommend defining custom sharding points based on the startkey and endkey of your HBase Regions. This ensures that each concurrent task queries a single Region. |
No |
None |
|
where |
A filter condition to add to the table query. The HBase20xsql Reader constructs a SQL query based on the column, table, and where parameters to extract data. |
No |
None |
|
querySql |
For complex filtering scenarios where the where parameter is insufficient, you can provide a custom SQL query. If you configure this parameter, the reader ignores the column, table, where, and splitKey parameters. The queryServerAddress parameter is still required. |
No |
None |
HBase11xsql Writer example
{
"type": "job",
"version": "1.0",
"configuration": {
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle":true,// Enables rate limiting. If set to false, the mbps parameter is ignored.
"concurrent":1, // The number of concurrent jobs.
"mbps":"1"// Rate limit in MB/s.
}
},
"reader": {
"plugin": "odps",
"parameter": {
"datasource": "",
"table": "",
"column": [],
"partition": ""
}
},
"plugin": "hbase11xsql",
"parameter": {
"table": "The name of the destination HBase table. The name is case-sensitive.",
"hbaseConfig": {
"hbase.zookeeper.quorum": "The ZooKeeper endpoint of the destination HBase cluster.",
"zookeeper.znode.parent": "The znode of the destination HBase cluster."
},
"column": [
"columnName"
],
"batchSize": 256,
"nullMode": "skip"
}
}
}
HBase11xsql Writer parameters
|
Parameter |
Description |
Required |
Default |
|
plugin |
Specifies the name of the plugin. The value must be hbase11xsql. |
Yes |
None |
|
table |
Specifies the name of the destination table. This parameter is case-sensitive. Phoenix table names are typically uppercase. |
Yes |
None |
|
column |
Specifies the names of the columns. The names are case-sensitive. Phoenix column names are typically uppercase. Note
|
Yes |
None |
|
hbaseConfig |
Specifies the HBase cluster endpoint. You must specify the ZooKeeper (ZK) endpoint in the format ip1,ip2,ip3. Note
|
Yes |
None |
|
batchSize |
Specifies the maximum number of rows for a batch write. |
No |
256 |
|
nullMode |
Specifies how to handle null values from the source data.
|
No |
skip |