HBase data synchronization with DataWorks-DataWorks(DataWorks)-阿里云帮助中心

The HBase data source lets you read from and write to HBase. This topic describes its data synchronization capabilities in DataWorks.

Supported versions

There are two types of HBase plugins: the HBase plugin and the HBase{xx}xsql plugin. The HBase{xx}xsql plugin requires both HBase and Phoenix.

HBase plugin:

Supports HBase0.94.x, HBase1.1.x, and HBase2.x. Both the codeless UI and code editor are supported. Use the hbaseVersion parameter to specify the version.
- If your HBase version is HBase0.94.x, set hbaseVersion to 094x for both the reader and writer.
```
"reader": {
        "hbaseVersion": "094x"
    }
```
```
"writer": {
        "hbaseVersion": "094x"
    }
```
- If your HBase version is HBase1.1.x or HBase2.x, set hbaseVersion to 11x for both the reader and writer.
```
"reader": {
        "hbaseVersion": "11x"
    }
```
```
"writer": {
        "hbaseVersion": "11x"
    }
```
  The HBase1.1.x plugin is compatible with HBase 2.0.
HBase{xx}xsql plugin
1. HBase20xsql plugin: Supports HBase2.x and Phoenix5.x. Only the code editor is supported.
  
  HBase11xsql plugin: Supports HBase1.1.x and Phoenix5.x. Only the code editor is supported.
2. The HBase{xx}xsql writer plugin provides a simple way to import data in batches to SQL tables (Phoenix) in HBase. Phoenix encodes the rowkey. Writing data by using the HBase API requires manual data conversion, a complex and error-prone process.
  
  Note
  The plugin uses the Phoenix JDBC driver to execute UPSERT statements, writing data to the table in batches. Its high-level interface also enables synchronous updates to index tables.

Limitations

HBase Reader	HBase20xsql Reader	HBase11xsql Writer
HBase Reader cannot read data written by Phoenix because Phoenix applies special processing to the data. HBase Reader supports only serverless resource group for Data Integration (recommended) and exclusive resource group for Data Integration, but not the default resource group or custom resource group.	Table sharding is restricted to a single column, which must be the primary key. For even sharding based on job concurrency, the sharding column must be an integer or a string. Table names, schema names, and column names are case-sensitive and must match the casing in the Phoenix table. HBase20xsql Reader can read data only through Phoenix QueryServer. Therefore, you must start Phoenix QueryServer to use the HBase20xsql Reader.	Only serverless resource group for Data Integration (recommended) is supported. Importing timestamped data is not supported. Only tables created with Phoenix are supported, not native HBase tables. The column order in the writer must match the column order in the reader. The reader's column order determines the sequence of columns in each output row. The writer's column order determines the expected sequence of columns in the received data. For example: The column order in the reader is c1, c2, c3, and c4. The column order in the writer is x1, x2, x3, and x4. In this case, the value of column c1 is assigned to column x1. If the column order in the writer is x1, x2, x4, and x3, the value of column c3 is assigned to column x4, and the value of column c4 is assigned to column x3. Importing data into an indexed table synchronously updates all related indexes.

Features

HBase Reader

HBase Reader supports the normal and multiVersionFixedColumn modes. See HBase field mapping guide for configuration instructions.

In normal mode, HBase Reader treats an HBase table as a standard two-dimensional table (wide table) and reads the latest version of the data.

hbase:007:0> scan 'student'
ROW                                   COLUMN+CELL
s001                                 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20
s001                                 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom
s001                                 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90
s001                                 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85
1 row(s) in 0.0580 seconds

The following table shows the output data.

Row key	basic:age	basic:name	score:english	score:math
s001	20	Tom	90	85

multiVersionFixedColumn mode: Reads an HBase table as a narrow table. Each returned record consists of four columns: rowKey, family:qualifier, timestamp, and value. You must explicitly specify the columns to read. The value of each cell is treated as a record. If multiple versions exist, multiple records are returned.

hbase:007:0> scan 'student',{VERSIONS=>5}
ROW                                   COLUMN+CELL
s001                                 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20
s001                                 column=basic:age, timestamp=2026-03-09T14:30:00.100, value=19
s001                                 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom
s001                                 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90
s001                                 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85
1 row(s) in 0.0260 seconds }

Row key	family:qualifier	Timestamp	Value
s001	basic:age	2026-03-09T14:41:40.240	20
s001	basic:age	2026-03-09T14:30:00.100	19
s001	basic:name	2026-03-09T14:41:40.214	Tom
s001	score:english	2026-03-09T14:41:40.333	90
s001	score:math	2026-03-09T14:41:40.277	85

HBase Writer

rowkey generation rule: Currently, HBase Writer supports concatenating multiple fields from the source to use as the rowkey for an HBase table.
You can specify a version (timestamp) for writing data to HBase. The available options are:
- Use the current time as the version.
- Specify a source column as the version.
- Specify a time as the version.

Supported data types

Batch read

This table shows the supported HBase data types and how HBase Reader converts them.

Category	Data Integration column type	Database data type
Integer	long	short, int, and long
Floating-point	double	float and double
String	string	binary_string and string
Date and time	date	date
Byte	bytes	bytes
Boolean	boolean	boolean

HBase20xsql Reader supports most, but not all, Phoenix data types. Ensure your data types are supported.

This table shows how HBase20xsql Reader maps Phoenix data types to DataX internal types.

DataX internal type	Phoenix data type
long	INTEGER, TINYINT, SMALLINT, BIGINT
double	FLOAT, DECIMAL, DOUBLE
string	CHAR, VARCHAR
date	DATE, TIME, TIMESTAMP
bytes	BINARY, VARBINARY
boolean	BOOLEAN

Batch write

This table lists the data types that HBase Writer supports.

Note

Each column's configured data type must match the corresponding data type in the HBase table.
Only the data types listed in the table are supported.

Category	Database data type
Integer	INT, LONG, and SHORT
Floating-point	FLOAT and DOUBLE
Boolean	BOOLEAN
String	STRING

Precautions

If you encounter the "tried to access method com.google.common.base.Stopwatch" error when testing connectivity, add the hbaseVersion property to the data source configuration to specify the HBase version.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure single-table batch synchronization

For details, see Codeless UI configuration and script mode configuration.

Because HBase is a schemaless data source, the Codeless UI does not display field mappings by default. You must configure them manually.

When using HBase as a data source, you must first select an Output Mode: normal mode or multiVersionFixedColumn mode.

The field mapping configuration differs for each mode:
- normal mode: This is the default mode. This mode reads an HBase table as a standard two-dimensional table and retrieves the latest version of the data. When using HBase as a data source, you must configure the mapping between the Source Field and Target Field. The source and destination fields have a one-to-one mapping. Because the source table has no fixed fields, fields are mapped by their order by default. To change the mapping, you must manually edit the field order.
  
  New version
  
  In the field mapping configuration, the mappings are: rowkey → rowkey, basic:age → age, basic:name → name, score:english → english, and score:math → math. The source fields are displayed in JSON format and include the name and type (both are string) properties.
  
  Legacy version
  
  In the field mapping configuration, source fields are displayed in the Type|ColumnFamily:ColumnName format, including string|rowkey, string|basic:age, string|basic:name, string|score:english, and string|score:math. These map to the target fields rowkey, age, name, english, and math, respectively.
  
  The target table contains fields such as rowkey, age, name, english, math, and pt. For example: rowkey=s001, age=20, name=Tom, english=90, math=85, pt=222222.
- multiVersionFixedColumn mode: Each output record consists of four columns (rowKey, family:qualifier, timestamp, and value), and this mode allows you to read multiple data versions. The Source Field is configured in the ColumnFamily:Qualifier format, such as basic:age. The destination table has four fixed columns: row_key, cf, timestamp_col, and value. No mapping configuration is required.
  
  New version
  
  In the field mapping area, map source fields to target fields. The source fields are in JSON format. Example mappings: {"name":"rowkey","type":"string"} maps to the target field rowkey, {"name":"basic:age","type":"string"} maps to family, {"name":"basic:name","type":"string"} maps to timestamp, and {"name":"score:english","type":"string"} maps to value. The system does not synchronize the unmapped source field {"name":"score:math","type":"string"}. You can click the Edit button on either side to edit the source and target fields respectively.
  
  Legacy version
  
  In the field mapping area, the source field includes string|rowkey, string|basic:age, string|basic:name, string|score:english, and string|score:math. The target field includes rowkey, family, timestamp, and value. The source and target fields are mapped row-by-row.
  
  The target table contains four fixed columns: row_key, cf, timestamp_col, and value. For example: row_key=s001, cf=basic:age, timestamp_col=1234567890, value=20.
- When using HBase as the data destination (only normal mode is supported), you must configure the Target Field and rowkey. You can form the rowkey field by concatenating multiple source fields.
For parameters and script examples for script mode, see Appendix: Script examples and parameter descriptions.

FAQ

Q: What is an appropriate concurrency setting? Does increasing concurrency help when the import speed is slow?

A: The default heap size for the Java Virtual Machine (JVM) in a data import process is 2 GB. Concurrency is implemented through multi-threading and configured by the number of channels. Excessive threads can degrade performance without improving import speed due to frequent garbage collection. We recommend using 5–10 concurrent threads (channels).
Q: What is an optimal value for batchSize?

A: The default value is 256. Calculate the optimal batchSize based on the row size. A single batch should typically contain 2–4 MB of data. Divide this data volume by the row size to determine the recommended batchSize.
Q: When reading data from HBase in multiVersionFixedColumn mode, I receive a java.lang.StringIndexOutOfBoundsException: String index out of range: -1 error. How can I resolve this?

A: This error typically occurs because the name field in the column configuration does not follow the columnFamily:qualifier (columnFamily:qualifier) format. For example, you might have specified only the qualifier, such as age, instead of basic:age. To resolve this, ensure the name for every column except rowkey is formatted as columnFamily:qualifier.

Appendix: Sample script and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Script mode configuration. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

HBase Reader example

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"hbase",// The plugin name.
            "parameter":{
                "mode":"normal",// Specifies the mode for reading data from HBase. Valid values: `normal` and `multiVersionFixedColumn`.
                "scanCacheSize":"256",// Specifies the number of rows to fetch from the server per RPC.
                "scanBatchSize":"100",// Specifies the number of columns to fetch from the server per RPC. 
                "hbaseVersion":"094x/11x",// The HBase version.
                "column":[// The fields to read.
                    {
                        "name":"rowkey",// The field name.
                        "type":"string"// The data type.
                    },
                    {
                        "name":"basic:age",
                        "type":"string"
                    },
                    {
                        "name":"basic:name",
                        "type":"string"
                    },
                    {
                        "name":"score:english",
                        "type":"string"
                    },
                    {
                        "name":"score:math",
                        "type":"string"
                    }
                ],
                "range":{// Specifies the rowkey range to read.
                    "endRowkey":"",// The end rowkey.
                    "isBinaryRowkey":true,// Specifies whether to use binary conversion for startRowkey and endRowkey. The default value is false.
                    "startRowkey":""// The start rowkey.
                },
                "maxVersion":"",// Specifies the number of versions to read in multi-version mode.
                "encoding":"UTF-8",// The encoding format.
                "table":"student",// The table name.
                "hbaseConfig":{// The connection configuration for the HBase cluster, in JSON format.
                    "hbase.zookeeper.quorum":"hostname",
                    "hbase.rootdir":"hdfs://ip:port/database",
                    "hbase.cluster.distributed":"true"
                }
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The plugin name for the destination. This example uses MaxCompute. You can replace it with another Writer plugin.
            "parameter":{
                "partition":"",// Partition information for the destination table. Not required for non-partitioned tables.
                "truncate":true,// Specifies whether to clear the destination table or partition before writing data.
                "datasource":"odps_datasource",// The MaxCompute data source name.
                "column":[// The destination fields.
                    "rowkey",
                    "basic_age",
                    "basic_name",
                    "score_english",
                    "score_math"
                ],
                "table":"student_target"// The name of the destination MaxCompute table.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of error records to allow before the job fails.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable rate limiting. If set to false, the mbps parameter is ignored.
            "concurrent":1,// The number of concurrent jobs.
            "mbps":"12"// The rate limit in megabytes per second (MB/s).
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

HBase Reader script (multiVersionFixedColumn mode)

The following example shows a complete script for reading data from HBase in multiVersionFixedColumn mode and writing it to MaxCompute. In this mode, the value of each cell in HBase is converted into a separate record. Each record consists of four columns: rowkey, family:qualifier, timestamp, and value.

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"hbase",// Plugin name.
            "parameter":{
                "mode":"multiVersionFixedColumn",// The mode for reading data from HBase. This example uses multiVersionFixedColumn mode.
                "scanCacheSize":"256",// The number of rows that the HBase client reads from the server in each remote procedure call (RPC).
                "scanBatchSize":"100",// The number of columns that the HBase client reads from the server in each RPC.
                "hbaseVersion":"20x",// HBase version.
                "datasource":"hbase_datasource",// HBase data source name.
                "column":[// The columns to read. The first column must be rowkey. The names of other columns must be in the "column family:qualifier" format.
                    {
                        "name":"rowkey",// The rowkey column.
                        "type":"string"
                    },
                    {
                        "name":"basic:age",// The age column in the basic column family.
                        "type":"string"
                    },
                    {
                        "name":"basic:name",// The name column in the basic column family.
                        "type":"string"
                    },
                    {
                        "name":"score:english",// The english column in the score column family.
                        "type":"string"
                    },
                    {
                        "name":"score:math",// The math column in the score column family.
                        "type":"string"
                    }
                ],
                "range":{
                    "isBinaryRowkey":false
                },
                "maxVersion":"-1",// The maximum number of versions to read. This parameter is required in multiVersionFixedColumn mode. A value of -1 specifies that all versions are read.
                "encoding":"UTF-8",// Encoding format.
                "table":"student"// HBase table name.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The name of the destination plugin. This example uses MaxCompute.
            "parameter":{
                "partition":"",// The partition of the destination table. This parameter is not required for non-partitioned tables.
                "truncate":true,// If set to true, this clears the destination table or partition before writing data.
                "datasource":"odps_datasource",// MaxCompute data source name.
                "column":[// The destination has four fixed columns that correspond to the rowkey, family:qualifier, timestamp, and value from the source, respectively.
                    "row_key",
                    "cf",
                    "timestamp_col",
                    "value"
                ],
                "table":"hbase_multiversion_target"// The name of the destination MaxCompute table.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of error records allowed.
        },
        "speed":{
            "throttle":false,// No rate limiting.
            "concurrent":2// Job concurrency.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}