HBase data source

更新时间:
复制 MD 格式

The HBase data source lets you read from and write to HBase. This topic describes its data synchronization capabilities in DataWorks.

Supported versions

There are two types of HBase plugins: the HBase plugin and the HBase{xx}xsql plugin. The HBase{xx}xsql plugin requires both HBase and Phoenix.

  1. HBase plugin:

    Supports HBase0.94.x, HBase1.1.x, and HBase2.x. Both the codeless UI and code editor are supported. Use the hbaseVersion parameter to specify the version.

    • If your HBase version is HBase0.94.x, set hbaseVersion to 094x for both the reader and writer.

      "reader": {
              "hbaseVersion": "094x"
          }
      "writer": {
              "hbaseVersion": "094x"
          }
    • If your HBase version is HBase1.1.x or HBase2.x, set hbaseVersion to 11x for both the reader and writer.

      "reader": {
              "hbaseVersion": "11x"
          }
      "writer": {
              "hbaseVersion": "11x"
          }
      The HBase1.1.x plugin is compatible with HBase 2.0.
  2. HBase{xx}xsql plugin

    1. HBase20xsql plugin: Supports HBase2.x and Phoenix5.x. Only the code editor is supported.

      HBase11xsql plugin: Supports HBase1.1.x and Phoenix5.x. Only the code editor is supported.

    2. The HBase{xx}xsql writer plugin provides a simple way to import data in batches to SQL tables (Phoenix) in HBase. Phoenix encodes the rowkey. Writing data by using the HBase API requires manual data conversion, a complex and error-prone process.

      Note

      The plugin uses the Phoenix JDBC driver to execute UPSERT statements, writing data to the table in batches. Its high-level interface also enables synchronous updates to index tables.

Limitations

HBase Reader

HBase20xsql Reader

HBase11xsql Writer

  • Table sharding is restricted to a single column, which must be the primary key.

  • For even sharding based on job concurrency, the sharding column must be an integer or a string.

  • Table names, schema names, and column names are case-sensitive and must match the casing in the Phoenix table.

  • HBase20xsql Reader can read data only through Phoenix QueryServer. Therefore, you must start Phoenix QueryServer to use the HBase20xsql Reader.

  • Only serverless resource group for Data Integration (recommended) is supported.

  • Importing timestamped data is not supported.

  • Only tables created with Phoenix are supported, not native HBase tables.

  • The column order in the writer must match the column order in the reader. The reader's column order determines the sequence of columns in each output row. The writer's column order determines the expected sequence of columns in the received data. For example:

    • The column order in the reader is c1, c2, c3, and c4.

    • The column order in the writer is x1, x2, x3, and x4.

    In this case, the value of column c1 is assigned to column x1. If the column order in the writer is x1, x2, x4, and x3, the value of column c3 is assigned to column x4, and the value of column c4 is assigned to column x3.

  • Importing data into an indexed table synchronously updates all related indexes.

Features

HBase Reader

HBase Reader supports the normal and multiVersionFixedColumn modes. See HBase field mapping guide for configuration instructions.

  • In normal mode, HBase Reader treats an HBase table as a standard two-dimensional table (wide table) and reads the latest version of the data.

    hbase:007:0> scan 'student'
    ROW                                   COLUMN+CELL
    s001                                 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20
    s001                                 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom
    s001                                 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90
    s001                                 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85
    1 row(s) in 0.0580 seconds 

    The following table shows the output data.

    Row key

    basic:age

    basic:name

    score:english

    score:math

    s001

    20

    Tom

    90

    85

  • multiVersionFixedColumn mode: Reads an HBase table as a narrow table. Each returned record consists of four columns: rowKey, family:qualifier, timestamp, and value. You must explicitly specify the columns to read. The value of each cell is treated as a record. If multiple versions exist, multiple records are returned.

    hbase:007:0> scan 'student',{VERSIONS=>5}
    ROW                                   COLUMN+CELL
    s001                                 column=basic:age, timestamp=2026-03-09T14:41:40.240, value=20
    s001                                 column=basic:age, timestamp=2026-03-09T14:30:00.100, value=19
    s001                                 column=basic:name, timestamp=2026-03-09T14:41:40.214, value=Tom
    s001                                 column=score:english, timestamp=2026-03-09T14:41:40.333, value=90
    s001                                 column=score:math, timestamp=2026-03-09T14:41:40.277, value=85
    1 row(s) in 0.0260 seconds }

    Row key

    family:qualifier

    Timestamp

    Value

    s001

    basic:age

    2026-03-09T14:41:40.240

    20

    s001

    basic:age

    2026-03-09T14:30:00.100

    19

    s001

    basic:name

    2026-03-09T14:41:40.214

    Tom

    s001

    score:english

    2026-03-09T14:41:40.333

    90

    s001

    score:math

    2026-03-09T14:41:40.277

    85

HBase Writer

  • rowkey generation rule: Currently, HBase Writer supports concatenating multiple fields from the source to use as the rowkey for an HBase table.

  • You can specify a version (timestamp) for writing data to HBase. The available options are:

    • Use the current time as the version.

    • Specify a source column as the version.

    • Specify a time as the version.

Supported data types

Batch read

  • This table shows the supported HBase data types and how HBase Reader converts them.

    Category

    Data Integration column type

    Database data type

    Integer

    long

    short, int, and long

    Floating-point

    double

    float and double

    String

    string

    binary_string and string

    Date and time

    date

    date

    Byte

    bytes

    bytes

    Boolean

    boolean

    boolean

  • HBase20xsql Reader supports most, but not all, Phoenix data types. Ensure your data types are supported.

  • This table shows how HBase20xsql Reader maps Phoenix data types to DataX internal types.

    DataX internal type

    Phoenix data type

    long

    INTEGER, TINYINT, SMALLINT, BIGINT

    double

    FLOAT, DECIMAL, DOUBLE

    string

    CHAR, VARCHAR

    date

    DATE, TIME, TIMESTAMP

    bytes

    BINARY, VARBINARY

    boolean

    BOOLEAN

Batch write

This table lists the data types that HBase Writer supports.

Note
  • Each column's configured data type must match the corresponding data type in the HBase table.

  • Only the data types listed in the table are supported.

Category

Database data type

Integer

INT, LONG, and SHORT

Floating-point

FLOAT and DOUBLE

Boolean

BOOLEAN

String

STRING

Precautions

If you encounter the "tried to access method com.google.common.base.Stopwatch" error when testing connectivity, add the hbaseVersion property to the data source configuration to specify the HBase version.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure single-table batch synchronization

  • For details, see Codeless UI configuration and script mode configuration.

    Because HBase is a schemaless data source, the Codeless UI does not display field mappings by default. You must configure them manually.

    When using HBase as a data source, you must first select an Output Mode: normal mode or multiVersionFixedColumn mode.

    The field mapping configuration differs for each mode:

    • normal mode: This is the default mode. This mode reads an HBase table as a standard two-dimensional table and retrieves the latest version of the data. When using HBase as a data source, you must configure the mapping between the Source Field and Target Field. The source and destination fields have a one-to-one mapping. Because the source table has no fixed fields, fields are mapped by their order by default. To change the mapping, you must manually edit the field order.

      New version

      In the field mapping configuration, the mappings are: rowkeyrowkey, basic:ageage, basic:namename, score:englishenglish, and score:mathmath. The source fields are displayed in JSON format and include the name and type (both are string) properties.

      Legacy version

      In the field mapping configuration, source fields are displayed in the Type|ColumnFamily:ColumnName format, including string|rowkey, string|basic:age, string|basic:name, string|score:english, and string|score:math. These map to the target fields rowkey, age, name, english, and math, respectively.

      The target table contains fields such as rowkey, age, name, english, math, and pt. For example: rowkey=s001, age=20, name=Tom, english=90, math=85, pt=222222.

    • multiVersionFixedColumn mode: Each output record consists of four columns (rowKey, family:qualifier, timestamp, and value), and this mode allows you to read multiple data versions. The Source Field is configured in the ColumnFamily:Qualifier format, such as basic:age. The destination table has four fixed columns: row_key, cf, timestamp_col, and value. No mapping configuration is required.

      New version

      In the field mapping area, map source fields to target fields. The source fields are in JSON format. Example mappings: {"name":"rowkey","type":"string"} maps to the target field rowkey, {"name":"basic:age","type":"string"} maps to family, {"name":"basic:name","type":"string"} maps to timestamp, and {"name":"score:english","type":"string"} maps to value. The system does not synchronize the unmapped source field {"name":"score:math","type":"string"}. You can click the Edit button on either side to edit the source and target fields respectively.

      Legacy version

      In the field mapping area, the source field includes string|rowkey, string|basic:age, string|basic:name, string|score:english, and string|score:math. The target field includes rowkey, family, timestamp, and value. The source and target fields are mapped row-by-row.

      The target table contains four fixed columns: row_key, cf, timestamp_col, and value. For example: row_key=s001, cf=basic:age, timestamp_col=1234567890, value=20.

    • When using HBase as the data destination (only normal mode is supported), you must configure the Target Field and rowkey. You can form the rowkey field by concatenating multiple source fields.

  • For parameters and script examples for script mode, see Appendix: Script examples and parameter descriptions.

FAQ

  • Q: What is an appropriate concurrency setting? Does increasing concurrency help when the import speed is slow?

    A: The default heap size for the Java Virtual Machine (JVM) in a data import process is 2 GB. Concurrency is implemented through multi-threading and configured by the number of channels. Excessive threads can degrade performance without improving import speed due to frequent garbage collection. We recommend using 5–10 concurrent threads (channels).

  • Q: What is an optimal value for batchSize?

    A: The default value is 256. Calculate the optimal batchSize based on the row size. A single batch should typically contain 2–4 MB of data. Divide this data volume by the row size to determine the recommended batchSize.

  • Q: When reading data from HBase in multiVersionFixedColumn mode, I receive a java.lang.StringIndexOutOfBoundsException: String index out of range: -1 error. How can I resolve this?

    A: This error typically occurs because the name field in the column configuration does not follow the columnFamily:qualifier (columnFamily:qualifier) format. For example, you might have specified only the qualifier, such as age, instead of basic:age. To resolve this, ensure the name for every column except rowkey is formatted as columnFamily:qualifier.

Appendix: Sample script and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Script mode configuration. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

HBase Reader example

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"hbase",// The plugin name.
            "parameter":{
                "mode":"normal",// Specifies the mode for reading data from HBase. Valid values: `normal` and `multiVersionFixedColumn`.
                "scanCacheSize":"256",// Specifies the number of rows to fetch from the server per RPC.
                "scanBatchSize":"100",// Specifies the number of columns to fetch from the server per RPC. 
                "hbaseVersion":"094x/11x",// The HBase version.
                "column":[// The fields to read.
                    {
                        "name":"rowkey",// The field name.
                        "type":"string"// The data type.
                    },
                    {
                        "name":"basic:age",
                        "type":"string"
                    },
                    {
                        "name":"basic:name",
                        "type":"string"
                    },
                    {
                        "name":"score:english",
                        "type":"string"
                    },
                    {
                        "name":"score:math",
                        "type":"string"
                    }
                ],
                "range":{// Specifies the rowkey range to read.
                    "endRowkey":"",// The end rowkey.
                    "isBinaryRowkey":true,// Specifies whether to use binary conversion for startRowkey and endRowkey. The default value is false.
                    "startRowkey":""// The start rowkey.
                },
                "maxVersion":"",// Specifies the number of versions to read in multi-version mode.
                "encoding":"UTF-8",// The encoding format.
                "table":"student",// The table name.
                "hbaseConfig":{// The connection configuration for the HBase cluster, in JSON format.
                    "hbase.zookeeper.quorum":"hostname",
                    "hbase.rootdir":"hdfs://ip:port/database",
                    "hbase.cluster.distributed":"true"
                }
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The plugin name for the destination. This example uses MaxCompute. You can replace it with another Writer plugin.
            "parameter":{
                "partition":"",// Partition information for the destination table. Not required for non-partitioned tables.
                "truncate":true,// Specifies whether to clear the destination table or partition before writing data.
                "datasource":"odps_datasource",// The MaxCompute data source name.
                "column":[// The destination fields.
                    "rowkey",
                    "basic_age",
                    "basic_name",
                    "score_english",
                    "score_math"
                ],
                "table":"student_target"// The name of the destination MaxCompute table.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of error records to allow before the job fails.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable rate limiting. If set to false, the mbps parameter is ignored.
            "concurrent":1,// The number of concurrent jobs.
            "mbps":"12"// The rate limit in megabytes per second (MB/s).
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

HBase Reader script (multiVersionFixedColumn mode)

The following example shows a complete script for reading data from HBase in multiVersionFixedColumn mode and writing it to MaxCompute. In this mode, the value of each cell in HBase is converted into a separate record. Each record consists of four columns: rowkey, family:qualifier, timestamp, and value.

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"hbase",// Plugin name.
            "parameter":{
                "mode":"multiVersionFixedColumn",// The mode for reading data from HBase. This example uses multiVersionFixedColumn mode.
                "scanCacheSize":"256",// The number of rows that the HBase client reads from the server in each remote procedure call (RPC).
                "scanBatchSize":"100",// The number of columns that the HBase client reads from the server in each RPC.
                "hbaseVersion":"20x",// HBase version.
                "datasource":"hbase_datasource",// HBase data source name.
                "column":[// The columns to read. The first column must be rowkey. The names of other columns must be in the "column family:qualifier" format.
                    {
                        "name":"rowkey",// The rowkey column.
                        "type":"string"
                    },
                    {
                        "name":"basic:age",// The age column in the basic column family.
                        "type":"string"
                    },
                    {
                        "name":"basic:name",// The name column in the basic column family.
                        "type":"string"
                    },
                    {
                        "name":"score:english",// The english column in the score column family.
                        "type":"string"
                    },
                    {
                        "name":"score:math",// The math column in the score column family.
                        "type":"string"
                    }
                ],
                "range":{
                    "isBinaryRowkey":false
                },
                "maxVersion":"-1",// The maximum number of versions to read. This parameter is required in multiVersionFixedColumn mode. A value of -1 specifies that all versions are read.
                "encoding":"UTF-8",// Encoding format.
                "table":"student"// HBase table name.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The name of the destination plugin. This example uses MaxCompute.
            "parameter":{
                "partition":"",// The partition of the destination table. This parameter is not required for non-partitioned tables.
                "truncate":true,// If set to true, this clears the destination table or partition before writing data.
                "datasource":"odps_datasource",// MaxCompute data source name.
                "column":[// The destination has four fixed columns that correspond to the rowkey, family:qualifier, timestamp, and value from the source, respectively.
                    "row_key",
                    "cf",
                    "timestamp_col",
                    "value"
                ],
                "table":"hbase_multiversion_target"// The name of the destination MaxCompute table.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of error records allowed.
        },
        "speed":{
            "throttle":false,// No rate limiting.
            "concurrent":2// Job concurrency.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}
Note

The destination table in MaxCompute must be created in advance. For example: CREATE TABLE IF NOT EXISTS hbase_multiversion_target (row_key STRING, cf STRING, timestamp_col STRING, value STRING);

HBase Reader script parameters

Parameter

Description

Required

Default

haveKerberos

If haveKerberos is true, the HBase cluster requires Kerberos authentication.

Note
  • If you set this parameter to true, you must also configure the following parameters:

    • kerberosKeytabFilePath

    • kerberosPrincipal

    • hbaseMasterKerberosPrincipal

    • hbaseRegionserverKerberosPrincipal

    • hbaseRpcProtection

  • If Kerberos authentication is not enabled for the HBase cluster, you do not need to configure these parameters.

No

false

hbaseConfig

The configuration required to connect to the HBase cluster, in JSON format. The hbase.zookeeper.quorum parameter, which specifies the ZooKeeper endpoint for HBase, is required. You can also add other HBase client configurations, such as scan cache and batch settings, to optimize interaction with the server.

Note

If you are connecting to an ApsaraDB for HBase instance, you must use its internal network endpoint.

Yes

None

mode

The supported read modes for HBase are normal and multiVersionFixedColumn.

Yes

None

table

The name of the HBase table to read. Table names are case-sensitive.

Yes

None

encoding

The encoding format, such as UTF-8 or GBK, is used to convert a binary HBase byte[] value to a String.

No

utf-8

column

The HBase field to read. This parameter is required in normal mode and multiVersionFixedColumn mode.

  • In normal mode:

    The name parameter specifies the HBase column to read. Except for rowkey, the value for this parameter must be in the column family:qualifier format. The type parameter specifies the type of the source data, and the format parameter specifies the format for date types. The value parameter specifies that the column is a constant. If you use the value parameter, data is not read from HBase. Instead, a corresponding column is automatically generated based on the specified value. The configuration format is as follows:

    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "value": "test",
      "type": "string"
    }
    ]

    In normal mode, for the Column information that you specify, the type parameter is required, and you must specify either the name or value parameter.

  • multiVersionFixedColumn mode

    The name parameter specifies the HBase column to read. Except for rowkey, the value must be in the column family:qualifier format. The type parameter specifies the type of the source data, and the format parameter specifies the format for date types. Constant columns are not supported in multiVersionFixedColumn mode. The configuration format is as follows:

    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "name": "info:age",
      "type": "string"
    }
    ]

Yes

None

maxVersion

The maximum number of cell versions to read in multi-version mode. Valid values are -1 (all versions) or an integer greater than 1.

Required in multiVersionFixedColumn mode.

None

range

Specifies the rowkey range to read.

  • startRowkey: Specifies the start rowkey.

  • endRowkey: Specifies the end rowkey.

  • isBinaryRowkey: Specifies how startRowkey and endRowkey are converted to a byte[]. The default value is false. If this parameter is set to true, the Bytes.toBytesBinary(rowkey) method is called. If this parameter is set to false, the Bytes.toBytes(rowkey) method is called. The configuration format is as follows:

    "range": {
    "startRowkey": "aaa",
    "endRowkey": "ccc",
    "isBinaryRowkey":false
    }

No

None

scanCacheSize

The number of rows to fetch from HBase in a single remote procedure call (RPC).

No

256

scanBatchSize

The number of columns to fetch from HBase in a single RPC. Set this to -1 to fetch all columns.

Note

The value for scanBatchSize should be greater than the actual number of columns to avoid data quality risks.

No

100

HBase Writer script

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"hbase",// The plugin name.
            "parameter":{
                "mode":"normal",// The write mode for HBase.
                "walFlag":"false",// Set to `false` to disable write-ahead logging (WAL).
                "hbaseVersion":"094x",// The HBase version.
                "rowkeyColumn":[// The columns that form the HBase rowkey.
                    {
                        "index":"0",// The index of the source data column.
                        "type":"string"// The data type for this part of the rowkey.
                    },
                    {
                        "index":"-1",
                        "type":"string",
                        "value":"_"
                    }
                ],
                "nullMode":"skip",// Specifies how to handle null values from the source.
                "column":[// The destination columns in the HBase table.
                    {
                        "name":"columnFamilyName1:columnName1",// The column name, in `family:qualifier` format.
                        "index":"0",// The index of the source data column.
                        "type":"string"// The data type of the column value.
                    },
                    {
                        "name":"columnFamilyName2:columnName2",
                        "index":"1",
                        "type":"string"
                    },
                    {
                        "name":"columnFamilyName3:columnName3",
                        "index":"2",
                        "type":"string"
                    }
                ],
                "encoding":"utf-8",// The character encoding.
                "table":"",// The name of the destination HBase table.
                "hbaseConfig":{// Configuration for the HBase cluster connection, in JSON format.
                    "hbase.zookeeper.quorum":"hostname",
                    "hbase.rootdir":"hdfs: //ip:port/database",
                    "hbase.cluster.distributed":"true"
                }
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of allowed error records.
        },
        "speed":{
            "throttle":true,// Enables (`true`) or disables (`false`) rate limiting. If `true`, the rate is defined by the `mbps` parameter.
            "concurrent":1, // The number of concurrent write tasks.
            "mbps":"12"// The maximum transfer rate in megabytes per second (MB/s).
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

HBase writer script parameters

Parameter

Description

Required

Default

haveKerberos

Specifies whether the HBase cluster requires Kerberos authentication. Set this parameter to true to enable Kerberos authentication.

Note
  • If you set this parameter to true, you must configure the following Kerberos authentication parameters:

    • kerberosKeytabFilePath

    • kerberosPrincipal

    • hbaseMasterKerberosPrincipal

    • hbaseRegionserverKerberosPrincipal

    • hbaseRpcProtection

  • These parameters are not required if Kerberos authentication is disabled.

No

false

hbaseConfig

The JSON configuration for connecting to the HBase cluster. The hbase.zookeeper.quorum parameter is required and specifies the ZooKeeper endpoint for the HBase cluster. You can also add more HBase client configurations, such as scan cache and batch settings, to optimize server interaction.

Note

To connect to an ApsaraDB for HBase database, you must use its internal network endpoint.

Yes

None

mode

The mode for writing data to HBase. Only the normal mode is currently supported.

Yes

None

table

The name of the HBase table to write to. This parameter is case-sensitive.

Yes

None

encoding

The encoding format for converting STRING data to HBase byte[]. Valid values: UTF-8 and GBK.

No

UTF-8

column

The configuration for the columns to which you are writing data:

  • index: Specifies the index of the corresponding column from the reader, starting from 0.

  • name: Specifies the column in the HBase table. The format must be column family:column name.

  • type: Specifies the data type for the write operation. This is used to convert the data to the HBase byte[] format.

Yes

None

rowkeyColumn

The rowkey column in the HBase table to write to:

  • index: Specifies the index of the corresponding column from the reader, starting from 0. If the column is a constant, set this parameter to -1.

  • type: Specifies the data type for the write operation. This is used to convert the data to the HBase byte[] format.

  • value: A constant, often used as a separator to concatenate multiple fields. The rowkey cannot be composed entirely of constants.

The format is as follows.

"rowkeyColumn": [
          {
            "index":0,
            "type":"string"
          },
          {
            "index":-1,
            "type":"string",
            "value":"_"
          }
      ]

Yes

None

versionColumn

Specifies the timestamp for the write operation. The value can be from a source column or a constant. If this parameter is not configured, the system's current time is used.

  • index: Specifies the index of the corresponding column from the reader, starting from 0. Ensure that the value can be converted to the LONG type.

  • type: If the data type is Date, the system attempts to parse the value using the yyyy-MM-dd HH:mm:ss and yyyy-MM-dd HH:mm:ss SSS formats. If a specific timestamp value is used, set index to -1.

  • value: A constant timestamp value of the LONG type.

The following examples show the format.

  • "versionColumn": {
      "index": 1
    }
  • "versionColumn": {
      "index": -1,
      "value": 123456789
    }

No

None

nullMode

Specifies how to handle null values in the source data:

  • skip: Skips writing the column to HBase.

  • empty: Writes HConstants.EMPTY_BYTE_ARRAY, which is new byte [0].

No

skip

walFlag

When an HBase client submits data, it first writes the operations to a write-ahead log (WAL) before writing to the MemStore. This process ensures data durability. To improve write performance, you can disable the WAL by setting this parameter to false.

No

false

writeBufferSize

The size of the write buffer for the HBase client, in bytes. This parameter is used with autoflush.

autoflush (disabled by default):

  • true: When true, the client sends a request for each put operation, and this write buffer is not used.

  • false: The HBase client sends a write request to the HBase server only when the client-side write cache is full.

No

8 MB

fileSystemUsername

To resolve Ranger permission issues during a synchronization task, convert the wizard-based task to script mode. Then, set the fileSystemUsername parameter to a user that has the required permissions. DataWorks will then access HBase as this specified user.

No

None

HBase20xsql Reader demo

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        {
            "stepType":"hbase20xsql",// Plugin name.
            "parameter":{
                "queryServerAddress": "http://127.0.0.1:8765",  // Phoenix QueryServer endpoint.
                "serialization": "PROTOBUF",  // QueryServer serialization format.
                "table": "TEST",    // Table to read.
                "column": ["ID", "NAME"],   // Columns to read.
                "splitKey": "ID"    // Sharding key, which must be the primary key of the table.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Maximum allowed error records.
        },
        "speed":{
            "throttle":true,// Toggles rate limiting. If true, the rate is limited by the mbps parameter.
            "concurrent":1,// Number of concurrent jobs.
            "mbps":"12"// Rate limit in MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

HBase20xsql Reader parameters

Parameter

Description

Required

Default

queryServerAddress

The endpoint of the Phoenix QueryServer. The HBase20xsql Reader plugin uses a lightweight client to connect to this endpoint. To pass user credentials for ApsaraDB for HBase Performance-enhanced Edition (Lindorm), append the user and password properties to the queryServerAddress string. For example: http://127.0.0.1:8765;user=root;password=root.

Yes

None

serialization

The serialization protocol used by the Phoenix QueryServer.

No

PROTOBUF

table

The name of the table to read. The name is case-sensitive.

Yes

None

schema

The schema that contains the table.

No

None

column

The columns to synchronize. Use a JSON array to define the column names. If you do not specify this parameter or leave it empty, the reader reads all columns.

No

All columns

splitKey

When a table is read, it is sharded. If you specify the splitKey parameter, the field that splitKey represents is used for data sharding. This allows data synchronization to start concurrent tasks and improves performance. You can choose between two different sharding methods. If the splitPoint parameter is empty, the table is automatically sharded based on method one by default:

  • Method 1: Find the maximum and minimum values based on the splitKey, and then shard evenly according to the specified concurrent number.

    Note

    The sharding key must be of an integer or string type.

  • Method 2: The reader partitions the data based on the manually configured splitPoints. The data is then synchronized based on the configured number of concurrent tasks.

Yes

None

splitPoints

Automatic sharding based on the minimum and maximum values of the sharding key may not prevent data hot spots. For optimal performance, we recommend defining custom sharding points based on the startkey and endkey of your HBase Regions. This ensures that each concurrent task queries a single Region.

No

None

where

A filter condition to add to the table query. The HBase20xsql Reader constructs a SQL query based on the column, table, and where parameters to extract data.

No

None

querySql

For complex filtering scenarios where the where parameter is insufficient, you can provide a custom SQL query. If you configure this parameter, the reader ignores the column, table, where, and splitKey parameters. The queryServerAddress parameter is still required.

No

None

HBase11xsql Writer example

{
  "type": "job",
  "version": "1.0",
  "configuration": {
    "setting": {
      "errorLimit": {
        "record": "0"
      },
      "speed": {
            "throttle":true,// Enables rate limiting. If set to false, the mbps parameter is ignored.
            "concurrent":1, // The number of concurrent jobs.
            "mbps":"1"// Rate limit in MB/s.
      }
    },
    "reader": {
      "plugin": "odps",
      "parameter": {
        "datasource": "",
        "table": "",
        "column": [],
        "partition": ""
      }
    },
    "plugin": "hbase11xsql",
    "parameter": {
      "table": "The name of the destination HBase table. The name is case-sensitive.",
      "hbaseConfig": {
        "hbase.zookeeper.quorum": "The ZooKeeper endpoint of the destination HBase cluster.",
        "zookeeper.znode.parent": "The znode of the destination HBase cluster."
      },
      "column": [
        "columnName"
      ],
      "batchSize": 256,
      "nullMode": "skip"
    }
  }
}

HBase11xsql Writer parameters

Parameter

Description

Required

Default

plugin

Specifies the name of the plugin. The value must be hbase11xsql.

Yes

None

table

Specifies the name of the destination table. This parameter is case-sensitive. Phoenix table names are typically uppercase.

Yes

None

column

Specifies the names of the columns. The names are case-sensitive. Phoenix column names are typically uppercase.

Note
  • The column order must match that of the reader output.

  • You do not need to specify data types. The writer automatically retrieves the column metadata from Phoenix.

Yes

None

hbaseConfig

Specifies the HBase cluster endpoint. You must specify the ZooKeeper (ZK) endpoint in the format ip1,ip2,ip3.

Note
  • Use commas (,) to separate multiple IP addresses.

  • The znode parameter is optional. The default value is /hbase.

Yes

None

batchSize

Specifies the maximum number of rows for a batch write.

No

256

nullMode

Specifies how to handle null values from the source data.

  • skip: The writer skips writing the column. If a value for this column already exists in the destination row, the writer deletes it.

  • empty: The writer inserts an empty value. The empty value is 0 for numeric types and an empty string for varchar types.

No

skip