Access ApsaraDB for HBase HDFS

更新时间:
复制 MD 格式

To run bulk loads or other operations that require direct Hadoop Distributed File System (HDFS) access, enable the HDFS ports on your ApsaraDB for HBase cluster and configure a Hadoop client to connect to the cluster's HDFS.

Warning

Enabling HDFS ports exposes your cluster to malicious attacks, which may cause performance instability or even data loss. Alibaba Cloud is not responsible for data loss in HDFS caused by user mistakes. Make sure you are familiar with HDFS operations before proceeding.

Prerequisites

Before you begin, ensure that you have:

  • A machine with the Hadoop client installed and the hadoop command available

  • Access to the ApsaraDB for HBase Q&A DingTalk group (required to activate HDFS and obtain header node hostnames)

Enable HDFS access

HDFS access is not self-service. To activate it, contact the ApsaraDB for HBase Q&A DingTalk group. After you complete your tasks, Alibaba Cloud disables HDFS again to protect your cluster.

The DingTalk group also provides the actual values for the two header node hostnames ({hbase-header-1-host} and {hbase-header-2-host}) that you need in the configuration files below.

Configure the Hadoop client

These are client-side configuration files only. You do not need to configure the full HDFS cluster — just the parameters that tell the Hadoop client how to reach the NameNodes.

  1. Create a configuration folder named conf in your Hadoop client directory. If the folder already exists, skip this step.

  2. Add the following two configuration files to the conf folder.

    core-site.xml — sets the default HDFS nameservice your client connects to:

    <configuration>
      <property>
         <name>fs.defaultFS</name>
         <value>hdfs://hbase-cluster</value>
      </property>
    </configuration>

    hdfs-site.xml — configures high availability (HA) failover so the client can locate the active NameNode automatically:

    <configuration>
    <property>
            <name>dfs.nameservices</name>
            <value>hbase-cluster</value>
    </property>
      <property>
       <name>dfs.client.failover.proxy.provider.hbase-cluster</name>
       <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
      <property>
       <name>dfs.ha.automatic-failover.enabled.hbase-cluster</name>
       <value>true</value>
      </property>
      <property>
            <name>dfs.ha.namenodes.hbase-cluster</name>
            <value>nn1,nn2</value>
      </property>
      <property>
            <name>dfs.namenode.rpc-address.hbase-cluster.nn1</name>
            <value>{hbase-header-1-host}:8020</value>
      </property>
      <property>
            <name>dfs.namenode.rpc-address.hbase-cluster.nn2</name>
            <value>{hbase-header-2-host}:8020</value>
      </property>
    </configuration>

    Replace {hbase-header-1-host} and {hbase-header-2-host} with the actual hostnames provided by the ApsaraDB for HBase Q&A DingTalk group.

    The key parameters in hdfs-site.xml and what they do:

    ParameterDescription
    dfs.nameservicesThe logical name for the HDFS nameservice. Used in fs.defaultFS and other configuration keys.
    dfs.ha.namenodes.hbase-clusterThe IDs of the two NameNodes (nn1, nn2) in the HA pair.
    dfs.namenode.rpc-address.hbase-cluster.nn1/nn2The RPC addresses (port 8020) of each NameNode.
    dfs.client.failover.proxy.provider.hbase-clusterThe Java class the client uses to determine which NameNode is currently active.
    dfs.ha.automatic-failover.enabled.hbase-clusterEnables automatic failover so the client switches to the standby NameNode if the active one becomes unavailable.
  3. Add the conf folder to the classpath of the Hadoop client.

Verify the connection

Run the following commands to confirm the HDFS ports are accessible and read/write operations work:

echo "hdfs port test"  >/tmp/test
hadoop dfs -put /tmp/test  /
hadoop dfs -cat /test

If the connection is successful, the last command prints:

hdfs port test

If the command returns an error or produces no output, check that:

  • HDFS has been activated by the ApsaraDB for HBase Q&A DingTalk group

  • The conf folder is on the Hadoop client classpath

  • The header node hostnames in hdfs-site.xml match the values provided by the DingTalk group