To run bulk loads or other operations that require direct Hadoop Distributed File System (HDFS) access, enable the HDFS ports on your ApsaraDB for HBase cluster and configure a Hadoop client to connect to the cluster's HDFS.
Enabling HDFS ports exposes your cluster to malicious attacks, which may cause performance instability or even data loss. Alibaba Cloud is not responsible for data loss in HDFS caused by user mistakes. Make sure you are familiar with HDFS operations before proceeding.
Prerequisites
Before you begin, ensure that you have:
A machine with the Hadoop client installed and the
hadoopcommand availableAccess to the ApsaraDB for HBase Q&A DingTalk group (required to activate HDFS and obtain header node hostnames)
Enable HDFS access
HDFS access is not self-service. To activate it, contact the ApsaraDB for HBase Q&A DingTalk group. After you complete your tasks, Alibaba Cloud disables HDFS again to protect your cluster.
The DingTalk group also provides the actual values for the two header node hostnames ({hbase-header-1-host} and {hbase-header-2-host}) that you need in the configuration files below.
Configure the Hadoop client
These are client-side configuration files only. You do not need to configure the full HDFS cluster — just the parameters that tell the Hadoop client how to reach the NameNodes.
Create a configuration folder named
confin your Hadoop client directory. If the folder already exists, skip this step.Add the following two configuration files to the
conffolder.core-site.xml— sets the default HDFS nameservice your client connects to:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hbase-cluster</value> </property> </configuration>hdfs-site.xml— configures high availability (HA) failover so the client can locate the active NameNode automatically:<configuration> <property> <name>dfs.nameservices</name> <value>hbase-cluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.hbase-cluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled.hbase-cluster</name> <value>true</value> </property> <property> <name>dfs.ha.namenodes.hbase-cluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.hbase-cluster.nn1</name> <value>{hbase-header-1-host}:8020</value> </property> <property> <name>dfs.namenode.rpc-address.hbase-cluster.nn2</name> <value>{hbase-header-2-host}:8020</value> </property> </configuration>Replace
{hbase-header-1-host}and{hbase-header-2-host}with the actual hostnames provided by the ApsaraDB for HBase Q&A DingTalk group.The key parameters in
hdfs-site.xmland what they do:Parameter Description dfs.nameservicesThe logical name for the HDFS nameservice. Used in fs.defaultFSand other configuration keys.dfs.ha.namenodes.hbase-clusterThe IDs of the two NameNodes ( nn1,nn2) in the HA pair.dfs.namenode.rpc-address.hbase-cluster.nn1/nn2The RPC addresses (port 8020) of each NameNode. dfs.client.failover.proxy.provider.hbase-clusterThe Java class the client uses to determine which NameNode is currently active. dfs.ha.automatic-failover.enabled.hbase-clusterEnables automatic failover so the client switches to the standby NameNode if the active one becomes unavailable. Add the
conffolder to the classpath of the Hadoop client.
Verify the connection
Run the following commands to confirm the HDFS ports are accessible and read/write operations work:
echo "hdfs port test" >/tmp/test
hadoop dfs -put /tmp/test /
hadoop dfs -cat /testIf the connection is successful, the last command prints:
hdfs port testIf the command returns an error or produces no output, check that:
HDFS has been activated by the ApsaraDB for HBase Q&A DingTalk group
The
conffolder is on the Hadoop client classpathThe header node hostnames in
hdfs-site.xmlmatch the values provided by the DingTalk group