This topic describes how to install and use Apache Spark on a Hadoop cluster with an Apsara File Storage for HDFS file system mounted.
Prerequisites
You have activated Apsara File Storage for HDFS and created a file system instance and a mount target. For more information, see Quick Start for Apsara File Storage for HDFS.
You have installed JDK 1.8 or later on all nodes of the Hadoop cluster.
-
You have downloaded the Apache Hadoop package. We recommend that you use Hadoop 2.7.2 or later. This topic uses Apache Hadoop 2.7.2.
-
You have downloaded the Scala package. This topic uses Scala 2.12.11.
-
You have downloaded the Apache Spark package. This topic uses Apache Spark 2.4.8.
Step 1: Configure Hadoop
-
Decompress the Hadoop package to a specified directory.
# Replace /usr/local/ with the actual path. tar -zxf hadoop-2.7.2.tar.gz -C /usr/local/ -
Modify the hadoop-env.sh configuration file.
-
Open the hadoop-env.sh configuration file.
vim /usr/local/hadoop-2.7.2/etc/hadoop/hadoop-env.sh -
Set the JAVA_HOME path.
export JAVA_HOME=/usr/java/default
-
-
Modify the core-site.xml configuration file.
-
Open the core-site.xml configuration file.
vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml -
Add the following properties to the core-site.xml configuration file. For more information, see Mount an Apsara File Storage HDFS file system.
<configuration> <property> <name>fs.defaultFS</name> <value>dfs://x-xxxxxxxx.cn-xxxxx.dfs.aliyuncs.com:10290</value> <!-- Enter the address of your mount target. --> </property> <property> <name>fs.dfs.impl</name> <value>com.alibaba.dfs.DistributedFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.dfs.impl</name> <value>com.alibaba.dfs.DFS</value> </property> </configuration>
-
-
Modify the yarn-site.xml configuration file.
-
Open the yarn-site.xml configuration file.
vim /usr/local/hadoop-2.7.2/etc/hadoop/yarn-site.xml -
Add the following properties to the yarn-site.xml configuration file.
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>xxxx</value> <!-- Enter the hostname of the ResourceManager in your cluster. --> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> <!-- Configure this parameter based on the capacity of your cluster. --> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <!-- Configure this parameter based on the capacity of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> <!-- Configure this parameter based on the capacity of your cluster. --> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3584</value> <!-- Configure this parameter based on the capacity of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>14336</value> <!-- Configure this parameter based on the capacity of your cluster. --> </property> </configuration>
-
-
Modify the slaves configuration file.
-
Open the slaves configuration file.
vim /usr/local/hadoop-2.7.2/etc/hadoop/slaves -
Add the hostnames of the compute nodes in your cluster to the slaves configuration file.
cluster-header-1cluster-worker-1
-
-
Configure environment variables.
-
Open the /etc/profile configuration file.
vim /etc/profile -
Configure HADOOP_HOME in the /etc/profile configuration file.
export HADOOP_HOME=/usr/local/hadoop-2.7.2 export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH -
Apply the configuration.
source /etc/profile
-
-
Install the Java SDK for Apsara File Storage for HDFS.
You can download the latest Java SDK for Apsara File Storage HDFS and deploy it to the CLASSPATH of your Hadoop ecosystem components. For more information, see Mount an Apsara File Storage HDFS file system.
cp aliyun-sdk-dfs-x.y.z.jar /usr/local/hadoop-2.7.2/share/hadoop/hdfsNoteReplace x.y.z with the actual version number of the Java SDK.
-
Synchronize the ${HADOOP_HOME} directory to the same directory on the other nodes in the cluster, and then follow Step 6 to configure the Hadoop environment variables on those nodes.
scp -r hadoop-2.7.2/ hadoop@cluster-worker-1:/usr/local/
Step 2: Verify Hadoop configuration
After you configure Hadoop, you do not need to format the NameNode or run the start-dfs.sh script to start HDFS-related services. If you use the YARN service, start it only on the ResourceManager node. For detailed instructions on how to verify the Hadoop configuration, see Verify the installation.
Step 3: Configure Scala
-
Decompress the Scala package to a specified directory.
# Replace /usr/local/ with the actual path. tar -zxf scala-2.12.11.tgz -C /usr/local -
Configure environment variables.
-
Open the /etc/profile configuration file.
vim /etc/profile -
Configure SCALA_HOME in the /etc/profile configuration file.
export SCALA_HOME=/usr/local/scala-2.12.11 export PATH=$PATH:${SCALA_HOME}/bin -
Apply the configuration.
source /etc/profile
-
-
Verify the Scala configuration.
scala -version scalaOutput similar to the following indicates a successful configuration.
[root@iZ8vb2q81658mubcgf0yprZ local]# scala -version Scala code runner version 2.12.11 -- Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc. [root@iZ8vb2q81658mubcgf0yprZ local]# scala Welcome to Scala 2.12.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_421). Type in expressions for evaluation. Or try :help. scala>
Step 4: Configure Apache Spark
-
Decompress the Apache Spark package to a specified directory.
tar -zxf spark-2.4.8-bin-hadoop2.7.tgz -C /usr/local/ -
Copy the Apsara File Storage for HDFS Java SDK to Spark's
jarsdirectory.cp aliyun-sdk-dfs-x.y.z.jar /usr/local/spark-2.4.8-bin-hadoop2.7/jarsNoteFor additional Spark configurations, see the official Spark documentation.
Step 5: Verify Apache Spark configuration
Use Spark to read a file from Apsara File Storage for HDFS, run a WordCount job, and write the results to Apsara File Storage for HDFS.
-
Generate test data on Apsara File Storage for HDFS.
hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \ randomtextwriter \ -D mapreduce.randomtextwriter.totalbytes=10240 \ -D mapreduce.randomtextwriter.bytespermap=1024 \ dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/inputNotef-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.comis the mount target address for Apsara File Storage for HDFS. Replace it with your actual address. -
Start spark-shell and run the WordCount job.
${SPARK_HOME}/bin/spark-shell --master yarn scala> val res = sc.textFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) scala> res.collect.foreach(println) scala> res.saveAsTextFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output") -
View the results on Apsara File Storage for HDFS.
hadoop fs -ls dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/outputOutput similar to the following indicates a successful configuration.
Found 11 items -rwxrwxrwx 3 root root 0 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/_SUCCESS -rwxrwxrwx 3 root root 1215 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00000 -rwxrwxrwx 3 root root 1171 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00001 -rwxrwxrwx 3 root root 1405 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00002 -rwxrwxrwx 3 root root 1532 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00003 -rwxrwxrwx 3 root root 1008 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00004 -rwxrwxrwx 3 root root 1061 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00005 -rwxrwxrwx 3 root root 1381 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00006 -rwxrwxrwx 3 root root 1497 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00007 -rwxrwxrwx 3 root root 1439 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00008 -rwxrwxrwx 3 root root 1294 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00009