Install and use Apache Spark on a Hadoop cluster with an Apsara File Storage HDFS file system mounted-Apsara File Storage for HDFS(HDFS)-阿里云帮助中心

This topic describes how to install and use Apache Spark on a Hadoop cluster with an Apsara File Storage for HDFS file system mounted.

Prerequisites

You have activated Apsara File Storage for HDFS and created a file system instance and a mount target. For more information, see Quick Start for Apsara File Storage for HDFS.
You have installed JDK 1.8 or later on all nodes of the Hadoop cluster.
You have downloaded the Apache Hadoop package. We recommend that you use Hadoop 2.7.2 or later. This topic uses Apache Hadoop 2.7.2.
You have downloaded the Scala package. This topic uses Scala 2.12.11.
You have downloaded the Apache Spark package. This topic uses Apache Spark 2.4.8.

Step 1: Configure Hadoop

Decompress the Hadoop package to a specified directory.

# Replace /usr/local/ with the actual path.
tar -zxf hadoop-2.7.2.tar.gz -C /usr/local/

Modify the hadoop-env.sh configuration file.
1. Open the hadoop-env.sh configuration file.
```
vim /usr/local/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
```
2. Set the JAVA_HOME path.
```
export JAVA_HOME=/usr/java/default
```

Modify the core-site.xml configuration file.

Open the core-site.xml configuration file.

vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml

Add the following properties to the core-site.xml configuration file. For more information, see Mount an Apsara File Storage HDFS file system.

<configuration>
    <property>
         <name>fs.defaultFS</name>
         <value>dfs://x-xxxxxxxx.cn-xxxxx.dfs.aliyuncs.com:10290</value>
         <!-- Enter the address of your mount target. -->
    </property>
    <property>
         <name>fs.dfs.impl</name>
         <value>com.alibaba.dfs.DistributedFileSystem</value>
    </property>
    <property>
         <name>fs.AbstractFileSystem.dfs.impl</name>
         <value>com.alibaba.dfs.DFS</value>
    </property>
</configuration>

Modify the yarn-site.xml configuration file.

Open the yarn-site.xml configuration file.

vim /usr/local/hadoop-2.7.2/etc/hadoop/yarn-site.xml

Add the following properties to the yarn-site.xml configuration file.

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>xxxx</value>
        <!-- Enter the hostname of the ResourceManager in your cluster. -->
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>16384</value>
        <!-- Configure this parameter based on the capacity of your cluster. -->
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>4</value>
        <!-- Configure this parameter based on the capacity of your cluster. -->
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>4</value>
        <!-- Configure this parameter based on the capacity of your cluster. -->
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>3584</value>
        <!-- Configure this parameter based on the capacity of your cluster. -->
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>14336</value>
        <!-- Configure this parameter based on the capacity of your cluster. -->
    </property>
</configuration>

Modify the slaves configuration file.
1. Open the slaves configuration file.
```
vim /usr/local/hadoop-2.7.2/etc/hadoop/slaves
```
2. Add the hostnames of the compute nodes in your cluster to the slaves configuration file.
```
cluster-header-1
```
```
cluster-worker-1
```

Configure environment variables.

Open the /etc/profile configuration file.
```
vim /etc/profile
```

Configure HADOOP_HOME in the /etc/profile configuration file.

export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Apply the configuration.
```
source /etc/profile
```

Install the Java SDK for Apsara File Storage for HDFS.
You can download the latest Java SDK for Apsara File Storage HDFS and deploy it to the CLASSPATH of your Hadoop ecosystem components. For more information, see Mount an Apsara File Storage HDFS file system.
```
cp aliyun-sdk-dfs-x.y.z.jar  /usr/local/hadoop-2.7.2/share/hadoop/hdfs
```
Note
Replace x.y.z with the actual version number of the Java SDK.
Synchronize the ${HADOOP_HOME} directory to the same directory on the other nodes in the cluster, and then follow Step 6 to configure the Hadoop environment variables on those nodes.
```
scp -r hadoop-2.7.2/ hadoop@cluster-worker-1:/usr/local/
```

Step 2: Verify Hadoop configuration

After you configure Hadoop, you do not need to format the NameNode or run the start-dfs.sh script to start HDFS-related services. If you use the YARN service, start it only on the ResourceManager node. For detailed instructions on how to verify the Hadoop configuration, see Verify the installation.

Step 3: Configure Scala

Decompress the Scala package to a specified directory.

# Replace /usr/local/ with the actual path.
tar -zxf scala-2.12.11.tgz -C /usr/local

Configure environment variables.
1. Open the /etc/profile configuration file.
```
vim /etc/profile
```
2. Configure SCALA_HOME in the /etc/profile configuration file.
```
  export SCALA_HOME=/usr/local/scala-2.12.11
  export PATH=$PATH:${SCALA_HOME}/bin
```
3. Apply the configuration.
```
source /etc/profile
```

Verify the Scala configuration.

scala -version
scala

Output similar to the following indicates a successful configuration.

[root@iZ8vb2q81658mubcgf0yprZ local]# scala -version
Scala code runner version 2.12.11 -- Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.
[root@iZ8vb2q81658mubcgf0yprZ local]# scala
Welcome to Scala 2.12.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_421).
Type in expressions for evaluation. Or try :help.
scala>

Step 4: Configure Apache Spark

Decompress the Apache Spark package to a specified directory.
```
tar -zxf spark-2.4.8-bin-hadoop2.7.tgz -C /usr/local/
```
Copy the Apsara File Storage for HDFS Java SDK to Spark's jars directory.
```
cp aliyun-sdk-dfs-x.y.z.jar  /usr/local/spark-2.4.8-bin-hadoop2.7/jars
```
Note
For additional Spark configurations, see the official Spark documentation.

Step 5: Verify Apache Spark configuration

Use Spark to read a file from Apsara File Storage for HDFS, run a WordCount job, and write the results to Apsara File Storage for HDFS.

Generate test data on Apsara File Storage for HDFS.

hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
randomtextwriter \
-D mapreduce.randomtextwriter.totalbytes=10240 \
-D mapreduce.randomtextwriter.bytespermap=1024 \
dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/input

Note

f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com is the mount target address for Apsara File Storage for HDFS. Replace it with your actual address.

Start spark-shell and run the WordCount job.

${SPARK_HOME}/bin/spark-shell --master yarn
scala> val res = sc.textFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
scala> res.collect.foreach(println)
scala> res.saveAsTextFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output")

View the results on Apsara File Storage for HDFS.

hadoop fs -ls dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output

Output similar to the following indicates a successful configuration.

Found 11 items
-rwxrwxrwx   3 root root          0 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/_SUCCESS
-rwxrwxrwx   3 root root       1215 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00000
-rwxrwxrwx   3 root root       1171 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00001
-rwxrwxrwx   3 root root       1405 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00002
-rwxrwxrwx   3 root root       1532 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00003
-rwxrwxrwx   3 root root       1008 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00004
-rwxrwxrwx   3 root root       1061 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00005
-rwxrwxrwx   3 root root       1381 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00006
-rwxrwxrwx   3 root root       1497 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00007
-rwxrwxrwx   3 root root       1439 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00008
-rwxrwxrwx   3 root root       1294 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00009