Use Apache Spark with Apsara File Storage for HDFS

更新时间:
复制 MD 格式

This topic describes how to install and use Apache Spark on a Hadoop cluster with an Apsara File Storage for HDFS file system mounted.

Prerequisites

Step 1: Configure Hadoop

  1. Decompress the Hadoop package to a specified directory.

    # Replace /usr/local/ with the actual path.
    tar -zxf hadoop-2.7.2.tar.gz -C /usr/local/
  2. Modify the hadoop-env.sh configuration file.

    1. Open the hadoop-env.sh configuration file.

      vim /usr/local/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
    2. Set the JAVA_HOME path.

      export JAVA_HOME=/usr/java/default
  3. Modify the core-site.xml configuration file.

    1. Open the core-site.xml configuration file.

      vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml
    2. Add the following properties to the core-site.xml configuration file. For more information, see Mount an Apsara File Storage HDFS file system.

      <configuration>
          <property>
               <name>fs.defaultFS</name>
               <value>dfs://x-xxxxxxxx.cn-xxxxx.dfs.aliyuncs.com:10290</value>
               <!-- Enter the address of your mount target. -->
          </property>
          <property>
               <name>fs.dfs.impl</name>
               <value>com.alibaba.dfs.DistributedFileSystem</value>
          </property>
          <property>
               <name>fs.AbstractFileSystem.dfs.impl</name>
               <value>com.alibaba.dfs.DFS</value>
          </property>
      </configuration>
  4. Modify the yarn-site.xml configuration file.

    1. Open the yarn-site.xml configuration file.

      vim /usr/local/hadoop-2.7.2/etc/hadoop/yarn-site.xml
    2. Add the following properties to the yarn-site.xml configuration file.

      <configuration>
          <property>
              <name>yarn.resourcemanager.hostname</name>
              <value>xxxx</value>
              <!-- Enter the hostname of the ResourceManager in your cluster. -->
          </property>
          <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
          </property>
          <property>
              <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
              <value>org.apache.hadoop.mapred.ShuffleHandler</value>
          </property>
          <property>
              <name>yarn.nodemanager.resource.memory-mb</name>
              <value>16384</value>
              <!-- Configure this parameter based on the capacity of your cluster. -->
          </property>
          <property>
              <name>yarn.nodemanager.resource.cpu-vcores</name>
              <value>4</value>
              <!-- Configure this parameter based on the capacity of your cluster. -->
          </property>
          <property>
              <name>yarn.scheduler.maximum-allocation-vcores</name>
              <value>4</value>
              <!-- Configure this parameter based on the capacity of your cluster. -->
          </property>
          <property>
              <name>yarn.scheduler.minimum-allocation-mb</name>
              <value>3584</value>
              <!-- Configure this parameter based on the capacity of your cluster. -->
          </property>
          <property>
              <name>yarn.scheduler.maximum-allocation-mb</name>
              <value>14336</value>
              <!-- Configure this parameter based on the capacity of your cluster. -->
          </property>
      </configuration>
  5. Modify the slaves configuration file.

    1. Open the slaves configuration file.

      vim /usr/local/hadoop-2.7.2/etc/hadoop/slaves
    2. Add the hostnames of the compute nodes in your cluster to the slaves configuration file.

      cluster-header-1
      cluster-worker-1
  6. Configure environment variables.

    1. Open the /etc/profile configuration file.

      vim /etc/profile
    2. Configure HADOOP_HOME in the /etc/profile configuration file.

      export HADOOP_HOME=/usr/local/hadoop-2.7.2
      export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
    3. Apply the configuration.

      source /etc/profile
  7. Install the Java SDK for Apsara File Storage for HDFS.

    You can download the latest Java SDK for Apsara File Storage HDFS and deploy it to the CLASSPATH of your Hadoop ecosystem components. For more information, see Mount an Apsara File Storage HDFS file system.

    cp aliyun-sdk-dfs-x.y.z.jar  /usr/local/hadoop-2.7.2/share/hadoop/hdfs
    Note

    Replace x.y.z with the actual version number of the Java SDK.

  8. Synchronize the ${HADOOP_HOME} directory to the same directory on the other nodes in the cluster, and then follow Step 6 to configure the Hadoop environment variables on those nodes.

    scp -r hadoop-2.7.2/ hadoop@cluster-worker-1:/usr/local/

Step 2: Verify Hadoop configuration

After you configure Hadoop, you do not need to format the NameNode or run the start-dfs.sh script to start HDFS-related services. If you use the YARN service, start it only on the ResourceManager node. For detailed instructions on how to verify the Hadoop configuration, see Verify the installation.

Step 3: Configure Scala

  1. Decompress the Scala package to a specified directory.

    # Replace /usr/local/ with the actual path.
    tar -zxf scala-2.12.11.tgz -C /usr/local
  2. Configure environment variables.

    1. Open the /etc/profile configuration file.

      vim /etc/profile
    2. Configure SCALA_HOME in the /etc/profile configuration file.

        export SCALA_HOME=/usr/local/scala-2.12.11
        export PATH=$PATH:${SCALA_HOME}/bin
    3. Apply the configuration.

      source /etc/profile
  3. Verify the Scala configuration.

    scala -version
    scala

    Output similar to the following indicates a successful configuration.

    [root@iZ8vb2q81658mubcgf0yprZ local]# scala -version
    Scala code runner version 2.12.11 -- Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.
    [root@iZ8vb2q81658mubcgf0yprZ local]# scala
    Welcome to Scala 2.12.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_421).
    Type in expressions for evaluation. Or try :help.
    scala>

Step 4: Configure Apache Spark

  1. Decompress the Apache Spark package to a specified directory.

    tar -zxf spark-2.4.8-bin-hadoop2.7.tgz -C /usr/local/
  2. Copy the Apsara File Storage for HDFS Java SDK to Spark's jars directory.

    cp aliyun-sdk-dfs-x.y.z.jar  /usr/local/spark-2.4.8-bin-hadoop2.7/jars
    Note

    For additional Spark configurations, see the official Spark documentation.

Step 5: Verify Apache Spark configuration

Use Spark to read a file from Apsara File Storage for HDFS, run a WordCount job, and write the results to Apsara File Storage for HDFS.

  1. Generate test data on Apsara File Storage for HDFS.

    hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
    randomtextwriter \
    -D mapreduce.randomtextwriter.totalbytes=10240 \
    -D mapreduce.randomtextwriter.bytespermap=1024 \
    dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/input
    Note

    f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com is the mount target address for Apsara File Storage for HDFS. Replace it with your actual address.

  2. Start spark-shell and run the WordCount job.

    ${SPARK_HOME}/bin/spark-shell --master yarn
    scala> val res = sc.textFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
    scala> res.collect.foreach(println)
    scala> res.saveAsTextFile("dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output")
  3. View the results on Apsara File Storage for HDFS.

    hadoop fs -ls dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output

    Output similar to the following indicates a successful configuration.

    Found 11 items
    -rwxrwxrwx   3 root root          0 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/_SUCCESS
    -rwxrwxrwx   3 root root       1215 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00000
    -rwxrwxrwx   3 root root       1171 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00001
    -rwxrwxrwx   3 root root       1405 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00002
    -rwxrwxrwx   3 root root       1532 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00003
    -rwxrwxrwx   3 root root       1008 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00004
    -rwxrwxrwx   3 root root       1061 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00005
    -rwxrwxrwx   3 root root       1381 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00006
    -rwxrwxrwx   3 root root       1497 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00007
    -rwxrwxrwx   3 root root       1439 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00008
    -rwxrwxrwx   3 root root       1294 2021-11-25 14:14 dfs://f-xxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/output/part-00009