全部产品
云市场

在ECI中访问HDFS的数据

更新时间:2019-12-10 20:18:27

数据准备

HDFS是Hadoop/Spark批处理作业最常用的数据存储之一,目前阿里云的HDFS也已经开始公测。本文将演示在HDFS中创建一个文件,并在Spark应用中进行访问。

1、开通HDFS服务,并创建文件系统
2、设置权限组

1、创建权限组

2、设置权限组的规则eci-hdfs-3

3、为挂载点添加权限组

至此HDFS文件系统就准备完毕。

3、安装Apache Hadoop Client。

HDFS文件系统准备就绪后,就是存入文件。我们采用HDFS client的方式。

Apache Hadoop下载地址:官方链接。建议选用的Apache Hadoop版本不低于2.7.2,本文档中使用的Apache Hadoop版本为Apache Hadoop 2.7.2。

1、执行如下命令解压Apache Hadoop压缩包到指定文件夹。

  1. tar -zxvf hadoop-2.7.2.tar.gz -C /usr/local/

2、执行如下命令打开core-site.xml配置文件。

  1. vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml

修改core-site.xml配置文件如下:

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <!--
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License. See accompanying LICENSE file.
  13. -->
  14. <!-- Put site-specific property overrides in this file. -->
  15. <configuration>
  16. <property>
  17. <name>fs.defaultFS</name>
  18. <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
  19. <!-- 该地址填写自己的HDFS挂载点地址 -->
  20. </property>
  21. <property>
  22. <name>fs.dfs.impl</name>
  23. <value>com.alibaba.dfs.DistributedFileSystem</value>
  24. </property>
  25. <property>
  26. <name>fs.AbstractFileSystem.dfs.impl</name>
  27. <value>com.alibaba.dfs.DFS</value>
  28. </property>
  29. <property>
  30. <name>io.file.buffer.size</name>
  31. <value>8388608</value>
  32. </property>
  33. <property>
  34. <name>alidfs.use.buffer.size.setting</name>
  35. <value>false</value>
  36. <!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
  37. </property>
  38. <property>
  39. <name>dfs.usergroupservice.impl</name>
  40. <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
  41. </property>
  42. <property>
  43. <name>dfs.connection.count</name>
  44. <value>256</value>
  45. </property>
  46. </configuration>

注:由于我们是on k8s,所以yarn相关的配置项不用配置,只用配置HDFS相关的几个配置项。修改后的core-site.xml文件后在面很多地方会用到。

3、执行如下命令打开/etc/profile配置文件。

  1. vim /etc/profile

添加环境变量

  1. export HADOOP_HOME=/usr/local/hadoop-2.7.2
  2. export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.2/etc/hadoop:/usr/local/hadoop-2.7.2/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/common/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.2/contrib/capacity-scheduler/*.jar
  3. export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.2/etc/hadoop

执行如下命令使配置生效。

  1. source /etc/profile

注:我们只需要一个HDFS client即可,不需要部署HDFS集群。

4、添加阿里云HDFS依赖

  1. cp aliyun-sdk-dfs-1.0.3.jar /usr/local/hadoop-2.7.2/share/hadoop/hdfs

下载地址:此处下载文件存储HDFS的SDK。

4、上传数据
  1. #创建数据目录
  2. [root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -mkdir -p /pod/data
  3. #将本地准备的文件(一本小说文本)上传到hdfs
  4. [root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -put ./A-Game-of-Thrones.txt /pod/data/A-Game-of-Thrones.txt
  5. #查看,文件大小为30G
  6. [root@liumi-hdfs local]# $HADOOP_HOME/bin/hadoop fs -ls /pod/data
  7. Found 1 items
  8. -rwxrwxrwx 3 root root 33710040000 2019-11-10 13:02 /pod/data/A-Game-of-Thrones.txt

至此HDFS数据准备部分就已经ready。

在spark应用中读取HDFS的数据

1、开发应用

应用开发上跟传统的部署方式没有区别。

  1. SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());
  2. JavaRDD<String> lines = sc.textFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones.txt", 250);
  3. ...
  4. wordsCountResult.saveAsTextFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones-Result");
  5. sc.close();

2、将前面的core-site.xml放入应用项目的resources目录

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <!--
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License. See accompanying LICENSE file.
  13. -->
  14. <!-- Put site-specific property overrides in this file. -->
  15. <configuration>
  16. <!-- HDFS 配置-->
  17. <configuration>
  18. <property>
  19. <name>fs.defaultFS</name>
  20. <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
  21. <!-- 该地址填写自己的HDFS挂载点地址 -->
  22. </property>
  23. <property>
  24. <name>fs.dfs.impl</name>
  25. <value>com.alibaba.dfs.DistributedFileSystem</value>
  26. </property>
  27. <property>
  28. <name>fs.AbstractFileSystem.dfs.impl</name>
  29. <value>com.alibaba.dfs.DFS</value>
  30. </property>
  31. <property>
  32. <name>io.file.buffer.size</name>
  33. <value>8388608</value>
  34. </property>
  35. <property>
  36. <name>alidfs.use.buffer.size.setting</name>
  37. <value>false</value>
  38. <!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
  39. </property>
  40. <property>
  41. <name>dfs.usergroupservice.impl</name>
  42. <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
  43. </property>
  44. <property>
  45. <name>dfs.connection.count</name>
  46. <value>256</value>
  47. </property>
  48. </configuration>

3、打包的jar文件需要包含所有依赖

  1. mvn assembly:assembly

附应用的pom.xml:

  1. 1<?xml version="1.0" encoding="UTF-8"?>
  2. 2<project xmlns="http://maven.apache.org/POM/4.0.0"
  3. 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4. 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5. 5 <modelVersion>4.0.0</modelVersion>
  6. 6
  7. 7 <groupId>com.aliyun.liumi.spark</groupId>
  8. 8 <artifactId>SparkExampleJava</artifactId>
  9. 9 <version>1.0-SNAPSHOT</version>
  10. 10
  11. 11 <dependencies>
  12. 12 <dependency>
  13. 13 <groupId>org.apache.spark</groupId>
  14. 14 <artifactId>spark-core_2.12</artifactId>
  15. 15 <version>2.4.3</version>
  16. 16 </dependency>
  17. 17
  18. 18 <dependency>
  19. 19 <groupId>com.aliyun.dfs</groupId>
  20. 20 <artifactId>aliyun-sdk-dfs</artifactId>
  21. 21 <version>1.0.3</version>
  22. 22 </dependency>
  23. 23
  24. 24 </dependencies>
  25. 25
  26. 26 <build>
  27. 27 <plugins>
  28. 28 <plugin>
  29. 29 <groupId>org.apache.maven.plugins</groupId>
  30. 30 <artifactId>maven-assembly-plugin</artifactId>
  31. 31 <version>2.6</version>
  32. 32 <configuration>
  33. 33 <appendAssemblyId>false</appendAssemblyId>
  34. 34 <descriptorRefs>
  35. 35 <descriptorRef>jar-with-dependencies</descriptorRef>
  36. 36 </descriptorRefs>
  37. 37 <archive>
  38. 38 <manifest>
  39. 39 <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
  40. 40 </manifest>
  41. 41 </archive>
  42. 42 </configuration>
  43. 43 <executions>
  44. 44 <execution>
  45. 45 <id>make-assembly</id>
  46. 46 <phase>package</phase>
  47. 47 <goals>
  48. 48 <goal>assembly</goal>
  49. 49 </goals>
  50. 50 </execution>
  51. 51 </executions>
  52. 52 </plugin>
  53. 53 </plugins>
  54. 54 </build>
  55. 55</project>

4、编写Dockerfile

  1. # spark base image
  2. FROM registry.cn-hangzhou.aliyuncs.com/eci_open/spark:2.4.4
  3. # 默认的kubernetes-client版本有问题,建议用最新的
  4. RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
  5. ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
  6. # 拷贝本地的应用jar
  7. RUN mkdir -p /opt/spark/jars
  8. COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars

5、构建应用镜像

  1. docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .

6、推到阿里云ACR

  1. docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example

至此,镜像都已经准备完毕。接下来就是在kubernetes集群中部署Spark应用了。