全部产品
云市场

在ECI中访问OSS数据

更新时间:2019-12-13 09:34:32

OSS是Hadoop/Spark批处理作业可选的数据存储之一。本文将演示在OSS中创建一个文件,并在Spark中进行访问。

OSS数据准备

创建bucket
上传文件

上传文件可以通过OSS SDK,也可以通过HDFS,我们直接以控制台为例:

记住这个bucket的地址:oss://liumi-hust/A-Game-of-Thrones.txtendpoint:oss-cn-hangzhou-internal.aliyuncs.com至此OSS数据准备部分就已经ready。

在spark应用中读取OSS的数据

1、开发应用

应用开发上跟传统的部署方式没有区别。

  1. SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());
  2. JavaRDD<String> lines = sc.textFile("oss://liumi-hust/A-Game-of-Thrones.txt", 250);
  3. ...
  4. wordsCountResult.saveAsTextFile("oss://liumi-hust/A-Game-of-Thrones-result");
  5. sc.close();

2、将前面的core-site.xml放入应用项目的resources目录

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <!--
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License. See accompanying LICENSE file.
  13. -->
  14. <!-- Put site-specific property overrides in this file. -->
  15. <configuration>
  16. <!-- OSS配置 -->
  17. <property>
  18. <name>fs.oss.impl</name>
  19. <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
  20. </property>
  21. <property>
  22. <name>fs.oss.endpoint</name>
  23. <value>oss-cn-hangzhou-internal.aliyuncs.com</value>
  24. </property>
  25. <property>
  26. <name>fs.oss.accessKeyId</name>
  27. <value>{临时AK_ID}</value>
  28. </property>
  29. <property>
  30. <name>fs.oss.accessKeySecret</name>
  31. <value>{临时AK_SECRET}</value>
  32. </property>
  33. <property>
  34. <name>fs.oss.buffer.dir</name>
  35. <value>/tmp/oss</value>
  36. </property>
  37. <property>
  38. <name>fs.oss.connection.secure.enabled</name>
  39. <value>false</value>
  40. </property>
  41. <property>
  42. <name>fs.oss.connection.maximum</name>
  43. <value>2048</value>
  44. </property>
  45. </configuration>

3、打包的jar文件需要包含所有依赖

  1. mvn assembly:assembly

附应用的pom.xml:

  1. 1<?xml version="1.0" encoding="UTF-8"?>
  2. 2<project xmlns="http://maven.apache.org/POM/4.0.0"
  3. 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4. 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5. 5 <modelVersion>4.0.0</modelVersion>
  6. 6
  7. 7 <groupId>com.aliyun.liumi.spark</groupId>
  8. 8 <artifactId>SparkExampleJava</artifactId>
  9. 9 <version>1.0-SNAPSHOT</version>
  10. 10
  11. 11 <dependencies>
  12. 12 <dependency>
  13. 13 <groupId>org.apache.spark</groupId>
  14. 14 <artifactId>spark-core_2.12</artifactId>
  15. 15 <version>2.4.3</version>
  16. 16 </dependency>
  17. 17
  18. 18 <dependency>
  19. 19 <groupId>com.aliyun.dfs</groupId>
  20. 20 <artifactId>aliyun-sdk-dfs</artifactId>
  21. 21 <version>1.0.3</version>
  22. 22 </dependency>
  23. 23
  24. 24 </dependencies>
  25. 25
  26. 26 <build>
  27. 27 <plugins>
  28. 28 <plugin>
  29. 29 <groupId>org.apache.maven.plugins</groupId>
  30. 30 <artifactId>maven-assembly-plugin</artifactId>
  31. 31 <version>2.6</version>
  32. 32 <configuration>
  33. 33 <appendAssemblyId>false</appendAssemblyId>
  34. 34 <descriptorRefs>
  35. 35 <descriptorRef>jar-with-dependencies</descriptorRef>
  36. 36 </descriptorRefs>
  37. 37 <archive>
  38. 38 <manifest>
  39. 39 <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
  40. 40 </manifest>
  41. 41 </archive>
  42. 42 </configuration>
  43. 43 <executions>
  44. 44 <execution>
  45. 45 <id>make-assembly</id>
  46. 46 <phase>package</phase>
  47. 47 <goals>
  48. 48 <goal>assembly</goal>
  49. 49 </goals>
  50. 50 </execution>
  51. 51 </executions>
  52. 52 </plugin>
  53. 53 </plugins>
  54. 54 </build>
  55. 55</project>

4、编写Dockerfile

OSS:

  1. # spark base image
  2. FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4
  3. RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
  4. ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
  5. RUN mkdir -p /opt/spark/jars
  6. COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars
  7. # oss 依赖jar
  8. COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars
  9. COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars
  10. COPY jdom-1.1.jar /opt/spark/jars

oss 依赖jar下载地址见:https://help.aliyun.com/document_detail/97854.html

5、构建应用镜像

  1. docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .

6、推到阿里云ACR

  1. docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example

至此,spark应用镜像都已经准备完毕。接下来就是在kubernetes集群中部署Spark应用了。