文档

在ECI中访问OSS数据

更新时间:

使用Hadoop、Spark等运行批处理作业时,可以选择对象存储OSS作为存储。本文以Spark为例,演示如何上传文件到OSS中,并在Spark中进行访问。

准备数据并上传到OSS

  1. 登录OSS管理控制台

  2. 创建Bucket。具体操作,请参见创建存储空间

  3. 上传文件到OSS。具体操作,请参见简单上传

    上传文件后,记录该文件在OSS Bucket的地址(例如oss://test***-hust/test.txt)和OSS endpoint(例如oss-cn-hangzhou-internal.aliyuncs.com)。

在Spark应用中读取OSS数据

  1. 开发应用。

    SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());
    JavaRDD<String> lines = sc.textFile("oss://test***-hust/test.txt", 250);
    ...
    wordsCountResult.saveAsTextFile("oss://test***-hust/test-result");
    sc.close();   
  2. 在应用中配置OSS信息。

    说明

    请根据实际替换OSS endpoint、AccessKey ID和AccessKey Secret。

    • 方式一:使用静态配置文件

      修改core-site.xml,然后将其放入到应用项目的resources目录下。

      <?xml version="1.0" encoding="UTF-8"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!--
        Licensed under the Apache License, Version 2.0 (the "License");
        you may not use this file except in compliance with the License.
        You may obtain a copy of the License at
          http://www.apache.org/licenses/LICENSE-2.0
        Unless required by applicable law or agreed to in writing, software
        distributed under the License is distributed on an "AS IS" BASIS,
        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions and
        limitations under the License. See accompanying LICENSE file.
      -->
      <!-- Put site-specific property overrides in this file. -->
      <configuration>
          <!-- OSS配置 -->
          <property>
              <name>fs.oss.impl</name>
              <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
          </property>
          <property>
              <name>fs.oss.endpoint</name>
              <value>oss-cn-hangzhou-internal.aliyuncs.com</value>
          </property>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>{your AccessKey ID}</value>
          </property>
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>{your AccessKey Secret}</value>
          </property>
          <property>
              <name>fs.oss.buffer.dir</name>
              <value>/tmp/oss</value>
          </property>
          <property>
              <name>fs.oss.connection.secure.enabled</name>
              <value>false</value>
          </property>
          <property>
              <name>fs.oss.connection.maximum</name>
              <value>2048</value>
          </property>
      </configuration>
    • 方式二:提交应用时进行动态设置

      以Spark为例,在提交应用时进行设置,示例如下:

      hadoopConf:
          # OSS
          "fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
          "fs.oss.endpoint": "oss-cn-hangzhou-internal.aliyuncs.com"
          "fs.oss.accessKeyId": "your AccessKey ID"
          "fs.oss.accessKeySecret": "your AccessKey Secret"
  3. 打包JAR文件。

    打包的JAR文件中需包含所有依赖。应用的pom.xml如下:

     1<?xml version="1.0" encoding="UTF-8"?>
     2<project xmlns="http://maven.apache.org/POM/4.0.0"
     3         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     4         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     5    <modelVersion>4.0.0</modelVersion>
     6
     7    <groupId>com.aliyun.liumi.spark</groupId>
     8    <artifactId>SparkExampleJava</artifactId>
     9    <version>1.0-SNAPSHOT</version>
    10
    11    <dependencies>
    12        <dependency>
    13            <groupId>org.apache.spark</groupId>
    14            <artifactId>spark-core_2.12</artifactId>
    15            <version>2.4.3</version>
    16        </dependency>
    17
    18        <dependency>
    19            <groupId>com.aliyun.dfs</groupId>
    20            <artifactId>aliyun-sdk-dfs</artifactId>
    21            <version>1.0.3</version>
    22        </dependency>
    23
    24    </dependencies>
    25
    26    <build>
    27    <plugins>
    28        <plugin>
    29            <groupId>org.apache.maven.plugins</groupId>
    30            <artifactId>maven-assembly-plugin</artifactId>
    31            <version>2.6</version>
    32            <configuration>
    33                <appendAssemblyId>false</appendAssemblyId>
    34                <descriptorRefs>
    35                    <descriptorRef>jar-with-dependencies</descriptorRef>
    36                </descriptorRefs>
    37                <archive>
    38                    <manifest>
    39                        <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
    40                    </manifest>
    41                </archive>
    42            </configuration>
    43            <executions>
    44                <execution>
    45                    <id>make-assembly</id>
    46                    <phase>package</phase>
    47                    <goals>
    48                        <goal>assembly</goal>
    49                    </goals>
    50                </execution>
    51            </executions>
    52        </plugin>
    53    </plugins>
    54    </build>
    55</project>
  4. 编写Dockerfile。

    # spark base image
    FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4
    RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
    ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
    RUN mkdir -p /opt/spark/jars
    COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars
    # OSS 依赖JAR
    COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars
    COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars
    COPY jdom-1.1.jar /opt/spark/jars
    说明

    OSS依赖JAR的下载地址请参见通过HDP 2.6 Hadoop读取和写入OSS数据

  5. 构建应用镜像。

    docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .
  6. 将镜像推送到阿里云ACR。

    docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example

完成上述操作,即准备好Spark应用镜像后,可以使用该镜像在Kubernetes集群中部署Spark应用。