Access OSS from Spark

更新时间:
复制 MD 格式

This topic describes the configurations required to access Object Storage Service (OSS) from Spark.

OSS Endpoint configuration

For testing, use the public endpoint of the region where your OSS service is located. When you submit a job to the cluster, replace the public endpoint with the VPC endpoint. For more information, see Endpoints and data centers.

OSS access method configuration

  • Access OSS using an AccessKey ID and an AccessKey secret.

    spark.hadoop.fs.oss.accessKeyId = xxxxxx
    spark.hadoop.fs.oss.accessKeySecret = xxxxxx
    spark.hadoop.fs.oss.endpoint = oss-xxxxxx-internal.aliyuncs.com
  • Access OSS using an StsToken.

    Accessing OSS using an AccessKey ID and an AccessKey secret poses a security risk because the credentials must be written in plaintext in the configuration file. To avoid this risk, access OSS using an StsToken instead.

    1. Click one-click authorization to grant your MaxCompute project direct access to the OSS resources in your Alibaba Cloud account using an StsToken.

      Note

      One-click authorization works only if the MaxCompute project owner is the same Alibaba Cloud account that owns the OSS resources.

    2. Obtain the role ARN.

      1. Log on to the Resource Access Management (RAM) console.

      2. In the navigation pane on the left, choose Identity > Roles.

      3. On the Roles page, search for AliyunODPSDefaultRole.

      4. Click AliyunODPSDefaultRole. In the Basic Information section, obtain the ARN. The format is acs:ram::xxxxxxxxxxxxxxx:role/aliyunodpsdefaultrole.

    3. Add the following content to your Spark configuration to access OSS resources.

      # This configuration specifies that Spark accesses OSS resources using an StsToken.
      spark.hadoop.fs.oss.credentials.provider=org.apache.hadoop.fs.aliyun.oss.AliyunStsTokenCredentialsProvider
      
      # This configuration is the role ARN generated after one-click authorization.
      spark.hadoop.fs.oss.ststoken.roleArn=acs:ram::xxxxxxxxxxxxxxx:role/aliyunodpsdefaultrole
      
      # This configuration is the VPC endpoint for the OSS resources.
      spark.hadoop.fs.oss.endpoint=oss-cn-hangzhou-internal.aliyuncs.com

Network whitelist configuration

By default, no network whitelist configuration is required to access OSS.

In some cases, such as when your OSS bucket handles high traffic, you may be unable to access OSS. If this occurs, add the following configuration.

spark.hadoop.odps.cupid.trusted.services.access.list=[your_bucket_name].oss-xxxxxx-internal.aliyuncs.com
Note

This configuration is for yarn-cluster mode. You can add this configuration item to the configuration file or pass it as a command-line parameter.

Access OSS using JindoSDK

Set spark.hadoop.fs.AbstractFileSystem.oss.impl and spark.hadoop.fs.oss.impl in SparkConf. The following code is an example.

val conf = new SparkConf()
  .setAppName("jindo-sdk-demo")
  .set("spark.hadoop.fs.AbstractFileSystem.oss.impl", "com.aliyun.emr.fs.oss.OSS")
  .set("spark.hadoop.fs.oss.impl", "com.aliyun.emr.fs.oss.JindoOssFileSystem")
Note

You must set spark.hadoop.fs.oss.impl. Otherwise, a "No FileSystem for scheme: oss" error occurs.