What is OSS-HDFS?

更新时间:
复制 MD 格式

OSS-HDFS (JindoFS) is a cloud-native data lake storage feature with unified metadata management and full Hadoop Distributed File System (HDFS) API compatibility for big data and AI workloads.

Usage notes

Warning
  • After you enable the OSS-HDFS service for a bucket, the service's data is stored in the bucket's.dlsdata/ directory. Do not perform write operations, such as renaming or deleting, on this directory and its objects using non-OSS-HDFS methods. This can cause service disruptions or data loss.

  • If your account has an overdue payment or if the dependent RAM roleAliyunOSSDlsDefaultRole is deleted, the HDFS background service may enter safe mode. In safe mode, all background tasks, such as audit logging, asynchronous deletion, and automatic storage tiering, are paused. The service automatically resumes after the issue is resolved.

After you enable OSS-HDFS, writing to the .dlsdata/ directory through other OSS features can cause data loss, corruption, or data inaccessibility, as described in Prerequisites.

Billing

  • Metadata management fees

    Metadata management is currently free of charge.

  • Data usage fees

    OSS-HDFS stores data blocks in OSS. Therefore, standard OSS billing applies to data blocks in OSS-HDFS. For more information, see Billing overview.

Benefits

OSS-HDFS works with existing Hadoop and Spark applications without modification. After basic configuration, you manage data as with native HDFS, with the added benefits of OSS: virtually unlimited capacity, elastic scalability, and enhanced security, reliability, and availability.

OSS-HDFS handles exabytes of data and billions of files at terabyte-level throughput. Beyond the flat namespace of standard object storage, it provides a hierarchical namespace that organizes objects into directories, with automatic namespace conversion through unified metadata management. Instead of active-standby NameNode redundancy in traditional HDFS, OSS-HDFS uses multi-node active-active redundancy for superior resiliency. Hadoop users access data as efficiently as on local HDFS without replication or conversion, improving job performance and reducing maintenance costs.

Features

Feature

Description

References

recycle bin

Deleted files move to the recycle bin instead of being permanently removed. Default retention: 3 days (configurable: 1–14 days). Restore files before the retention period expires.

Use the recycle bin

export inventory

Export the file inventory of an OSS-HDFS-enabled Bucket to a specified path as a JSON file for metadata analysis.

Export a metadata inventory

export audit log

OSS-HDFS logs client requests that query, modify, or delete file metadata. Use audit logs to review operations, analyze access statistics, and identify abnormal requests.

Export an audit log

storage tiering

Not all data requires frequent access, but some must be retained for compliance or archival purposes. Storage tiering keeps hot data in Standard and moves cold data to Infrequent Access, Archive, or Cold Archive to reduce costs.

Use storage tiering

metadata conversion

Convert OSS metadata to OSS-HDFS metadata directly without import or export tools.

Convert metadata

RootPolicy

Configure a custom prefix for OSS-HDFS so jobs run without changing their original hdfs:// access prefix.

Access OSS-HDFS by using RootPolicy

ProxyUser

Authorize a user to perform file system operations on behalf of other users, such as accessing sensitive data.

ProxyUser (Configure a proxy user)

UserGroupsMapping

Configure mappings between users and user groups.

UserGroupsMapping (Manage user and group mappings)

Use cases

OSS-HDFS supports big data and AI use cases:

Hive and Spark

OSS-HDFS suits offline data warehouses built with Hive and Spark. It natively supports file and directory semantics, permissions, atomic directory operations, millisecond-level renames, setTimes, extended attributes (XAttrs), ACLs, and local read cache acceleration. In ETL workloads, OSS-HDFS significantly outperforms standard OSS buckets. Access OSS-HDFS from Hive or Spark on EMR.

OLAP

OSS-HDFS supports append, truncate, flush, sync, and pwrite with full POSIX support through JindoFuse. This lets you replace local disks in OLAP scenarios (such as ClickHouse) to decouple storage and compute. Built-in caching accelerates performance.

HBase decoupling

OSS-HDFS natively supports file and directory semantics and flush operations, enabling it to replace HDFS in a decoupled storage-compute architecture for HBase. Unlike standard OSS, this stores the Write-Ahead Log (WAL) directly in OSS-HDFS, simplifying the architecture. Use OSS-HDFS as the underlying storage for HBase.

Real-time computing

OSS-HDFS supports flush and truncate and seamlessly replaces HDFS for sinks and checkpoints in Flink real-time computing.

Data migration

OSS-HDFS enables smooth migration of HDFS data from on-premises to cloud, reducing storage costs through elastic scaling and pay-as-you-go pricing. JindoDistCp migrates HDFS data (including file attributes and metadata) to OSS-HDFS and provides fast data comparison using HDFS checksums.

Supported engines

Ecosystem

Engine/Platform

References

open source ecosystem

Flink

Use open source Flink with JindoSDK to process data in OSS-HDFS

Flume

Use Flume with JindoSDK to write data to OSS-HDFS

Hadoop

Use Hadoop with JindoSDK to access OSS-HDFS

HBase

Use OSS-HDFS as the underlying storage for HBase

Hive

Use Hive with JindoSDK to process data in OSS-HDFS

Impala

Use Impala with JindoSDK to query data in OSS-HDFS

Presto

Use Trino with JindoSDK to query data in OSS-HDFS

Spark

Use Spark with JindoSDK to query data in OSS-HDFS

Alibaba Cloud ecosystem

EMR

Access OSS-HDFS from Hive or Spark on EMR

Flink

Flume

Use Flume to synchronize data from an EMR Kafka cluster to OSS-HDFS

HBase

Use OSS-HDFS as the underlying storage for HBase on an EMR cluster

Hive

Use Hive on an EMR cluster to process data in OSS-HDFS

Impala

Use Impala on an EMR cluster to query data in OSS-HDFS

Presto

Use Trino on an EMR cluster to query data in OSS-HDFS

Spark

Use Spark on an EMR cluster to process data in OSS-HDFS

Sqoop

Use Sqoop on an EMR cluster to read and write data in OSS-HDFS

third-party ecosystem

SeaTunnel

Use the SeaTunnel integration platform to write data to OSS-HDFS

More references

Cloud Lakehouse provides hands-on data lake analytics on an EMR cluster using OSS-HDFS in a decoupled storage-compute architecture. Analyze data lakes by using EMR, DLF, and OSS-HDFS.