What is OSS-HDFS?

Usage notes

Warning

After you enable the OSS-HDFS service for a bucket, the service's data is stored in the bucket's.dlsdata/ directory. Do not perform write operations, such as renaming or deleting, on this directory and its objects using non-OSS-HDFS methods. This can cause service disruptions or data loss.
If your account has an overdue payment or if the dependent RAM roleAliyunOSSDlsDefaultRole is deleted, the HDFS background service may enter safe mode. In safe mode, all background tasks, such as audit logging, asynchronous deletion, and automatic storage tiering, are paused. The service automatically resumes after the issue is resolved.

After you enable OSS-HDFS, writing to the .dlsdata/ directory through other OSS features can cause data loss, corruption, or data inaccessibility, as described in Prerequisites.

Billing

Metadata management fees

Metadata management is currently free of charge.
Data usage fees

OSS-HDFS stores data blocks in OSS. Therefore, standard OSS billing applies to data blocks in OSS-HDFS. For more information, see Billing overview.

Benefits

OSS-HDFS works with existing Hadoop and Spark applications without modification. After basic configuration, you manage data as with native HDFS, with the added benefits of OSS: virtually unlimited capacity, elastic scalability, and enhanced security, reliability, and availability.

OSS-HDFS handles exabytes of data and billions of files at terabyte-level throughput. Beyond the flat namespace of standard object storage, it provides a hierarchical namespace that organizes objects into directories, with automatic namespace conversion through unified metadata management. Instead of active-standby NameNode redundancy in traditional HDFS, OSS-HDFS uses multi-node active-active redundancy for superior resiliency. Hadoop users access data as efficiently as on local HDFS without replication or conversion, improving job performance and reducing maintenance costs.

Features

Feature	Description	References
recycle bin	Deleted files move to the recycle bin instead of being permanently removed. Default retention: 3 days (configurable: 1–14 days). Restore files before the retention period expires.	Use the recycle bin
export inventory	Export the file inventory of an OSS-HDFS-enabled Bucket to a specified path as a JSON file for metadata analysis.	Export a metadata inventory
export audit log	OSS-HDFS logs client requests that query, modify, or delete file metadata. Use audit logs to review operations, analyze access statistics, and identify abnormal requests.	Export an audit log
storage tiering	Not all data requires frequent access, but some must be retained for compliance or archival purposes. Storage tiering keeps hot data in Standard and moves cold data to Infrequent Access, Archive, or Cold Archive to reduce costs.	Use storage tiering
metadata conversion	Convert OSS metadata to OSS-HDFS metadata directly without import or export tools.	Convert metadata
RootPolicy	Configure a custom prefix for OSS-HDFS so jobs run without changing their original `hdfs://` access prefix.	Access OSS-HDFS by using RootPolicy
ProxyUser	Authorize a user to perform file system operations on behalf of other users, such as accessing sensitive data.	ProxyUser (Configure a proxy user)
UserGroupsMapping	Configure mappings between users and user groups.	UserGroupsMapping (Manage user and group mappings)

Use cases

OSS-HDFS supports big data and AI use cases:

Hive and Spark

OSS-HDFS suits offline data warehouses built with Hive and Spark. It natively supports file and directory semantics, permissions, atomic directory operations, millisecond-level renames, setTimes, extended attributes (XAttrs), ACLs, and local read cache acceleration. In ETL workloads, OSS-HDFS significantly outperforms standard OSS buckets. Access OSS-HDFS from Hive or Spark on EMR.

OLAP

OSS-HDFS supports append, truncate, flush, sync, and pwrite with full POSIX support through JindoFuse. This lets you replace local disks in OLAP scenarios (such as ClickHouse) to decouple storage and compute. Built-in caching accelerates performance.

HBase decoupling

OSS-HDFS natively supports file and directory semantics and flush operations, enabling it to replace HDFS in a decoupled storage-compute architecture for HBase. Unlike standard OSS, this stores the Write-Ahead Log (WAL) directly in OSS-HDFS, simplifying the architecture. Use OSS-HDFS as the underlying storage for HBase.

Real-time computing

OSS-HDFS supports flush and truncate and seamlessly replaces HDFS for sinks and checkpoints in Flink real-time computing.

Data migration

OSS-HDFS enables smooth migration of HDFS data from on-premises to cloud, reducing storage costs through elastic scaling and pay-as-you-go pricing. JindoDistCp migrates HDFS data (including file attributes and metadata) to OSS-HDFS and provides fast data comparison using HDFS checksums.

Supported engines

Ecosystem	Engine/Platform	References
open source ecosystem	Flink	Use open source Flink with JindoSDK to process data in OSS-HDFS
	Flume	Use Flume with JindoSDK to write data to OSS-HDFS
	Hadoop	Use Hadoop with JindoSDK to access OSS-HDFS
	HBase	Use OSS-HDFS as the underlying storage for HBase
	Hive	Use Hive with JindoSDK to process data in OSS-HDFS
	Impala	Use Impala with JindoSDK to query data in OSS-HDFS
	Presto	Use Trino with JindoSDK to query data in OSS-HDFS
	Spark	Use Spark with JindoSDK to query data in OSS-HDFS
Alibaba Cloud ecosystem	EMR	Access OSS-HDFS from Hive or Spark on EMR
	Flink	Perform recoverable writes from EMR Flink to OSS-HDFS Use Realtime Compute for Apache Flink to read from and write to OSS or OSS-HDFS
	Flume	Use Flume to synchronize data from an EMR Kafka cluster to OSS-HDFS
	HBase	Use OSS-HDFS as the underlying storage for HBase on an EMR cluster
	Hive	Use Hive on an EMR cluster to process data in OSS-HDFS
	Impala	Use Impala on an EMR cluster to query data in OSS-HDFS
	Presto	Use Trino on an EMR cluster to query data in OSS-HDFS
	Spark	Use Spark on an EMR cluster to process data in OSS-HDFS
	Sqoop	Use Sqoop on an EMR cluster to read and write data in OSS-HDFS
third-party ecosystem	SeaTunnel	Use the SeaTunnel integration platform to write data to OSS-HDFS

More references

Cloud Lakehouse provides hands-on data lake analytics on an EMR cluster using OSS-HDFS in a decoupled storage-compute architecture. Analyze data lakes by using EMR, DLF, and OSS-HDFS.