OSS-HDFS (JindoFS) is a cloud-native data lake storage feature with unified metadata management and full Hadoop Distributed File System (HDFS) API compatibility for big data and AI workloads.
Usage notes
-
After you enable the OSS-HDFS service for a bucket, the service's data is stored in the bucket's
.dlsdata/directory. Do not perform write operations, such as renaming or deleting, on this directory and its objects using non-OSS-HDFS methods. This can cause service disruptions or data loss. -
If your account has an overdue payment or if the dependent RAM role
AliyunOSSDlsDefaultRoleis deleted, the HDFS background service may enter safe mode. In safe mode, all background tasks, such as audit logging, asynchronous deletion, and automatic storage tiering, are paused. The service automatically resumes after the issue is resolved.
After you enable OSS-HDFS, writing to the .dlsdata/ directory through other OSS features can cause data loss, corruption, or data inaccessibility, as described in Prerequisites.
Billing
-
Metadata management fees
Metadata management is currently free of charge.
-
Data usage fees
OSS-HDFS stores data blocks in OSS. Therefore, standard OSS billing applies to data blocks in OSS-HDFS. For more information, see Billing overview.
Benefits
OSS-HDFS works with existing Hadoop and Spark applications without modification. After basic configuration, you manage data as with native HDFS, with the added benefits of OSS: virtually unlimited capacity, elastic scalability, and enhanced security, reliability, and availability.
OSS-HDFS handles exabytes of data and billions of files at terabyte-level throughput. Beyond the flat namespace of standard object storage, it provides a hierarchical namespace that organizes objects into directories, with automatic namespace conversion through unified metadata management. Instead of active-standby NameNode redundancy in traditional HDFS, OSS-HDFS uses multi-node active-active redundancy for superior resiliency. Hadoop users access data as efficiently as on local HDFS without replication or conversion, improving job performance and reducing maintenance costs.
Features
|
Feature |
Description |
References |
|
recycle bin |
Deleted files move to the recycle bin instead of being permanently removed. Default retention: 3 days (configurable: 1–14 days). Restore files before the retention period expires. |
|
|
export inventory |
Export the file inventory of an OSS-HDFS-enabled Bucket to a specified path as a JSON file for metadata analysis. |
|
|
export audit log |
OSS-HDFS logs client requests that query, modify, or delete file metadata. Use audit logs to review operations, analyze access statistics, and identify abnormal requests. |
|
|
storage tiering |
Not all data requires frequent access, but some must be retained for compliance or archival purposes. Storage tiering keeps hot data in Standard and moves cold data to Infrequent Access, Archive, or Cold Archive to reduce costs. |
|
|
metadata conversion |
Convert OSS metadata to OSS-HDFS metadata directly without import or export tools. |
|
|
RootPolicy |
Configure a custom prefix for OSS-HDFS so jobs run without changing their original |
|
|
ProxyUser |
Authorize a user to perform file system operations on behalf of other users, such as accessing sensitive data. |
|
|
UserGroupsMapping |
Configure mappings between users and user groups. |
Use cases
OSS-HDFS supports big data and AI use cases:
Hive and Spark
OSS-HDFS suits offline data warehouses built with Hive and Spark. It natively supports file and directory semantics, permissions, atomic directory operations, millisecond-level renames, setTimes, extended attributes (XAttrs), ACLs, and local read cache acceleration. In ETL workloads, OSS-HDFS significantly outperforms standard OSS buckets. Access OSS-HDFS from Hive or Spark on EMR.
OLAP
OSS-HDFS supports append, truncate, flush, sync, and pwrite with full POSIX support through JindoFuse. This lets you replace local disks in OLAP scenarios (such as ClickHouse) to decouple storage and compute. Built-in caching accelerates performance.
HBase decoupling
OSS-HDFS natively supports file and directory semantics and flush operations, enabling it to replace HDFS in a decoupled storage-compute architecture for HBase. Unlike standard OSS, this stores the Write-Ahead Log (WAL) directly in OSS-HDFS, simplifying the architecture. Use OSS-HDFS as the underlying storage for HBase.
Real-time computing
OSS-HDFS supports flush and truncate and seamlessly replaces HDFS for sinks and checkpoints in Flink real-time computing.
Data migration
OSS-HDFS enables smooth migration of HDFS data from on-premises to cloud, reducing storage costs through elastic scaling and pay-as-you-go pricing. JindoDistCp migrates HDFS data (including file attributes and metadata) to OSS-HDFS and provides fast data comparison using HDFS checksums.
Supported engines
|
Ecosystem |
Engine/Platform |
References |
|
open source ecosystem |
Flink |
Use open source Flink with JindoSDK to process data in OSS-HDFS |
|
Flume |
||
|
Hadoop |
||
|
HBase |
||
|
Hive |
||
|
Impala |
||
|
Presto |
||
|
Spark |
||
|
Alibaba Cloud ecosystem |
EMR |
|
|
Flink |
||
|
Flume |
Use Flume to synchronize data from an EMR Kafka cluster to OSS-HDFS |
|
|
HBase |
Use OSS-HDFS as the underlying storage for HBase on an EMR cluster |
|
|
Hive |
||
|
Impala |
||
|
Presto |
||
|
Spark |
||
|
Sqoop |
Use Sqoop on an EMR cluster to read and write data in OSS-HDFS |
|
|
third-party ecosystem |
SeaTunnel |
Use the SeaTunnel integration platform to write data to OSS-HDFS |
More references
Cloud Lakehouse provides hands-on data lake analytics on an EMR cluster using OSS-HDFS in a decoupled storage-compute architecture. Analyze data lakes by using EMR, DLF, and OSS-HDFS.