What is CPFS dataflow-Cloud Parallel File Storage(CPFS)-阿里云帮助中心

Cloud Parallel File Storage (CPFS) dataflow synchronizes data between CPFS file systems and Object Storage Service (OSS) buckets. You can create dataflow tasks to import, export, or delete data across the two storage services.

Overview

After a dataflow is created between a CPFS fileset and an OSS bucket, the CPFS file system automatically synchronizes object metadata from the OSS bucket. You can then access and process OSS data through high-performance, POSIX-compatible file interfaces. You can also export data to OSS buckets from the CPFS console or by calling OpenAPI operations.

On-demand loading

When you access a directory or file in a CPFS file system linked to an OSS bucket, the file system automatically loads the required metadata or data on demand. For example, running the ls command on a linked directory triggers metadata loading from OSS. Accessing a file triggers loading of the required data blocks from OSS.
Data import and export

You can create dataflow tasks to import or export data between CPFS and OSS. This way, data is synchronized to the high-performance CPFS file system before compute jobs start. CPFS supports importing and exporting full directory trees or specific file lists. After a task completes, you can review the task report for execution details.
Important
- CPFS exports metadata to the custom metadata of an OSS object. This metadata is named x-oss-meta-afm-xxx. Do not delete or modify this metadata. Otherwise, file system metadata errors may occur.
- Task reports are for reference only. The final state of the data at the destination is the definitive record. You must verify data consistency between the source and the destination after the dataflow completes.
Automatic metadata updates

After you enable automatic metadata updates, CPFS monitors OSS data modification events and synchronizes updated file metadata to the CPFS file system. This ensures eventual consistency between CPFS and OSS and reduces O&M costs.
Elastic scaling

You can scale dataflow bandwidth up or down as needed. Increase bandwidth during peak traffic and decrease it during off-peak periods.

Limits

Fileset
- Filesets are supported only in CPFS 2.2.0 and later.
- A single CPFS file system supports a maximum of 10 filesets.
- A fileset can be linked to a directory up to eight levels deep within the CPFS file system.
- A fileset can contain a maximum of 1 million files or directories.
- Nested filesets are not supported.
Dataflows
- Dataflows are supported only in CPFS 2.2.0 and later.
- A single CPFS file system supports a maximum of 10 dataflows.
- A single dataflow can have a maximum of five auto-update directories.
- A fileset in a CPFS file system can be linked to only one OSS bucket.
- Dataflow task records are retained for a maximum of 90 days.
- Dataflow task reports are stored in the CPFS file system and consume storage space. A maximum of 1 million reports can be stored.
- Cross-region dataflows are not supported. The CPFS file system and OSS bucket must be in the same region.
Dataflow limits on file systems
- Do not rename a non-empty directory in a fileset associated with a dataflow. Renaming a non-empty directory may cause a Permission Denied error or a "directory not empty" error.
- Dataflows do not support Archive or Cold Archive objects in OSS.
- Use special characters in directory and file names with caution. Supported characters are uppercase and lowercase letters, digits, exclamation points (!), hyphens (-), underscores (_), periods (.), asterisks (*), and parentheses ().
- Long paths are not supported. The maximum path length for a dataflow is 1,023 characters.
Limits on data export
- Dataflows do not support exporting hard links or symbolic links to an OSS bucket.
- Dataflows do not support exporting empty directories to an OSS bucket.
- Dataflows do not support exporting ChangeTime properties to an OSS bucket.
- When a dataflow exports sparse data, zero-value holes are padded before the data is exported to an OSS bucket.
Limits on automatic metadata updates

The automatic metadata update feature is available only in the following regions: China (Hangzhou), China (Chengdu), China (Shanghai), China (Shenzhen), China (Zhangjiakou), and China (Beijing).

Procedure

Create a CPFS fileset. For more information, see Create a fileset.
Create a dataflow. For more information, see Create a dataflow.
Create a data import, data export, or data deletion task. For more information, see Create a dataflow task.
Verify the data. After the dataflow task is complete, verify that the data at the destination is accurate.

Warning
If you delete the source data before verifying that the data was transferred to the destination correctly, you are solely responsible for any resulting data loss and all consequences.

Performance metrics

Operation type	Metric	Description
Data import	Throughput for files larger than 1 GB	Single-file import throughput: 200 MB/s. Multi-file import throughput can reach the configured bandwidth.
Data import	OPS for megabyte-scale files	Single-directory and multi-directory import: 1,000.
Data export	Throughput for files larger than 1 GB	Single-file export throughput: 200 MB/s. Multi-file export throughput can reach the configured bandwidth.
Data export	OPS for megabyte-scale files	Single-directory and multi-directory export: 600.
Data deletion	OPS	Single-directory and multi-directory deletion: 2,000.
On-demand loading (lazy load)	Throughput for files larger than 1 GB	Single-file import throughput: 200 MB/s. Multi-file import throughput can reach the configured bandwidth.
On-demand loading (lazy load)	OPS for megabyte-scale files	Single-directory and multi-directory import: 1,000.
Automatic metadata update	OPS	Dataflow at 600 MB/s: 2,000. Dataflow at 1,200 MB/s: 3,000. Dataflow at 1,500 MB/s: 4,000.