CPFS dataflow
Cloud Parallel File Storage (CPFS) dataflow synchronizes data between CPFS file systems and Object Storage Service (OSS) buckets. You can create dataflow tasks to import, export, or delete data across the two storage services.
Overview
After a dataflow is created between a CPFS fileset and an OSS bucket, the CPFS file system automatically synchronizes object metadata from the OSS bucket. You can then access and process OSS data through high-performance, POSIX-compatible file interfaces. You can also export data to OSS buckets from the CPFS console or by calling OpenAPI operations.
-
On-demand loading
When you access a directory or file in a CPFS file system linked to an OSS bucket, the file system automatically loads the required metadata or data on demand. For example, running the ls command on a linked directory triggers metadata loading from OSS. Accessing a file triggers loading of the required data blocks from OSS.
-
Data import and export
You can create dataflow tasks to import or export data between CPFS and OSS. This way, data is synchronized to the high-performance CPFS file system before compute jobs start. CPFS supports importing and exporting full directory trees or specific file lists. After a task completes, you can review the task report for execution details.
Important-
CPFS exports metadata to the custom metadata of an OSS object. This metadata is named
x-oss-meta-afm-xxx. Do not delete or modify this metadata. Otherwise, file system metadata errors may occur. -
Task reports are for reference only. The final state of the data at the destination is the definitive record. You must verify data consistency between the source and the destination after the dataflow completes.
-
-
Automatic metadata updates
After you enable automatic metadata updates, CPFS monitors OSS data modification events and synchronizes updated file metadata to the CPFS file system. This ensures eventual consistency between CPFS and OSS and reduces O&M costs.
-
Elastic scaling
You can scale dataflow bandwidth up or down as needed. Increase bandwidth during peak traffic and decrease it during off-peak periods.
Limits
-
Fileset
-
Filesets are supported only in CPFS 2.2.0 and later.
-
A single CPFS file system supports a maximum of 10 filesets.
-
A fileset can be linked to a directory up to eight levels deep within the CPFS file system.
-
A fileset can contain a maximum of 1 million files or directories.
-
Nested filesets are not supported.
-
-
Dataflows
-
Dataflows are supported only in CPFS 2.2.0 and later.
-
A single CPFS file system supports a maximum of 10 dataflows.
-
A single dataflow can have a maximum of five auto-update directories.
-
A fileset in a CPFS file system can be linked to only one OSS bucket.
-
Dataflow task records are retained for a maximum of 90 days.
-
Dataflow task reports are stored in the CPFS file system and consume storage space. A maximum of 1 million reports can be stored.
-
Cross-region dataflows are not supported. The CPFS file system and OSS bucket must be in the same region.
-
-
Dataflow limits on file systems
-
Do not rename a non-empty directory in a fileset associated with a dataflow. Renaming a non-empty directory may cause a
Permission Deniederror or a "directory not empty" error. -
Dataflows do not support Archive or Cold Archive objects in OSS.
-
Use special characters in directory and file names with caution. Supported characters are uppercase and lowercase letters, digits, exclamation points (!), hyphens (-), underscores (_), periods (.), asterisks (*), and parentheses ().
-
Long paths are not supported. The maximum path length for a dataflow is 1,023 characters.
-
-
Limits on data export
-
Dataflows do not support exporting hard links or symbolic links to an OSS bucket.
-
Dataflows do not support exporting empty directories to an OSS bucket.
-
Dataflows do not support exporting ChangeTime properties to an OSS bucket.
-
When a dataflow exports sparse data, zero-value holes are padded before the data is exported to an OSS bucket.
-
-
Limits on automatic metadata updates
The automatic metadata update feature is available only in the following regions: China (Hangzhou), China (Chengdu), China (Shanghai), China (Shenzhen), China (Zhangjiakou), and China (Beijing).
Procedure
-
Create a CPFS fileset. For more information, see Create a fileset.
-
Create a dataflow. For more information, see Create a dataflow.
-
Create a data import, data export, or data deletion task. For more information, see Create a dataflow task.
-
Verify the data. After the dataflow task is complete, verify that the data at the destination is accurate.
WarningIf you delete the source data before verifying that the data was transferred to the destination correctly, you are solely responsible for any resulting data loss and all consequences.
Performance metrics
|
Operation type |
Metric |
Description |
|
Data import |
Throughput for files larger than 1 GB |
|
|
OPS for megabyte-scale files |
Single-directory and multi-directory import: 1,000. |
|
|
Data export |
Throughput for files larger than 1 GB |
|
|
OPS for megabyte-scale files |
Single-directory and multi-directory export: 600. |
|
|
Data deletion |
OPS |
Single-directory and multi-directory deletion: 2,000. |
|
On-demand loading (lazy load) |
Throughput for files larger than 1 GB |
|
|
OPS for megabyte-scale files |
Single-directory and multi-directory import: 1,000. |
|
|
Automatic metadata update |
OPS |
|